Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress test failures due to logs after tests complete #6067

Merged
merged 9 commits into from
May 31, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented May 28, 2024

"Fixes" test failures like this, by sweeping them under the rug:

2024/05/28 21:39:29 ----- Done -----
2024/05/28 21:39:29 Schema setup complete
PASS
coverage: 27.3% of statements in github.com/uber/cadence/client/..., github.com/uber/cadence/common/..., github.com/uber/cadence/host/..., github.com/uber/cadence/service/..., github.com/uber/cadence/tools/...

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T21:39:39.501Z	DEBUG	Selected default store shard for tasklist	{"store-shard": "NonShardedStore", "wf-task-list-name": "9985f719-4b6a-4f0a-97c7-41a9e00d2414", "logging-call-at": "sharding_policy.go:100"}

goroutine 72245 [running]:
testing.(*common).logDepth(0xc0016ee680, {0xc00695ae00, 0xd5}, 0x3)
	/usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
	/usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0016ee680, {0x4a884b4, 0x2}, {0xc0021324b0, 0x1, 0x1})
	/usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x662be10?, 0xc0016ee680?}, 0xa0?}, {0xc003ecf400, 0xd6, 0x400})
	/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0030ad920, {0xff, {0xc18db1a6dddeef12, 0xa3b5c069bd, 0x8268f40}, {0x0, 0x0}, {0x4af6670, 0x29}, {0x0, ...}, ...}, ...)
...

When the test completes, it will simply log to stderr rather than the test logger:

❯ go test -count 1 -v ./...
=== RUN   TestLoggerShouldNotFailIfLoggedLate
--- PASS: TestLoggerShouldNotFailIfLoggedLate (0.00s)
PASS
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, orig{"logging-call-at": "testlogger_test.go:41", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, with{"actor-id": "testing", "logging-call-at": "testlogger_test.go:42", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
ok  	github.com/uber/cadence/common/log/testlogger	0.586s

Ignoring the correctness part of the problem, this gives us the best of both worlds:

  • most logs are grouped by the test that produced them
  • logs produced due to incomplete shutdowns are still produced (as long as -v or some other test fails), but will not directly fail the test(s)

I am not overly fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything. Seems like a reasonable tradeoff, and we can move to an opt-in model eventually instead of applying it everywhere like this PR does.

I am not fond of this, but some of our largest tests are quite flaky
due to these logs, and that really isn't helping anything.
@Groxx
Copy link
Member Author

Groxx commented May 28, 2024

unfortunately not sure which test is producing these, AFAICT they're not testlogger-using, so I've just replaced everything I can find:

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T22:34:12.141Z	ERROR	Persistent store operation failure	{"component": "matching-engine", "wf-domain-name": "integration-test-domain-a76efb8a-48f9-47fb-a74a-9e91f7713bd7", "wf-task-list-name": "integration-continue-as-new-workflow-timeout-test-tasklist", "wf-task-list-type": 0, "store-operation": "update-task-list", "error": "UpdateTaskList operation failed. Error: session has been closed", "logging-call-at": "task_reader.go:224"}
goroutine 74122 [running]:
testing.(*common).logDepth(0xc0014a2820, {0xc001b0c1c0, 0x1bb}, 0x3)
	/usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
	/usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0014a2820, {0x4a89a74, 0x2}, {0xc00732a290, 0x1, 0x1})
	/usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x7f1dc0d489c0?, 0xc0014a2820?}, 0x80?}, {0xc006340800, 0x1bc, 0x400})
	/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc008adac00, {0x2, {0xc18db4d9086c398b, 0x9eec277b0a, 0x826bf60}, {0x0, 0x0}, {0x4adb3e2, 0x22}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:90 +0x193
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc005404210, {0xc008284400, 0x3, 0x4})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:216 +0x210
go.uber.org/zap.(*Logger).Error(0xc008581080, {0x4adb3e2, 0x22}, {0xc008284400, 0x3, 0x4})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:203 +0x68
github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error(0xc0073c5620, {0x4adb3e2, 0x22}, {0xc0071f9480, 0x2, 0x2})
	/cadence/common/log/loggerimpl/logger.go:150 +0x19f
github.com/uber/cadence/service/matching/tasklist.(*taskReader).getTasksPump(0xc003ada000)
	/cadence/service/matching/tasklist/task_reader.go:224 +0xe17
github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start.func2()
	/cadence/service/matching/tasklist/task_reader.go:141 +0x98
created by github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start in goroutine 74130
	/cadence/service/matching/tasklist/task_reader.go:139 +0x38e
FAIL	github.com/uber/cadence/host	682.736s
FAIL

If this commit doesn't work, I'll have to see if I screwed up the implementation.

gonna rerun a few times and check output, I believe I should see "bad" logs or it's probably just lacking evidence that it fixed anything.

Copy link

codecov bot commented May 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.16%. Comparing base (5f45da5) to head (01b0a67).
Report is 2 commits behind head on master.

Additional details and impacted files

see 13 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f45da5...01b0a67. Read the comment docs.

@Groxx
Copy link
Member Author

Groxx commented May 29, 2024

.... I think I'm going to sit on this for a while. I tried to construct a test that would show if I implemented this correctly, and it pretty randomly fails or succeeds in super duper confusing ways.
Even several seconds later, when waiting in a different test entirely, it still doesn't always mark t.done = true in the test for some reason, and doesn't always notice the log. Which seems insane from the implementation.

Not sure if my Go is screwing up or blending caches or if this log-detecting is just fundamentally unreliable for some reason.


edit: well. it's fundamentally unreliable. golang/go#67701
I'm not sure if this is something that'll change or not. it may fail too many tests...

@Groxx Groxx merged commit 4edac98 into cadence-workflow:master May 31, 2024
21 checks passed
@Groxx Groxx deleted the test-log-shenanigans branch May 31, 2024 18:18
timl3136 pushed a commit to timl3136/cadence that referenced this pull request Jun 6, 2024
…flow#6067)

"Fixes" test failures like this, by sweeping them under the rug:
```
2024/05/28 21:39:29 ----- Done -----
2024/05/28 21:39:29 Schema setup complete
PASS
coverage: 27.3% of statements in github.com/uber/cadence/client/..., github.com/uber/cadence/common/..., github.com/uber/cadence/host/..., github.com/uber/cadence/service/..., github.com/uber/cadence/tools/...

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T21:39:39.501Z	DEBUG	Selected default store shard for tasklist	{"store-shard": "NonShardedStore", "wf-task-list-name": "9985f719-4b6a-4f0a-97c7-41a9e00d2414", "logging-call-at": "sharding_policy.go:100"}

goroutine 72245 [running]:
testing.(*common).logDepth(0xc0016ee680, {0xc00695ae00, 0xd5}, 0x3)
	/usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
	/usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0016ee680, {0x4a884b4, 0x2}, {0xc0021324b0, 0x1, 0x1})
	/usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x662be10?, 0xc0016ee680?}, 0xa0?}, {0xc003ecf400, 0xd6, 0x400})
	/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0030ad920, {0xff, {0xc18db1a6dddeef12, 0xa3b5c069bd, 0x8268f40}, {0x0, 0x0}, {0x4af6670, 0x29}, {0x0, ...}, ...}, ...)
...
```
When the test completes, it will simply log to stderr rather than the test logger:
```
❯ go test -count 1 -v ./...
=== RUN   TestLoggerShouldNotFailIfLoggedLate
--- PASS: TestLoggerShouldNotFailIfLoggedLate (0.00s)
PASS
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, orig{"logging-call-at": "testlogger_test.go:41", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, with{"actor-id": "testing", "logging-call-at": "testlogger_test.go:42", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
ok  	github.com/uber/cadence/common/log/testlogger	0.586s
```

Ignoring the correctness part of the problem, this gives us the best of both worlds:
- most logs are grouped by the test that produced them
- logs produced due to incomplete shutdowns are still produced (as long as `-v` or some other test fails), but will not directly fail the test(s)

I am not overly fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything.  Seems like a reasonable tradeoff, and we can move to an opt-in model eventually instead of applying it everywhere like this PR does.
Groxx added a commit to cadence-workflow/cadence-go-client that referenced this pull request Nov 1, 2024
Fixing the flakiness that led to #1375.

The races in these tests were due to `t.Log` calls occurring after the test finishes,
because the workflow (and test suite and tests and...) does not wait for goroutines to shut down.

It's an annoying enough issue that I tackled it with gusto in cadence-workflow/cadence#6067
and it's probably worth porting over here too.
Though the underlying "shut down and do not wait" behavior is still extremely dangerous and needs to be fixed some day.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants