Suppress test failures due to logs after tests complete #6067

Groxx · 2024-05-28T22:17:31Z

"Fixes" test failures like this, by sweeping them under the rug:

2024/05/28 21:39:29 ----- Done -----
2024/05/28 21:39:29 Schema setup complete
PASS
coverage: 27.3% of statements in github.com/uber/cadence/client/..., github.com/uber/cadence/common/..., github.com/uber/cadence/host/..., github.com/uber/cadence/service/..., github.com/uber/cadence/tools/...

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T21:39:39.501Z	DEBUG	Selected default store shard for tasklist	{"store-shard": "NonShardedStore", "wf-task-list-name": "9985f719-4b6a-4f0a-97c7-41a9e00d2414", "logging-call-at": "sharding_policy.go:100"}

goroutine 72245 [running]:
testing.(*common).logDepth(0xc0016ee680, {0xc00695ae00, 0xd5}, 0x3)
	/usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
	/usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0016ee680, {0x4a884b4, 0x2}, {0xc0021324b0, 0x1, 0x1})
	/usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x662be10?, 0xc0016ee680?}, 0xa0?}, {0xc003ecf400, 0xd6, 0x400})
	/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0030ad920, {0xff, {0xc18db1a6dddeef12, 0xa3b5c069bd, 0x8268f40}, {0x0, 0x0}, {0x4af6670, 0x29}, {0x0, ...}, ...}, ...)
...

When the test completes, it will simply log to stderr rather than the test logger:

❯ go test -count 1 -v ./...
=== RUN   TestLoggerShouldNotFailIfLoggedLate
--- PASS: TestLoggerShouldNotFailIfLoggedLate (0.00s)
PASS
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, orig{"logging-call-at": "testlogger_test.go:41", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
2024-05-29T16:20:50.742-0500	INFO	COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, with{"actor-id": "testing", "logging-call-at": "testlogger_test.go:42", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User...
ok  	github.com/uber/cadence/common/log/testlogger	0.586s

Ignoring the correctness part of the problem, this gives us the best of both worlds:

most logs are grouped by the test that produced them
logs produced due to incomplete shutdowns are still produced (as long as -v or some other test fails), but will not directly fail the test(s)

I am not overly fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything. Seems like a reasonable tradeoff, and we can move to an opt-in model eventually instead of applying it everywhere like this PR does.

I am not fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything.

Groxx · 2024-05-28T22:49:24Z

unfortunately not sure which test is producing these, AFAICT they're not testlogger-using, so I've just replaced everything I can find:

panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T22:34:12.141Z	ERROR	Persistent store operation failure	{"component": "matching-engine", "wf-domain-name": "integration-test-domain-a76efb8a-48f9-47fb-a74a-9e91f7713bd7", "wf-task-list-name": "integration-continue-as-new-workflow-timeout-test-tasklist", "wf-task-list-type": 0, "store-operation": "update-task-list", "error": "UpdateTaskList operation failed. Error: session has been closed", "logging-call-at": "task_reader.go:224"}
goroutine 74122 [running]:
testing.(*common).logDepth(0xc0014a2820, {0xc001b0c1c0, 0x1bb}, 0x3)
	/usr/local/go/src/testing/testing.go:1028 +0x6d4
testing.(*common).log(...)
	/usr/local/go/src/testing/testing.go:1010
testing.(*common).Logf(0xc0014a2820, {0x4a89a74, 0x2}, {0xc00732a290, 0x1, 0x1})
	/usr/local/go/src/testing/testing.go:1061 +0xa5
go.uber.org/zap/zaptest.testingWriter.Write({{0x7f1dc0d489c0?, 0xc0014a2820?}, 0x80?}, {0xc006340800, 0x1bc, 0x400})
	/go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e
go.uber.org/zap/zapcore.(*ioCore).Write(0xc008adac00, {0x2, {0xc18db4d9086c398b, 0x9eec277b0a, 0x826bf60}, {0x0, 0x0}, {0x4adb3e2, 0x22}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:90 +0x193
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc005404210, {0xc008284400, 0x3, 0x4})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:216 +0x210
go.uber.org/zap.(*Logger).Error(0xc008581080, {0x4adb3e2, 0x22}, {0xc008284400, 0x3, 0x4})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:203 +0x68
github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error(0xc0073c5620, {0x4adb3e2, 0x22}, {0xc0071f9480, 0x2, 0x2})
	/cadence/common/log/loggerimpl/logger.go:150 +0x19f
github.com/uber/cadence/service/matching/tasklist.(*taskReader).getTasksPump(0xc003ada000)
	/cadence/service/matching/tasklist/task_reader.go:224 +0xe17
github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start.func2()
	/cadence/service/matching/tasklist/task_reader.go:141 +0x98
created by github.com/uber/cadence/service/matching/tasklist.(*taskReader).Start in goroutine 74130
	/cadence/service/matching/tasklist/task_reader.go:139 +0x38e
FAIL	github.com/uber/cadence/host	682.736s
FAIL

If this commit doesn't work, I'll have to see if I screwed up the implementation.

gonna rerun a few times and check output, I believe I should see "bad" logs or it's probably just lacking evidence that it fixed anything.

codecov · 2024-05-28T23:08:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.16%. Comparing base (5f45da5) to head (01b0a67).
Report is 2 commits behind head on master.

Additional details and impacted files

see 13 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f45da5...01b0a67. Read the comment docs.

Groxx · 2024-05-29T00:26:20Z

.... I think I'm going to sit on this for a while. I tried to construct a test that would show if I implemented this correctly, and it pretty randomly fails or succeeds in super duper confusing ways.
Even several seconds later, when waiting in a different test entirely, it still doesn't always mark t.done = true in the test for some reason, and doesn't always notice the log. Which seems insane from the implementation.

Not sure if my Go is screwing up or blending caches or if this log-detecting is just fundamentally unreliable for some reason.

edit: well. it's fundamentally unreliable. golang/go#67701
I'm not sure if this is something that'll change or not. it may fail too many tests...

…flow#6067) "Fixes" test failures like this, by sweeping them under the rug: ``` 2024/05/28 21:39:29 ----- Done ----- 2024/05/28 21:39:29 Schema setup complete PASS coverage: 27.3% of statements in github.com/uber/cadence/client/..., github.com/uber/cadence/common/..., github.com/uber/cadence/host/..., github.com/uber/cadence/service/..., github.com/uber/cadence/tools/... panic: Log in goroutine after TestIntegrationSuite has completed: 2024-05-28T21:39:39.501Z DEBUG Selected default store shard for tasklist {"store-shard": "NonShardedStore", "wf-task-list-name": "9985f719-4b6a-4f0a-97c7-41a9e00d2414", "logging-call-at": "sharding_policy.go:100"} goroutine 72245 [running]: testing.(*common).logDepth(0xc0016ee680, {0xc00695ae00, 0xd5}, 0x3) /usr/local/go/src/testing/testing.go:1028 +0x6d4 testing.(*common).log(...) /usr/local/go/src/testing/testing.go:1010 testing.(*common).Logf(0xc0016ee680, {0x4a884b4, 0x2}, {0xc0021324b0, 0x1, 0x1}) /usr/local/go/src/testing/testing.go:1061 +0xa5 go.uber.org/zap/zaptest.testingWriter.Write({{0x662be10?, 0xc0016ee680?}, 0xa0?}, {0xc003ecf400, 0xd6, 0x400}) /go/pkg/mod/go.uber.org/[email protected]/zaptest/logger.go:130 +0x11e go.uber.org/zap/zapcore.(*ioCore).Write(0xc0030ad920, {0xff, {0xc18db1a6dddeef12, 0xa3b5c069bd, 0x8268f40}, {0x0, 0x0}, {0x4af6670, 0x29}, {0x0, ...}, ...}, ...) ... ``` When the test completes, it will simply log to stderr rather than the test logger: ``` ❯ go test -count 1 -v ./... === RUN TestLoggerShouldNotFailIfLoggedLate --- PASS: TestLoggerShouldNotFailIfLoggedLate (0.00s) PASS 2024-05-29T16:20:50.742-0500 INFO COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, orig{"logging-call-at": "testlogger_test.go:41", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User... 2024-05-29T16:20:50.742-0500 INFO COULD FAIL TEST "TestLoggerShouldNotFailIfLoggedLate", logged too late: too late, with{"actor-id": "testing", "logging-call-at": "testlogger_test.go:42", "log_stack": "github.com/uber/cadence/common/log/testlogger.(*fallbackTestCore).Write\n\t/User... ok github.com/uber/cadence/common/log/testlogger 0.586s ``` Ignoring the correctness part of the problem, this gives us the best of both worlds: - most logs are grouped by the test that produced them - logs produced due to incomplete shutdowns are still produced (as long as `-v` or some other test fails), but will not directly fail the test(s) I am not overly fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything. Seems like a reasonable tradeoff, and we can move to an opt-in model eventually instead of applying it everywhere like this PR does.

Fixing the flakiness that led to #1375. The races in these tests were due to `t.Log` calls occurring after the test finishes, because the workflow (and test suite and tests and...) does not wait for goroutines to shut down. It's an annoying enough issue that I tackled it with gusto in cadence-workflow/cadence#6067 and it's probably worth porting over here too. Though the underlying "shut down and do not wait" behavior is still extremely dangerous and needs to be fixed some day.

Suppress test failures due to logs after tests complete

6824e2a

I am not fond of this, but some of our largest tests are quite flaky due to these logs, and that really isn't helping anything.

Groxx marked this pull request as ready for review May 28, 2024 22:20

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners May 28, 2024 22:20

taylanisikdemir approved these changes May 28, 2024

View reviewed changes

wrap everything to see if this catches the last failures

9dae9a8

Groxx added 6 commits May 29, 2024 14:07

reliably failing ugh

5460f54

reliably fails

ef38319

go back to flaky

bc5926e

reliable

b66df6f

tested and working logger

10152da

a bit more testing to show detailed behavior

0f715d5

Groxx requested a review from taylanisikdemir May 29, 2024 21:40

fmt

01b0a67

taylanisikdemir approved these changes May 31, 2024

View reviewed changes

Groxx merged commit 4edac98 into cadence-workflow:master May 31, 2024
21 checks passed

Groxx deleted the test-log-shenanigans branch May 31, 2024 18:18

Groxx mentioned this pull request Nov 1, 2024

Restore race-checking tests cadence-workflow/cadence-go-client#1376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress test failures due to logs after tests complete #6067

Suppress test failures due to logs after tests complete #6067

Groxx commented May 28, 2024 •

edited

Loading

Groxx commented May 28, 2024 •

edited

Loading

codecov bot commented May 28, 2024 •

edited

Loading

Groxx commented May 29, 2024 •

edited

Loading

Suppress test failures due to logs after tests complete #6067

Suppress test failures due to logs after tests complete #6067

Conversation

Groxx commented May 28, 2024 • edited Loading

Groxx commented May 28, 2024 • edited Loading

codecov bot commented May 28, 2024 • edited Loading

Codecov Report

Groxx commented May 29, 2024 • edited Loading

Groxx commented May 28, 2024 •

edited

Loading

Groxx commented May 28, 2024 •

edited

Loading

codecov bot commented May 28, 2024 •

edited

Loading

Groxx commented May 29, 2024 •

edited

Loading