roachtest: reset `failuresSuppressed` in `resetFailures` #120497

renatolabs · 2024-03-14T15:04:17Z

In #119535, we introduced a resetFailures method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng.

However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal failuresSuppressed field to true; in other words, resetFailures followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While #119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted.

In this commit, we reset failuresSuppressed in resetFailures to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts.

Fixes: #120381

Release note: None

In cockroachdb#119535, we introduced a `resetFailures` method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng. However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal `failuresSuppressed` field to true; in other words, `resetFailures` followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While cockroachdb#119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted. In this commit, we reset `failuresSuppressed` in `resetFailures` to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts. Fixes: cockroachdb#120381 Release note: None

cockroach-teamcity · 2024-03-14T15:04:23Z

This change is

DarrylWong

So this solves the case where the VM is preempted but the monitor doesn't error whereas the previous PR solves the case where the monitor does error?

Am I understanding that correctly?

renatolabs · 2024-03-14T15:49:17Z

So this solves the case where the VM is preempted but the monitor doesn't error whereas the previous PR solves the case where the monitor does error?

Am I understanding that correctly?

I think it's hard to state this in terms of whether the monitor errors or not because not every test uses a monitor (the scrub failure this PR fixes doesn't, for example). The change here is meant to fix the "general" timeout case.

#119535 fixed the monitor such that, if a VM was preempted, the monitor would fail (instead of hanging until the test timed out). This was by far the most common type of test timeout due to VM preemption. It didn't change any behaviour as far as "what happens when a test times out", it just made timeouts due to VM preemption a lot less likely.

However, tests may timeout for other reasons: they are arbitrary code, so if they use an unbounded retry loop, for instance, they could also timeout in ways that are outside our control. What this current PR does is make sure that when the test does time out, we are still able to route the failure to test eng if there was a VM preemption.

Prior to this PR, the VM preemption would be identified but we would fail to add the failure record because the timeout happened first (and therefore failuresSuppressed was set to true). As a result, the GitHub issue poster would never see the VM preemption error and would just route the failure to the owner indicated in the test spec.

Does that make more sense?

DarrylWong · 2024-03-14T15:57:06Z

Ah yes that makes much more sense. I was trying to think of why a preempted VM wouldn't have the monitor error out, when there just isn't a monitor in the first place. That seems straightforward enough to me then. Thanks!

renatolabs · 2024-03-14T16:02:28Z

TFTR!

bors r=DarrylWong

craig · 2024-03-14T16:54:47Z

Build succeeded:

renatolabs requested a review from a team as a code owner March 14, 2024 15:04

renatolabs requested review from srosenberg and DarrylWong and removed request for a team March 14, 2024 15:04

DarrylWong approved these changes Mar 14, 2024

View reviewed changes

craig bot merged commit 24c998b into cockroachdb:master Mar 14, 2024
18 of 19 checks passed

renatolabs deleted the rc/roachtest-reset-failures-resets-suppressed branch March 14, 2024 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: reset `failuresSuppressed` in `resetFailures` #120497

roachtest: reset `failuresSuppressed` in `resetFailures` #120497

renatolabs commented Mar 14, 2024

cockroach-teamcity commented Mar 14, 2024

DarrylWong left a comment

renatolabs commented Mar 14, 2024 •

edited

Loading

DarrylWong commented Mar 14, 2024 •

edited

Loading

renatolabs commented Mar 14, 2024

craig bot commented Mar 14, 2024

roachtest: reset failuresSuppressed in resetFailures #120497

roachtest: reset failuresSuppressed in resetFailures #120497

Conversation

renatolabs commented Mar 14, 2024

cockroach-teamcity commented Mar 14, 2024

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs commented Mar 14, 2024 • edited Loading

DarrylWong commented Mar 14, 2024 • edited Loading

renatolabs commented Mar 14, 2024

craig bot commented Mar 14, 2024

roachtest: reset `failuresSuppressed` in `resetFailures` #120497

roachtest: reset `failuresSuppressed` in `resetFailures` #120497

renatolabs commented Mar 14, 2024 •

edited

Loading

DarrylWong commented Mar 14, 2024 •

edited

Loading