Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: reset failuresSuppressed in resetFailures #120497

Conversation

renatolabs
Copy link
Contributor

In #119535, we introduced a resetFailures method that is called after a test failure when the test runner identifies that a VM was preempted while the test was running. This function makes sure that the VM preemption error is the only one visible to the issue poster, ensuring that whenever a VM is preempted, the test failure is routed to Test Eng.

However, there was one situation where we still would not get the routing right: when the test times out. In this situation, we set the internal failuresSuppressed field to true; in other words, resetFailures followed by the VM preempmtion error would not have the desired effect of making the VM preemption error visible to the issue poster. While #119535 solves the most common source of test timeouts due to preemption (a bug in the monitor), tests can run arbitrary code and can timeout if a VM is preempted.

In this commit, we reset failuresSuppressed in resetFailures to make sure that the VM preemption error is taken into account when reporting failures in arbitrary test timeouts.

Fixes: #120381

Release note: None

In cockroachdb#119535, we introduced a `resetFailures` method that is called
after a test failure when the test runner identifies that a VM was
preempted while the test was running. This function makes sure that
the VM preemption error is the only one visible to the issue poster,
ensuring that whenever a VM is preempted, the test failure is routed
to Test Eng.

However, there was one situation where we still would not get the
routing right: when the test times out. In this situation, we set the
internal `failuresSuppressed` field to true; in other words,
`resetFailures` followed by the VM preempmtion error would not have
the desired effect of making the VM preemption error visible to the
issue poster. While cockroachdb#119535 solves the most common source of test
timeouts due to preemption (a bug in the monitor), tests can run
arbitrary code and can timeout if a VM is preempted.

In this commit, we reset `failuresSuppressed` in `resetFailures` to
make sure that the VM preemption error is taken into account when
reporting failures in arbitrary test timeouts.

Fixes: cockroachdb#120381

Release note: None
@renatolabs renatolabs requested a review from a team as a code owner March 14, 2024 15:04
@renatolabs renatolabs requested review from srosenberg and DarrylWong and removed request for a team March 14, 2024 15:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this solves the case where the VM is preempted but the monitor doesn't error whereas the previous PR solves the case where the monitor does error?

Am I understanding that correctly?

@renatolabs
Copy link
Contributor Author

renatolabs commented Mar 14, 2024

So this solves the case where the VM is preempted but the monitor doesn't error whereas the previous PR solves the case where the monitor does error?

Am I understanding that correctly?

I think it's hard to state this in terms of whether the monitor errors or not because not every test uses a monitor (the scrub failure this PR fixes doesn't, for example). The change here is meant to fix the "general" timeout case.

#119535 fixed the monitor such that, if a VM was preempted, the monitor would fail (instead of hanging until the test timed out). This was by far the most common type of test timeout due to VM preemption. It didn't change any behaviour as far as "what happens when a test times out", it just made timeouts due to VM preemption a lot less likely.

However, tests may timeout for other reasons: they are arbitrary code, so if they use an unbounded retry loop, for instance, they could also timeout in ways that are outside our control. What this current PR does is make sure that when the test does time out, we are still able to route the failure to test eng if there was a VM preemption.

Prior to this PR, the VM preemption would be identified but we would fail to add the failure record because the timeout happened first (and therefore failuresSuppressed was set to true). As a result, the GitHub issue poster would never see the VM preemption error and would just route the failure to the owner indicated in the test spec.

Does that make more sense?

@DarrylWong
Copy link
Contributor

DarrylWong commented Mar 14, 2024

Ah yes that makes much more sense. I was trying to think of why a preempted VM wouldn't have the monitor error out, when there just isn't a monitor in the first place. That seems straightforward enough to me then. Thanks!

@renatolabs
Copy link
Contributor Author

TFTR!

bors r=DarrylWong

@craig
Copy link
Contributor

craig bot commented Mar 14, 2024

@craig craig bot merged commit 24c998b into cockroachdb:master Mar 14, 2024
18 of 19 checks passed
@renatolabs renatolabs deleted the rc/roachtest-reset-failures-resets-suppressed branch March 14, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: VM preemption: scrub/index-only/tpcc/w=100 failed
3 participants