-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry flaky e2e tests at most 2 times #31682
Conversation
Size Change: 0 B Total Size: 1.31 MB ℹ️ View Unchanged
|
😄 - I think this will make our life a bit easier but only in the short term as we'll be increasing technical debt. |
What's the problem with a manual retry? It's good to have a sense of what's breaking sometimes and try to fix it? |
Not strongly against but I believe we need a more scalable way to track unstable tests first before doing this. Right now we rely too much on pinging folks every time some thing happens which may not scale forever. @gwwar had some good ideas on this subject. |
Manual retry has to re-run all the e2e tests, which could be very slow, as running them once is already slow enough. I agree we should still try to alert if something failed so that we can try to fix it properly. But often times such cases are extremely difficult to resolve, and require a deep understanding of the domain knowledge of that specific test. I'm open to discussions/suggestions on how we can still alert on failing tests with retrying enabled (hence it's only a draft PR for now). I'm thinking maybe we can post a comment to the commit which has intermittently failing test? We can go a step further and automatically tag the last contributor working on those tests to take a look. |
I do think we should get retries going eventually (to automatically test/mark flakyness), but I suspect we'll see some benefits from figuring out how to automate a way to see what tests are failing, and testing out some ownership options for fixing them. Eg say an easily digestible dashboard + some form of notifications (slack/gh pings). There's some pretty low hanging fruit already by sifting through recent e2e failures on trunk. Any of these are flaky since we can assume that most contributors should be verifying that checks are green on their branch before merge: There was a related blog post by GitHub which was a decent read https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/. |
When an e2e test fails intermittently, it usually means the test is bad and we should fix it. There's lots of cases where we're not appropriately waiting for a selector. Often checking the screenshot artefact gives some good clues about what goes wrong and someone just needs to take the time to fix it. |
Perfect example of a test failing when it runs too fast: Fix intermittent embeds failure. |
@ellatrix I agree to all of these, but I don't think they're mutually exclusive. We should fix the intermittently failing tests, but we can also add retrying. The current problem is that contributors often get confused when there are failed tests in their PRs, having no idea if they caused those tests to fail. This makes them lose confidence to the checks in the PR, and maybe even ignore the failing tests. I suggest adding some retrying to the tests, so that we can get those tests to pass in PRs, but we should also add some some kind of alert to notify the right people if any of those tests fail intermittently. The latter part is still TBD, hence the reason this is still a draft PR. |
Sure, it seems fine when we have a log somewhere about which tests have failed how many times with artefacts, so the data is not lost. It’s sometimes also important to know when the test started failing. If we keep all this information, I’m ok with it. |
I think this idea is a good complementary help, which does not replace the need to fix flaky tests at all. It may obscure this need if we don't surface them anymore.
I think it is better to have a central place of seeing these problems. In an ideal world once we detect a flaky test, which is flaky (which means it restarted and passed) more than X times we auto-create an issue and label it accordingly. I have no clue if this can be done, but it does not sound impossible. Notifying people is a system that only creates more notifications. All in all, the idea to auto-restart is solid and will remove a blocker for all contributors, increase the confidence in the failures (meaning that the computer already "tried again", so it's probably you), and be a solution to the problem at hand which is flaky-ness costing time and creating frustration. |
It should be very possible, and probably not very difficult to do. We can do that via GitHub actions, and automatically create an issue for each flaky test. Whenever it's detected, we can add a new comment about when, which commit, and the error message of the failed test.
The idea is to make sure the flaky test is being handled or assigned to at least one person, much like an auto-triaging system. In the GitHub post mentioned above, they recommended to only tag the person who wrote the flaky test, which doesn't seem like a bad idea IMO. A nice-to-have bonus would be to create a visualized dashboard of all the flaky tests over time. So that we can monitor if we increase the confidence of our tests or not. |
My worry here is that pushing flaky tests away from the spotlight of a PR's checks — whether by auto-posting a comment in some past commit, by aggregating a list somewhere else, or what have you — is going to: decrease awareness of test flakiness; decrease the perceived severity of it; and foster a bystander effect by which most contributors, novice and seasoned alike, will disregard the issue entirely, "abstracting away" the problem and leaving it up to those most involved or diligent in the core team. I would prefer that no action be taken than to merge this PR in its current form. That said, what about the following hybrid approach? For every test that fails, we log that failure before letting Jest retry it (twice at most). If the test succeeds after retrying, it will show up as passing. However, at the end of the test suite we add a specific test whose purpose is to fail if any flakiness was logged. That way, all parties involved in that PR need to confront the failure. But now they are in a better position to diagnose it. If it is a flaky test, they can make a conscious decision to force-merge a PR which has otherwise passing tests. As a consequence, this might put a brake on the proliferation of new flaky tests.
Thoughts? |
I do agree with @mcsf 's suggestion that, both by blindly retries and by creating specific "flaky test" issues, we indirectly create a new problem for the core maintainers. Fixing tests is not "fun", and it "only" solves a generic project wide problem. So, I can foresee these issues aging there. On the other hand, it may be that many of these flaky test issues are also good 1st issues. Also, for example, efforts by folks like @hellofromtonya, to create a more stable and consistent testing team and testing focus, may result in these issues being picked up and solved. I like @mcsf 's proposal because it gives the PR author the opportunity to have a clear description of what they have to fix. Sometimes this fixing will be skipped by force merging, but this action needs a justification. I worry that the PR author will, many times, be very removed in expertise from the flaky test (imagine fixing a typo in a doc and being hit with a flaky e2e from widgets). I am also afraid that we underestimate the number of requests for "force" merges, if that is what we aim for as a best practice. I don't think either of the solutions will put a brake on the proliferation of flaky tests. These appear because the system that we use to develop tests allows for their flakiness to be invisible to the developer. They "proliferate" because perhaps there is a tension between the complexity we're testing and the simplicity of the tooling. In conclusion, either of the "don't let it slide" directions (the automated issue creation and/or the flaky tests test) works equally towards nudging people to improve the health of the codebase, but the problem this PR tries to address is that we are wasting probably considerable time manually, blindly, annoyingly, clicking a button: the restart all jobs button. For this problem, automating retries is a good idea and it is better than nothing. |
In my opinion, we should start with identifying the tests that are failing, the ratio of the failure vs passes, classify the reasons for the failures. Once we have the full picture of the current state of the e2e tests, we can discuss further steps. Trying to pass the same tests 3 times improves the optics for the contributors because they will see all checks green more often but in practice, it won't increase the level of confidence that the changes added in PRs won't cause regressions. |
This sounds like a good plan 👍 Though I think it should only be based on the results of the tests that run on commits to trunk. PR test outcomes are often skewed by the code being a work in progress. |
For anyone subscribed to this issue. I opened a follow-up draft PR as a proposal in #34432. Feel free to leave your feedbacks there! |
Now that #34432 is merged this becomes more feasible. Right? |
Isn't it an alternative approach and the PR can be closed now? |
Yep this can be closed now. This PR is included in #34432. |
🤦🏻 <- thats all I can say. |
Description
Related to #33980.
It retries failed tests at most 2 times (3 times counting the initial run) for e2e tests. This is only possible after we migrated to
jest-circus
.This could be a controversial one, but IMO it does unblock us from some flaky tests immediately, and makes our lives a bit easier to maintain the tests.
If the test failed 3 times in a row, then we should definitely fix it. We should probably still add a mark if a test failed at least one time though, so that we can take a look at them early.
This is only enabled in CI environment (GitHub actions), so that in local testing we can still be alerted if something failed as soon as possible.
How has this been tested?
Intentionally update an e2e test to fail intermittently and observe that it runs at most 3 times until it passes.
Types of changes
New feature
Checklist:
*.native.js
files for terms that need renaming or removal).