Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OOM slice removal race #2353

Merged
merged 1 commit into from
Aug 2, 2024
Merged

Conversation

saschagrunert
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the slice is already removed then we mostly encounter two different errors:

  • get next line: No such device (os error 19)
  • open memory events file: /sys/fs/cgroup/test.slice/crio-$ID.scope/memory.events: No such file or directory (os error 2)

To avoid such a race we now check after the errors if the file still exists. If not, then we assume an OOM.

Which issue(s) this PR fixes:

Fixes cri-o/cri-o#8411

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Fixed OOM slice removal race to not report an out of memory event.

Copy link
Contributor

openshift-ci bot commented Aug 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov-commenter
Copy link

codecov-commenter commented Aug 1, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 37.34%. Comparing base (4e0f474) to head (3b0829d).
Report is 644 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2353      +/-   ##
==========================================
- Coverage   37.53%   37.34%   -0.20%     
==========================================
  Files          15       15              
  Lines        1268     1264       -4     
  Branches      414      420       +6     
==========================================
- Hits          476      472       -4     
+ Misses        526      524       -2     
- Partials      266      268       +2     

@saschagrunert saschagrunert mentioned this pull request Aug 1, 2024
@saschagrunert
Copy link
Member Author

@haircommander @rphillips PTAL

@saschagrunert saschagrunert force-pushed the oom branch 3 times, most recently from 9cad604 to bbca1a0 Compare August 2, 2024 08:23
@saschagrunert saschagrunert changed the title Fix OOM slice removal race WIP: Fix OOM slice removal race Aug 2, 2024
@saschagrunert saschagrunert force-pushed the oom branch 2 times, most recently from 1018592 to d6d9a93 Compare August 2, 2024 09:03
@saschagrunert saschagrunert changed the title WIP: Fix OOM slice removal race Fix OOM slice removal race Aug 2, 2024
@saschagrunert saschagrunert force-pushed the oom branch 2 times, most recently from ff0615e to db24515 Compare August 2, 2024 09:32
If the slice is already removed then we mostly encounter two different
errors:

- `get next line: No such device (os error 19)`
- `open memory events file: /sys/fs/cgroup/test.slice/crio-$ID.scope/memory.events: No such file or directory (os error 2)`

To avoid such a race we now check after the errors if the file still
exists. If not, then we assume an OOM.

Signed-off-by: Sascha Grunert <[email protected]>
@rphillips
Copy link
Collaborator

Nice!
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 2, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 7a44dfe into containers:main Aug 2, 2024
33 checks passed
@saschagrunert saschagrunert deleted the oom branch August 2, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cri-tools (critest) case "runtime should output OOMKilled reason" flakes when being fixed
3 participants