Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrapper: prioritize etcd disk I/O #3114

Merged
merged 7 commits into from
May 22, 2024
Merged

Conversation

msanft
Copy link
Contributor

@msanft msanft commented May 21, 2024

Context

The etcd issues like the notorious "etcdserver: leader changed" are most likely caused due to etcd's disk activity conflicting with other disk-heavy operations during cluster initialization / joining such as the pulling of container images. To mitigate this, we can prioritize the disk operations of the etcd process over those of other processes by giving it an I/O priority in the kernel. It will not consider pod restarts of etcd, which are out of scope for this PR.

How to move forward:
We'll need to see (over a course of, say, 2 weeks) if this change is enough to get rid of the issues. If not, we'll probably need to use faster disks per default on Azure. This would probably require switching to the (~90€/M) 512Gb P20 disk, which satisfies the requirements of etcd. This should be done in a separate testing period to gather differential results.

Proposed change(s)

  • Upon initializing / joining / re-joining a Constellation cluster, find the etcd process on the node and set it's I/O priority to the highest possible value (i.e. RT (real-time) and a priority value of 0 (highest priority)).

Additional info

Checklist

  • Run the E2E tests that are relevant to this PR's changes
  • Add labels (e.g., for changelog category)
  • Is PR title adequate for changelog?
  • Link to Milestone

@msanft msanft added this to the v2.17.0 milestone May 21, 2024
@msanft msanft requested review from burgerdev and daniel-weisse May 21, 2024 14:26
@msanft msanft requested a review from 3u13r as a code owner May 21, 2024 14:26
Copy link

netlify bot commented May 21, 2024

Deploy Preview for constellation-docs canceled.

Name Link
🔨 Latest commit 451d51d
🔍 Latest deploy log https://app.netlify.com/sites/constellation-docs/deploys/664dde011bc687000865cde2

@msanft msanft force-pushed the feat/bootstrapper/etcd-io-prio branch from 11d3bcc to af411ae Compare May 22, 2024 08:41
@msanft msanft requested a review from burgerdev May 22, 2024 08:48
@daniel-weisse
Copy link
Member

Lets get this merged so we can see if it fixes things during our daily and weekly e2e tests

@msanft msanft requested review from daniel-weisse and burgerdev May 22, 2024 09:31
@msanft msanft force-pushed the feat/bootstrapper/etcd-io-prio branch from e5e8166 to a1f883f Compare May 22, 2024 09:51
@msanft msanft requested a review from daniel-weisse May 22, 2024 11:58
Copy link
Contributor

Coverage report

Package Old New Trend
bootstrapper/cmd/bootstrapper 0.00% 0.00% 🚧
bootstrapper/internal/etcdio 0.00% 0.00% 🆕
bootstrapper/internal/initserver 22.80% 22.80% 🚧
bootstrapper/internal/joinclient 39.70% 39.70% 🚧
bootstrapper/internal/kubernetes 46.40% 38.30% ↘️

@msanft msanft merged commit 9c100a5 into main May 22, 2024
17 checks passed
@msanft msanft deleted the feat/bootstrapper/etcd-io-prio branch May 22, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants