-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce get/update node API calls in allocate-tunnel-addrs #9506
Reduce get/update node API calls in allocate-tunnel-addrs #9506
Conversation
9afaeef
to
e54e725
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our environment, we only use BGP, didn't use tunnel, we found when create/delete/update a IPPool, all calico-node will get and update node many times, it will make k8s apiserver load very high.
Could you share some logs or watch events on the node so I can see precisely what node updates are being made?
Also, what types of IP pool changes are you making?
I am surprised that you are seeing the node updates occurring many times when changing IP pools and would like to understand it better in order to determine the correct fix.
I don't think the code in this PR is quite right as it stands - I've added some comments but this code is pretty nuanced and there are some larger restructurings I think we would need to make in order to do what you want here.
I think we don't need get and update node in every removeHostTunnelAddr, we can merge all changes and update node only once.
We do actually - if the update fails, then we need to re-query the Node otherwise any further attempts are guaranteed to fail. ErrorResourceConflict means "requery the object and try again".
If the update succeeds (which it should almost all of the time) then it should only send 1 request to the API server.
e54e725
to
0c45402
Compare
0c45402
to
26893b0
Compare
|
I found there is no way to config tunnel-ip-allocator log level for now, and I modify the code set log level to Debug. Create a IPPool to reproduce the problem apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: test
spec:
allowedUses:
- Workload
blockSize: 27
cidr: 127.0.0.0/26
ipipMode: Never
nodeSelector: kubernetes.io/hostname=='127.0.0.1'
vxlanMode: Never We can see the https://github.com/projectcalico/calico/blob/v3.29.0/libcalico-go/lib/backend/k8s/resources/node.go#L87 So if a IPPool changed(Create/Update/Delete), tunnel-ip-allocator will call Nodes().Get() 11 times and Nodes().UpdateStatus() 5 times. If there are 1000 nodes in the cluster, each IPPool change will performed 11000 get operations and 5000 update operations, these operations almost at the same time.This will put a lot of pressure on the apiserver. |
In our environments, we implement a IPAM controller, each node may use one or more IPPools(it depends how many pods on the node), each IPPool only select one node. The IPAM controller will create IPPool if the node join the cluster, and delete IPPool if the node leave the cluster. |
a88c68f
to
0153c9e
Compare
@caseydavenport I refactored the code according to the above logic,PTAL |
I think we could, in theory, refactor some of this code to natively use the Kubernetes Node API - this would avoid us needing to do this extra Get(). It's a larger change than I want to take on in this PR though, so let's plan on getting this merged and then considering follow-up PRs to make further improvements. |
/sem-approve |
@liuxu623 looks like static checks are failing - you should be able to run |
0153c9e
to
b38fba2
Compare
done |
/sem-approve |
@caseydavenport Sorry, I forgot update test file before. |
/sem-approve |
@liuxu623 since you added error returns to a few functions, static checks are now failing for the previous calls which do not check them, for example:
Could you please update the calls? |
@coutinhop fixed, thank you for your remind. |
/sem-approve |
99e7490
to
da59682
Compare
@coutinhop The test is fixed, I want to ask how to run the test locally, I tried to run the test but encountered |
/sem-approve |
@caseydavenport empty address test case fixed. |
/sem-approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM - just a minor comment about a couple of ignored errors.
/sem-approve |
Thanks @liuxu623 ! Going to merge this now. let's see how it does :) |
Description
Fix #9394
In our environment, we only use BGP, didn't use tunnel, we found when create/delete/update a IPPool, all calico-node will get and update node many times, it will make k8s apiserver load very high.
allocate-tunnel-addrs watch all IPPool resource in cluster, reconcileTunnelAddrs call 5 times removeHostTunnelAddr
I think we don't need get and update node in every removeHostTunnelAddr, we can merge all changes and update node only once.
Related issues/PRs
Todos
Release Note
Reminder for the reviewer
Make sure that this PR has the correct labels and milestone set.
Every PR needs one
docs-*
label.docs-pr-required
: This change requires a change to the documentation that has not been completed yet.docs-completed
: This change has all necessary documentation completed.docs-not-required
: This change has no user-facing impact and requires no docs.Every PR needs one
release-note-*
label.release-note-required
: This PR has user-facing changes. Most PRs should have this label.release-note-not-required
: This PR has no user-facing changes.Other optional labels:
cherry-pick-candidate
: This PR should be cherry-picked to an earlier release. For bug fixes only.needs-operator-pr
: This PR is related to install and requires a corresponding change to the operator.