Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce get/update node API calls in allocate-tunnel-addrs #9506

Merged
merged 6 commits into from
Dec 9, 2024

Conversation

liuxu623
Copy link
Contributor

@liuxu623 liuxu623 commented Nov 20, 2024

Description

Fix #9394

In our environment, we only use BGP, didn't use tunnel, we found when create/delete/update a IPPool, all calico-node will get and update node many times, it will make k8s apiserver load very high.

allocate-tunnel-addrs watch all IPPool resource in cluster, reconcileTunnelAddrs call 5 times removeHostTunnelAddr

	// If the update fails with ResourceConflict error then retry 5 times with 1 second delay before failing.
	for i := 0; i < 5; i++ {
		node, err := c.Nodes().Get(ctx, nodename, options.GetOptions{})
		if err != nil {
			logCtx.WithError(err).Fatalf("Unable to retrieve tunnel address for cleanup. Error getting node '%s'", nodename)
		}
                 ...
                // Update the node object.
		_, updateError = c.Nodes().Update(ctx, node, options.SetOptions{})
		if _, ok := updateError.(cerrors.ErrorResourceUpdateConflict); ok {
			// Wait for a second and try again if there was a conflict during the resource update.
			logCtx.WithError(err).Infof("Error updating node %s. Retrying.", node.Name)
			time.Sleep(1 * time.Second)
			continue
		}
		break
        }

I think we don't need get and update node in every removeHostTunnelAddr, we can merge all changes and update node only once.

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Reduce get/update node API calls in allocate-tunnel-addrs.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@liuxu623 liuxu623 requested a review from a team as a code owner November 20, 2024 10:00
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Nov 20, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Nov 20, 2024
@liuxu623 liuxu623 changed the title Reduce get/update node API calls Reduce get/update node API calls in allocate-tunnel-addrs Nov 20, 2024
@liuxu623 liuxu623 force-pushed the reduce-update-node branch 2 times, most recently from 9afaeef to e54e725 Compare November 20, 2024 10:39
Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our environment, we only use BGP, didn't use tunnel, we found when create/delete/update a IPPool, all calico-node will get and update node many times, it will make k8s apiserver load very high.

Could you share some logs or watch events on the node so I can see precisely what node updates are being made?

Also, what types of IP pool changes are you making?

I am surprised that you are seeing the node updates occurring many times when changing IP pools and would like to understand it better in order to determine the correct fix.

I don't think the code in this PR is quite right as it stands - I've added some comments but this code is pretty nuanced and there are some larger restructurings I think we would need to make in order to do what you want here.

I think we don't need get and update node in every removeHostTunnelAddr, we can merge all changes and update node only once.

We do actually - if the update fails, then we need to re-query the Node otherwise any further attempts are guaranteed to fail. ErrorResourceConflict means "requery the object and try again".

If the update succeeds (which it should almost all of the time) then it should only send 1 request to the API server.

node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
@liuxu623
Copy link
Contributor Author

liuxu623 commented Nov 21, 2024

2024-11-21 07:43:30.798 [DEBUG][94] tunnel-ip-allocator/sync_client.go 594: New message from Typha. connection=&discovery.Typha{Addr:"10.251.252.18:5473", IP:"10.251.252.18", NodeName:(*string)(0xc00043b3f0)} envelope=syncproto.Envelope{Message:syncproto.MsgKVs{Num:1,KVs:[]{SerializedUpdate<Key:/calico/v1/ipam/v4/pool/127.0.0.0-26, Value:{"cidr":"127.0.0.0/26","ipip":"","ipip_mode":"","vxlan_mode":"","masquerade":false,"ipam":true,"disabled":false,"disableBGPExport":false}, Revision:229914083, TTL:0s, UpdateType:1>}}} myID=0x1 serverConnID=0x36 type="tunnel-ip-allocation"
2024-11-21 07:43:30.798 [DEBUG][94] tunnel-ip-allocator/ippool.go 79: Get Pool key from /calico/v1/ipam/v4/pool/127.0.0.0-26
2024-11-21 07:43:30.798 [DEBUG][94] tunnel-ip-allocator/sync_client.go 516: Decoded update from Typha connection=&discovery.Typha{Addr:"10.251.252.18:5473", IP:"10.251.252.18", NodeName:(*string)(0xc00043b3f0)} deserialized=api.Update{KVPair:model.KVPair{Key:model.IPPoolKey{CIDR:net.IPNet{IPNet:net.IPNet{IP:net.IP{0x7f, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xc0}}}}, Value:(*model.IPPool)(0xc0006259d0), Revision:"229914083", UID:(*types.UID)(nil), TTL:0}, UpdateType:0x1} myID=0x1 serialized=SerializedUpdate<Key:/calico/v1/ipam/v4/pool/127.0.0.0-26, Value:{"cidr":"127.0.0.0/26","ipip":"","ipip_mode":"","vxlan_mode":"","masquerade":false,"ipam":true,"disabled":false,"disableBGPExport":false}, Revision:229914083, TTL:0s, UpdateType:1> serverConnID=0x36 type="tunnel-ip-allocation"
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/dedupe_buffer.go 118: Updates received numUpdates=1
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/dedupe_buffer.go 125: Acquired lock queueLen=0
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/dedupe_buffer.go 172: No in flight value for key, adding to queue. key="/calico/v1/ipam/v4/pool/127.0.0.0-26"
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/dedupe_buffer.go 185: Queue transitioned to non-empty; signalling.
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/dedupe_buffer.go 299: Sending updates updates=[]api.Update{api.Update{KVPair:model.KVPair{Key:model.IPPoolKey{CIDR:net.IPNet{IPNet:net.IPNet{IP:net.IP{0x7f, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xc0}}}}, Value:(*model.IPPool)(0xc0006259d0), Revision:"229914083", UID:(*types.UID)(nil), TTL:0}, UpdateType:0x1}}
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/allocateip.go 186: Updated pool resource: IPPool(cidr=127.0.0.0/26)
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/allocateip.go 205: Stored data has been modified - trigger reconciliation
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.799 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.815 [DEBUG][94] tunnel-ip-allocator/k8s.go 683: Performing 'List' for IPPool model.ResourceListOptions
2024-11-21 07:43:30.815 [DEBUG][94] tunnel-ip-allocator/customresource.go 291: Received List request ListInterface=IPPool Resource="IPPools" Type="CustomResource"
2024-11-21 07:43:30.822 [DEBUG][94] tunnel-ip-allocator/list_pager.go 52: List() call completed, convert results ListInterface=IPPool Resource="IPPools" Type="CustomResource" pagedList=false
2024-11-21 07:43:30.822 [DEBUG][94] tunnel-ip-allocator/allocateip.go 719: Wireguard is not running on node 10.251.252.18, do not allocate a device IPv4 address
2024-11-21 07:43:30.822 [DEBUG][94] tunnel-ip-allocator/allocateip.go 571: Remove tunnel addresses node="10.251.252.18" type="wireguardTunnelAddress"
2024-11-21 07:43:30.822 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.822 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.832 [DEBUG][94] tunnel-ip-allocator/ipam.go 1583: Releasing all IPs with handle 'wireguard-tunnel-addr-10.251.252.18'
2024-11-21 07:43:30.832 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for IPAMHandleKey(id=wireguard-tunnel-addr-10.251.252.18)
2024-11-21 07:43:30.832 [DEBUG][94] tunnel-ip-allocator/customresource.go 257: Get custom Kubernetes resource Key=IPAMHandle(wireguard-tunnel-addr-10.251.252.18) Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.832 [DEBUG][94] tunnel-ip-allocator/customresource.go 268: Get custom Kubernetes resource by name Key=IPAMHandle(wireguard-tunnel-addr-10.251.252.18) Name="wireguard-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.836 [DEBUG][94] tunnel-ip-allocator/customresource.go 276: Error getting resource Key=IPAMHandle(wireguard-tunnel-addr-10.251.252.18) Name="wireguard-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision="" error=ipamhandles.crd.projectcalico.org "wireguard-tunnel-addr-10.251.252.18" not found
2024-11-21 07:43:30.836 [DEBUG][94] tunnel-ip-allocator/validator.go 652: Validate IPv4 network: 10.251.252.18/26
2024-11-21 07:43:30.836 [DEBUG][94] tunnel-ip-allocator/k8s.go 593: Performing 'Update' for &{Key:Node(10.251.252.18) Value:0xc00120e700 Revision:229913853 UID:<nil> TTL:0s}
2024-11-21 07:43:30.836 [DEBUG][94] tunnel-ip-allocator/node.go 85: Received Update request on Node type
2024-11-21 07:43:30.843 [DEBUG][94] tunnel-ip-allocator/node.go 510: Loaded label annotations
2024-11-21 07:43:30.858 [DEBUG][94] tunnel-ip-allocator/allocateip.go 728: Wireguard is not running on node 10.251.252.18, do not allocate a device IPv6 address
2024-11-21 07:43:30.858 [DEBUG][94] tunnel-ip-allocator/allocateip.go 571: Remove tunnel addresses node="10.251.252.18" type="wireguardV6TunnelAddress"
2024-11-21 07:43:30.858 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.858 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.862 [DEBUG][94] tunnel-ip-allocator/ipam.go 1583: Releasing all IPs with handle 'wireguard-v6-tunnel-addr-10.251.252.18'
2024-11-21 07:43:30.862 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for IPAMHandleKey(id=wireguard-v6-tunnel-addr-10.251.252.18)
2024-11-21 07:43:30.862 [DEBUG][94] tunnel-ip-allocator/customresource.go 257: Get custom Kubernetes resource Key=IPAMHandle(wireguard-v6-tunnel-addr-10.251.252.18) Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.862 [DEBUG][94] tunnel-ip-allocator/customresource.go 268: Get custom Kubernetes resource by name Key=IPAMHandle(wireguard-v6-tunnel-addr-10.251.252.18) Name="wireguard-v6-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.867 [DEBUG][94] tunnel-ip-allocator/customresource.go 276: Error getting resource Key=IPAMHandle(wireguard-v6-tunnel-addr-10.251.252.18) Name="wireguard-v6-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision="" error=ipamhandles.crd.projectcalico.org "wireguard-v6-tunnel-addr-10.251.252.18" not found
2024-11-21 07:43:30.867 [DEBUG][94] tunnel-ip-allocator/validator.go 652: Validate IPv4 network: 10.251.252.18/26
2024-11-21 07:43:30.867 [DEBUG][94] tunnel-ip-allocator/k8s.go 593: Performing 'Update' for &{Key:Node(10.251.252.18) Value:0xc0004dba40 Revision:229913853 UID:<nil> TTL:0s}
2024-11-21 07:43:30.867 [DEBUG][94] tunnel-ip-allocator/node.go 85: Received Update request on Node type
2024-11-21 07:43:30.871 [DEBUG][94] tunnel-ip-allocator/node.go 510: Loaded label annotations
2024-11-21 07:43:30.880 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.880 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.880 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'test' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/allocateip.go 571: Remove tunnel addresses node="10.251.252.18" type="ipipTunnelAddress"
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.881 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.886 [DEBUG][94] tunnel-ip-allocator/ipam.go 1583: Releasing all IPs with handle 'ipip-tunnel-addr-10.251.252.18'
2024-11-21 07:43:30.886 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for IPAMHandleKey(id=ipip-tunnel-addr-10.251.252.18)
2024-11-21 07:43:30.886 [DEBUG][94] tunnel-ip-allocator/customresource.go 257: Get custom Kubernetes resource Key=IPAMHandle(ipip-tunnel-addr-10.251.252.18) Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.886 [DEBUG][94] tunnel-ip-allocator/customresource.go 268: Get custom Kubernetes resource by name Key=IPAMHandle(ipip-tunnel-addr-10.251.252.18) Name="ipip-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.890 [DEBUG][94] tunnel-ip-allocator/customresource.go 276: Error getting resource Key=IPAMHandle(ipip-tunnel-addr-10.251.252.18) Name="ipip-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision="" error=ipamhandles.crd.projectcalico.org "ipip-tunnel-addr-10.251.252.18" not found
2024-11-21 07:43:30.890 [DEBUG][94] tunnel-ip-allocator/validator.go 652: Validate IPv4 network: 10.251.252.18/26
2024-11-21 07:43:30.890 [DEBUG][94] tunnel-ip-allocator/k8s.go 593: Performing 'Update' for &{Key:Node(10.251.252.18) Value:0xc00120ea80 Revision:229913853 UID:<nil> TTL:0s}
2024-11-21 07:43:30.890 [DEBUG][94] tunnel-ip-allocator/node.go 85: Received Update request on Node type
2024-11-21 07:43:30.895 [DEBUG][94] tunnel-ip-allocator/node.go 510: Loaded label annotations
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'test' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/allocateip.go 571: Remove tunnel addresses node="10.251.252.18" type="vxlanTunnelAddress"
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.904 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.909 [DEBUG][94] tunnel-ip-allocator/ipam.go 1583: Releasing all IPs with handle 'vxlan-tunnel-addr-10.251.252.18'
2024-11-21 07:43:30.909 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for IPAMHandleKey(id=vxlan-tunnel-addr-10.251.252.18)
2024-11-21 07:43:30.909 [DEBUG][94] tunnel-ip-allocator/customresource.go 257: Get custom Kubernetes resource Key=IPAMHandle(vxlan-tunnel-addr-10.251.252.18) Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.909 [DEBUG][94] tunnel-ip-allocator/customresource.go 268: Get custom Kubernetes resource by name Key=IPAMHandle(vxlan-tunnel-addr-10.251.252.18) Name="vxlan-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.913 [DEBUG][94] tunnel-ip-allocator/customresource.go 276: Error getting resource Key=IPAMHandle(vxlan-tunnel-addr-10.251.252.18) Name="vxlan-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision="" error=ipamhandles.crd.projectcalico.org "vxlan-tunnel-addr-10.251.252.18" not found
2024-11-21 07:43:30.913 [DEBUG][94] tunnel-ip-allocator/validator.go 652: Validate IPv4 network: 10.251.252.18/26
2024-11-21 07:43:30.913 [DEBUG][94] tunnel-ip-allocator/k8s.go 593: Performing 'Update' for &{Key:Node(10.251.252.18) Value:0xc001212c40 Revision:229913853 UID:<nil> TTL:0s}
2024-11-21 07:43:30.913 [DEBUG][94] tunnel-ip-allocator/node.go 85: Received Update request on Node type
2024-11-21 07:43:30.917 [DEBUG][94] tunnel-ip-allocator/node.go 510: Loaded label annotations
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'ipv4-pool-xxx' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 749: IPPool 'test' is not allowed for tunnel IPs.  Allowed uses: [Workload]
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/allocateip.go 571: Remove tunnel addresses node="10.251.252.18" type="vxlanV6TunnelAddress"
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for Node(10.251.252.18)
2024-11-21 07:43:30.927 [DEBUG][94] tunnel-ip-allocator/node.go 125: Received Get request on Node type
2024-11-21 07:43:30.932 [DEBUG][94] tunnel-ip-allocator/ipam.go 1583: Releasing all IPs with handle 'vxlan-v6-tunnel-addr-10.251.252.18'
2024-11-21 07:43:30.932 [DEBUG][94] tunnel-ip-allocator/k8s.go 668: Performing 'Get' for IPAMHandleKey(id=vxlan-v6-tunnel-addr-10.251.252.18)
2024-11-21 07:43:30.932 [DEBUG][94] tunnel-ip-allocator/customresource.go 257: Get custom Kubernetes resource Key=IPAMHandle(vxlan-v6-tunnel-addr-10.251.252.18) Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.932 [DEBUG][94] tunnel-ip-allocator/customresource.go 268: Get custom Kubernetes resource by name Key=IPAMHandle(vxlan-v6-tunnel-addr-10.251.252.18) Name="vxlan-v6-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision=""
2024-11-21 07:43:30.936 [DEBUG][94] tunnel-ip-allocator/customresource.go 276: Error getting resource Key=IPAMHandle(vxlan-v6-tunnel-addr-10.251.252.18) Name="vxlan-v6-tunnel-addr-10.251.252.18" Namespace="" Resource="IPAMHandles" Revision="" error=ipamhandles.crd.projectcalico.org "vxlan-v6-tunnel-addr-10.251.252.18" not found
2024-11-21 07:43:30.936 [DEBUG][94] tunnel-ip-allocator/validator.go 652: Validate IPv4 network: 10.251.252.18/26
2024-11-21 07:43:30.936 [DEBUG][94] tunnel-ip-allocator/k8s.go 593: Performing 'Update' for &{Key:Node(10.251.252.18) Value:0xc000f5ca80 Revision:229913853 UID:<nil> TTL:0s}
2024-11-21 07:43:30.936 [DEBUG][94] tunnel-ip-allocator/node.go 85: Received Update request on Node type
2024-11-21 07:43:30.941 [DEBUG][94] tunnel-ip-allocator/node.go 510: Loaded label annotations

@liuxu623
Copy link
Contributor Author

liuxu623 commented Nov 21, 2024

I found there is no way to config tunnel-ip-allocator log level for now, and I modify the code set log level to Debug.

Create a IPPool to reproduce the problem

apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
  name: test
spec:
  allowedUses:
  - Workload
  blockSize: 27
  cidr: 127.0.0.0/26
  ipipMode: Never
  nodeSelector: kubernetes.io/hostname=='127.0.0.1'
  vxlanMode: Never

We can see the Performing 'Get' for Node(xxx) appeared 6 times,
and Performing 'Update' for xxx appeared 5 times in above log.

https://github.com/projectcalico/calico/blob/v3.29.0/libcalico-go/lib/backend/k8s/resources/node.go#L87
In fact, k8s backend will get node before each update status.

So if a IPPool changed(Create/Update/Delete), tunnel-ip-allocator will call Nodes().Get() 11 times and Nodes().UpdateStatus() 5 times.

If there are 1000 nodes in the cluster, each IPPool change will performed 11000 get operations and 5000 update operations, these operations almost at the same time.This will put a lot of pressure on the apiserver.

@liuxu623
Copy link
Contributor Author

liuxu623 commented Nov 21, 2024

Could you share some logs or watch events on the node so I can see precisely what node updates are being made?

Also, what types of IP pool changes are you making?

I am surprised that you are seeing the node updates occurring many times when changing IP pools and would like to understand it better in order to determine the correct fix.

In our environments, we implement a IPAM controller, each node may use one or more IPPools(it depends how many pods on the node), each IPPool only select one node. The IPAM controller will create IPPool if the node join the cluster, and delete IPPool if the node leave the cluster.
IPPool changes frequently when the cluster scaling.

@liuxu623 liuxu623 force-pushed the reduce-update-node branch 2 times, most recently from a88c68f to 0153c9e Compare November 23, 2024 17:59
@liuxu623
Copy link
Contributor Author

liuxu623 commented Nov 23, 2024

  • Query IPAM data once at the start of the reconcile.
  • Use that IPAM data to determine which addresses are currently allocated for this node.
  • Assign / release addresses as needed to satisfy the desired state.
  • Update the Node in-memory with the currently allocated set of IP addresses.
  • If !DeepEqual, Update() the Node in the API.
  • If retErr != nil release all assigned IP addresses.

@caseydavenport I refactored the code according to the above logic,PTAL

@caseydavenport
Copy link
Member

https://github.com/projectcalico/calico/blob/v3.29.0/libcalico-go/lib/backend/k8s/resources/node.go#L87
In fact, k8s backend will get node before each update status.

I think we could, in theory, refactor some of this code to natively use the Kubernetes Node API - this would avoid us needing to do this extra Get().

It's a larger change than I want to take on in this PR though, so let's plan on getting this merged and then considering follow-up PRs to make further improvements.

@caseydavenport
Copy link
Member

/sem-approve

@caseydavenport caseydavenport added the docs-not-required Docs not required for this change label Nov 27, 2024
@marvin-tigera marvin-tigera removed the docs-pr-required Change is not yet documented label Nov 27, 2024
@caseydavenport
Copy link
Member

@liuxu623 looks like static checks are failing - you should be able to run make fix-changed to clean up the code formatting.

@liuxu623
Copy link
Contributor Author

liuxu623 commented Nov 28, 2024

@liuxu623 looks like static checks are failing - you should be able to run make fix-changed to clean up the code formatting.

done

@caseydavenport
Copy link
Member

/sem-approve

@caseydavenport caseydavenport self-assigned this Dec 3, 2024
@liuxu623
Copy link
Contributor Author

liuxu623 commented Dec 3, 2024

@caseydavenport Sorry, I forgot update test file before.

@coutinhop
Copy link
Member

/sem-approve

@coutinhop
Copy link
Member

@liuxu623 since you added error returns to a few functions, static checks are now failing for the previous calls which do not check them, for example:

pkg/allocateip/allocateip.go:91:23: Error return value is not checked (errcheck) 02:21
		reconcileTunnelAddrs(nodename, c, felixEnvConfig)

Could you please update the calls?

@liuxu623
Copy link
Contributor Author

liuxu623 commented Dec 4, 2024

@liuxu623 since you added error returns to a few functions, static checks are now failing for the previous calls which do not check them, for example:

pkg/allocateip/allocateip.go:91:23: Error return value is not checked (errcheck) 02:21
		reconcileTunnelAddrs(nodename, c, felixEnvConfig)

Could you please update the calls?

@coutinhop fixed, thank you for your remind.

@coutinhop
Copy link
Member

/sem-approve

@liuxu623
Copy link
Contributor Author

liuxu623 commented Dec 5, 2024

@coutinhop The test is fixed, I want to ask how to run the test locally, I tried to run the test but encountered operation Create is not supports on Node(test.node) error.

@caseydavenport
Copy link
Member

/sem-approve

@liuxu623
Copy link
Contributor Author

liuxu623 commented Dec 5, 2024

@caseydavenport empty address test case fixed.

@caseydavenport
Copy link
Member

/sem-approve

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM - just a minor comment about a couple of ignored errors.

node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
node/pkg/allocateip/allocateip.go Outdated Show resolved Hide resolved
@caseydavenport
Copy link
Member

/sem-approve

@caseydavenport
Copy link
Member

Thanks @liuxu623 ! Going to merge this now. let's see how it does :)

@caseydavenport caseydavenport merged commit 5d7c855 into projectcalico:master Dec 9, 2024
3 checks passed
@liuxu623 liuxu623 deleted the reduce-update-node branch December 10, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add CALICO_DISABLE_TUNNEL env
4 participants