-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: wait for pods to get IPs #16001
test: wait for pods to get IPs #16001
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: prezha The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
ok, so following the "side note" from the above, i think i've just confirmed that the issue is not with the cri-dockerd upgrade but actually, probably an issue/bug introduced in kubernetes 1.26.1+, as with 1.26.0 we didn't and don't have that delay issue - ie, both pods get IPs within few seconds and the multinode tests complete w/o issues (and without these pr changes) - as shown in the screenshot below: |
kvm2 driver with docker runtime
Times for minikube start: 52.6s 53.4s 55.8s 53.2s 53.8s Times for minikube ingress: 24.7s 26.2s 25.7s 25.2s 24.2s docker driver with docker runtime
Times for minikube start: 26.1s 27.1s 26.9s 25.7s 28.1s Times for minikube ingress: 21.0s 21.0s 19.6s 20.5s 21.6s docker driver with containerd runtime
Times for minikube start: 21.9s 22.8s 22.1s 23.4s 23.2s Times for minikube ingress: 32.6s 31.6s 31.6s 32.6s 31.6s |
These are the flake rates of all failed tests.
Too many tests failed - See test logs for more details. To see the flake rates of all tests by environment, click here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this PR that fixes the failed test, is this something we can add to the minikube code it self as additional Verificaiton ? like if we do minkube start --wait=all would we need to add a new Wait componenet to make sure it waits for all ?
@prezha what is your guess that these tests started ti fail around same time as cri-dockerd change? |
Oh thats interesting. Should we create an issue on k8s repo? |
@medyagh there could be a couple of things that interfere in a bad way - not sure which one exactly atm, but yeah: the same minikube current head (ie, containing the updated cri-dockerd v0.3.1) does work with k8s v1.26.0 (set via you are most probably right in saying that the problem surfaced with cri-dockerd update to v0.3.1 but at that time k8s default version was already set to v1.26.1 - based on https://storage.googleapis.com/minikube-builds/logs/15752/27907/Docker_Linux.html#fail_TestMultiNode%2fserial%2fDeployApp2Nodes, we have:
so it could be due to a combination of cri-dockerd v0.3.1 and k8s v1.26.1 (as k8s v1.26.0 works with cri-dockerd v0.3.1) |
How about we try a diffrent cni and see if that gets passed the problem of pods getting ips late? |
@medyagh i've run a number of TestMultiNode tests with docker/docker on different OSs (opensuse tumbleweed, ubuntu 20.04/22.04, macos m1), with different images (busybox, agnhost) and different CNIs (flannel, calico and also older kindnet images) and they all had the similar issue (the difference is only how big the delay is before all pods eventually get an ip, but the test, as it is now, would fail for most of the time - at times it just gets lucky to pass) tl;dr: i think that this pr is valid and could help us mitigate this test's flakiness as it explicitly waits for all the test pods to get IPs details: as mentioned earlier, test pods (ie, busybox) are both marked with "Ready" status even though the pod on the worker node did not get ip - example snapshot: now, based on the aforementioned Pod Lifecycle/Pod conditions:
which i interpret as: if a pod is Ready, then it would probably also mean that it has to have an ip to be able to communicate over the network and serve requests if i'm not wrong, kubernetes gets pod/container(s) status (amongst other things) via the kubelet's (generic atm) PLEG module that talks with the cr (cri-docker, in this case) now, looking at the code, i think that cri-docker separates pod's status from whether it has ip or not, eg: as seen on the snapshot above, docker can report that container/pod is up/running even though it does not have an ip address it then also seems that the kubelet/pleg accepts the status response from the runtime "as-is" and does not check if the Ready pod also has an ip assigned - eg, the updateCache func calls runtime to GetsPodStatus, then gets pod's ip(s) (if any, but does not check!) and then just sets the status as received in runtime's response - from increased-verbosity kubelet log re: snapshot above (level 7, attached "kubelet-m02.log" below in full) - line 2884:
now, i'm not sure if Ready should also mean being able to communicate (i think it should), and if so, whose "responsibility" is to check whether the pod has an ip before stating it's finally Ready? based on the cri runtime api:
then, cri-docker implements:
and also this comment:
what we see in cri-docker logs (attached below), is the following
that comes from the getIPs func, that's in turn, called by the
which, i think, then allows the finally, from the cri-docker upgrade changes from v0.3.0 to v0.3.1, it's not obvious (amongst thousands of lines changed including a bunch of external dependencies) if something is directly changed in the logic that worked before, so it also might be something in one of many libraries that are updated in this upgrade attached are the kubelet (-v=7) and cri-docker logs that are related to the snapshot above: bottom line: i might be wrong in the above analysis, but until this is clarified (and fixed somewhere upstream, if needed), this pr should help us reduce the TestMultiNode tests' flakiness (by waiting for test pods to get ip) |
Thank you for the analysis and report on this, would you mind sharing this investigation in this issue for cri-dockerd in a format that they could possibily benefit from it ? |
fixes #15870
related to Mirantis/cri-dockerd#163
i could replicate the original issue with --driver=docker and --container-runtime=docker on both linux and macos and observed that the busybox pod on the cp node gets ip after ~5sec while the other busybox pod takes significantly longer but eventually (after ~40sec) gets ip on the second node
with this pr we retry until both pods get IPs (max 120sec) so that the multinode tests don't fail because of this delay
the below screenshot from linux (similarly on macos) shows that there's a delay between two pods getting ip - marked with a red rectangle
on a side note, but i think still interesting, both pods were marked as having
![linux](https://user-images.githubusercontent.com/6320846/223855009-fa277e91-b2ed-40ff-988b-886d00a4fb65.png)
Ready
condition while only one had ip, which probably should not happen (based on the Pod Lifecycle), but that's potentially a separate topic - marked with amber rectangle on the screenshot, which is the reason why we cannot wait for the pods to become egReady
but retry instead