Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-48320,SDN-4930: Increase probe timeouts on UDN pod #29458

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

tssurya
Copy link
Contributor

@tssurya tssurya commented Jan 21, 2025

This PR increases the liveness and readiness probe failure threshold to 3 and also increases each probe's timeout to 3seconds instead of the default value 1 for both which is pretty aggressive.

We have seen failures of the following pattern:

Pod event: Type=Warning Reason=Unhealthy Message=Liveness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Normal Reason=Killing Message=Container agnhost-container failed liveness probe, will be restarted LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": read tcp [fd01:0:0:5::2]:33400->[fd01:0:0:5::2ed]:9000: read: connection reset by peer LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1

where randomly after 15 or 30 seconds the liveness probe timeones out waiting for headers -> so at least we know the TCP conn was established and kubelet unfortunately didn't receive the headers within the 1sec? Hard to tell why that's the case. But this PR when increasing it to 3 seconds so far in the CI has not hit this flake even once.

We think increasing the failure threshold to 3 is safer on ocp just like we have for startup probes.

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview/1881666601044938752
  2. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-azure-ovn-runc-techpreview/1881591233504088064
  3. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2314/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview/1849904878739001344

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2025

/test e2e-gcp-ovn-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/422b3d50-d809-11ef-9095-911f8eab4ccc-0

@openshift-ci openshift-ci bot requested review from danwinship and knobunc January 21, 2025 15:08
@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9f9edb10-d80c-11ef-9b1f-9851ec981cd1-0

@tssurya
Copy link
Contributor Author

tssurya commented Jan 21, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d44a8ca0-d830-11ef-9237-b435d7189f1c-0

@tssurya tssurya force-pushed the increase-probe-timeouts branch 2 times, most recently from 3097cee to 2d3339e Compare January 22, 2025 13:01
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

1 similar comment
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7edf6b0-d8c1-11ef-9154-d5ab7bd9ab9c-0

@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/fa0f01f0-d8c1-11ef-906b-6dc76be18a87-0

Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@tssurya: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-ovn-runc-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0017a7a0-d8c2-11ef-9257-d12072dbf52d-0

@tssurya tssurya changed the title [WIP] Increase probe timeouts on UDN pod Increase probe timeouts on UDN pod Jan 22, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 22, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

I have vetted 12 runs on this PR all looking good with restart issue

Copy link

openshift-trt bot commented Jan 22, 2025

Job Failure Risk Analysis for sha: 2d3339e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 95.84% of 409 runs on release 4.19 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-imageregistry][Serial] Image signature workflow can push a signed image to openshift registry and verify it [apigroup:user.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/serial]
This test has passed 95.84% of 409 runs on release 4.19 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade Low
[sig-node] static pods should start after being created
This test has passed 77.46% of 213 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
Static pod controller pods sometimes fail to start [etcd]
---
[sig-node] static pods should start after being created
This test has passed 76.85% of 216 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Open Bugs
Static pod controller pods sometimes fail to start [etcd]

@tssurya tssurya force-pushed the increase-probe-timeouts branch 2 times, most recently from 5bdbd13 to c8ba555 Compare January 22, 2025 18:25
Signed-off-by: Surya Seetharaman <[email protected]>
@tssurya tssurya force-pushed the increase-probe-timeouts branch from c8ba555 to 0bc15f9 Compare January 22, 2025 18:26
@tssurya tssurya changed the title Increase probe timeouts on UDN pod SDN-4930: Increase probe timeouts on UDN pod Jan 22, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 22, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 22, 2025

@tssurya: This pull request references SDN-4930 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

This PR increases the liveness and readiness probe failure threshold to 3 and also increases each probe's timeout to 3seconds instead of the default value 1 for both which is pretty aggressive.

We have seen failures of the following pattern:

Pod event: Type=Warning Reason=Unhealthy Message=Liveness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Normal Reason=Killing Message=Container agnhost-container failed liveness probe, will be restarted LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": read tcp [fd01:0:0:5::2]:33400->[fd01:0:0:5::2ed]:9000: read: connection reset by peer LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1

where randomly after 15 or 30 seconds the liveness probe timeones out waiting for headers -> so at least we know the TCP conn was established and kubelet unfortunately didn't receive the headers within the 1sec? Hard to tell why that's the case. But this PR when increasing it to 3 seconds so far in the CI has not hit this flake even once.

We think increasing the failure threshold to 3 is safer on ocp just like we have for startup probes.

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview/1881666601044938752
  2. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-azure-ovn-runc-techpreview/1881591233504088064
  3. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2314/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview/1849904878739001344

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tssurya
Copy link
Contributor Author

tssurya commented Jan 22, 2025

/test e2e-gcp-ovn-techpreview

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025
Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: trozet, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2025
@knobunc
Copy link
Contributor

knobunc commented Jan 22, 2025

/override ci/prow/e2e-aws-ovn-serial

Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-aws-ovn-serial

In response to this:

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4b05413 and 2 for PR HEAD 0bc15f9 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 4b05413 and 2 for PR HEAD 0bc15f9 in total

@knobunc
Copy link
Contributor

knobunc commented Jan 23, 2025

/override ci/prow/e2e-aws-ovn-serial

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-aws-ovn-serial

In response to this:

/override ci/prow/e2e-aws-ovn-serial

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tssurya tssurya changed the title SDN-4930: Increase probe timeouts on UDN pod OCPBUGS-48320,SDN-4930: Increase probe timeouts on UDN pod Jan 23, 2025
@openshift-ci-robot openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jan 23, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@tssurya: This pull request references Jira Issue OCPBUGS-48320, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references SDN-4930 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

This PR increases the liveness and readiness probe failure threshold to 3 and also increases each probe's timeout to 3seconds instead of the default value 1 for both which is pretty aggressive.

We have seen failures of the following pattern:

Pod event: Type=Warning Reason=Unhealthy Message=Liveness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Normal Reason=Killing Message=Container agnhost-container failed liveness probe, will be restarted LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": read tcp [fd01:0:0:5::2]:33400->[fd01:0:0:5::2ed]:9000: read: connection reset by peer LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1

where randomly after 15 or 30 seconds the liveness probe timeones out waiting for headers -> so at least we know the TCP conn was established and kubelet unfortunately didn't receive the headers within the 1sec? Hard to tell why that's the case. But this PR when increasing it to 3 seconds so far in the CI has not hit this flake even once.

We think increasing the failure threshold to 3 is safer on ocp just like we have for startup probes.

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview/1881666601044938752
  2. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-azure-ovn-runc-techpreview/1881591233504088064
  3. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2314/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview/1849904878739001344

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 23, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Jan 23, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 23, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@tssurya: This pull request references Jira Issue OCPBUGS-48320, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @anuragthehatter

This pull request references SDN-4930 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "4.19." or "openshift-4.19.", but it targets "openshift-4.18" instead.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 23, 2025
@tssurya
Copy link
Contributor Author

tssurya commented Jan 23, 2025

/tide refresh

@tssurya
Copy link
Contributor Author

tssurya commented Jan 23, 2025

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29458/pull-ci-openshift-origin-master-e2e-gcp-ovn-techpreview/1882166582520582144

though failed did not fail for reasons related to UDN, all network segmentation tests have passed

@knobunc knobunc merged commit 890f4fd into openshift:master Jan 23, 2025
24 of 30 checks passed
@openshift-ci-robot
Copy link

@tssurya: Jira Issue OCPBUGS-48320: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-48320 has not been moved to the MODIFIED state.

In response to this:

This PR increases the liveness and readiness probe failure threshold to 3 and also increases each probe's timeout to 3seconds instead of the default value 1 for both which is pretty aggressive.

We have seen failures of the following pattern:

Pod event: Type=Warning Reason=Unhealthy Message=Liveness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Normal Reason=Killing Message=Container agnhost-container failed liveness probe, will be restarted LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1
Pod event: Type=Warning Reason=Unhealthy Message=Readiness probe failed: Get "http://[fd01:0:0:5::2ed]:9000/healthz": read tcp [fd01:0:0:5::2]:33400->[fd01:0:0:5::2ed]:9000: read: connection reset by peer LastTimestamp=2025-01-21 15:16:43 +0000 UTC Count=1

where randomly after 15 or 30 seconds the liveness probe timeones out waiting for headers -> so at least we know the TCP conn was established and kubelet unfortunately didn't receive the headers within the 1sec? Hard to tell why that's the case. But this PR when increasing it to 3 seconds so far in the CI has not hit this flake even once.

We think increasing the failure threshold to 3 is safer on ocp just like we have for startup probes.

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview/1881666601044938752
  2. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-29424-nightly-4.19-e2e-azure-ovn-runc-techpreview/1881591233504088064
  3. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_ovn-kubernetes/2314/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-techpreview/1849904878739001344

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 0bc15f9 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn-techpreview 0bc15f9 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-aws-ovn-single-node-serial 0bc15f9 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node-upgrade 0bc15f9 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node 0bc15f9 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants