Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-3022] Write the production readiness requirements to graduate to beta #3338

Merged
merged 15 commits into from
Jun 22, 2022

Conversation

sanposhiho
Copy link
Member

  • One-line PR description: Write the production readiness requirements to graduate to beta
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 5, 2022
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g June 5, 2022 09:14
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 5, 2022
@k8s-ci-robot k8s-ci-robot requested a review from Huang-Wei June 5, 2022 09:14
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 5, 2022
@sanposhiho
Copy link
Member Author

@wojtek-t wojtek-t self-assigned this Jun 6, 2022
@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

It shouldn't impact already running workloads. It's an opt-in feature,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review tomorrow - in the meantime, if you're targeting beta, please update the kep.yaml and corresponding prr file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. will update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are new requirements for the Test Plan. Please check the updated template.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

A spike on metric `schedule_attempts_total{result="error|unschedulable"}` when pods using this feature are added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also include a spike on scheduling_algorithm_duration_seconds, which would suggest that the feature is too slow.

Copy link
Member Author

@sanposhiho sanposhiho Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
will add a mention of plugin_execution_duration_seconds as well.

@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Yes. The behavior is changed as expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the manual testing that was done :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.

So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll do another manual test.

@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could extend kubernetes/kubernetes#107556 to produce a metric.

But I wouldn't block beta graduation to this.

Copy link
Member Author

@sanposhiho sanposhiho Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need kubernetes/kubernetes#107556...
I have not had enough time to work on this, but will make it a priority...

to produce a metric.

So, what metrics do you imagine like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue so we can discuss there?

But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will create that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -363,18 +378,19 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

- 99% of `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` are within x milliseconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to provide an actual value instead of x

@@ -383,6 +399,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
implementation difficulties, etc.).
-->

No.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as described in the How can an operator determine if the feature is in use by workloads? question

@@ -466,8 +486,13 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd
The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd

@@ -483,6 +508,9 @@ For each of them, fill in the following information by copying the below templat

###### What steps should be taken if SLOs are not being met to determine the problem?

- Check `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` to see if latency increased.
- Check `schedule_attempts_total{result="error|unschedulable"}` to see if the number of attempts increased.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do if I see problems in either of the metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sanposhiho
Copy link
Member Author

@alculquicondor @wojtek-t
Updated. Please retake a look 🙏

stage: alpha
latest-milestone: "v1.24"
stage: beta
latest-milestone: "v1.25"
milestone:
alpha: "v1.24"
beta: "v1.25"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the stable line?

We don't know yet if it will be done in 1.26 or not, but most likely 1.27+ (after 2 beta releases).

@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

It shouldn't impact already running workloads. It's an opt-in feature,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are new requirements for the Test Plan. Please check the updated template.

@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Yes. The behavior is changed as expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.

So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.

@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue so we can discuss there?

But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2022
@sanposhiho
Copy link
Member Author

@alculquicondor
Thanks for reviewing. I updated as your suggestions.
(except #3338 (comment))

Please retake a look again. 🙏

extending the production code to implement this enhancement.
-->

- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2020-06-17` - `86%`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention the pod validation packages

We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->

N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why:

possible explanations is that there are no new API endpoints and that the feature doesn't interact with other components, so E2E doesn't add extra value to integration tests.

@@ -383,6 +469,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
implementation difficulties, etc.).
-->

Yes. It would be useful if we could see more details related to scheduler's decisions in metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the issue.

Maybe be more specific about decisions. In this case, we would like to know which Filters had impact on the scheduling of the pod.

@sanposhiho
Copy link
Member Author

@alculquicondor

Applied your new suggestions.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2022
@sanposhiho
Copy link
Member Author

@alculquicondor

I just updated the upgrade/rollback manual test section.
#3338 (comment)

@sanposhiho
Copy link
Member Author

@wojtek-t Could you take a look at this PR as well? 🙏

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm satisfied with the PRR answers, awaiting SIG approval.

@@ -4,3 +4,5 @@
kep-number: 3022
alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wojtek is out, please change to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for taking over! 🙏

@johnbelamaric
Copy link
Member

I'm satisfied with the PRR answers, awaiting SIG approval.

Oh, I see SIG approval. Just change the PRR approver to me, so if all hell breaks loose Wojtek doesn't get blamed :-P

@sanposhiho
Copy link
Member Author

@johnbelamaric Updated.

@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, johnbelamaric, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2022
@sanposhiho
Copy link
Member Author

@johnbelamaric @alculquicondor
Would you mind adding /lgtm? It's needed for merging.

@alculquicondor
Copy link
Member

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9b475dc into kubernetes:master Jun 22, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants