-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tainting and untainting logic implemented via configuration #565
Conversation
Hi @bilalcaliskan. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
8d48e33
to
e204891
Compare
i have modified the commit message so there are lots of activity since PR is opened, sorry for that. |
/retest |
@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @xueweiz @andyxning |
We discussed this long time back when we firstly started NPD. |
According to the remedy system section, NPD is able to add condition to node that eventually will taint them, then descheduler will evict pods that doesn't respect the taint. Or we could be taking advantage of Taint Based Eviction instead of descheduler. Currently, descheduler support RemovePodsViolatingNodeTaints and NPD can add condition nevertheless, there is no way to rely only those two compontents to drain nodes automatically. Draino is still required, am I right ? |
Condition is mostly for informative. Taint will actually affect the scheduler decision, e.g. not schedule any pod to a node any more, or evict running pods from a node. That kind of decision should be done by the cluster level controller, or else if in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods. |
I agree 👍 I want to avoid using draino so, I wonder how to taint by custom condition 🤔 |
NPD can works with the descheduler only with the predefined (Ready, MemoryPressure...) conditions. So, the current design is in my opinion definitively limited. I understand the decision of spliting responsability but the documentation leads to mesleading. |
I agree with @azman0101, descheduler does not do tainting by itself. According to remedy systems section we can use descheduler for that purpose but sadly i guess there is just one option for that and its draino. @Random-Liu as i know, Descheduler does the job only if there are specified taints on the node. So we should use 3 different component to taint nodes on specific |
What is the status on this? |
there is no more work left on the development side i guess but waiting for a maintainer review. |
@btiernay Could you help finding a maintainer that have the time to push this over the finish line? |
@andyxning @wangzhen127 @xueweiz @vteratipally @mmiranda96 Are you available to help review this great addition and help to push it over the finish line? 🙏 |
Kindly requesting your review, one more time 🙏. Thank you 🙇. |
Hey, what is the current status of this PR? Would be great so see this merged 🙏🏻 |
/retest-required |
@bilalcaliskan: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@bilalcaliskan Seems like tests are failing due to build issues. Maybe need to rebase? |
@bilalcaliskan Would be really nice to have this! |
Should we close this PR without merging it? I agree with @Random-Liu's #565 (comment). NPD does not necessarily want to affect the scheduling decisions. For example, we already have examples like Taint Nodes by Condition, in which the node controller taints nodes based on node conditions. This should be the recommended model. |
That makes NPD pretty much useless IMO. Thank you for confirming. Plus given that PR has been open for 4 years, that raises even more concerns about using it in production. |
Agree with @nvermande's assessment. The lack of this capability makes NPD practically much harder to use, integrate, and be successful with. I can understand why actuation isn't in scope, but imo not having (un)tainting support means poorer ecosystem interop and more work for users. |
Can we introduce an "enabled" flag to allow users to determine whether the node-problem-detector should taint a node and affect scheduling decisions? By default, this feature can be set to false. Additionally, we can provide the considerations in the README to help users evaluate whether to enable this feature. |
@wangzhen127 the readme suggests to use Draino (no commit since 4 years on the master branch - IMO should be removed from the readme), mediK8S (brings it's own ecosystem) and MachineHealthCheck (related to ClusterAPI) as Remedy Systems next to Descheduler. The outdated Draino makes Descheduler "useless" IMO. So for me rejecting this PR (for ~3 years now) feels like this project is EOL.
True but I think is upon the user using this feature - so it should be documented. An additional idea would be to add labels to the nodes like
True but these are very basic IMO. Getting this PR merged would give users the possibility to define them on their own. Examples which come to my mind in a short time for this could be:
There are several use-cases where this feature makes absolutely sense without creating a new controller because then we could simply fork this project... |
As far as I know, NPD has been used by several cloud providers and products in production for many years. The reason why this approach is not recommended has been clearly stated previously. Adding this as an optional feature could work in some cases, but it could also be abused and eventually harm ourselves. Given there are several people feeling strongly about this feature, I suggest to bring this issue to the wider community in the sig-node weekly meeting for feedback. Please let me know when you plan to discuss this. Thanks! |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which issue(s) this PR fixes:
this pr solve issue #457 .
What this PR does / why we need it:
This PR adds functionality of tainting and untainting a node on specific circumstances conditionally. With that improvement, node-problem-detector can be used in conjunction with descheduler. User should specify
taintEnabled
,taintKey
,taintValue
,taintEffect
in config/kernel-monitor.json. If not specified,taintEnabled
is false, so npd will not taint any node. With that improvement, node-problem-detector also removes taint if problem is resolved.Special notes for your reviewer:
This improvement needs
update
Clusterrole to node-problem-detector. If that PR somehow merged to the master,update
verb must be added to right here.