Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0135: Per-PipelineRun (instead of per-workspace) affinity assistant #6543

Closed
lbernick opened this issue Apr 14, 2023 · 15 comments
Closed
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@lbernick
Copy link
Member

lbernick commented Apr 14, 2023

Feature request

A cluster-level, opt-in configuration option to allow all of a PipelineRun's pods to be scheduled to the same node.

Use case

  1. PipelineRuns with parallel TaskRuns using the same PVC-backed workspace. (This use case is addressed with the current affinity assistant.) Discussed in Difficult to use parallel Tasks that share files using workspace #2586

  2. PipelineRuns with TaskRuns using multiple PVC-backed workspaces. (This use case is not addressed with the current affinity assistant.)

Existing Workarounds

Potential drawbacks

The following drawbacks associated with the existing affinity assistant won't be addressed by this feature:

This also introduces a new drawback compared to the current affinity assistant: Some PipelineRuns need to have TaskRuns run on different nodes; for example, multi-arch builds. Discussed in more detail here

More details

Looking to gauge interest in this feature, as well as whether there is any interest in a configuration option that would prevent multiple PipelineRuns from running concurrently on the same node.

@skaegi has expressed interest in this feature as long as PipelineRuns can be moved to a new node in case a cluster operator needs to do node maintenance. (He is not interested in preventing multiple PipelineRuns from running on the same node, as this results in worse bin-packing.)

@lbernick lbernick added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 14, 2023
@lbernick
Copy link
Member Author

Chatted about this in Monday's API WG-- seemed like there was interest in this feature which was great!

@pritidesai expressed some concern about how this feature would handle a node going down. I think a node going down would kill pods running on it and cause taskruns to fail regardless of whether the existing affinity assistant is used or not, so I think this could make sense to address separately. (Looks like a contributor created a separate issue for this: #6558). @pritidesai would you be ok with descoping this requirement?

@QuanZhang-William and I brainstormed a few different ideas for how to build this feature. The clearest way is to create only one placeholder pod per pipelinerun and mount all workspace persistent volumes onto it, which Quan prototyped here: https://github.com/QuanZhang-William/pod-affinity. This requires a cluster operator to set the default storage class to one that uses WaitForFirstConsumer VolumeBindingMode, so that the PVs are not bound to nodes until the placeholder pod is scheduled, to prevent the PVs from being bound to different nodes. We would have to document that cluster operators need to make this change in order to use the per-pipelinerun affinity assistant, which isn't ideal but I think it could be workable.

I want to go into a bit more detail on the problem @skaegi brought up: a situation where an operator cordons a node in order to do maintenance on it.

  • Without the affinity assistant, you can cordon a node, wait for existing TaskRuns to finish before you delete any pods, and then the cluster autoscaler will trigger a scale up, creating a new node matching the node affinity terms of the original PV. Subsequent TaskRun pods get scheduled on the new node and the PipelineRun completes successfully.
  • With the affinity assistant, when you cordon a node, the PipelineRun gets stuck because new TaskRun pods have affinity for the assistant pod, but can't be scheduled onto the node, so the cluster autoscaler doesn't scale up. The cluster operator needs to delete the affinity assistant pod in order for the PipelineRun to complete successfully. (Cluster Autoscaler conflict with volumneClaim and or affinity-assistant #4699 has the same root cause.)

@QuanZhang-William and I tried to come up with a few ways to get around this problem with a per-PipelineRun affinity assistant, but haven't come up with any great solutions yet. Here are some ideas we thought of:

  • Setting pod affinity for each pod in a PipelineRun: this deadlocks because the first pod cannot schedule. (More details in the original affinity design doc: "The RequiredDuringScheduling rule {LabelSelector: "service" In "S", TopologyKey: "zone"} only "works" once one pod from service S has been scheduled. But if all pods in service S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule will block the first pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from the same service. And of course that means none of the pods of the service will be able to schedule".) This also doesn't solve the node maintenance problem.
  • Setting pod affinity for each pod in a PipelineRun except the first in the DAG: This also does not work correctly, see Quan's experiment here
  • Using PreferredDuringSchedulingIgnoredDuringExecution affinity on the placeholder pod: This doesn't guarantee that pods get scheduled and PVCs get bound to the same node
  • Trying to delete and recreate the placeholder pod during deadlock situations or between TaskRuns: I think this will get complicated and I'm not sure how this would work exactly
  • Building a k8s scheduler plugin: The coscheduling plugin could work here, and I built one specific to Tekton; however, this would definitely depend on the k8s flavor and would have to be separate from the pipelines repo, and it may have some weird interactions with the cluster autoscaler (e.g. cluster-autoscaler does not support custom scheduling config kubernetes/autoscaler#4518). It's also very hard to debug.

@skaegi is this requirement table stakes for you or do you think we could descope it?

@lbernick
Copy link
Member Author

I chatted with @pritidesai last week about concerns related to nodes going down. To summarize, Priti has general concerns about long CI/CD PipelineRuns failing due to transient issues and users having a limited ability to recover the work that has already been done. We explored whether it would make sense for a per-PipelineRun affinity assistant (PPAA) to handle node failures by moving a PipelineRun to a new node if the one it was running on went down. However, we don't currently handle node failures for PipelineRuns in general, so we agreed that it wouldn't make sense for the PPAA to provide stronger guarantees than what we currently have. We're also not really sure how we could implement this guarantee for PPAA. Therefore, we decided to descope this concern and address it separately, making sure to document this limitation for cluster operators.

@skaegi
Copy link
Contributor

skaegi commented Apr 25, 2023

@lbernick i think that if we can't handle a cordon correctly (e.g. controlled NoSchedule) it says we're in trouble in general for robustness. Maybe #6558 is the higher priority as it's the simple case.

I recognize that the "Trying to delete and recreate the placeholder pod" game might get complex but I'd start there as we can do that in a backward compatible way and at least fix some cases a bit better.

Something like... for a PipelineRun, if its TaskRuns are both not running and cannot be scheduled on the affinity assistant's node then delete the affinity assistant pod and select a candidate TaskRun and re-run the affinity assistant placement algorithm. Better than nothing and might really help simple cases if we can begin to figure them out.

@lbernick
Copy link
Member Author

lbernick commented Apr 25, 2023

@lbernick i think that if we can't handle a cordon correctly (e.g. controlled NoSchedule) it says we're in trouble in general for robustness. Maybe #6558 is the higher priority as it's the simple case.

We should definitely fix #6558 asap but I think it's a slightly separate problem

I recognize that the "Trying to delete and recreate the placeholder pod" game might get complex but I'd start there as we can do that in a backward compatible way and at least fix some cases a bit better.

Something like... for a PipelineRun, if its TaskRuns are both not running and cannot be scheduled on the affinity assistant's node then delete the affinity assistant pod and select a candidate TaskRun and re-run the affinity assistant placement algorithm. Better than nothing and might really help simple cases if we can begin to figure them out.

The tricky thing here is that a pod could be unschedulable for all sorts of reasons, and not all of them are deadlock situations where we'd want to recreate the affinity assistant. For example, what if a pod is just pending because it's waiting for the cluster autoscaler to scale up the number of nodes?

One way we could handle the node maintenance case specifically could be to watch nodes and see if any of them become "unschedulable". As far as I can tell a node only has "unschedulable=true" when it's cordoned or being drained (or not ready yet or about to be torn down by the autoscaler). If an unschedulable node has any affinity assistant pods, we could create new copies of them and then delete the old ones. (Create would happen first so that TaskRun pods can't schedule in between when a placeholder pod gets deleted and the new one gets scheduled.) It feels a bit hacky tbh; will probably need to design in more detail.

@skaegi Can you elaborate on what you mean by "affinity assistant placement algorithm"? The only affinity terms we apply to the affinity assistant pod itself are any affinity terms from the pipelinerun pod template, and the affinity term to repel other affinity assistants. k8s handles scheduling the placeholder pod from there.

@skaegi
Copy link
Contributor

skaegi commented Apr 26, 2023

One way we could handle the node maintenance case specifically could be to watch nodes and see if any of them become "unschedulable". As far as I can tell a node only has "unschedulable=true" when it's cordoned or being drained (or not ready yet or about to be torn down by the autoscaler). If an unschedulable node has any affinity assistant pods, we could create new copies of them and then delete the old ones. (Create would happen first so that TaskRun pods can't schedule in between when a placeholder pod gets deleted and the new one gets scheduled.) It feels a bit hacky tbh; will probably need to design in more detail.

Yes exactly -- we might look to see if node.kubernetes.io/unschedulable and node.kubernetes.io/not-ready is set. Agree It is a bit hacky and will require work but I think will be an improvement where currently we mostly just hang and eventually timeout.

re: create/delete - Off the top of my head the order shouldn't matter as the node is already unscheduable and the taskrun pods should already have pod affinity to prevent being created on an arbitrary node. Either way part of the design.

Can you elaborate on what you mean by "affinity assistant placement algorithm"? The only affinity terms we apply to the affinity assistant pod itself are any affinity terms from the pipelinerun pod template, and the affinity term to repel other affinity assistants. k8s handles scheduling the placeholder pod from there.

It's a more advanced case but ... the new smarter placement algorithm that still has yet to be written 😃 but I'm hoping it could examine a TaskRun and match affinity requirements and possibly tolerations. Really not sure if the current repelling aspect is correct or what it hopes to accomplish.

@lbernick
Copy link
Member Author

@skaegi PTAL at #6584 and LMK if this seems like what you had in mind!

@lbernick
Copy link
Member Author

/assign @QuanZhang-William

@tekton-robot
Copy link
Collaborator

@lbernick: GitHub didn't allow me to assign the following users: QuanZhang-William.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @QuanZhang-William

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@QuanZhang-William
Copy link
Member

/assign

@l-qing
Copy link
Contributor

l-qing commented May 22, 2023

I am really looking forward to this feature.

I have a question: Are we scheduling each Pod in the Pipelinerun to the same node in order to build the tasks in parallel as much as possible?

If so,

  • Can tasks that do not mount PVC still be scheduled on any node, such as a simple notification task?
  • If a task only mounts ReadWriteMany PVC, can it also be scheduled on any node?

@lbernick
Copy link
Member Author

@l-qing great questions! I forgot to comment on this issue earlier, but we've merged TEP-0135: Coscheduling PipelineRun pods as our design proposal for this feature. Please feel free to leave any feedback on the design proposal on this issue!

Can tasks that do not mount PVC still be scheduled on any node, such as a simple notification task?

I think with the initial implementation, we're going to ensure all pods are scheduled on the same node. We can definitely explore making the per-pipelinerun affinity assistant a bit smarter, especially if more users indicate they'd prefer to allow free scheduling of pods that don't mount PVCs. This will be implemented as an alpha feature initially, so we have some freedom to make changes if needed.

If a task only mounts ReadWriteMany PVC, can it also be scheduled on any node?

We currently don't take PVC access modes into account; we've been assuming so far that cluster-level constraints affect whether users are using ReadWriteMany or ReadWriteOnce PVCs, and their decision on whether or not to enable the affinity assistant depends on what type of PVCs they are using in general. Does your use case involve PipelineRuns that mix different access modes for PVCs, and if so, can you please go into more detail?

@l-qing
Copy link
Contributor

l-qing commented May 25, 2023

@lbernick Thank you for your reply.

I discovered this issue when I encountered performance-related problems.

In my environment, there is only one shared storage class, such as NFS or ceph. If there are a large number of parallel pipelines (100 TaskRuns), it is possible to have a single PVC with very poor performance.
I used fio to perform 4k random read and write tests on the files and found that the overall performance of the storage pool was only 40MB/s. If running fio tests in parallel across multiple PVCs, it will be discovered that they actually share the same performance, each performance may only have 20+MB/s.

Especially for the nodejs frontend pipeline, they need to download dependencies into the cache of the pvc, and copy them between multiple pvc directories. The storage pressure on a single pipeline is very high.
A single task run may involve copying small files up to 1GB in size. If multiple pipelines of this kind are executed in parallel, they may even affect other pipelines. Reading files from PVC becomes very slow.

So, I was wondering if it's possible to mix and use different types of PVC.

For example, for pipeline like nodejs, The PVC of the source code can use local storage, such as topoLVM or OpenEBS. Only the pvc for caching should use shared storage, such as NFS or ceph.
This way, at least the overall pipeline won't be slow due to some parts having very frequent disk read/write operations.

Kubernetes also optimizes for local storage, ensuring that Pods using the WaitForFirstConsumer PVC are scheduled to the same node.

No action (setting pod affinity) is required to ensure the pipeline can be executed successfully when only one local storage PVC is used. Only using multiple local storage PVCs with a specific combination may lead to pipeline execution failure due to the PVCs being bound to different nodes.

I'm considering the ReadWriteMany and ReadWriteOnce attributes of PVC for fully utilizing the nodes in the cluster, and avoiding pipeline queueing caused by insufficient resources on a single node.
Therefore, I expect pods that are not dependent on PVC to be able to be scheduled freely to idle nodes.

@lbernick
Copy link
Member Author

@l-qing thanks for the detailed response, super helpful!

To make sure I understand you correctly, it sounds like you're trying to do something similar to the user who reported #5275, where you have a PVC you're using to cache dependencies between pipelineruns. Each PipelineRun also needs somewhere to store the repo contents (not shared between PipelineRuns), and you could use the same PVC you're using for caching, but that's slow. You'd like to use the per-PipelineRun affinity assistant feature, when it's available, to be able to use a separate local storage PVC in each PipelineRun in addition to the cache PVC, and you can't currently do this because you can't use two PVCs in one TaskRun (i.e. the problem reported in #5275).

It sounds like you're concerned, though, that once you enable the per-PipelineRun affinity assistant, that any pods that don't rely on one of these two PVCs will still be forced onto the same node as the other PipelineRun pods, causing inefficient cluster usage.

Does that sound right?

It sounds like we could make two improvements to the proposed design:

  • not forcing pods that don't use any PVCs onto the same node as the other pods
  • taking into account PVC access modes and applying affinity constraints only to pods using ReadWriteOnce PVCs, and not ReadWriteMany/ReadOnlyMany PVCs. I'm not sure if there are other considerations beyond the access mode we'd have to take into account, or if this would be useful for the existing affinity assistant as well.

I'm thinking for right now, we should take the simple approach proposed in the design and just run all the PipelineRun pods on the same node, see how well that works for people, and make these adjustments if necessary. However, I'm also open to updating the design proposal now. (I also got another comment on the tekton slack from a user who says "After going through everything I think we're leaning more towards switching out all our PVCs used for workspaces to use a ReadWriteMany and not have the option to mix with ReadWriteOnce which would then mean we can disable the affinity-assistant completely." So it sounds like others may have similar concerns.)

@l-qing
Copy link
Contributor

l-qing commented May 27, 2023

@lbernick

you can't currently do this because you can't use two PVCs in one TaskRun (i.e. the problem reported in #5275).

Not all of it.
If I set disable-affinity-assistant to true, and there are only two PVCs:

  • ReadWriteOnce and WaitForFirstConsumer Code PVC
  • ReadWriteMany Cache PVC

It will work well. Because after k8s binds the Code PVC for the first time, any subsequent use of the PVC will only be scheduled on that node. here
There can only be a scheduling conflict if I have multiple PVCs with WaitForFirstConsumer. They are bound to different nodes respectively.

Does that sound right?

Yes.
This two improvements that I think are enough.

"leaning more towards switching out all our PVCs used for workspaces to use a ReadWriteMany"

Currently we do it this way as well, but have indeed encountered performance issues caused by single storage classes.
As long as it is shared storage, there will always be a performance limit. We cannot restrict users from how they use the workspace.


Regarding the front-end pipeline, our current workaround is: The Code PVC is only used for sharing data.

Each task copies its contents to the /tmp temporary directory before executing commands (such as downloading dependencies). After that, it uses tar to compress and return the contents to the Code PVC. This avoids a large amount of small file read and write operations in the Code PVC, with only sequential read and write operations for large files, ensuring performance.
But this method of use is definitely not elegant.

@lbernick lbernick changed the title Per-PipelineRun (instead of per-workspace) affinity assistant TEP-0135: Per-PipelineRun (instead of per-workspace) affinity assistant May 31, 2023
@lbernick
Copy link
Member Author

@l-qing per-pipelinerun affinity assistant has now been implemented as an alpha feature. Please feel free to try it out and share feedback! You can enable this by setting "disable-affinity-assistant" to "true" and "coschedule" to "pipelineruns" (docs here). We incorporated your feedback that pods that don't bind PVCs should be able to be scheduled freely. I also created a follow-up issue to track your feedback about PVC access modes: #6989. I'm going to close this issue since the implementation is complete in alpha; if you'd like to provide feedback, please comment on #6990.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants