Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tekton Queue. Concurrency #5835

Open
marniks7 opened this issue Dec 5, 2022 · 20 comments
Open

Tekton Queue. Concurrency #5835

marniks7 opened this issue Dec 5, 2022 · 20 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@marniks7
Copy link

marniks7 commented Dec 5, 2022

Feature request

We would like to have ability to create many PipelineRuns at once, but execute them one by one (mostly). Sometimes concurrently.

Use case

Tekton can be used as regular pipeline \ workflow engine for all sort of activities.

Use case#1 - Chaos Engineering

Chaos Engineering - create pipelineRuns with some chaos, e.g. deployment restart for each deployment present on the environment. Execute PipelineRun one after another. Sometimes concurrent execution maybe desirable.

Use case#2 - Load Testing by single person

  • Create pipelineRun with load equal to 10 users
  • Create pipelineRun with load equal to 20 users
  • Create pipelineRun with load equal to 30 users

create them at the same time, but run them one after another.

Use case#3 - Load Testing by few people

There could be single environment for Load Testing, but multiple people working on it. In order to control the load runs done by multiple people each person can just create PipelineRun and it will be executed when previous load is finished.
Alternative: ask in chat if server is free \ available

Solution (we have)

What we have right now:
All pipelineRuns are created in PipelineRunPending state. It is done manually in pipelineRun yaml. We are considering to use Kyverno for automatic approach, but they don't support spec.status field change as of September 2022.
We have implemented custom kubernetes controller which handles PipelineRuns with label queue.tekton.dev/name and remove PipelineRunPending state when previous PipelineRun finished.

 metadata:
  labels:
    queue.tekton.dev/name: env-maria-d3hs0

And it is possible to search for PipelineRuns is all namespaces. E.g. in case of namespace per person

metadata:
  annotations:
    queue.tekton.dev/scope: cluster

We didn't implement concurrent ability (e.g. to execute 2 PipelineRuns from the same queue), just because for top use cases we don't need it, but we may implement it in the future.

Other Notes

@marniks7 marniks7 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 5, 2022
@lbernick
Copy link
Member

Related: #4903

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2023
@vdemeester
Copy link
Member

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2023
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 21, 2023
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@khrm
Copy link
Contributor

khrm commented Oct 21, 2023

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 21, 2023
@khrm
Copy link
Contributor

khrm commented Oct 21, 2023

/reopen

This is part of the Roadmap and quite important.
/lifecycle frozen

@tekton-robot
Copy link
Collaborator

@khrm: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

This is part of the Roadmap and quite important.
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 21, 2023
@khrm
Copy link
Contributor

khrm commented Oct 21, 2023

@vdemeester Please reopen this.

@vdemeester
Copy link
Member

/lifecycle frozen

@vdemeester vdemeester reopened this Oct 23, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Tekton Community Roadmap Oct 23, 2023
@sibelius
Copy link

queueing would be awesome

@benoitschipper
Copy link

This would be a great addition, so we could pool pipeline resources from multiple customers and just use queue-ing. Pending state is not desirable as timeouts might cause the pipelines to fail on kubernetes clusters.

@sibelius
Copy link

can we do this using resource request and limits ?

@benoitschipper
Copy link

benoitschipper commented Apr 12, 2024

can we do this using resource request and limits ?

Yeah you can use requests and limits for this. Problem is that if there are no resources left to run a pipeline it goes into "Pending" which eventually times out. Meaning people will get Pipeline Fails.

If there was something like a queueing system. It would instead never timeout and just wait till there are resources available. This is obviously only for when there are busy periods on the cluster. Hence the request for something like a queueing syste.

I also found that there is a possibility to do something like a lease, but that needed additional self made resources. I want to make this a solution. For our DevOps teams.

We currently give each DevOps team a certain amount of resources to perform pipeline related tasks within a namespace. But that means that a lot of resources are potentially wasted when some DevOps teams are not running their pipelines during some days. We want to instead pool all the resources for our DevOps teams so we can reserve some nodes for pipeline related runtimes. Share the resources and utilize some sort of queueing. It's all efficiency and effectiveness related :)

Hope that makes sense :)

@sibelius
Copy link

why PENDING timeouts ?

@benoitschipper
Copy link

benoitschipper commented Apr 13, 2024

why PENDING timeouts ?

Kubernetes' scheduler is, due to lack of compute space within a set quota or overall compute space on the cluster, unable to schedule the pod on any node with the requested cpu/men/storage. Making it go in pending mode.

https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/#my-pod-stays-pending

I think it might have to do something with the default timeout, this is from searching the web and within tekton docs:

Reasons for Pending State:

  • Insufficient Nodes: Your Kubernetes cluster lacks the physical nodes to accommodate the pod's CPU or memory requirements.
  • Resource Quotas: You might have resource quotas in place that limit the total number of pods or the amount of resources that can be used in a specific namespace.

Tekton Timeouts:

  • PipelineRun Timeout: Each Tekton PipelineRun has a configurable timeout. If the pod remains in "Pending" state beyond this timeout, the PipelineRun will fail with an error indicating that it timed out.
  • Default Timeout: Tekton has a global default timeout (usually 60 minutes) that acts as a catch-all if you haven't specified a PipelineRun-specific timeout.
  • Task Timeouts: You can even define timeouts at the individual Task level within your pipeline.

Customization:

  • Overriding Defaults: You can change the global default timeout by adjusting the default-timeout-minutes field in your Tekton configuration (config/config-defaults.yaml).
  • Specific Timeouts: Set more tailored timeouts at the PipelineRun and individual Task levels to match the expected execution time of your pipeline steps.

How to Find Out Specific Timeout Values

  • Examine PipelineRun: Use kubectl describe pipelinerun to see the timeout configured for that specific instance.
  • Pipeline Definition: If no timeout is set on the PipelineRun, check your Pipeline definition using kubectl describe pipeline .
  • Cluster-Wide Default: If there are no timeouts in either of the above, the cluster-wide default in Tekton's configuration applies (default apparently 60 min)

Important Considerations:

  • Timeouts are crucial to prevent stalled pipelines from consuming resources indefinitely.
  • If the resource shortage is temporary, the pod might automatically start running once enough resources become available (before timing out).

@sibelius
Copy link

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

@benoitschipper
Copy link

benoitschipper commented Jun 17, 2024

can we increase PENDING timeout?

or put these tasks in a "queue"?

like the error CouldNotGetTask

Queueing mechanisme in Tekton would be great. But this is the thread for the feature request. So this is not a possibility as of yet.

I am not sure if you can increase the "PENDING" state of all of your pods on the cluster but that would be a work around. Not sure there is a setting for it. Maybe setting TerminationGacePeriodSeconds is an option or a command that extends the life of a pod with a sleep 30. But this is not a great solution.

The best option would be for a queueing mechanisme for Tekton with some settings that allow you to manage the queue 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: In Progress
Development

No branches or pull requests

7 participants