Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule Pods in resource-constrained environments #734

Closed
imjasonh opened this issue Apr 5, 2019 · 4 comments · Fixed by #905
Closed

Schedule Pods in resource-constrained environments #734

imjasonh opened this issue Apr 5, 2019 · 4 comments · Fixed by #905
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@imjasonh
Copy link
Member

imjasonh commented Apr 5, 2019

Expected Behavior

In a resource-constrained environment like a namespace with resource limits imposed (or just an insufficiently provisioned cluster), creating a TaskRun (Pod) that exceeds those limits should not fail the TaskRun, but should instead continually try to create the Pod until it either succeeds or times out.

Actual Behavior

Pods fail to start and the TaskRun is failed ~immediately.

Steps to Reproduce the Problem

  1. Define a namespace with resource constraints (e.g., 10 CPU, 10 GB RAM)
  2. Create 15 TaskRuns each requesting 1 CPU and 1 GB RAM, running hello world or something simple
  3. ~10 of those will be scheduled and will succeed, the rest will fail due to insufficient resources.

Additional Info

This is similar to how Jobs can handle Pod scheduling failures by retrying until they are successful.

It's unclear whether users would expect TaskRuns waiting for sufficient resources to queue in order of the time they were created, or whether they'd expect the Kubernetes scheduler to do whatever it needs to do to schedule the Pods. As an initial implementation it's probably fine to have Kubernetes schedule Pods, and not have to worry about enforcing FIFO.

@imjasonh imjasonh added kind/feature Categorizes issue or PR as related to a new feature. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels Apr 5, 2019
@ghost
Copy link

ghost commented Apr 17, 2019

/assign @sbwsg

@tekton-robot tekton-robot assigned ghost Apr 17, 2019
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 22, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Apr 22, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
@ghost
Copy link

ghost commented Apr 23, 2019

Been working through some of the implementation details in a POC but want to drop current working notes here since I likely won't be able to work on it more until tomorrow.

  1. Catching a pod failure is relatively straightforward; checking the error message produced by the createPod() func in pkg/reconciler/v1alpha1/taskrun/taskrun.go reveals the reason. From here it's quick to parse out the error message and look for e.g. "exceeded quota" in the string. This relies on a somewhat brittle contract though. I'll also need to check for the different messages generated both by LimitRanges as well as ResourceQuotas since it looks like they both enforce resource limits on a pod. I'm currently looking around to see if there's a less brittle approach to this error checking.

  2. Once the resource constraint error is detected the pod then needs to be restarted. In my POC implementation this works by simply Enqueue()ing the TR to be re-assessed on the next reconcile loop. This results in many rapid reruns however when really it would be nicer to see an exponential backoff strategy similar to that used by k8s' job controller. The job controller uses a particular kind of workqueue to implement this ("ExponentialFailureRateLimiter") but TaskRun's controller Impl uses the "RateLimitedQueue", which is set up via knative's controller.NewImpl() func. So I'm looking at other alternatives to implement this.

tekton-robot pushed a commit that referenced this issue Apr 23, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in #734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes #731

Co-authored-by: Nader Ziada <[email protected]>
@ghost
Copy link

ghost commented Apr 25, 2019

I'm going to move this into a design doc. There are enough variables here to seed some discussion and it'd be good to get broader input before committing to one approach.

@ghost
Copy link

ghost commented Apr 29, 2019

I've started the design doc here including use cases, a draft implementation, some open questions and possible alternative implementations that I'm still working through.

wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
wlynch pushed a commit to wlynch/pipeline that referenced this issue May 20, 2019
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
nikhil-thomas pushed a commit to nikhil-thomas/pipeline that referenced this issue Aug 25, 2021
…pace-duplicate-test

Skip TestWorkspacePipelineRunDuplicateWorkspaceEntriesInvalid Test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants