Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPipelineRun/fan-in_and_fan-out is broken #375

Closed
adrcunha opened this issue Jan 9, 2019 · 15 comments · Fixed by #551
Closed

TestPipelineRun/fan-in_and_fan-out is broken #375

adrcunha opened this issue Jan 9, 2019 · 15 comments · Fixed by #551

Comments

@adrcunha
Copy link
Contributor

adrcunha commented Jan 9, 2019

The test is timing out after 10 minutes:

I0109 15:14:22.583]     --- FAIL: TestPipelineRun/fan-in_and_fan-out (600.37s)
I0109 15:14:22.583]         pipelinerun_test.go:153: Error waiting for PipelineRun helloworld-pipelinerun1 to finish: timed out waiting for the condition

This seems to occur only when running on Prow.

Sample log:

https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/370/pull-knative-build-pipeline-integration-tests/1083012932113010688/

@bobcatfish
Copy link
Collaborator

@adrcunha can you point as at or give us some details on what the clusters look like that these tests run against?

Seems like we're going to have to tackle this by either:

  1. trying to make our own clusters as similar as possible or
  2. keep fiddling with this test in a PR and try to get more info about what's going on

@bobcatfish
Copy link
Collaborator

bobcatfish commented Jan 9, 2019

This looks like an ominous error log:

googleapi: Error 400: The disk resource 'projects/knative-boskos-16/zones/us-central1-b/disks/gke-kbuild-pipeline-e2-pvc-3fba5047-1421-11e9-9528-42010a800053' is already being used by 'projects/knative-boskos-16/zones/us-central1-b/instances/gke-kbuild-pipeline-e2e--default-pool-3d89c5e2-jx0x', resourceInUseByAnotherResource

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jan 9, 2019
Trying to gather more info for tektoncd#375 - does the test fail when it's run
on it's own? (not 100% sure i got the syntax right haha just winging it)
@bobcatfish
Copy link
Collaborator

I think it could be interesting to figure out when this started happening. Timeline:

  1. Test was added Nov 29 Link git outputs between tasks #270

Last pass:
2. Test passed on Dec 20 #343, in logs:

TestPipelineRun :PASS:

First failure (only doc changes b/w these 2 from what I can see, so it looks like integration tests were not run):
3. Test failed on Dec 30 #352, in logs:

I1230 15:41:23.937]     --- FAIL: TestPipelineRun/fan-in_and_fan-out (600.32s)
I1230 15:41:23.937]     	pipelinerun_test.go:153: Error waiting for PipelineRun helloworld-pipelinerun1 to finish: timed out waiting for the condition

@bobcatfish
Copy link
Collaborator

Hm I don't think that timeline is quite correct (stuff is missing), the "closed" PR order doesn't seem to be working for this, gonna try again looking at commits directly.

@bobcatfish
Copy link
Collaborator

Okay looking at the commits in order (https://github.com/knative/build-pipeline/commits/master) (assuming this is the right order) the first commit to fail is 5f28380

Logs (started at 2018-12-30 01:29 PST):
https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/359/pull-knative-build-pipeline-integration-tests/1079308537055678466/

The last commit to pass, right before that is:
1556faf

Logs (started at 2018-12-30 00:06 PST):
https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/360/pull-knative-build-pipeline-integration-tests/1079287649333678082/

(I'm not 100% sure if this includes the additional check that Tide does right before merging 🤔 )

@bobcatfish bobcatfish self-assigned this Jan 9, 2019
@bobcatfish
Copy link
Collaborator

bobcatfish commented Jan 9, 2019

oooo i think i reproduced this locally by switching from region us-central-1a to zone us-central-1 (which happened in 5f28380 via knative/test-infra#297

The pod status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-09T22:52:02Z
    message: '0/3 nodes are available: 3 node(s) had no available volume zone.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

The taskrun status (looks like the same error when i repro-ed in #377 (comment)):

  status:
    conditions:
    - lastTransitionTime: 2019-01-09T22:52:14Z
      message: 'pod status "PodScheduled":"False"; message: "0/3 nodes are available:
        3 node(s) had no available volume zone."'
      reason: Pending
      status: Unknown
      type: Succeeded
    podName: helloworld-pipelinerun1-create-file-kritis-pod-d29ac6

How I created the cluster:

gcloud container clusters create --enable-autoscaling --min-nodes=1 --max-nodes=3 --scopes=cloud-platform --enable-basic-auth --no-issue-client-certificate --project=$PROJECT_ID --region=us-central1 --machine-type=n1-standard-4 --image-type=cos --num-nodes=1 --cluster-version=1.11.5 catzone

@bobcatfish
Copy link
Collaborator

bobcatfish commented Jan 10, 2019

Okay so I'm pretty sure I understand what's going on now. The test started failing because in knative/test-infra#297 we switched from using zonal clusters to using regional clusters.

Looking at the GKE regional cluster docs on persistent storage:

When you add persistent storage to your cluster, unless a zone is specified, GKE assigns the disk to a single zone. GKE chooses the zone at random. ... Once a persistent disk is provisioned, any Pods referencing the disk are scheduled to the same zone as the disk.

So any PVCs used by a pod must (just happen to) be created in the same zone the pod ends up running in.

Unfortunately we are creating 2 PVCs that each pod which declares an output needs:

  1. The PipelineRun controller creates a PVC for outputs
  2. The TaskRun controller creates a PVC for logs

It seems like what is happening is these are both getting created in different zones, and so any pod that tries to use both can't be scheduled.

I'm not 100% sure that's happening b/c I'm not sure how to see which zone a PVC has been created in, however I feel pretty confident about this because I experimented by:

  1. Only creating the PipelineRun controller PVC (was scheduled)
  2. Only creating the TaskRun controller PVC (was scheduled)
  3. Having each of the controllers create 2 PVCs (and the other creates none) (was scheduled)

This is obviously not a great state to stay in since it means Pipelines wouldn't work with a Regional cluster!! However for logs long term, once we have knative/build#470, we'll stop using a PVC for logs, and also @shashwathi has talked about exploring not using PVCs for input/output linking either.

@adrcunha in the meantime can we have a flag exposed which will allow us to go back to using a zonal cluster instead of a regional cluster? i.e. something that will let us use the behaviour before knative/test-infra#297 so we can keep testing this over time

Once we are back in a state where we are creating 0 or 1 PVCs, we can go back to using the regional cluster.

@bobcatfish
Copy link
Collaborator

@mattmoor has confirmed that this seems likely to be the problem, see also https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

@adrcunha
Copy link
Contributor Author

@adrcunha in the meantime can we have a flag exposed which will allow us to go back to using a zonal
cluster instead of a regional cluster? i.e. something that will let us use the behaviour before
knative/test-infra#297 so we can keep testing this
over time

I don't see why not. It's not super trivial, you'll have to change the flags in a few places. I recommend using an environment variable to turn on this behavior (so it's always turned on for build-pipeline e2e tests) instead of a flag. Feel free to send a PR for review.

@bobcatfish
Copy link
Collaborator

Feel free to send a PR for review

kk sounds good thanks @adrcunha !

@bobcatfish
Copy link
Collaborator

btw shout out to @shashwathi who pretty much nailed the diagnosis in #377 (comment) 🎉 🙏

@adrcunha
Copy link
Contributor Author

Feel free to send a PR for review
kk sounds good thanks @adrcunha !

Don't worry about this anymore, due to pressing matters (aka knative/test-infra#383) I'm working on this.

@bobcatfish
Copy link
Collaborator

Oh great!! Thanks @adrcunha :D

@bobcatfish bobcatfish assigned adrcunha and unassigned bobcatfish Jan 15, 2019
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jan 27, 2019
We noticed early on that logs from init containers are often cleaned up
immediately by k8s, particularly if the containers are short running
(e.g. just echoing "hello world"). We started down a path to correct
that, which takes an approach based on Prow's entrypoint solution
(https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint)
(even using the same image at the moment!) which wraps the user's
provided command and streams logs to a volume, from which the logs can
be uploaded/streamed by a sidecar.

Since we are using init containers for step execution, we can't yet use
sidecars, but we are addressing that in tektoncd#224 (also an entrypoint
re-writing based solution). Once we have that, we can sidecar support,
starting with GCS as a POC (#107) and moving into other types.

In the meantime, to enable us to get logs (particularly in tests), we
had the taskrun controller create a PVC on the fly to hold logs. This
has two problems:
* The PVCs are not cleaned up so this is an unexpected side effect for
  users
* Combined with PVC based input + ouput linking, this causes scheduling
  problems for the resulting pods (tektoncd#375)

Now that we want to have an official release, this would be a bad state
to release in, so we will remove this magical log PVC creation logic,
which was never our intended end state anyway.

Since we _do_ need the entrypoint rewriting and log interception logic
in the long run, this commit leaves most functionality intact, removing
only the PVC creation and changing the volume being used to an
`emptyDir`, which is what we will likely use for #107 (and this is how
Prow handles this as well). This means the released functionality will
be streaming logs to a location where nothing can read them, however I
think it is better than completely removing the functionality b/c:
1. We need the functionality in the long run
2. Users should be prepared for this functionality (e.g. dealing with
   edge cases around the taskrun controller being able to fetch an
   image's entrypoint)

Fixes tektoncd#387
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jan 30, 2019
We noticed early on that logs from init containers are often cleaned up
immediately by k8s, particularly if the containers are short running
(e.g. just echoing "hello world"). We started down a path to correct
that, which takes an approach based on Prow's entrypoint solution
(https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint)
(even using the same image at the moment!) which wraps the user's
provided command and streams logs to a volume, from which the logs can
be uploaded/streamed by a sidecar.

Since we are using init containers for step execution, we can't yet use
sidecars, but we are addressing that in tektoncd#224 (also an entrypoint
re-writing based solution). Once we have that, we can sidecar support,
starting with GCS as a POC (#107) and moving into other types.

In the meantime, to enable us to get logs (particularly in tests), we
had the taskrun controller create a PVC on the fly to hold logs. This
has two problems:
* The PVCs are not cleaned up so this is an unexpected side effect for
  users
* Combined with PVC based input + ouput linking, this causes scheduling
  problems for the resulting pods (tektoncd#375)

Now that we want to have an official release, this would be a bad state
to release in, so we will remove this magical log PVC creation logic,
which was never our intended end state anyway.

Since we _do_ need the entrypoint rewriting and log interception logic
in the long run, this commit leaves most functionality intact, removing
only the PVC creation and changing the volume being used to an
`emptyDir`, which is what we will likely use for #107 (and this is how
Prow handles this as well). This means the released functionality will
be streaming logs to a location where nothing can read them, however I
think it is better than completely removing the functionality b/c:
1. We need the functionality in the long run
2. Users should be prepared for this functionality (e.g. dealing with
   edge cases around the taskrun controller being able to fetch an
   image's entrypoint)

Fixes tektoncd#387
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jan 30, 2019
We noticed early on that logs from init containers are often cleaned up
immediately by k8s, particularly if the containers are short running
(e.g. just echoing "hello world"). We started down a path to correct
that, which takes an approach based on Prow's entrypoint solution
(https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint)
(even using the same image at the moment!) which wraps the user's
provided command and streams logs to a volume, from which the logs can
be uploaded/streamed by a sidecar.

Since we are using init containers for step execution, we can't yet use
sidecars, but we are addressing that in tektoncd#224 (also an entrypoint
re-writing based solution). Once we have that, we can sidecar support,
starting with GCS as a POC (#107) and moving into other types.

In the meantime, to enable us to get logs (particularly in tests), we
had the taskrun controller create a PVC on the fly to hold logs. This
has two problems:
* The PVCs are not cleaned up so this is an unexpected side effect for
  users
* Combined with PVC based input + ouput linking, this causes scheduling
  problems for the resulting pods (tektoncd#375)

Now that we want to have an official release, this would be a bad state
to release in, so we will remove this magical log PVC creation logic,
which was never our intended end state anyway.

Since we _do_ need the entrypoint rewriting and log interception logic
in the long run, this commit leaves most functionality intact, removing
only the PVC creation and changing the volume being used to an
`emptyDir`, which is what we will likely use for #107 (and this is how
Prow handles this as well). This means the released functionality will
be streaming logs to a location where nothing can read them, however I
think it is better than completely removing the functionality b/c:
1. We need the functionality in the long run
2. Users should be prepared for this functionality (e.g. dealing with
   edge cases around the taskrun controller being able to fetch an
   image's entrypoint)

Fixes tektoncd#387
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jan 30, 2019
We noticed early on that logs from init containers are often cleaned up
immediately by k8s, particularly if the containers are short running
(e.g. just echoing "hello world"). We started down a path to correct
that, which takes an approach based on Prow's entrypoint solution
(https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint)
(even using the same image at the moment!) which wraps the user's
provided command and streams logs to a volume, from which the logs can
be uploaded/streamed by a sidecar.

Since we are using init containers for step execution, we can't yet use
sidecars, but we are addressing that in tektoncd#224 (also an entrypoint
re-writing based solution). Once we have that, we can sidecar support,
starting with GCS as a POC (#107) and moving into other types.

In the meantime, to enable us to get logs (particularly in tests), we
had the taskrun controller create a PVC on the fly to hold logs. This
has two problems:
* The PVCs are not cleaned up so this is an unexpected side effect for
  users
* Combined with PVC based input + ouput linking, this causes scheduling
  problems for the resulting pods (tektoncd#375)

Now that we want to have an official release, this would be a bad state
to release in, so we will remove this magical log PVC creation logic,
which was never our intended end state anyway.

Since we _do_ need the entrypoint rewriting and log interception logic
in the long run, this commit leaves most functionality intact, removing
only the PVC creation and changing the volume being used to an
`emptyDir`, which is what we will likely use for #107 (and this is how
Prow handles this as well). This means the released functionality will
be streaming logs to a location where nothing can read them, however I
think it is better than completely removing the functionality b/c:
1. We need the functionality in the long run
2. Users should be prepared for this functionality (e.g. dealing with
   edge cases around the taskrun controller being able to fetch an
   image's entrypoint)

Fixes tektoncd#387
knative-prow-robot pushed a commit that referenced this issue Jan 31, 2019
We noticed early on that logs from init containers are often cleaned up
immediately by k8s, particularly if the containers are short running
(e.g. just echoing "hello world"). We started down a path to correct
that, which takes an approach based on Prow's entrypoint solution
(https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint)
(even using the same image at the moment!) which wraps the user's
provided command and streams logs to a volume, from which the logs can
be uploaded/streamed by a sidecar.

Since we are using init containers for step execution, we can't yet use
sidecars, but we are addressing that in #224 (also an entrypoint
re-writing based solution). Once we have that, we can sidecar support,
starting with GCS as a POC (#107) and moving into other types.

In the meantime, to enable us to get logs (particularly in tests), we
had the taskrun controller create a PVC on the fly to hold logs. This
has two problems:
* The PVCs are not cleaned up so this is an unexpected side effect for
  users
* Combined with PVC based input + ouput linking, this causes scheduling
  problems for the resulting pods (#375)

Now that we want to have an official release, this would be a bad state
to release in, so we will remove this magical log PVC creation logic,
which was never our intended end state anyway.

Since we _do_ need the entrypoint rewriting and log interception logic
in the long run, this commit leaves most functionality intact, removing
only the PVC creation and changing the volume being used to an
`emptyDir`, which is what we will likely use for #107 (and this is how
Prow handles this as well). This means the released functionality will
be streaming logs to a location where nothing can read them, however I
think it is better than completely removing the functionality b/c:
1. We need the functionality in the long run
2. Users should be prepared for this functionality (e.g. dealing with
   edge cases around the taskrun controller being able to fetch an
   image's entrypoint)

Fixes #387
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Feb 6, 2019
Verifying the pipelinerun example worked was disabled due to the combo
of tektoncd#375 and tektoncd#443, but now that we've removed the extra log PVC in tektoncd#443
we shouldn't run into this issue anymore :D
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Feb 6, 2019
Verifying the pipelinerun example worked was disabled due to the combo
of tektoncd#375 and tektoncd#443, but now that we've removed the extra log PVC in tektoncd#443
we shouldn't run into this issue anymore :D
knative-prow-robot pushed a commit that referenced this issue Feb 6, 2019
Verifying the pipelinerun example worked was disabled due to the combo
of #375 and #443, but now that we've removed the extra log PVC in #443
we shouldn't run into this issue anymore :D
@bobcatfish
Copy link
Collaborator

We have since removed the double PVC issue (#443) so we should now be able to re-enable this test.

@shashwathi
Copy link
Contributor

/assign @shashwathi

shashwathi pushed a commit to shashwathi/build-pipeline that referenced this issue Feb 25, 2019
fixes tektoncd#375

- It was disabled when gke test infrastructure was run in multi regional
cluster and multiple PVCs were allocated across regions. Entrypoint PVC
has been removed in last release and there is only one PVC created to
support sharing resources across taskruns
knative-prow-robot pushed a commit that referenced this issue Feb 25, 2019
fixes #375

- It was disabled when gke test infrastructure was run in multi regional
cluster and multiple PVCs were allocated across regions. Entrypoint PVC
has been removed in last release and there is only one PVC created to
support sharing resources across taskruns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants