TestPipelineRun/fan-in_and_fan-out is broken #375

adrcunha · 2019-01-09T18:36:50Z

The test is timing out after 10 minutes:

I0109 15:14:22.583]     --- FAIL: TestPipelineRun/fan-in_and_fan-out (600.37s)
I0109 15:14:22.583]         pipelinerun_test.go:153: Error waiting for PipelineRun helloworld-pipelinerun1 to finish: timed out waiting for the condition

This seems to occur only when running on Prow.

Sample log:

https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/370/pull-knative-build-pipeline-integration-tests/1083012932113010688/

The text was updated successfully, but these errors were encountered:

bobcatfish · 2019-01-09T19:17:24Z

@adrcunha can you point as at or give us some details on what the clusters look like that these tests run against?

Seems like we're going to have to tackle this by either:

trying to make our own clusters as similar as possible or
keep fiddling with this test in a PR and try to get more info about what's going on

bobcatfish · 2019-01-09T19:18:01Z

This looks like an ominous error log:

googleapi: Error 400: The disk resource 'projects/knative-boskos-16/zones/us-central1-b/disks/gke-kbuild-pipeline-e2-pvc-3fba5047-1421-11e9-9528-42010a800053' is already being used by 'projects/knative-boskos-16/zones/us-central1-b/instances/gke-kbuild-pipeline-e2e--default-pool-3d89c5e2-jx0x', resourceInUseByAnotherResource

Trying to gather more info for tektoncd#375 - does the test fail when it's run on it's own? (not 100% sure i got the syntax right haha just winging it)

bobcatfish · 2019-01-09T19:39:33Z

I think it could be interesting to figure out when this started happening. Timeline:

Test was added Nov 29 Link git outputs between tasks #270

Last pass:
2. Test passed on Dec 20 #343, in logs:

TestPipelineRun :PASS:

First failure (only doc changes b/w these 2 from what I can see, so it looks like integration tests were not run):
3. Test failed on Dec 30 #352, in logs:

I1230 15:41:23.937]     --- FAIL: TestPipelineRun/fan-in_and_fan-out (600.32s)
I1230 15:41:23.937]     	pipelinerun_test.go:153: Error waiting for PipelineRun helloworld-pipelinerun1 to finish: timed out waiting for the condition

bobcatfish · 2019-01-09T19:56:01Z

Hm I don't think that timeline is quite correct (stuff is missing), the "closed" PR order doesn't seem to be working for this, gonna try again looking at commits directly.

bobcatfish · 2019-01-09T20:04:47Z

Okay looking at the commits in order (https://github.com/knative/build-pipeline/commits/master) (assuming this is the right order) the first commit to fail is 5f28380

Logs (started at 2018-12-30 01:29 PST):
https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/359/pull-knative-build-pipeline-integration-tests/1079308537055678466/

The last commit to pass, right before that is:
1556faf

Logs (started at 2018-12-30 00:06 PST):
https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/360/pull-knative-build-pipeline-integration-tests/1079287649333678082/

(I'm not 100% sure if this includes the additional check that Tide does right before merging 🤔 )

bobcatfish · 2019-01-09T22:56:29Z

oooo i think i reproduced this locally by switching from region us-central-1a to zone us-central-1 (which happened in 5f28380 via knative/test-infra#297

The pod status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-01-09T22:52:02Z
    message: '0/3 nodes are available: 3 node(s) had no available volume zone.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

The taskrun status (looks like the same error when i repro-ed in #377 (comment)):

  status:
    conditions:
    - lastTransitionTime: 2019-01-09T22:52:14Z
      message: 'pod status "PodScheduled":"False"; message: "0/3 nodes are available:
        3 node(s) had no available volume zone."'
      reason: Pending
      status: Unknown
      type: Succeeded
    podName: helloworld-pipelinerun1-create-file-kritis-pod-d29ac6

How I created the cluster:

gcloud container clusters create --enable-autoscaling --min-nodes=1 --max-nodes=3 --scopes=cloud-platform --enable-basic-auth --no-issue-client-certificate --project=$PROJECT_ID --region=us-central1 --machine-type=n1-standard-4 --image-type=cos --num-nodes=1 --cluster-version=1.11.5 catzone

bobcatfish · 2019-01-10T01:50:34Z

Okay so I'm pretty sure I understand what's going on now. The test started failing because in knative/test-infra#297 we switched from using zonal clusters to using regional clusters.

Looking at the GKE regional cluster docs on persistent storage:

When you add persistent storage to your cluster, unless a zone is specified, GKE assigns the disk to a single zone. GKE chooses the zone at random. ... Once a persistent disk is provisioned, any Pods referencing the disk are scheduled to the same zone as the disk.

So any PVCs used by a pod must (just happen to) be created in the same zone the pod ends up running in.

Unfortunately we are creating 2 PVCs that each pod which declares an output needs:

The PipelineRun controller creates a PVC for outputs
The TaskRun controller creates a PVC for logs

It seems like what is happening is these are both getting created in different zones, and so any pod that tries to use both can't be scheduled.

I'm not 100% sure that's happening b/c I'm not sure how to see which zone a PVC has been created in, however I feel pretty confident about this because I experimented by:

Only creating the PipelineRun controller PVC (was scheduled)
Only creating the TaskRun controller PVC (was scheduled)
Having each of the controllers create 2 PVCs (and the other creates none) (was scheduled)

This is obviously not a great state to stay in since it means Pipelines wouldn't work with a Regional cluster!! However for logs long term, once we have knative/build#470, we'll stop using a PVC for logs, and also @shashwathi has talked about exploring not using PVCs for input/output linking either.

@adrcunha in the meantime can we have a flag exposed which will allow us to go back to using a zonal cluster instead of a regional cluster? i.e. something that will let us use the behaviour before knative/test-infra#297 so we can keep testing this over time

Once we are back in a state where we are creating 0 or 1 PVCs, we can go back to using the regional cluster.

bobcatfish · 2019-01-10T01:58:20Z

@mattmoor has confirmed that this seems likely to be the problem, see also https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

adrcunha · 2019-01-10T02:35:44Z

@adrcunha in the meantime can we have a flag exposed which will allow us to go back to using a zonal
cluster instead of a regional cluster? i.e. something that will let us use the behaviour before
knative/test-infra#297 so we can keep testing this
over time

I don't see why not. It's not super trivial, you'll have to change the flags in a few places. I recommend using an environment variable to turn on this behavior (so it's always turned on for build-pipeline e2e tests) instead of a flag. Feel free to send a PR for review.

bobcatfish · 2019-01-10T23:23:08Z

Feel free to send a PR for review

kk sounds good thanks @adrcunha !

bobcatfish · 2019-01-11T19:22:42Z

btw shout out to @shashwathi who pretty much nailed the diagnosis in #377 (comment) 🎉 🙏

adrcunha · 2019-01-12T05:12:06Z

Feel free to send a PR for review
kk sounds good thanks @adrcunha !

Don't worry about this anymore, due to pressing matters (aka knative/test-infra#383) I'm working on this.

bobcatfish · 2019-01-15T14:27:51Z

Oh great!! Thanks @adrcunha :D

We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387

We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in #224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes #387

Verifying the pipelinerun example worked was disabled due to the combo of tektoncd#375 and tektoncd#443, but now that we've removed the extra log PVC in tektoncd#443 we shouldn't run into this issue anymore :D

Verifying the pipelinerun example worked was disabled due to the combo of #375 and #443, but now that we've removed the extra log PVC in #443 we shouldn't run into this issue anymore :D

bobcatfish · 2019-02-21T02:00:57Z

We have since removed the double PVC issue (#443) so we should now be able to re-enable this test.

shashwathi · 2019-02-21T14:59:36Z

/assign @shashwathi

fixes tektoncd#375 - It was disabled when gke test infrastructure was run in multi regional cluster and multiple PVCs were allocated across regions. Entrypoint PVC has been removed in last release and there is only one PVC created to support sharing resources across taskruns

fixes #375 - It was disabled when gke test infrastructure was run in multi regional cluster and multiple PVCs were allocated across regions. Entrypoint PVC has been removed in last release and there is only one PVC created to support sharing resources across taskruns

adrcunha mentioned this issue Jan 9, 2019

Unbreak the E2E tests #370

Merged

bobcatfish mentioned this issue Jan 9, 2019

[WIP] [DO NOT MERGE] Only run fan-in fan-out test #377

Closed

bobcatfish self-assigned this Jan 9, 2019

bobcatfish assigned adrcunha Jan 10, 2019

bobcatfish mentioned this issue Jan 10, 2019

0.4 release must not try to create multiple PVCs across controllers #387

Closed

bobcatfish unassigned adrcunha Jan 10, 2019

bobcatfish assigned adrcunha and unassigned bobcatfish Jan 15, 2019

bobcatfish mentioned this issue Jan 27, 2019

Remove extra log PVC #443

Merged

bobcatfish unassigned adrcunha Feb 21, 2019

bobcatfish added this to the Pipelines 0.2 🎉 🎉 🎉 milestone Feb 21, 2019

knative-prow-robot assigned shashwathi Feb 21, 2019

shashwathi mentioned this issue Feb 25, 2019

Enable fan-in fan-out e2e test #551

Merged

3 tasks

knative-prow-robot closed this as completed in #551 Feb 25, 2019

mchmarny unassigned shashwathi Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestPipelineRun/fan-in_and_fan-out is broken #375

TestPipelineRun/fan-in_and_fan-out is broken #375

adrcunha commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019 •

edited

Loading

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019 •

edited

Loading

bobcatfish commented Jan 10, 2019 •

edited

Loading

bobcatfish commented Jan 10, 2019

adrcunha commented Jan 10, 2019

bobcatfish commented Jan 10, 2019

bobcatfish commented Jan 11, 2019

adrcunha commented Jan 12, 2019

bobcatfish commented Jan 15, 2019

bobcatfish commented Feb 21, 2019

shashwathi commented Feb 21, 2019

TestPipelineRun/fan-in_and_fan-out is broken #375

TestPipelineRun/fan-in_and_fan-out is broken #375

Comments

adrcunha commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019 • edited Loading

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019

bobcatfish commented Jan 9, 2019 • edited Loading

bobcatfish commented Jan 10, 2019 • edited Loading

bobcatfish commented Jan 10, 2019

adrcunha commented Jan 10, 2019

bobcatfish commented Jan 10, 2019

bobcatfish commented Jan 11, 2019

adrcunha commented Jan 12, 2019

bobcatfish commented Jan 15, 2019

bobcatfish commented Feb 21, 2019

shashwathi commented Feb 21, 2019

bobcatfish commented Jan 9, 2019 •

edited

Loading

bobcatfish commented Jan 9, 2019 •

edited

Loading

bobcatfish commented Jan 10, 2019 •

edited

Loading