-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestPipelineRun/fan-in_and_fan-out is broken #375
Comments
@adrcunha can you point as at or give us some details on what the clusters look like that these tests run against? Seems like we're going to have to tackle this by either:
|
This looks like an ominous error log:
|
Trying to gather more info for tektoncd#375 - does the test fail when it's run on it's own? (not 100% sure i got the syntax right haha just winging it)
I think it could be interesting to figure out when this started happening. Timeline:
Last pass:
First failure (only doc changes b/w these 2 from what I can see, so it looks like integration tests were not run):
|
Hm I don't think that timeline is quite correct (stuff is missing), the "closed" PR order doesn't seem to be working for this, gonna try again looking at commits directly. |
Okay looking at the commits in order (https://github.com/knative/build-pipeline/commits/master) (assuming this is the right order) the first commit to fail is 5f28380 Logs (started at 2018-12-30 01:29 PST): The last commit to pass, right before that is: Logs (started at 2018-12-30 00:06 PST): (I'm not 100% sure if this includes the additional check that Tide does right before merging 🤔 ) |
oooo i think i reproduced this locally by switching from region The pod status:
The taskrun status (looks like the same error when i repro-ed in #377 (comment)):
How I created the cluster:
|
Okay so I'm pretty sure I understand what's going on now. The test started failing because in knative/test-infra#297 we switched from using zonal clusters to using regional clusters. Looking at the GKE regional cluster docs on persistent storage:
So any PVCs used by a pod must (just happen to) be created in the same zone the pod ends up running in. Unfortunately we are creating 2 PVCs that each pod which declares an output needs:
It seems like what is happening is these are both getting created in different zones, and so any pod that tries to use both can't be scheduled. I'm not 100% sure that's happening b/c I'm not sure how to see which zone a PVC has been created in, however I feel pretty confident about this because I experimented by:
This is obviously not a great state to stay in since it means Pipelines wouldn't work with a Regional cluster!! However for logs long term, once we have knative/build#470, we'll stop using a PVC for logs, and also @shashwathi has talked about exploring not using PVCs for input/output linking either. @adrcunha in the meantime can we have a flag exposed which will allow us to go back to using a zonal cluster instead of a regional cluster? i.e. something that will let us use the behaviour before knative/test-infra#297 so we can keep testing this over time Once we are back in a state where we are creating 0 or 1 PVCs, we can go back to using the regional cluster. |
@mattmoor has confirmed that this seems likely to be the problem, see also https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/ |
I don't see why not. It's not super trivial, you'll have to change the flags in a few places. I recommend using an environment variable to turn on this behavior (so it's always turned on for build-pipeline e2e tests) instead of a flag. Feel free to send a PR for review. |
kk sounds good thanks @adrcunha ! |
btw shout out to @shashwathi who pretty much nailed the diagnosis in #377 (comment) 🎉 🙏 |
Don't worry about this anymore, due to pressing matters (aka knative/test-infra#383) I'm working on this. |
Oh great!! Thanks @adrcunha :D |
We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387
We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387
We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387
We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387
We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in #224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes #387
Verifying the pipelinerun example worked was disabled due to the combo of tektoncd#375 and tektoncd#443, but now that we've removed the extra log PVC in tektoncd#443 we shouldn't run into this issue anymore :D
Verifying the pipelinerun example worked was disabled due to the combo of tektoncd#375 and tektoncd#443, but now that we've removed the extra log PVC in tektoncd#443 we shouldn't run into this issue anymore :D
We have since removed the double PVC issue (#443) so we should now be able to re-enable this test. |
/assign @shashwathi |
fixes tektoncd#375 - It was disabled when gke test infrastructure was run in multi regional cluster and multiple PVCs were allocated across regions. Entrypoint PVC has been removed in last release and there is only one PVC created to support sharing resources across taskruns
fixes #375 - It was disabled when gke test infrastructure was run in multi regional cluster and multiple PVCs were allocated across regions. Entrypoint PVC has been removed in last release and there is only one PVC created to support sharing resources across taskruns
The test is timing out after 10 minutes:
This seems to occur only when running on Prow.
Sample log:
https://gubernator.knative.dev/build/knative-prow/pr-logs/pull/knative_build-pipeline/370/pull-knative-build-pipeline-integration-tests/1083012932113010688/
The text was updated successfully, but these errors were encountered: