-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sidecar task does not work with "gpu_limit" #180
Comments
Hi, I just tried the example again after upgrading flytekit and I'm getting the same error but uppercase instead of lowercase...do I need to update anything else?
Update: I've done some more troubleshooting, it looks like removing the .lowercase() also makes the cpu_limit invalid (before the fix it would work and only the GPU limit would be invalid) |
Thanks for pointing this out, will try a different fix again. Sorry for the trouble 😓 |
@katrogan i think we will need to create a new lyft/flyte release so that he can deploy the new propeller |
@kumare3 lyft/flyte has already been updated with the propeller fix |
Closing the issue |
…lyteorg#180) * Added validation check for all models Signed-off-by: Yuvraj <[email protected]>
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
* updated paths in docker-files Signed-off-by: Samhita Alla <[email protected]> * updated dockerfile in pima diabetes example Signed-off-by: Samhita Alla <[email protected]> * updated github workflows Signed-off-by: Samhita Alla <[email protected]>
…g#180) * Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)" This reverts commit 6b9f1d4. Signed-off-by: Sean Lin <[email protected]>
Signed-off-by: Flyte-Bot <[email protected]> Co-authored-by: flyte-bot <[email protected]>
…lyteorg#180) * Added validation check for all models Signed-off-by: Yuvraj <[email protected]>
…g#180) * Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)" This reverts commit 7c31c1e. Signed-off-by: Sean Lin <[email protected]>
## Overview This changes adds support for OLTP and sampling to the otelutils tracer provider abstraction. Adds support for OLTP. This is the recommended replacement for the deprecated jaeger exporter. > "go.opentelemetry.io/otel/exporters/jaeger" is deprecated: This module is no longer supported. OpenTelemetry dropped support for Jaeger exporter in July 2023. Jaeger officially accepts and recommends using OTLP OLTP supports grpc and http, which are added as separate exporter types and configs. Adds initial [sampling](https://opentelemetry.io/docs/languages/go/sampling/) support to the top level open telemetry config. Defaults to parent sampler `always`, but also adds a config for [TraceIdRatioBased](https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#TraceIDRatioBased). See [these docs](https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#ParentBased) for behavior of parent sampler. ## Test Plan Ran local sandbox with local jaeger all-in-one and verified - [x] otlpgrpc with parent sampler always - [x] otlpgrpc with parent sampler traceid (along with other defaults) - [x] otlphttp with parent sampler traceid (along with other defaults) Flyte config ``` ❯ docker run --rm -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:1.52 ``` Jaeger ``` otel: type: otlpgrpc sampler: parentSampler: traceid ``` ## Rollout Plan (if applicable) This change is a no-op, so limited concerns merging. Next step is to pull into cloud repo, wire up to open telemetry collector, and enable sampling. ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If so, please check this box for auditing. Note, this is the responsibility of each developer. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed ## Jira Issue https://unionai.atlassian.net/browse/CLOUD-1565
…g#180) * Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)" This reverts commit 7c31c1e. Signed-off-by: Sean Lin <[email protected]>
…g#180) * Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)" This reverts commit 7c31c1e. Signed-off-by: Sean Lin <[email protected]>
I have an issue using "gpu_limit" with a sidecar task (i need this in order to enable "privileged mode" on the pod), it fails with the following error:
Workflow[yolotrain:development:train.simple_gpu.SimpleWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[a]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [sidecar]: [Invalid] failed to create resource, caused by: Pod "zxq9zn3huq-a-0" is invalid: [spec.containers[0].resources.limits[gpu]: Invalid value: "gpu": must be a standard resource type or fully qualified, spec.containers[0].resources.limits[gpu]: Invalid value: "gpu": must be a standard resource for containers, spec.containers[0].resources.requests[gpu]: Invalid value: "gpu": must be a standard resource type or fully qualified, spec.containers[0].resources.requests[gpu]: Invalid value: "gpu": must be a standard resource for containers]
this is the code that I tested it with:
the same code using @python_task(gpu_limit="1") instead of the sidecar_task works fine.
as a workaround I had to specify the limits inside the generate_pod_spec_for_task() like this:
I'm running this on baremetal hardware, k8s v1.16.1 and nvidia k8s device plugin 1.0.0-beta4
The text was updated successfully, but these errors were encountered: