Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar task does not work with "gpu_limit" #180

Closed
giordyb opened this issue Feb 18, 2020 · 5 comments · Fixed by flyteorg/flytekit#88
Closed

sidecar task does not work with "gpu_limit" #180

giordyb opened this issue Feb 18, 2020 · 5 comments · Fixed by flyteorg/flytekit#88
Assignees
Labels
bug Something isn't working
Milestone

Comments

@giordyb
Copy link

giordyb commented Feb 18, 2020

I have an issue using "gpu_limit" with a sidecar task (i need this in order to enable "privileged mode" on the pod), it fails with the following error:

Workflow[yolotrain:development:train.simple_gpu.SimpleWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[a]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [sidecar]: [Invalid] failed to create resource, caused by: Pod "zxq9zn3huq-a-0" is invalid: [spec.containers[0].resources.limits[gpu]: Invalid value: "gpu": must be a standard resource type or fully qualified, spec.containers[0].resources.limits[gpu]: Invalid value: "gpu": must be a standard resource for containers, spec.containers[0].resources.requests[gpu]: Invalid value: "gpu": must be a standard resource type or fully qualified, spec.containers[0].resources.requests[gpu]: Invalid value: "gpu": must be a standard resource for containers]

this is the code that I tested it with:

def generate_pod_spec_for_task():
    pod_spec = generated_pb2.PodSpec()
    primary_container = generated_pb2.Container(name="primary")
    primary_container.securityContext.privileged = True
    pod_spec.containers.extend([primary_container])
    return pod_spec

@inputs(a=Types.Integer)
@outputs(b=Types.Integer)
@sidecar_task(
    pod_spec=generate_pod_spec_for_task(), primary_container_name="primary", gpu_limit="1"
)
def add_one(wf_params, a, b):
    # TODO lets add a test that works with tensorflow, but we need it to be in
    # a different container
    b.set(a + 1)

@workflow_class
class SimpleWorkflow(object):
    input_1 = Input(Types.Integer)
    input_2 = Input(Types.Integer, default=5, help="Not required.")
    a = add_one(a=input_1)
    output = Output(a.outputs.b, sdk_type=Types.Integer)

the same code using @python_task(gpu_limit="1") instead of the sidecar_task works fine.

as a workaround I had to specify the limits inside the generate_pod_spec_for_task() like this:


def generate_pod_spec_for_task():
    pod_spec = generated_pb2.PodSpec()
    # pod_spec.nodeSelector.update({"accelerator": "nvidia-gtx1080ti"})

    resource_requirements = generated_pb2.ResourceRequirements()

    resource_requirements.requests["nvidia.com/gpu"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="1")
    )
    resource_requirements.limits["nvidia.com/gpu"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="1")
    )
    resource_requirements.requests["cpu"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="4")
    )
    resource_requirements.limits["cpu"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="4")
    )
    resource_requirements.requests["memory"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="20Gi")
    )
    resource_requirements.limits["memory"].CopyFrom(
        _lazy_k8s.io.apimachinery.pkg.api.resource.generated_pb2.Quantity(string="20Gi")
    )
    primary_container = generated_pb2.Container(name="primary")
    primary_container.resources.CopyFrom(resource_requirements)
    primary_container.securityContext.privileged = True

    pod_spec.containers.extend([primary_container])

    return pod_spec

I'm running this on baremetal hardware, k8s v1.16.1 and nvidia k8s device plugin 1.0.0-beta4

@giordyb giordyb changed the title sidecar does not work with gpu_limit sidecar task does not work with "gpu_limit" Feb 18, 2020
@kumare3 kumare3 added the bug Something isn't working label Mar 4, 2020
@giordyb
Copy link
Author

giordyb commented Mar 5, 2020

Hi,

I just tried the example again after upgrading flytekit and I'm getting the same error but uppercase instead of lowercase...do I need to update anything else?

Workflow[yolotrain:development:object_detection.simple_gpu.SimpleWorkflow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[a]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [sidecar]: [Invalid] failed to create resource, caused by: Pod "v7c7qb7ldf-a-0" is invalid: [spec.containers[0].resources.limits[CPU]: Invalid value: "CPU": must be a standard resource type or fully qualified, spec.containers[0].resources.limits[CPU]: Invalid value: "CPU": must be a standard resource for containers, spec.containers[0].resources.limits[GPU]: Invalid value: "GPU": must be a standard resource type or fully qualified, spec.containers[0].resources.limits[GPU]: Invalid value: "GPU": must be a standard resource for containers, spec.containers[0].resources.requests[CPU]: Invalid value: "CPU": must be a standard resource type or fully qualified, spec.containers[0].resources.requests[CPU]: Invalid value: "CPU": must be a standard resource for containers, spec.containers[0].resources.requests[GPU]: Invalid value: "GPU": must be a standard resource type or fully qualified, spec.containers[0].resources.requests[GPU]: Invalid value: "GPU": must be a standard resource for containers]

Update: I've done some more troubleshooting, it looks like removing the .lowercase() also makes the cpu_limit invalid (before the fix it would work and only the GPU limit would be invalid)

@katrogan
Copy link
Contributor

katrogan commented Mar 6, 2020

Thanks for pointing this out, will try a different fix again. Sorry for the trouble 😓

@kumare3
Copy link
Contributor

kumare3 commented Mar 11, 2020

@katrogan i think we will need to create a new lyft/flyte release so that he can deploy the new propeller

@katrogan
Copy link
Contributor

@kumare3 lyft/flyte has already been updated with the propeller fix

@kumare3
Copy link
Contributor

kumare3 commented Mar 30, 2020

Closing the issue

@kumare3 kumare3 closed this as completed Mar 30, 2020
@kumare3 kumare3 added this to the 0.2.0 milestone Mar 30, 2020
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
* updated paths in docker-files

Signed-off-by: Samhita Alla <[email protected]>

* updated dockerfile in pima diabetes example

Signed-off-by: Samhita Alla <[email protected]>

* updated github workflows

Signed-off-by: Samhita Alla <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
…g#180)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)"

This reverts commit 6b9f1d4.
Signed-off-by: Sean Lin <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
Signed-off-by: Flyte-Bot <[email protected]>

Co-authored-by: flyte-bot <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 21, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
…g#180)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)"

This reverts commit 7c31c1e.
Signed-off-by: Sean Lin <[email protected]>
pmahindrakar-oss pushed a commit that referenced this issue May 1, 2024
## Overview
This changes adds support for OLTP and sampling to the otelutils tracer provider abstraction.

Adds support for OLTP. This is the recommended replacement for the deprecated jaeger exporter.
> "go.opentelemetry.io/otel/exporters/jaeger" is deprecated: This module is no longer supported. OpenTelemetry dropped support for Jaeger exporter in July 2023. Jaeger officially accepts and recommends using OTLP

OLTP supports grpc and http, which are added as separate exporter types and configs.

Adds initial [sampling](https://opentelemetry.io/docs/languages/go/sampling/) support to the top level open telemetry config. Defaults to parent sampler `always`, but also adds a config for [TraceIdRatioBased](https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#TraceIDRatioBased). See [these docs](https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#ParentBased) for behavior of parent sampler.

## Test Plan
Ran local sandbox with local jaeger all-in-one and verified
- [x] otlpgrpc with parent sampler always 
- [x] otlpgrpc with parent sampler traceid (along with other defaults)
- [x] otlphttp with parent sampler traceid (along with other defaults)

Flyte config
```
❯ docker run --rm -e COLLECTOR_OTLP_ENABLED=true -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/all-in-one:1.52
```

Jaeger
```
otel:
  type: otlpgrpc
  sampler:
    parentSampler: traceid
```

## Rollout Plan (if applicable)
This change is a no-op, so limited concerns merging. Next step is to pull into cloud repo, wire up to open telemetry collector, and enable sampling.

## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If so, please check this box for auditing. Note, this is the responsibility of each developer. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [x] To be upstreamed

## Jira Issue
https://unionai.atlassian.net/browse/CLOUD-1565
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
…g#180)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)"

This reverts commit 7c31c1e.
Signed-off-by: Sean Lin <[email protected]>
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
…g#180)

* Revert "Adopt flyteidl's ordered variable map change (flyteorg#158)"

This reverts commit 7c31c1e.
Signed-off-by: Sean Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants