Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Ephemeral Storage request is not respected #4965

Closed
2 tasks done
robert-ulbrich-mercedes-benz opened this issue Feb 27, 2024 · 15 comments
Closed
2 tasks done

[BUG] Ephemeral Storage request is not respected #4965

robert-ulbrich-mercedes-benz opened this issue Feb 27, 2024 · 15 comments
Assignees
Labels
bug Something isn't working flyteadmin Issue for FlyteAdmin Service helm

Comments

@robert-ulbrich-mercedes-benz
Copy link
Contributor

robert-ulbrich-mercedes-benz commented Feb 27, 2024

Describe the bug

When submitting a task with ephemeral storage request set (e.g. to 100Gi) this is taken to the task definition, also see first attached screenshot.

The Kubernetes pod created from this request will nevertheless only set a limit of 20Mi for the ephemeral storage: see second attached screenshot.

Expected behavior

The Kubernetes pod should have a limit on ephemeral storage equal to the requested amount of storage.

Additional context to reproduce

  1. Issue occurs on Flyte 1.10.7
  2. The issue does not occur on Flyte 1.10.0
  3. Run the following example code
from flytekit import Resources, task, workflow


@task(
   requests=Resources(ephemeral_storage="100Gi"),
   limits=Resources(ephemeral_storage="100Gi"),
)
def hello_world(name: str) -> str:
   """This is a Flyte task that prints 'Hello, World!

   Returns:
       str: "Hello, World!"
   """
   return f"Hello, {name}!"


@workflow
def hello_wf(your_name: str) -> str:
   """This is a Flyte workflow that runs the task!

   Returns:
       str: "Hello, World!"
   """
   return hello_world(name=your_name)

Screenshots

MicrosoftTeams-image (4)

MicrosoftTeams-image (5)

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@robert-ulbrich-mercedes-benz robert-ulbrich-mercedes-benz added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Feb 27, 2024
Copy link

welcome bot commented Feb 27, 2024

Thank you for opening your first issue here! 🛠

Copy link

dosubot bot commented Feb 27, 2024

🤖

Based on the changes made between Flyte 1.10.0 and Flyte 1.10.7, it seems like there has been a shift in how ephemeral storage is handled. Specifically, the term Storage was replaced with Ephemeral Storage in the Helm Chart as per this pull request. Additionally, the option to use storage as a task resource was removed as per this pull request.

These changes could potentially be the reason why your ephemeral storage request is not being respected. However, without more information, it's hard to say for certain.

To help troubleshoot this issue, could you please provide the following information:

  1. The version of the Flytekit you're using.
  2. The configuration of your Flyte deployment, specifically the settings related to resource allocation and limits.
  3. The logs from the Flyte Propeller, which is the component responsible for translating Flyte tasks into Kubernetes pods.

This information will help us understand the context in which the task is being executed and why the ephemeral storage request might not be respected.

In the meantime, you might want to consider downgrading to Flyte 1.10.0 if that version was working as expected for your use case. Please note that this is just a workaround and not a long-term solution. We'll continue to investigate this issue to find a proper fix.

For more information about resource configuration in Flyte, you can refer to the Flyte documentation.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@Future-Outlier
Copy link
Member

It's fixed here.
#4929
Thank you.

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

robert-ulbrich-mercedes-benz commented Feb 28, 2024

Hi @Future-Outlier,

thanks for your quick response. I am addressing my response to the questions asked by the bot to you.

  1. We were using Flyte 1.10.7
  2. I am not allowed to send the complete configuration, but I can tell, we have no resource defaults set on ephemeral storage

{ "insertId": "1qgk89aismv5o2by", "jsonPayload": { "json": { "node": "n0", "res_ver": "254879967", "tasktype": "helix-task", "src": "container_helper.go:329", "wf": "ndas:dev:msp_export_workflow.msp_export_specific_session_workflow", "routine": "worker-24", "exec_id": "fda883097f3dd44f992e", "ns": "ndas-dev" }, "msg": "Adjusted container resources [{map[cpu:{{2 0} {<nil>} 2 DecimalSI} ephemeral-storage:{{20971520 0} {<nil>} 20Mi BinarySI} memory:{{2 9} {<nil>} 2G DecimalSI}] map[cpu:{{2 0} {<nil>} 2 DecimalSI} ephemeral-storage:{{20971520 0} {<nil>} 20Mi BinarySI} memory:{{2 9} {<nil>} 2G DecimalSI}] []}]", "level": "info", "ts": "2024-02-27T13:59:37Z" }, "resource": { "type": "k8s_container", "labels": { "container_name": "flytepropeller", "project_id": "mb-adas-workflow-p-56e1", "pod_name": "flytepropeller-67ff8fdc9f-nbtbs", "cluster_name": "mbadas-prod-gke-workflow", "location": "europe-west4", "namespace_name": "flyte" } }, "timestamp": "2024-02-27T13:59:37.057502579Z", "severity": "INFO", "labels": { "k8s-pod/app_kubernetes_io/managed-by": "Helm", "k8s-pod/app_kubernetes_io/name": "flytepropeller", "k8s-pod/pod-template-hash": "67ff8fdc9f", "k8s-pod/helm_sh/chart": "flyte-core-v1.10.7", "k8s-pod/app_kubernetes_io/instance": "flyte-core", "compute.googleapis.com/resource_name": "gke-mbadas-prod-gke--mbadas-prod-gke--d5166da6-59c5" }, "logName": "projects/mb-adas-workflow-p-56e1/logs/stderr", "receiveTimestamp": "2024-02-27T13:59:39.647297684Z" }

{ "insertId": "a56xdll71u0g5c71", "jsonPayload": { "msg": "The resource requirement for creating Pod [ndas-dev/fda883097f3dd44f992e-n0-6] is [[{[memory]: [2G]} {[cpu]: [2]} {[ephemeral-storage]: [20Mi]}]]\n", "level": "info", "json": { "tasktype": "helix-task", "routine": "worker-24", "res_ver": "254879967", "src": "plugin_manager.go:179", "node": "n0", "exec_id": "fda883097f3dd44f992e", "ns": "ndas-dev", "wf": "ndas:dev:msp_export_workflow.msp_export_specific_session_workflow" }, "ts": "2024-02-27T13:59:37Z" }, "resource": { "type": "k8s_container", "labels": { "location": "europe-west4", "cluster_name": "mbadas-prod-gke-workflow", "pod_name": "flytepropeller-67ff8fdc9f-nbtbs", "namespace_name": "flyte", "project_id": "mb-adas-workflow-p-56e1", "container_name": "flytepropeller" } }, "timestamp": "2024-02-27T13:59:37.058019995Z", "severity": "INFO", "labels": { "k8s-pod/app_kubernetes_io/instance": "flyte-core", "k8s-pod/helm_sh/chart": "flyte-core-v1.10.7", "compute.googleapis.com/resource_name": "gke-mbadas-prod-gke--mbadas-prod-gke--d5166da6-59c5", "k8s-pod/app_kubernetes_io/name": "flytepropeller", "k8s-pod/pod-template-hash": "67ff8fdc9f", "k8s-pod/app_kubernetes_io/managed-by": "Helm" }, "logName": "projects/mb-adas-workflow-p-56e1/logs/stderr", "receiveTimestamp": "2024-02-27T13:59:39.647297684Z" }

We already downgraded and since then everything works fine again. But is is sad, that we cannot update to a more recent version.

Looking at the piece of code, from which the logs are generated, I wonder why the resource requests from our task are overwritten at all. That should not happen and I am concerned that it will still happen even with #4929

@Future-Outlier
Copy link
Member

If you set the configuration with ephemeral storage no limit, did you restart the flyte cluster deployment?

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

@Future-Outlier, I maybe do not understand your last statement correctly, but what we did was to limit the ephemeral storage use in a task decorator to 100 GB, then starting the workflow with pyflyte. In the end the limit of the Kubernetes container was set to only 20 MiB.

It is not an option for us to fully lift the limit on the ephemeral storage.

@Future-Outlier
Copy link
Member

Future-Outlier commented Feb 28, 2024

@robert-ulbrich-mercedes-benz, I mean that did you

  1. edit the task resource field in the configmap
  2. restart you cluster with command like
    kubectl rollout restart deployment flyte-sandbox -n flyte?

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

Hi @Future-Outlier ,

it is hard for me to understand what your actual point is. At least for me it would make things easier if you could please provide a few more sentences about your intentions.

We deployed Flyte using the Flyte Core Helm Chart, so we do not have a flyte-sandbox deployment that I could restart. Which exact config map are your referring to? There are quite a lot of them for the Flyte Core helm deployment. I could find the flyte-admin-base-config which has a key ``task_resource_defaults.yaml`:

task_resource_defaults.yaml:
----
task_resources:
  defaults:
    cpu: 500m
    ephemeralStorage: 500Mi
    memory: 500Mi
    storage: 500Mi
  limits:
    cpu: 128
    ephemeralStorage: 20Mi
    gpu: 16
    memory: 2000Gi
    storage: 10000Gi

This is the config on Flyte v1.10.7 in our dev environment. The question again is: Why is our config provided in the task overwritten by the defaults? It should rather be the other way around in my eyes

@Future-Outlier
Copy link
Member

Hi, @robert-ulbrich-mercedes-benz sorry for the misunderstanding.
I didn't recognize that you are Flyte Core helm deployment.
I haven't used this before, so can't respond to you immediately.
Can you come to the Slack channel and discuss this with us?
Someone know how to fix it, thank you

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

Hi @Future-Outlier,

okay, I will then request help in the Slack channel then

@pingsutw pingsutw added flyteadmin Issue for FlyteAdmin Service helm and removed untriaged This issues has not yet been looked at by the Maintainers labels Feb 29, 2024
@eapolinario eapolinario self-assigned this Feb 29, 2024
@eapolinario
Copy link
Contributor

@robert-ulbrich-mercedes-benz , I was not able to repro the issue using the example in the description, only if I used pod templates. For example, let's say a task f is defined like so:

@task(
    task_config=Pod(
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="primary",
                    resources=V1ResourceRequirements(
                        requests={"ephemeral-storage": "1Gi"},
                        limits={"ephemeral-storage": "1Gi"},
                    ),
                ),
            ],
        ),
    ),
)
def f():
    pass

If the configured default ephemeral storage is set to any value, then that's what's used (since that's what passes the validation).

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

Hi @eapolinario,

we also have default pod templates configured for our workflows. But those pod templates do not configure default ephemeral storage. But as mentioned in this ticket, there are task resource defaults.

As mentioned we can easily reproduce the issue.

Best regards

Rob Ulbrich

@eapolinario
Copy link
Contributor

@robert-ulbrich-mercedes-benz , just to be clear, we have an outstanding bug in Flyte that essentially does not validate ephemeral storage values in pod templates. The moment we added a default value for ephemeral storage as a task resource default, that value was the one used by the task resource validation, regardless of the original value defined in the pod template.

Just to be crystal clear, let's say we have this task:

@task(
    task_config=Pod(
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="primary",
                    resources=V1ResourceRequirements(
                        requests={"ephemeral-storage": "1Gi"},
                        limits={"ephemeral-storage": "1Gi"},
                    ),
                ),
            ],
        ),
    ),
)
def f():
    ...

This is the values as shown in the registered task template:
Screenshot 2024-03-08 at 10 23 55 AM

And task resource defaults are defined as such:

task_resources:
  defaults:
    ephemeralStorage: 500Mi
  limits:
    ephemeralStorage: 20Mi

Notice that we're just using the (non-sensical) values defined in 1.10.7.

Upon running that task f we see the following resources in the pod:

resources:
  limits:
    cpu: "2"
    ephemeral-storage: 20Mi
    memory: 200Mi
  requests:
    cpu: "2"
    ephemeral-storage: 20Mi
    memory: 200Mi

The values for cpu and memory are coming from the defaults.

This bug is being fixed by #5019. After that's released we'll see an error during registration if the values defined in the pod template do not pass task validation. FYI: we are planning a release, 1.11.0, for next week.

@davidmirror-ops
Copy link
Contributor

@robert-ulbrich-mercedes-benz can you confirm this is fixed in your environment?

@robert-ulbrich-mercedes-benz
Copy link
Contributor Author

Hey @davidmirror-ops,

yes, we do not see this issue anymore. It is fixed in our environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flyteadmin Issue for FlyteAdmin Service helm
Projects
None yet
Development

No branches or pull requests

5 participants