-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Limitation] SDK Complier creates bigger YAML exceeding Kubeflow limits quickly #4170
Comments
/assign @Ark-kun |
Ideally, this pipeline compiled from that code should be very small since it only needs a single template. But with the current pipeline, each task has its own template, increasing the workflow size. How limiting is this for your scenarios? |
Thanks @Ark-kun for your quick reply. Our real pipeline is mush richer with a mix of data preapration, training and model evaluation. It uses multiple sources of data and thus we run the same compnent (defined ounce) for each source of data. The whole pipeline can get up to more than 100 tasks. In the example I gave, I used the for loop only to reproduce the issue. In our real case, the pipeline can no more run on cluster with the same error if compiled with |
That's great to hear. The real value of the Pipelines starts to manifest when the pipelines are bigger.
Roughly, how many instances do you have of same components?
Sorry to hear about the problem. import yaml
with open('toto.yaml') as f:
workflow = yaml.load(f)
for template in workflow['spec']['templates']:
del template.setdefault('metadata', {}).setdefault('annotations', {})['pipelines.kubeflow.org/component_spec']
with open('toto_fixed.yaml') as f:
yaml.dump(workflow, f) |
Everyone else independently affected by this issue, please speak up. I'd like to know about your situation. |
Removing from 1.0 project because this is intended behavior. |
We are getting this error too. We are using kfp 0.5.1. The yaml created is 1.3 MB in size, and we have 400+ components in the pipeline. We are working on a benchmarking tool that has multiple datasets and sub-pipelines in the graph. Is there a plan to allow larger pipelines than currently allowed? This seems like a common use case. |
Are they different components or component instances?
Unfortunately, this is a limitation of Kubernetes itself (and partially Argo). There is a limit on the size of any Kubernetes object. It was 512KiB some time ago, then 1MiB, and now 1.5MiB. It might be possible to increase the limit though: https://github.com/etcd-io/etcd/blob/master/Documentation/dev-guide/limit.md#request-size-limit |
@radcheb Are you currently blocked by this issue? |
@Ark-kun We are still using kfp |
@Ark-kun we actually implemented and tested your workaround squeezing pipeline size and it has been working with no problems. Thus, we upgraded to @nikhil-dce you could use this workaround to reduce import yaml
with open("big_pipeline.yaml") as f:
workflow = yaml.load(f)
for template in workflow['spec']['templates']:
annotations = template.setdefault('metadata', {}).setdefault('annotations', {})
if 'pipelines.kubeflow.org/component_spec' in annotations:
del annotations['pipelines.kubeflow.org/component_spec']
with open("smaller_pipeline.yaml", "w") as f:
yaml.dump(workflow, f) |
@Ark-kun shall we close this issue? |
The hack workaround is appreciated. I think there should be an option to cull the metadata->annotations->component_spec in the kfp compiler. |
We're hitting the problem in kubeflow 1.0. The amount of data coming into the system is variable and so the DAG grows with it. We're on GKE and according to this answer we're stuck:
Is it possible for kubeflow to break up the resources into smaller chunks? Otherwise this is quite limiting |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle frozen |
Encountering this - following and trying out suggestions above Situation is similar to several above. Using multiple of similar components. Full complete run requires 1000s of components built (dynamically) of those core ~10 component templates. Been working with setting limits/requests to deal with OOM issues so far, but still encountering this |
Might be a silly question, but where are you putting this workaround?
|
It's expected to be placed after the |
My team is also affected by this issue. We're on KFP 1.5.1. Workflows with ~200 tasks run into the size limit. Interestingly, we seem to be running into this issue at the Argo level. The workflow is able to start executing, but after a few task completions, it stops.
(screenshot attached below) I think what's happening is:
|
FWIW, I wrote a script to compute how much size is taken by all the fields: kfp_spec_field_sizes.py. I ran it on one of our large pipelines:
The suggested hack above to delete the component_spec annotation would reduce the total space by ~1/3! We'll try this out and see how much it helps. |
@Ark-kun could you say more on what features we lose if we remove the |
tl;dr One solution could be if KFP supported Argo's "workflow of workflows" feature, to allow arbitrarily sized workflows. Is there any chance KFP would adopt this? @Ark-kun some more details:
|
Hi all, I just stumbled upon this error. Are you working on a solution, or is it expected, that we use some kind of a workaround? |
@pogacsa there hasn't been any update on this for a while it seems. fwiw, I worked around this by doing a variety of hacks to post-process the YAML generated by the KFP compiler. This is an easy one: #4170 (comment) Another big change my team made was to factor out common fields across all our templates into the |
We found another workaround when we figured out why the yaml gets so large: |
/cc @chensun |
My team are using KFP for running very large pipelines (>20k pods) and we've got a solution for this. I'm in the process of getting approval to open-source it |
@tClelford Could you share some details about how your solution works? Do you post-process the KFP YAML and create a few Argo workflow template objects, or make use of Argo's workflow-of-workflows feature? |
Hello @jli! Sorry for the delayed reply, I've been on paternity leave :) To be more accurate, we aren't using argo workflow-of-workflows in KFP, rather we've implemented the same pattern to get the same results. Basically, we split our massive pipeline into several child pipelines and wrote a "pipeline-runner" KFP component that uses the The bit we want to open-source is the pipeline-runner component as it knows nothing about the content of the pipelines it's running. There are a couple of caveats/limitations to be aware of before you try using this to solve your kfp scale problem.
Hope this helps, give me a shout if you want more detail or if you'd like us to open-source our implementation. |
Maybe :) I think it depends on whether KFP treats them as different pipelines under the hood. What I've observed is that it's the size of the generated yaml that causes it to fall over so if it's still one pipeline, probably not. |
Has there been any additional work on this area? We have a workflow that's 460 tasks and we've tried submitting the run in both KFP V1 and KFPV2 (splitting the workflow in 2 with above feature) and we still get the same failure, likely hitting the 1.5MB limit in etcd. Ideally the Argo team needs to de-construct the workflow before writing to etcd and send it in smaller chunks, although I'm not sure if the Kube API exposes such a functionality from a controller. From the KFP side, there still probably some more optimizations that can be done. |
What steps did you take:
This problem is only noticed in
1.0.0-rc.3
, it does not exist in0.5.1
.Given the same pipeline,
1.0.0-rc.3
Complier creates a YAML file far bigger than the same created with0.5.1
Compiler.Thus, this may be a problem for huge workflows, that previously run successfully and can no more be run if compiled with new Compiler.
What happened:
The pipeline compiled with
1.0.0-rc.3
fails to run with the following error:Workflow is longer than maximum allowed size. Size=1048602
What did you expect to happen:
The pipeline must run successfully
Environment:
Kubeflow 1.0
Build commit:
743746b
python
3.7
+ kfp1.0.0-rc.3
How did you deploy Kubeflow Pipelines (KFP)?
Anything else you would like to add: : Reproducing the limit
For this same pipeline:
If compiled with kfp
0.5.1
it produces a file of size 625kb while if compiled with kfp1.0.0-rc.3
it produces a file of size 1129kb and thus fails to run in cluster.For kfp
0.5.1
we can increase the size of this example pipeline up to 438 component. At 439 it fails to run (exceeded workflow size limit). While with1.0.0-rc.3
this limit decreases to 239 component because of additional YAML size.I am not sure whether it's a bug, but it's a huge limitation for complex training and data preparation workflows.
/area sdk
The text was updated successfully, but these errors were encountered: