Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

Refactor kf-operator plugins configs and support setting different specs for different replica groups #386

Merged
merged 14 commits into from
May 4, 2023

Conversation

yubofredwang
Copy link
Contributor

@yubofredwang yubofredwang commented Mar 31, 2023

TL;DR

This PR refactors config fields for kubeflow operator plugins including: TFJob, MPIJob, PyTorchJob. This change is backward incompatible, so we will keep the original plugin configs untouched and use the new configs in v2 version of the plugins.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Code change includes:

  1. Move TFJob, MPIJob, PyTorchJob from flyteidl/plugins/ to flyteidl/plugins/kubeflow/
  2. Create flyteidl/plugins/kubeflow/common.proto that contains the definition RestartPolicy, RunPolicy and CleanPodPolicy
  3. Add xxxReplicaSpec in each job type to allow settings of replicas, image, resources and restart_policy

Tracking Issue

fixes flyteorg/flyte#3308

Follow-up issue

@welcome
Copy link

welcome bot commented Mar 31, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@codecov
Copy link

codecov bot commented Mar 31, 2023

Codecov Report

Merging #386 (7ba6b07) into master (9189ff2) will increase coverage by 2.37%.
The diff coverage is n/a.

❗ Current head 7ba6b07 differs from pull request most recent head 96a45ea. Consider uploading reports for the commit 96a45ea to get more accurate results

@@            Coverage Diff             @@
##           master     #386      +/-   ##
==========================================
+ Coverage   76.11%   78.49%   +2.37%     
==========================================
  Files          18       18              
  Lines        1390     1195     -195     
==========================================
- Hits         1058      938     -120     
+ Misses        280      205      -75     
  Partials       52       52              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

@hamersaw
Copy link
Contributor

hamersaw commented Apr 7, 2023

Quick update per our offline discussion. The goal is to make these changes backwards compatible with the existing flytekit kubeflow operator APIs. So the proposed IDL spec will override the existing approach, so the kubeflow operator updates are something like:

@task(
    task_config=TfJob(
        num_workers=2, # @deprecated
        num_ps_replicas=1, # @deprecated
        num_chief_replicas=1 # @deprecated
        worker_spec=TfWorkflowSpec(
             replicas=3 # backend job will use 3 replicas, overriding the 2 provided above
        ),
        # similar specs for ps and chief replicas
    ),
)
def foo:
    # omitted
    
@task(
    task_config=TfJob(
        num_workers=2, # @deprecated - but used because worker_spec is not specified
        num_ps_replicas=1, # @deprecated - similarly used
        num_chief_replicas=1 # @deprecated  -similarly used
    ),
)
def foo:
    # omitted

IIUC this is the cleanest way to implement these changes. Very interested in flytekit maintainer thoughts - cc @wild-endeavor @eapolinario.

The largest open question is how this should be written in wire-format, there are two solutions:
(1) the proposed solution creates a new flyteidl proto message to support the new format. This will require flytekit to compile all new kubeflow jobs to use TaskType=1 (rather than the exist 0) in the TaskTemplate denoting the custom struct is using the old or new configuration definition. In this approach flytekit will be responsible for applying the override values and the deprecation messages will be on the flytekit kubeflow API fields.
(2) Merge the new and existing proto messages. In this scenario the idl message will directly reflect the kubeflow APIs. In the above example, the idl message will have both num_workers and workers_spec.replicas. In this approach flyteplugins will be responsible for applying the override values and the deprecation messages will be on both the flytekit kubeflow API fields and the flyteidl configuration proto.

@yubofredwang please feel free to correct this and / or clarify if there is more context here.

@hamersaw
Copy link
Contributor

@yubofredwang we talked internally here, (hopefully) quick summary. The best approach is probably to use a new proto message and increment the TaskType version to 1. This means deprecating the top-level replica counts will be done in flytekit-side. Does this sound reasonable to you? We should be able to move forward with this PR then, but want to touch-base before pushing it through.

@yubofredwang
Copy link
Contributor Author

@yubofredwang we talked internally here, (hopefully) quick summary. The best approach is probably to use a new proto message and increment the TaskType version to 1. This means deprecating the top-level replica counts will be done in flytekit-side. Does this sound reasonable to you? We should be able to move forward with this PR then, but want to touch-base before pushing it through.

Got it. My understanding is we make the following changes:

flyteidl: keep the new proto separated from the old protos of kf-operators
flyteplugins: support both TaskTypeVersion 0 and 1, the TaskTemplate will be in different format for the two versions
flytekit: bump the flytekit version and use the new proto only in the new flytekit version

My questions regarding to this approach is how the user upgrade process should look like. There are three related pieces here: user code, flytekit, flyteplugins

Case1: code unchanged, flytekit unchanged, flyteplugins updated
in this case, the existing workflow works without a problem since flyteplugins support both version

Case2: code unchanged, flytekit updated, flyteplugins unchanged
this use case, the existing workflow will fail due to flytekit does not accept old config. User should get error message instructing them to update code.

Case3: code updated, flytekit updated, flyteplugins unchanged
same as case1

Case4: code updated, flytekit unchanged, flyteplugins updated
Not likely to occur since users are not aware of the new flytekit interface

@ByronHsu
Copy link
Contributor

ByronHsu commented Apr 13, 2023

Regardless of the compatibility issues, the idl looks good to me!

@fg91 fg91 self-requested a review April 13, 2023 18:56
Signed-off-by: Yubo Wang <[email protected]>
Yubo Wang added 2 commits April 26, 2023 13:45
Signed-off-by: Yubo Wang <[email protected]>
Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! My only nit is on proto location in respect to v0 of the kubeflow operators. Not sure of the cleanest way to handle this (maybe this is it!) but it seems a little odd to have v0 and v1 in different locations.

@yubofredwang
Copy link
Contributor Author

Looks great! My only nit is on proto location in respect to v0 of the kubeflow operators. Not sure of the cleanest way to handle this (maybe this is it!) but it seems a little odd to have v0 and v1 in different locations.

The reason I put these into that specific location kubeflow/plugins was the following:

  1. The three plugins proto share common parameters, which are specific to kubeflow plugins.
  2. The new plugins now can keep the same name such as "DistributedMPITrainingTask". If I put it under same root directory, compiler will complain with conflicts since they are named the same.
  3. I am kind of reluctant to rename the plugins to pytorch_v1.proto since later on after we deprecate v0, we might have to rename it again. Then at that time, we have to do another code change in both flytekit and flyteplugins repo

@hamersaw hamersaw merged commit 5a3a44f into flyteorg:master May 4, 2023
@welcome
Copy link

welcome bot commented May 4, 2023

Congrats on merging your first pull request! 🎉

eapolinario pushed a commit that referenced this pull request May 16, 2023
…ecs for different replica groups (#386)

* refactor kubeflow operators proto

Signed-off-by: Yubo Wang <[email protected]>

* add back the original proto for backward compatible

Signed-off-by: Yubo Wang <[email protected]>

* clean up comments

Signed-off-by: Yubo Wang <[email protected]>

* add kubeflow.rs

Signed-off-by: Yubo Wang <[email protected]>

* add elastic config

Signed-off-by: Yubo Wang <[email protected]>

* add command to MPI

Signed-off-by: Yubo Wang <[email protected]>

* add slots and command to mpi spec

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
eapolinario added a commit that referenced this pull request May 16, 2023
* added dynamic_job_spec_uri to dynamic workflow metadata and node execution closure (#360)

Signed-off-by: Daniel Rammer <[email protected]>

* Use TokenCache in ClientCredentialsTokenSourceProvider (#377)

* Init customTokenSource.refreshTime (#381)

Signed-off-by: Andrew Dye <[email protected]>

* added DataLoadingConfig to K8sPod message (#368)

Signed-off-by: Daniel Rammer <[email protected]>

* Add Reasons field to TaskExecutionClosure to track time-series of reasons (#382)

* added a time-series of reasons to the TaskExecution closure

Signed-off-by: Daniel Rammer <[email protected]>

* added docs

Signed-off-by: Daniel Rammer <[email protected]>

* actually finishing docs too

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>

* Create service for runtime metrics (#367)

* added span messages

Signed-off-by: Daniel Rammer <[email protected]>

* added endpoints to service

Signed-off-by: Daniel Rammer <[email protected]>

* generated mocks

Signed-off-by: Daniel Rammer <[email protected]>

* removed get task execution metrics rpc

Signed-off-by: Daniel Rammer <[email protected]>

* added EXECUTION_IDLE category

Signed-off-by: Daniel Rammer <[email protected]>

* updated PLUGIN_EXECUTION to PLUGIN_RUNTIME

Signed-off-by: Daniel Rammer <[email protected]>

* removed recorded_at on workflow and node level events

Signed-off-by: Daniel Rammer <[email protected]>

* added docs for task event reported_at field

Signed-off-by: Daniel Rammer <[email protected]>

* removed GetNodeExecutionMetrics endpoint - will implement later if necessary

Signed-off-by: Daniel Rammer <[email protected]>

* updated docs

Signed-off-by: Daniel Rammer <[email protected]>

* added reported_at for node execution events

Signed-off-by: Daniel Rammer <[email protected]>

* fixed typo

Signed-off-by: Daniel Rammer <[email protected]>

* fixed typos and removed dead code

Signed-off-by: Daniel Rammer <[email protected]>

* updated categories

Signed-off-by: Daniel Rammer <[email protected]>

* added workflow setup and teardown categories

Signed-off-by: Daniel Rammer <[email protected]>

* simplified span message and moved to flyteidl.core

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>

* Remove misleading token refresh logic from client credentials token source provider (#383)

* Out of core plugin (#378)

* Add backend plugin system service

Signed-off-by: Kevin Su <[email protected]>

* Add backend plugin system service

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* update state

Signed-off-by: Kevin Su <[email protected]>

* update state

Signed-off-by: Kevin Su <[email protected]>

* dics

Signed-off-by: Kevin Su <[email protected]>

* Remove output prefix from get request

Signed-off-by: Kevin Su <[email protected]>

* update

Signed-off-by: Kevin Su <[email protected]>

* remove prev state

Signed-off-by: Kevin Su <[email protected]>

* update proto

Signed-off-by: Kevin Su <[email protected]>

* remove error message

Signed-off-by: Kevin Su <[email protected]>

* update comment

Signed-off-by: Kevin Su <[email protected]>

* make generate

Signed-off-by: Kevin Su <[email protected]>

* Rename the service

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>

* Feat: Add `ElasticConfig` message type for torch elastic training (#394)

* Add elastic config args to pytorch proto

Signed-off-by: Fabio Graetz <[email protected]>

* Add elastic config message type for torchrun training

Signed-off-by: Fabio Graetz <[email protected]>

---------

Signed-off-by: Fabio Graetz <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Co-authored-by: Ketan Umare <[email protected]>

* Retract 1.4.x (#397)

Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>

* Data addresses #minor (#391)

Signed-off-by: Yee Hing Tong <[email protected]>

* Refactor kf-operator plugins configs and support setting different specs for different replica groups (#386)

* refactor kubeflow operators proto

Signed-off-by: Yubo Wang <[email protected]>

* add back the original proto for backward compatible

Signed-off-by: Yubo Wang <[email protected]>

* clean up comments

Signed-off-by: Yubo Wang <[email protected]>

* add kubeflow.rs

Signed-off-by: Yubo Wang <[email protected]>

* add elastic config

Signed-off-by: Yubo Wang <[email protected]>

* add command to MPI

Signed-off-by: Yubo Wang <[email protected]>

* add slots and command to mpi spec

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>

* add user_identifier (#388)

Signed-off-by: byhsu <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: byhsu <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>

* Add envs to execution spec (#400)

Signed-off-by: Kevin Su <[email protected]>

* Support union and none type in flyteidl (#401)

* add support for Union Scalar

Signed-off-by: Yubo Wang <[email protected]>

* support union type and literals

Signed-off-by: Yubo Wang <[email protected]>

* change union type extraction

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Kevin Su <[email protected]>

* Rename user_identity to execution_identity (#402)

Signed-off-by: byhsu <[email protected]>
Co-authored-by: byhsu <[email protected]>

* make generate

Signed-off-by: eduardo apolinario <[email protected]>

* Revert "Support union and none type in flyteidl (#401)"

This reverts commit 3284f61.

Signed-off-by: Eduardo Apolinario <[email protected]>

* We should not update flyteidl version in backend components in the case of this branch

Signed-off-by: eduardo apolinario <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yubo Wang <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Co-authored-by: Andrew Dye <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Co-authored-by: Fabio M. Graetz, Ph.D <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Co-authored-by: Ketan Umare <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>
Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: ByronHsu <[email protected]>
Co-authored-by: byhsu <[email protected]>
eapolinario pushed a commit that referenced this pull request Jun 27, 2023
…ecs for different replica groups (#386)

* refactor kubeflow operators proto

Signed-off-by: Yubo Wang <[email protected]>

* add back the original proto for backward compatible

Signed-off-by: Yubo Wang <[email protected]>

* clean up comments

Signed-off-by: Yubo Wang <[email protected]>

* add kubeflow.rs

Signed-off-by: Yubo Wang <[email protected]>

* add elastic config

Signed-off-by: Yubo Wang <[email protected]>

* add command to MPI

Signed-off-by: Yubo Wang <[email protected]>

* add slots and command to mpi spec

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
eapolinario added a commit that referenced this pull request Jun 28, 2023
* Adding support for structured dataset (#369)

Signed-off-by: pmahindrakar-oss <[email protected]>

* added dynamic_job_spec_uri to dynamic workflow metadata and node execution closure (#360)

Signed-off-by: Daniel Rammer <[email protected]>

* Use TokenCache in ClientCredentialsTokenSourceProvider (#377)

* Init customTokenSource.refreshTime (#381)

Signed-off-by: Andrew Dye <[email protected]>

* added DataLoadingConfig to K8sPod message (#368)

Signed-off-by: Daniel Rammer <[email protected]>

* Add Reasons field to TaskExecutionClosure to track time-series of reasons (#382)

* added a time-series of reasons to the TaskExecution closure

Signed-off-by: Daniel Rammer <[email protected]>

* added docs

Signed-off-by: Daniel Rammer <[email protected]>

* actually finishing docs too

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>

* Create service for runtime metrics (#367)

* added span messages

Signed-off-by: Daniel Rammer <[email protected]>

* added endpoints to service

Signed-off-by: Daniel Rammer <[email protected]>

* generated mocks

Signed-off-by: Daniel Rammer <[email protected]>

* removed get task execution metrics rpc

Signed-off-by: Daniel Rammer <[email protected]>

* added EXECUTION_IDLE category

Signed-off-by: Daniel Rammer <[email protected]>

* updated PLUGIN_EXECUTION to PLUGIN_RUNTIME

Signed-off-by: Daniel Rammer <[email protected]>

* removed recorded_at on workflow and node level events

Signed-off-by: Daniel Rammer <[email protected]>

* added docs for task event reported_at field

Signed-off-by: Daniel Rammer <[email protected]>

* removed GetNodeExecutionMetrics endpoint - will implement later if necessary

Signed-off-by: Daniel Rammer <[email protected]>

* updated docs

Signed-off-by: Daniel Rammer <[email protected]>

* added reported_at for node execution events

Signed-off-by: Daniel Rammer <[email protected]>

* fixed typo

Signed-off-by: Daniel Rammer <[email protected]>

* fixed typos and removed dead code

Signed-off-by: Daniel Rammer <[email protected]>

* updated categories

Signed-off-by: Daniel Rammer <[email protected]>

* added workflow setup and teardown categories

Signed-off-by: Daniel Rammer <[email protected]>

* simplified span message and moved to flyteidl.core

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>

* Remove misleading token refresh logic from client credentials token source provider (#383)

* Out of core plugin (#378)

* Add backend plugin system service

Signed-off-by: Kevin Su <[email protected]>

* Add backend plugin system service

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

* update state

Signed-off-by: Kevin Su <[email protected]>

* update state

Signed-off-by: Kevin Su <[email protected]>

* dics

Signed-off-by: Kevin Su <[email protected]>

* Remove output prefix from get request

Signed-off-by: Kevin Su <[email protected]>

* update

Signed-off-by: Kevin Su <[email protected]>

* remove prev state

Signed-off-by: Kevin Su <[email protected]>

* update proto

Signed-off-by: Kevin Su <[email protected]>

* remove error message

Signed-off-by: Kevin Su <[email protected]>

* update comment

Signed-off-by: Kevin Su <[email protected]>

* make generate

Signed-off-by: Kevin Su <[email protected]>

* Rename the service

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>

* Feat: Add `ElasticConfig` message type for torch elastic training (#394)

* Add elastic config args to pytorch proto

Signed-off-by: Fabio Graetz <[email protected]>

* Add elastic config message type for torchrun training

Signed-off-by: Fabio Graetz <[email protected]>

---------

Signed-off-by: Fabio Graetz <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Co-authored-by: Ketan Umare <[email protected]>

* Retract 1.4.x (#397)

Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>

* Data addresses #minor (#391)

Signed-off-by: Yee Hing Tong <[email protected]>

* Refactor kf-operator plugins configs and support setting different specs for different replica groups (#386)

* refactor kubeflow operators proto

Signed-off-by: Yubo Wang <[email protected]>

* add back the original proto for backward compatible

Signed-off-by: Yubo Wang <[email protected]>

* clean up comments

Signed-off-by: Yubo Wang <[email protected]>

* add kubeflow.rs

Signed-off-by: Yubo Wang <[email protected]>

* add elastic config

Signed-off-by: Yubo Wang <[email protected]>

* add command to MPI

Signed-off-by: Yubo Wang <[email protected]>

* add slots and command to mpi spec

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>

* add user_identifier (#388)

Signed-off-by: byhsu <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]>
Co-authored-by: byhsu <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>

* Add envs to execution spec (#400)

Signed-off-by: Kevin Su <[email protected]>

* Support union and none type in flyteidl (#401)

* add support for Union Scalar

Signed-off-by: Yubo Wang <[email protected]>

* support union type and literals

Signed-off-by: Yubo Wang <[email protected]>

* change union type extraction

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Kevin Su <[email protected]>

* Rename user_identity to execution_identity (#402)

Signed-off-by: byhsu <[email protected]>
Co-authored-by: byhsu <[email protected]>

* Single literal in GetDataResponse (#404)

Signed-off-by: Yee Hing Tong <[email protected]>

* Add namespace to execution system metadata (#406)

Signed-off-by: Katrina Rogan <[email protected]>

* Add oauth2 http proxy client (#405)

Signed-off-by: byhsu <[email protected]>

* Rename externalPluginService to AgentService (#410)

* Rename externalPluginService to AgentService

Signed-off-by: Kevin Su <[email protected]>

* nit

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>

* Add external_plugin_service proto back to the idl (#413)

* Add external-plugin-service proto back to the idl

Signed-off-by: Kevin Su <[email protected]>

* update idl

Signed-off-by: Kevin Su <[email protected]>

* update idll

Signed-off-by: Kevin Su <[email protected]>

* update idll

Signed-off-by: Kevin Su <[email protected]>

* AsyncAgentService

Signed-off-by: Kevin Su <[email protected]>

---------

Signed-off-by: Kevin Su <[email protected]>

* Rerun make generate

Signed-off-by: eduardo apolinario <[email protected]>

---------

Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Andrew Dye <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: eduardo apolinario <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yubo Wang <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Co-authored-by: pmahindrakar-oss <[email protected]>
Co-authored-by: Dan Rammer <[email protected]>
Co-authored-by: Andrew Dye <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
Co-authored-by: Fabio M. Graetz, Ph.D <[email protected]>
Co-authored-by: Fabio Grätz <[email protected]>
Co-authored-by: Ketan Umare <[email protected]>
Co-authored-by: eduardo apolinario <[email protected]>
Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Co-authored-by: ByronHsu <[email protected]>
Co-authored-by: byhsu <[email protected]>
Co-authored-by: Katrina Rogan <[email protected]>
eapolinario pushed a commit that referenced this pull request Sep 8, 2023
…ecs for different replica groups (#386)

* refactor kubeflow operators proto

Signed-off-by: Yubo Wang <[email protected]>

* add back the original proto for backward compatible

Signed-off-by: Yubo Wang <[email protected]>

* clean up comments

Signed-off-by: Yubo Wang <[email protected]>

* add kubeflow.rs

Signed-off-by: Yubo Wang <[email protected]>

* add elastic config

Signed-off-by: Yubo Wang <[email protected]>

* add command to MPI

Signed-off-by: Yubo Wang <[email protected]>

* add slots and command to mpi spec

Signed-off-by: Yubo Wang <[email protected]>

---------

Signed-off-by: Yubo Wang <[email protected]>
Co-authored-by: Yubo Wang <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core feature] Allow setting separate resource configs for different replica types in Kubeflow Jobs
5 participants