Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Don't add master replica log link when doing elastic pytorch training #356

Merged
merged 2 commits into from
Jun 7, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jun 7, 2023

TL;DR

When doing torch elastic training, there is no so-called master replica in the resulting PytorchJob as opposed to when doing non-elastic pytorch distributed training.

Flyteplugins, however, still generates a log link for the non-existing master replica in case of elastic training. This PR fixes this.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

I built a propeller image and tested that the correct log links are shown both for elastic and the original non-elastic pytorch tasks.

Complete description

When doing "normal" non-elastic training, a flyte task looks like this:

@task(
    task_config=PyTorch(
        num_workers=3,
    )
)
def train():

The pytorch job that is created from this task definition looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      ...
    Worker:
      replicas: 3
      ...

Notice that there is a so-called "master" replica and multiple workers.

In the Flyte console, a link to the master replica and to the 3 worker replicas logs is shown.

When using the new elastic training task (torchrun) ...

@task(
    task_config=Elastic(
        nnodes=3,
        nproc_per_node=1,
    ),
)
def train():

... the resulting pytorch job looks like this:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  ...
spec:
  elasticPolicy:
    ...
  pytorchReplicaSpecs:
    Worker:
      replicas: 3
      ...

Notice that there is no-more "master" replica.

Even though there is no "master" replica, currently the Flyte console still shows a log link for the master replica that doesn't exist.

This PR fixes this.

Tracking Issue

NA

Follow-up issue

NA

@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 984c99c to a4fbf94 Compare June 7, 2023 14:21
@codecov
Copy link

codecov bot commented Jun 7, 2023

Codecov Report

Merging #356 (173acea) into master (76a80ec) will increase coverage by 1.44%.
The diff coverage is 77.83%.

❗ Current head 173acea differs from pull request most recent head 63b4fcd. Consider uploading reports for the commit 63b4fcd to get more accurate results

@@            Coverage Diff             @@
##           master     #356      +/-   ##
==========================================
+ Coverage   62.76%   64.20%   +1.44%     
==========================================
  Files         148      148              
  Lines       12444    10289    -2155     
==========================================
- Hits         7810     6606    -1204     
+ Misses       4038     3072     -966     
- Partials      596      611      +15     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/webapi/agent/config.go 100.00% <ø> (ø)
.../plugins/k8s/kfoperators/common/common_operator.go 65.94% <34.88%> (-5.99%) ⬇️
go/tasks/plugins/k8s/dask/dask.go 85.77% <57.14%> (+1.85%) ⬆️
go/tasks/plugins/webapi/agent/plugin.go 68.22% <66.66%> (ø)
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 75.00% <81.48%> (+0.80%) ⬆️
go/tasks/plugins/k8s/ray/ray.go 77.48% <84.09%> (-0.09%) ⬇️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 77.77% <84.33%> (+3.02%) ⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 78.08% <88.37%> (-1.45%) ⬇️
...tasks/pluginmachinery/flytek8s/container_helper.go 86.36% <88.88%> (+0.36%) ⬆️
go/tasks/pluginmachinery/flytek8s/pod_helper.go 69.81% <100.00%> (-3.21%) ⬇️
... and 1 more

... and 121 files with indirect coverage changes

@fg91 fg91 force-pushed the fabio/fix/elastic-log-links branch from 0709b32 to 63b4fcd Compare June 7, 2023 16:11
@fg91 fg91 marked this pull request as ready for review June 7, 2023 16:12
@fg91 fg91 requested a review from pingsutw June 7, 2023 16:12
@fg91 fg91 merged commit de1d1d3 into master Jun 7, 2023
@fg91 fg91 deleted the fabio/fix/elastic-log-links branch June 7, 2023 18:15
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
…#356)

* Don't add master log link when doing elastic pytorch training

Signed-off-by: Fabio Graetz <[email protected]>

* Lint

Signed-off-by: Fabio Graetz <[email protected]>

---------

Signed-off-by: Fabio Graetz <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants