pytorch plugin implementation #112

igorvalko · 2020-05-08T18:20:13Z

TL;DR

Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.
Related PRs:
flyteorg/flyteidl#61
flyteorg/flyteplugins#85

Tracking Issue

NA

Follow-up issue

NA

flytekit/plugins/__init__.py

EngHabu · 2020-05-18T23:12:04Z

flytekit/sdk/tasks.py

+        cache=False,
+        timeout=None,
+        workers_count=1,
+        instance_storage_request="",


Any reason to call it "instance_" ? is it to signify this request is per worker instead of for the entire task?

Yes, you've got it right: it's just to stress an accent that it's for requesting resources per instance (both for worker(s) and master)

maybe for the user we could provide an interface that says "master_cpu" "worker_cpu". But for now internally map it. Eventually I think we want to use - https://github.com/pytorch/elastic/tree/master/kubernetes, which will not require the master/worker model

actually thinking more about it, lets replace "instance" -> "node"? and that should be it. WE could move to elastic later on?

Huh, I see that in related docs (at least here and here) processes in distributed job are referred as replicas. Hence, may be let's go with per_replica_*? Any objections?

@EngHabu I think what @igorvalko is trying to do is use same cpu/mem for both master and worker. Initially he had all resources as part of the cRD and i said you could re-purpose our "container" resources for one of them.
So Igor made them the same.
Now maybe the way to think about this this, the container in "flyte" for a distributed job refers to the master or worker?

I would like to avoid the Spark issue we had where people would assign huge resources to the "driver" pod the same way they assign them to the executor pods but they end up barely using any on the driver pod... by essentially forcing users to use the same for both, are we getting into the same situation?

I have no hands-on experience with pytorch, but it looks like master in pytorch paradigm does the same kind of job as workers with some extra communication arrangement work. In spark different approach is being used: driver (most often) is a lightweight coordinator and all the heavy work happens on executors.
Hence I assume that's uncommon thing to request uneven resources for master and workers. Input from anyone having experience with distributed pytorch would be very welcome.

From what I read, I think there are two possible paradigms: (1) a master node does what a worker does + the coordination and the communication, (2) a master node does only the reduction. But I think either of these requires the master node to have the same/similar resource requirement as workers.

(I am assuming we are using torch.nn.parallel.DistributedDataParallel() which builds on top of torch.distributed package, which should supposedly use the NCCL backend according to https://pytorch.org/docs/stable/distributed.html#which-backend-to-use)

flytekit/sdk/tasks.py

flytekit/common/tasks/pytorch_task.py

flytekit/plugins/__init__.py

igorvalko · 2020-05-25T18:01:05Z

Thanks everyone for the review. Latest changes should resolve all the requests.

codecov-commenter · 2020-05-26T18:11:00Z

Codecov Report

Merging #112 into master will increase coverage by 0.15%.
The diff coverage is 91.89%.

@@            Coverage Diff             @@
##           master     #112      +/-   ##
==========================================
+ Coverage   81.68%   81.84%   +0.15%     
==========================================
  Files         209      211       +2     
  Lines       13462    13654     +192     
  Branches     1120     1125       +5     
==========================================
+ Hits        10997    11175     +178     
- Misses       2200     2213      +13     
- Partials      265      266       +1

Impacted Files	Coverage Δ
flytekit/sdk/tasks.py	`75.00% <66.66%> (-0.68%)`	⬇️
flytekit/models/task.py	`92.43% <90.00%> (-0.11%)`	⬇️
flytekit/common/tasks/pytorch_task.py	`90.47% <90.47%> (ø)`
tests/flytekit/unit/sdk/tasks/test_pytorch_task.py	`97.05% <97.05%> (ø)`
flytekit/common/constants.py	`100.00% <100.00%> (ø)`
flytekit/plugins/__init__.py	`100.00% <100.00%> (ø)`
flytekit/common/mixins/artifact.py	`45.45% <0.00%> (-1.22%)`	⬇️
flytekit/common/workflow_execution.py	`44.23% <0.00%> (-0.67%)`	⬇️
flytekit/common/nodes.py	`60.83% <0.00%> (-0.59%)`	⬇️
flytekit/__init__.py	`100.00% <0.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c125cd7...e747c56. Read the comment docs.

kumare3

LGTM. Thank you for all the work @igorvalko

pytorch plugin implementation

60a9294

igorvalko requested review from kumare3, matthewphsmith and wild-endeavor as code owners May 8, 2020 18:20

This was referenced May 8, 2020

pytorch.proto and respective changes flyteorg/flyteidl#61

Merged

pytorch plugin implementation flyteorg/flyteplugins#85

Merged

EngHabu reviewed May 18, 2020

View reviewed changes

wild-endeavor reviewed May 19, 2020

View reviewed changes

flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved

flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved

flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved

matthewphsmith reviewed May 19, 2020

View reviewed changes

flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved

matthewphsmith reviewed May 19, 2020

View reviewed changes

flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved

matthewphsmith reviewed May 19, 2020

View reviewed changes

flytekit/plugins/__init__.py Show resolved Hide resolved

PR review requests addressed

36baa49

igorvalko requested a review from EngHabu May 25, 2020 19:25

assert on container definition

396790d

wrapper params renamed: instance_* -> per_replica_*

e747c56

kumare3 approved these changes May 28, 2020

View reviewed changes

kumare3 merged commit 08cea51 into flyteorg:master May 28, 2020

kumare3 mentioned this pull request Jun 16, 2020

[Backend][Plugin] Kubeflow operators - Pytorch flyteorg/flyte#338

Closed

6 tasks

max-hoffman pushed a commit to dolthub/flytekit that referenced this pull request May 11, 2021

pytorch plugin implementation (flyteorg#112)

e908a59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch plugin implementation #112

pytorch plugin implementation #112

igorvalko commented May 8, 2020 •

edited

Loading

EngHabu May 18, 2020

igorvalko May 25, 2020

kumare3 May 25, 2020

kumare3 May 25, 2020

igorvalko May 25, 2020

kumare3 May 26, 2020

EngHabu May 27, 2020

igorvalko May 27, 2020 •

edited

Loading

bnsblue May 27, 2020

bnsblue May 27, 2020

igorvalko commented May 25, 2020

codecov-commenter commented May 26, 2020 •

edited

Loading

kumare3 left a comment

pytorch plugin implementation #112

pytorch plugin implementation #112

Conversation

igorvalko commented May 8, 2020 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igorvalko May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igorvalko commented May 25, 2020

codecov-commenter commented May 26, 2020 • edited Loading

Codecov Report

kumare3 left a comment

Choose a reason for hiding this comment

igorvalko commented May 8, 2020 •

edited

Loading

igorvalko May 27, 2020 •

edited

Loading

codecov-commenter commented May 26, 2020 •

edited

Loading