Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch plugin implementation #112

Merged
merged 4 commits into from
May 28, 2020
Merged

Conversation

igorvalko
Copy link
Contributor

@igorvalko igorvalko commented May 8, 2020

TL;DR

Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.
Related PRs:
flyteorg/flyteidl#61
flyteorg/flyteplugins#85

Tracking Issue

NA

Follow-up issue

NA

flytekit/plugins/__init__.py Outdated Show resolved Hide resolved
cache=False,
timeout=None,
workers_count=1,
instance_storage_request="",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to call it "instance_" ? is it to signify this request is per worker instead of for the entire task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you've got it right: it's just to stress an accent that it's for requesting resources per instance (both for worker(s) and master)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for the user we could provide an interface that says "master_cpu" "worker_cpu". But for now internally map it. Eventually I think we want to use - https://github.com/pytorch/elastic/tree/master/kubernetes, which will not require the master/worker model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually thinking more about it, lets replace "instance" -> "node"? and that should be it. WE could move to elastic later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I see that in related docs (at least here and here) processes in distributed job are referred as replicas. Hence, may be let's go with per_replica_*? Any objections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EngHabu I think what @igorvalko is trying to do is use same cpu/mem for both master and worker. Initially he had all resources as part of the cRD and i said you could re-purpose our "container" resources for one of them.
So Igor made them the same.
Now maybe the way to think about this this, the container in "flyte" for a distributed job refers to the master or worker?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid the Spark issue we had where people would assign huge resources to the "driver" pod the same way they assign them to the executor pods but they end up barely using any on the driver pod... by essentially forcing users to use the same for both, are we getting into the same situation?

Copy link
Contributor Author

@igorvalko igorvalko May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no hands-on experience with pytorch, but it looks like master in pytorch paradigm does the same kind of job as workers with some extra communication arrangement work. In spark different approach is being used: driver (most often) is a lightweight coordinator and all the heavy work happens on executors.
Hence I assume that's uncommon thing to request uneven resources for master and workers. Input from anyone having experience with distributed pytorch would be very welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I read, I think there are two possible paradigms: (1) a master node does what a worker does + the coordination and the communication, (2) a master node does only the reduction. But I think either of these requires the master node to have the same/similar resource requirement as workers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am assuming we are using torch.nn.parallel.DistributedDataParallel() which builds on top of torch.distributed package, which should supposedly use the NCCL backend according to https://pytorch.org/docs/stable/distributed.html#which-backend-to-use)

flytekit/sdk/tasks.py Outdated Show resolved Hide resolved
flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved
flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved
flytekit/common/tasks/pytorch_task.py Outdated Show resolved Hide resolved
@igorvalko
Copy link
Contributor Author

Thanks everyone for the review. Latest changes should resolve all the requests.

@igorvalko igorvalko requested a review from EngHabu May 25, 2020 19:25
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2020

Codecov Report

Merging #112 into master will increase coverage by 0.15%.
The diff coverage is 91.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #112      +/-   ##
==========================================
+ Coverage   81.68%   81.84%   +0.15%     
==========================================
  Files         209      211       +2     
  Lines       13462    13654     +192     
  Branches     1120     1125       +5     
==========================================
+ Hits        10997    11175     +178     
- Misses       2200     2213      +13     
- Partials      265      266       +1     
Impacted Files Coverage Δ
flytekit/sdk/tasks.py 75.00% <66.66%> (-0.68%) ⬇️
flytekit/models/task.py 92.43% <90.00%> (-0.11%) ⬇️
flytekit/common/tasks/pytorch_task.py 90.47% <90.47%> (ø)
tests/flytekit/unit/sdk/tasks/test_pytorch_task.py 97.05% <97.05%> (ø)
flytekit/common/constants.py 100.00% <100.00%> (ø)
flytekit/plugins/__init__.py 100.00% <100.00%> (ø)
flytekit/common/mixins/artifact.py 45.45% <0.00%> (-1.22%) ⬇️
flytekit/common/workflow_execution.py 44.23% <0.00%> (-0.67%) ⬇️
flytekit/common/nodes.py 60.83% <0.00%> (-0.59%) ⬇️
flytekit/__init__.py 100.00% <0.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c125cd7...e747c56. Read the comment docs.

Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for all the work @igorvalko

@kumare3 kumare3 merged commit 08cea51 into flyteorg:master May 28, 2020
max-hoffman pushed a commit to dolthub/flytekit that referenced this pull request May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants