-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch plugin implementation #112
Conversation
flytekit/sdk/tasks.py
Outdated
cache=False, | ||
timeout=None, | ||
workers_count=1, | ||
instance_storage_request="", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to call it "instance_" ? is it to signify this request is per worker instead of for the entire task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you've got it right: it's just to stress an accent that it's for requesting resources per instance (both for worker(s) and master)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for the user we could provide an interface that says "master_cpu" "worker_cpu". But for now internally map it. Eventually I think we want to use - https://github.com/pytorch/elastic/tree/master/kubernetes, which will not require the master/worker model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually thinking more about it, lets replace "instance" -> "node"? and that should be it. WE could move to elastic later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EngHabu I think what @igorvalko is trying to do is use same cpu/mem for both master and worker. Initially he had all resources as part of the cRD and i said you could re-purpose our "container" resources for one of them.
So Igor made them the same.
Now maybe the way to think about this this, the container in "flyte" for a distributed job refers to the master or worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid the Spark issue we had where people would assign huge resources to the "driver" pod the same way they assign them to the executor pods but they end up barely using any on the driver pod... by essentially forcing users to use the same for both, are we getting into the same situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no hands-on experience with pytorch, but it looks like master in pytorch paradigm does the same kind of job as workers with some extra communication arrangement work. In spark different approach is being used: driver (most often) is a lightweight coordinator and all the heavy work happens on executors.
Hence I assume that's uncommon thing to request uneven resources for master and workers. Input from anyone having experience with distributed pytorch would be very welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, I think there are two possible paradigms: (1) a master node does what a worker does + the coordination and the communication, (2) a master node does only the reduction. But I think either of these requires the master node to have the same/similar resource requirement as workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am assuming we are using torch.nn.parallel.DistributedDataParallel()
which builds on top of torch.distributed
package, which should supposedly use the NCCL backend according to https://pytorch.org/docs/stable/distributed.html#which-backend-to-use)
Thanks everyone for the review. Latest changes should resolve all the requests. |
Codecov Report
@@ Coverage Diff @@
## master #112 +/- ##
==========================================
+ Coverage 81.68% 81.84% +0.15%
==========================================
Files 209 211 +2
Lines 13462 13654 +192
Branches 1120 1125 +5
==========================================
+ Hits 10997 11175 +178
- Misses 2200 2213 +13
- Partials 265 266 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for all the work @igorvalko
TL;DR
Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.
Type
Are all requirements met?
Complete description
Pytorch plugin for Flyte that leverages https://github.com/kubeflow/pytorch-operator to run distributed pytorch jobs on k8s.
Related PRs:
flyteorg/flyteidl#61
flyteorg/flyteplugins#85
Tracking Issue
NA
Follow-up issue
NA