Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch plugin implementation #112
pytorch plugin implementation #112
Changes from 3 commits
60a9294
36baa49
396790d
e747c56
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to call it "instance_" ? is it to signify this request is per worker instead of for the entire task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you've got it right: it's just to stress an accent that it's for requesting resources per instance (both for worker(s) and master)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for the user we could provide an interface that says "master_cpu" "worker_cpu". But for now internally map it. Eventually I think we want to use - https://github.com/pytorch/elastic/tree/master/kubernetes, which will not require the master/worker model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually thinking more about it, lets replace "instance" -> "node"? and that should be it. WE could move to elastic later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I see that in related docs (at least here and here) processes in distributed job are referred as
replicas
. Hence, may be let's go withper_replica_*
? Any objections?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EngHabu I think what @igorvalko is trying to do is use same cpu/mem for both master and worker. Initially he had all resources as part of the cRD and i said you could re-purpose our "container" resources for one of them.
So Igor made them the same.
Now maybe the way to think about this this, the container in "flyte" for a distributed job refers to the master or worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid the Spark issue we had where people would assign huge resources to the "driver" pod the same way they assign them to the executor pods but they end up barely using any on the driver pod... by essentially forcing users to use the same for both, are we getting into the same situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no hands-on experience with pytorch, but it looks like master in pytorch paradigm does the same kind of job as workers with some extra communication arrangement work. In spark different approach is being used: driver (most often) is a lightweight coordinator and all the heavy work happens on executors.
Hence I assume that's uncommon thing to request uneven resources for master and workers. Input from anyone having experience with distributed pytorch would be very welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, I think there are two possible paradigms: (1) a master node does what a worker does + the coordination and the communication, (2) a master node does only the reduction. But I think either of these requires the master node to have the same/similar resource requirement as workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am assuming we are using
torch.nn.parallel.DistributedDataParallel()
which builds on top oftorch.distributed
package, which should supposedly use the NCCL backend according to https://pytorch.org/docs/stable/distributed.html#which-backend-to-use)