-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0023 Unified Memory for Pytorch #36
base: master
Are you sure you want to change the base?
RFC-0023 Unified Memory for Pytorch #36
Conversation
Make FAQ sections Question/Answer more noticeable
Update formatting on FAQ section
Hi @jayfurmanek! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
I signed it! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
cc @mcarilli |
1. UVM is enabled | ||
2. The `.to()` only changes the device and does not change any other tensor attribute. | ||
|
||
`Torch.cuda.uvm_to(Tensor t)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is needed. Do you have specific use cases where the user would not want this behavior but get the original to() behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agree with your statement, and would like to propose using the 'copy' parameter in the 'to' function for a user to get the original behavior of the operation.
Such that tensor.to(device='cpu', copy=true) or the dtype or layout changes, we would force the tensor to allocate more storage and copy to the target device.
Otherwise, if the tensor.to() only changed the device type and copy=false(default), the operation would call a move_ function that uses the same memory, but changes DeviceType to the new device.
A use case for this would be if you wanted a tensor to be accessible on different devices, but allowed to be changed asychronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To illustrate, the following would always copy (assume the tensor is on cpu)
tensor.to(device='cpu', copy=true)
tensor.to(device='cuda:0', copy=false, dtype=torch.float32)
And this would call the move (zero copy)
tensor.to(device='cuda:0', copy=false)
|
||
## FAQ | ||
|
||
#### Should we have set_enabled_uvm(bool)? Will there be a use case where the user can set this to false? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a technical reason for not providing this? Or just that we don't expect users to need this feature?
I could see a use case where a user does:
enable uvm
create the model and profile memory usage
relocate/optimize memory usage
disable uvm
run the training without uvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the result of our "big switch" approach of enabling UVM mode. Since memory must be allocated using a different API, having it always on, or off made the design simpler.
Perhaps being able to mark a tensor as "managed" would help your use-case above. We could add a new flag (managed=true
) to the to() function that would be able to cast a tensor to a managed tensor. We could still have the big switches (and a big switch disable), but those would just dictate the default value of the managed parameter here.
This would allow more user control on which tensors are managed. So:
recast (reallocate) a non-managed tensor to a managed tensor
tensor.to(device="cuda:0", managed=true)
The reverse would also work.
- No explicit cudaMemcpy in either direction | ||
- Set cudaMemPrefetchAsync() | ||
- Device to Host synchronization | ||
- There is no concept of a GPU “stream” on the host so Device-to-Host copies have a required synchronization step attached. With UVM enabled, no explicit copy will be scheduled, but synchronization will still be required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify here what is expected from the end user? Does this mean that doing a to("cpu")
and using the result straight away is unsafe? Or the allocator will make sure the data is available before we can read that memory?
If it is the first case, I think we should force a sync explicitly here unless the user passes non_blocking=True
meaning that it takes on the responsibility of doing this sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
Honoring the non_blocking=true
also makes sense. The "move kernel" could check that and skip the device sync if directed to, passing on the responsibility.
|
||
We propose to treat tensor.to(device) as a special case when UVM is enabled and `enabled_uvm_move_on_copy` has been set. Additionally, the behavior of tensor.uvm_to() will match the behavior of tensor.to() when UVM is enabled and `enabled_uvm_move_on_copy` is True. | ||
|
||
![Existing and proposed allocator usage](./RFC-0023-assets/copy-diagram.jpg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what it means here for the cpu Tensor not to exist anymore? the copy op is not modifying the source inplace so I am not sure how this behavior would fit within pytorch.
Does this mean that there is no way to have b = a.to("cuda")
(with a on cpu initially) and have:
- a will keep behaving as a cpu Tensor
- b will behave as a gpu Tensor
- a and b share memory (in the sense that inplace ops on either of them is reflected on the other).
Effectively, this means that this .to()
will only be a cross-device view op (meaning that the input and output share memory). The parallel to this in core is the torch.view_as_real()
function for example that is a view op for which the input is complex and the output is float dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with the statement I made up above, we are going to add the use of the copy=true
parameter on the .to()
function to allow an explicit copy to be made even when the tensor is managed by uvm.
For the torch.view_as_real()
we are treating a .to()
as an explicit change of DeviceType for the tensor. Even though the memory address may be the same, .to()
will trigger a prefetch to tell the driver to start moving the memory to the target device. Necessary synchronizations and other actions will be taken as needed to ensure correct coherency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting thought for core composibility is to see whether this RFC can be prototyped outside of the main repo for faster iteration. The extension point for allocator should be already sufficient. Overriding kernels might be a bit more tricky because the copy kernel has a lot of craft in it that might be replicated. Maybe it's not worth it though.
|
||
- In `at::empty_cpu()`, there is already a switch to determine which cpu allocator to use (pinned vs cpu). When UVM is enabled, this switch will always return the CachingManagedAllocator. There are likely a few other places (other empties) that get an allocator as well. | ||
|
||
- Per a find from the FBGEMM team, there is an issue with the context being moved from the device where the memory was allocated, and where it currently resides. We need to keep the original context, so we don’t create new ones unnecessarily on other devices. To account for this, at `at::empty` time, we will create two Storage structs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification: 'context' here really refers to DataPtr
of 'storage'. Since the Storage
is bound to a device, when moving to a different device we need to create a new Storage
instance and point it to the old one
Currently there are different allocators used for CPU (CPUAllocator) , pinned host memory (CachingHostAllocator), and device memory (CUDACachingAllocator) memory allocations. | ||
|
||
- When UVM is enabled, we propose forgoing all traditional allocators and working with a newly developed CachingManagedAllocator. | ||
- This will provide benefits in a few areas. We can continue to develop and improve on Unified memory allocations without the risk of introducing performance degradation, or errors to the existing CUDACachingAllocator. In addition, we expect the behavior of Managed memory to be different than that of traditional discrete GPU/CPU memory. For example, managed memory is available to all GPUs on all streams by default and can move between host and any number of GPUs at will. This will provide intricacies to the caching methodology that may conflict with the existing caching allocators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about the stream part - the reason caching allocator has to keep track of streams for deallocation is to make deallocation synchronization free and be able to reuse the same buffer on another stream (after injecting a pair of events). I'd imagine you need to do the same here too
but in general agree on having separate allocator and being able to customize behavior
@dzhulgakov Do you have any examples of using an allocator extension point? I haven't found much documentation on it, and I may be looking in the wrong areas. If you have any docs etc. that provide more clarification on this I'd love to look at it. Thanks!! |
For CPU allocator there's some proper wiring in the core library allowing to override the allocator used for new tensors: https://github.com/pytorch/pytorch/blob/master/c10/core/Allocator.h#L210 and it works as the allocations from Unfortunately, I originally forgot that CUDA allocator is not wired through this interface. Some use cases do use it via That means that for immediate prototyping of this RFC we'd need to do something like modifying the code in-place (i.e. in a temporary fork). The more extensible course of action would be to turn CUDACachingAllocator into a proper class and allow to override it with a different implementation like for the CPU one. The interface of it would need to be broader than the base cc @ezyang |
pytorch/pytorch#65365 adds an alternative backend for cuda allocations (and puts backends in the namespaces, THC and CUDAMallocAsync). Design is not finalized (and THC name definitely has to go), but probably we can decide what we need to do in that PR, as it will need to go in the core soon-ish, probably sooner than unified memory. |
In pytorch/pytorch#65365, to avoid a vtable lookup and maintain existing inlining opportunities, I deliberately chose a non-polymorphic design. Each (inline) interface function in inline void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) {
static auto f = (std::strcmp(allocatorBackend(), "native") == 0) ?
THC::raw_alloc_with_stream : CudaMallocAsync::raw_alloc_with_stream;
return f(nbytes, stream);
} There are only two backends now, but macros could allow extensibility without too much boilerplate. That being said, I'm not sure we need a full-blown unique allocator implementation to accommodate managed memory. My first instinct is most (or all!) of what a cached managed memory allocator would need is identical to what the native allocator already does for cudaMalloc...so maybe we could just s/cudaMalloc/cudaMallocManaged/g in CUDACachingAllocator.cpp (or add an envvar-based conditional choice between them at all sites where cudaMalloc is currently called) and blow imaginary smoke off our finger guns. We'd need to add a few calls to expose managed memory's prefetch-to-gpu and prefetch-to-cpu functionality, but my point is I think most of the caching code could be shared with the native allocator. |
Ah, Thanks for the insight, @mcarilli, we were just debating if a proper class for the allocator makes sense and hadn't considered the possible inline benefits. That's interesting.
That was our first instinct as well! There are a few complications, or at least decisions, that have to be resolved. One is the CUDACachingAllocator has a DeviceAllocator for each present CUDA device, and the caching is relevant to each. With managed memory, it's abstracted, so that allocator hierarchy makes less sense. Prefetch and device hints are used for data locality suggestions and those need to happen somewhere. Another question is do we want to allow some designated tensors to be managed, or just have a big switch to do all or none. Our design proposes a single allocator for all devices (CPU too). Initially the caching will look like the existing CUDACachingAllocator, but there would be room for extension if needed (for CPU focused data loading for example). |
What I'd like to see are some examples / case studies of when you would use UVM, as opposed to traditional management. What is it good for, what is a typical user program that is using UVM going to look like? How would you have written this code in straight line C? If you want to be ROCm specific that's fine. |
This RFC proposes to add Unified Virtual Memory (UVM) (or “Managed Memory”) function utilizing the managed memory allocation APIs available in CUDA/ROCm.
The proposed changes to the front end and back end have been minimized as much as possible to have a very targeted effect when UVM is enabled and have no effect at all when UVM is disabled, which will of course remain the default.
Please note that the details of these proposals are subject to revision given feedback from users and prototype testing. Please feel free to comment on the RFCs with your feedback