UVM PR for Comment (DNM) #1101

jayfurmanek · 2022-09-12T18:05:29Z

No description provided.

shintaro-iwasaki · 2022-09-12T18:49:56Z

torch/cuda/memory.py

+
+
+def get_enabled_move():
+    r"""Returns a bool indicating if Unified Virtual Memory is currently enabled."""


Thanks for making a draft PR! Please let me add comments here.

What's the difference between "UVM" and "Move"? The description looks the same.

Hi @shintaro-iwasaki !

Enabling "UVM" will cause all getAllocator calls to return the new Caching Managed Allocator.
Enabling "Move" will attempt to short-circuit copy calls that are determined to only be changing the device (as opposed to other tensor transformations, which would require a copy)

The idea with enabling the move is deferring to the UVM driver to handle data placement. With Moves off, it will always copy, even if UVM is on - from managed alloc to managed alloc.

Thanks for your reply. So do you mean the following?

When "UVM" is on, the new UVM allocator will be used.

When "UVM" is off, the existing memory allocator will be used.

When "Move" is on, PyTorch checks if data is on UVM. If it's on UVM, the data is not explicitly copied. If it's not on UVM, the data is explicitly copied.

When "Move" is off, PyTorch always explicitly copies data without check if data is on UVM or not.

Then when do you want to disable "Move"? Is this check so heavy? Or do we want to sometimes copy data explicitly even if UVM is used?

Almost.

When "Move" is on, PyTorch checks if data is on UVM. If it's on UVM, the data is not explicitly copied. If it's not on UVM, the data is explicitly copied.

When Move is on, Pytorch does a few different checks before allowing the UVM driver control:

Is UVM enabled?

Is the storage ptr for the Tensor on a managed allocation?

Has the device changed in this to() command?

Are the other parameters unchanged? (value, layout, mem format)

These checks are in the aten/src/ATen/native/TensorConversions.cpp file (in to_will_move() and _to_move

We made move a separate enable/disable so that we could test each part separately, mostly.
There could be a scenario, however, where you know allowing the UVM driver to do copies will be slower but you still care about having tensors on managed memory to avoid running out of GPU memory.

allowing the UVM driver to do copies will be slower but you still care about having tensors on managed memory to avoid running out of GPU memory.

i'm confused about how you could take advantage of managed memory to avoid running out of GPU memory but also avoid slow copies. Could you elaborate on which type of system you're referring to and whether the cure to running out of memory relates to 'move' functionality? Is 'move' enablement orthogonal to the functionality in the driver that pages data between gpu/cpu seamlessly as needed by gpu?

the cure to running out of memory relates to 'move' functionality?

No, that's not what I meant. Using managed allocations will help with running out of memory.

Is 'move' enablement orthogonal to the functionality in the driver that pages data between gpu/cpu seamlessly as needed by gpu?

yes. The driver can page in/out based on need. (more memory is needed, or a chunk is being accessed)
Move is a function to defer Pytorch's explict copying to the UVM driver when enabled. When off Pytorch will copy from one managed chunk to another (like normal) where the target managed chunk was hinted to be on GPU (or CPU if you are going the other way).

On dGPUs using Move may be a bit slower. In this case the data does need to get to GPU memory and generally asap. In the move kernel we do a prefetch using that target device instead of an explicit copy and it's up to the driver to copy. The UVM driver also copies on page boundaries which could be more than needed. That being said prefetching managed chunks can be tuned, using events and hints, to become close performance-wise to explicit copies. Nvidia has a blog showing some of this: https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/

There is room for more tuning in the Move kernel that's here.

shintaro-iwasaki · 2022-09-12T18:53:50Z

aten/src/ATen/native/miopen/Conv_miopen.cpp

+      if (at::globalContext().userEnabledUVM())
+        at::cuda::CachingManagedAllocator::raw_delete(data);
+      else
+        c10::hip::HIPCachingAllocator::raw_delete(data);


Does this PR intend to support fine-grained changes of UVM settings, like the following?

torch.cuda.memory.set_enabled_uvm(True) # [Allocate Tensor1] torch.cuda.memory.set_enabled_uvm(False) # [Allocate Tensor2] ...

Because this seemingly reads a global setting to change the behavior of destructor, I am afraid that it can be broken if the setting is changed after it is constructed.

Not yet. We can add some state to get that to work, but we'd really like to be able to control it at the tensor level if possible, allowing for some tensors as managed, others not. That may solve the problem above in a more holistic way. We'd love to collaborate on that.

shintaro-iwasaki · 2022-09-12T18:59:29Z

Thank you! I added a few questions on it (see above). We'd also truly appreciate it if you could add the following:

Short examples (either in this code or just in a PR description). This can explain how to use this functionality.
Tests are needed for the real PR (though I understand that this PR is WIP and for the demonstration purpose).

jianyuh

Thanks for working on UVM supports in PyTorch core! There is a reference (hacky) implementation in FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/cumem_utils.cu) with the customized operators. Wonder if we can integrate more UVM features from there into PT core.

jianyuh · 2022-09-19T23:42:33Z

aten/src/ATen/cuda/CachingManagedAllocator.cpp

+      //C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, hint_device));
+      //C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, device));


Should it be the default behavior to set the preferred location to CPU and accessed device to be the current device?

Thanks for your interest! We debated a bit on this and this is an area that could certainly evolve with more testing/tuning.

jianyuh · 2022-09-19T23:50:10Z

aten/src/ATen/native/cuda/Copy.cu

+      t.sizes(), t.strides(), t.dtype().itemsize());
+
+  device_guard.set_index(cuda_device_index);
+  AT_CUDA_CHECK(cudaMemAdvise(


Should we also expose other cudaMemAdvise API (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gc314a8b14091f7e02a7ad15dcb36c857) to the users? In FBGEMM, the API is defined as https://github.com/pytorch/FBGEMM/blob/5876514306de2586e617b6ff1212eacf25114a49/fbgemm_gpu/src/cumem_utils.cu#L364 , and the usages are https://github.com/pytorch/FBGEMM/blob/5876514306de2586e617b6ff1212eacf25114a49/fbgemm_gpu/test/uvm_test.py#L103

yes, but at the moment, this is not exposed to the user like in FBGEMM, so we may not need all of them. Only the "move kernel" here uses it.

jianyuh · 2022-09-19T23:55:49Z

aten/src/ATen/native/cuda/Copy.cu

+  // request a prefetch to new device
+  uvm_cuda_mem_prefetch_async(iter.tensor(0), stream);
+
+  // An explicit sync is always needed when copying back to CPU


Wonder why this explicit stream synchronization is required when we copy the UVM tensor to CPU? It's good that we avoid the explicit copy when UVM is enabled, but only change the device meta data.

Because the CPU isn't bound by the implicit sync in the CUDA stream. This is kind if the big hammer though. We might be able to use events to make sure we are in sync when going from GPU > CPU.

I think this is consistent with .to('cpu') for cuda tensors too, isn't it? e.g. unless you specify non_blocking=True, it syncs?

jayfurmanek · 2022-09-26T17:43:25Z

Link to the RFC:
pytorch/rfcs#36

jayfurmanek added 5 commits September 7, 2022 14:17

[UVM] Allocation and move enablement front end

de071cf

[UVM] Add Caching Managed Allocator

4e444c3

[UVM] Add CUDA Hooks, isManagedPtr, tensor.is_managed()

953c368

[UVM] Enable UVM stats module

15c04c5

[UVM] Add Move kernel and associated functions

37c5984

shintaro-iwasaki reviewed Sep 12, 2022

View reviewed changes

jayfurmanek assigned jayfurmanek, zstreet87 and dllehr-amd Sep 12, 2022

jianyuh reviewed Sep 19, 2022

View reviewed changes

pruthvistony closed this Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UVM PR for Comment (DNM) #1101

UVM PR for Comment (DNM) #1101

jayfurmanek commented Sep 12, 2022

shintaro-iwasaki Sep 12, 2022 •

edited

Loading

jayfurmanek Sep 12, 2022

jayfurmanek Sep 12, 2022

shintaro-iwasaki Sep 12, 2022 •

edited

Loading

jayfurmanek Sep 12, 2022

jayfurmanek Sep 12, 2022

wconstab Sep 15, 2022

jayfurmanek Sep 15, 2022 •

edited

Loading

shintaro-iwasaki Sep 12, 2022 •

edited

Loading

jayfurmanek Sep 12, 2022

shintaro-iwasaki commented Sep 12, 2022

jianyuh left a comment

jianyuh Sep 19, 2022

jayfurmanek Sep 20, 2022

jianyuh Sep 19, 2022

jayfurmanek Sep 20, 2022

jianyuh Sep 19, 2022

jayfurmanek Sep 20, 2022

wconstab Sep 27, 2022

jayfurmanek commented Sep 26, 2022



		def get_enabled_move():
		r"""Returns a bool indicating if Unified Virtual Memory is currently enabled."""

		//C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, hint_device));
		//C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, device));

UVM PR for Comment (DNM) #1101

UVM PR for Comment (DNM) #1101

Conversation

jayfurmanek commented Sep 12, 2022

shintaro-iwasaki Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shintaro-iwasaki Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayfurmanek Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

shintaro-iwasaki Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shintaro-iwasaki commented Sep 12, 2022

jianyuh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayfurmanek commented Sep 26, 2022

shintaro-iwasaki Sep 12, 2022 •

edited

Loading

shintaro-iwasaki Sep 12, 2022 •

edited

Loading

jayfurmanek Sep 15, 2022 •

edited

Loading

shintaro-iwasaki Sep 12, 2022 •

edited

Loading