Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UVM PR for Comment (DNM) #1101

Closed
wants to merge 5 commits into from
Closed

Conversation

jayfurmanek
Copy link

No description provided.



def get_enabled_move():
r"""Returns a bool indicating if Unified Virtual Memory is currently enabled."""
Copy link

@shintaro-iwasaki shintaro-iwasaki Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a draft PR! Please let me add comments here.

What's the difference between "UVM" and "Move"? The description looks the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shintaro-iwasaki !

Enabling "UVM" will cause all getAllocator calls to return the new Caching Managed Allocator.
Enabling "Move" will attempt to short-circuit copy calls that are determined to only be changing the device (as opposed to other tensor transformations, which would require a copy)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with enabling the move is deferring to the UVM driver to handle data placement. With Moves off, it will always copy, even if UVM is on - from managed alloc to managed alloc.

Copy link

@shintaro-iwasaki shintaro-iwasaki Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply. So do you mean the following?

  • When "UVM" is on, the new UVM allocator will be used.
  • When "UVM" is off, the existing memory allocator will be used.
  • When "Move" is on, PyTorch checks if data is on UVM. If it's on UVM, the data is not explicitly copied. If it's not on UVM, the data is explicitly copied.
  • When "Move" is off, PyTorch always explicitly copies data without check if data is on UVM or not.

Then when do you want to disable "Move"? Is this check so heavy? Or do we want to sometimes copy data explicitly even if UVM is used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost.

When "Move" is on, PyTorch checks if data is on UVM. If it's on UVM, the data is not explicitly copied. If it's not on UVM, the data is explicitly copied.

When Move is on, Pytorch does a few different checks before allowing the UVM driver control:

  • Is UVM enabled?
  • Is the storage ptr for the Tensor on a managed allocation?
  • Has the device changed in this to() command?
  • Are the other parameters unchanged? (value, layout, mem format)

These checks are in the aten/src/ATen/native/TensorConversions.cpp file (in to_will_move() and _to_move

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made move a separate enable/disable so that we could test each part separately, mostly.
There could be a scenario, however, where you know allowing the UVM driver to do copies will be slower but you still care about having tensors on managed memory to avoid running out of GPU memory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing the UVM driver to do copies will be slower but you still care about having tensors on managed memory to avoid running out of GPU memory.

i'm confused about how you could take advantage of managed memory to avoid running out of GPU memory but also avoid slow copies. Could you elaborate on which type of system you're referring to and whether the cure to running out of memory relates to 'move' functionality? Is 'move' enablement orthogonal to the functionality in the driver that pages data between gpu/cpu seamlessly as needed by gpu?

Copy link
Author

@jayfurmanek jayfurmanek Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cure to running out of memory relates to 'move' functionality?

No, that's not what I meant. Using managed allocations will help with running out of memory.

Is 'move' enablement orthogonal to the functionality in the driver that pages data between gpu/cpu seamlessly as needed by gpu?

yes. The driver can page in/out based on need. (more memory is needed, or a chunk is being accessed)
Move is a function to defer Pytorch's explict copying to the UVM driver when enabled. When off Pytorch will copy from one managed chunk to another (like normal) where the target managed chunk was hinted to be on GPU (or CPU if you are going the other way).

On dGPUs using Move may be a bit slower. In this case the data does need to get to GPU memory and generally asap. In the move kernel we do a prefetch using that target device instead of an explicit copy and it's up to the driver to copy. The UVM driver also copies on page boundaries which could be more than needed. That being said prefetching managed chunks can be tuned, using events and hints, to become close performance-wise to explicit copies. Nvidia has a blog showing some of this: https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/

There is room for more tuning in the Move kernel that's here.

if (at::globalContext().userEnabledUVM())
at::cuda::CachingManagedAllocator::raw_delete(data);
else
c10::hip::HIPCachingAllocator::raw_delete(data);
Copy link

@shintaro-iwasaki shintaro-iwasaki Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR intend to support fine-grained changes of UVM settings, like the following?

torch.cuda.memory.set_enabled_uvm(True)
# [Allocate Tensor1]
torch.cuda.memory.set_enabled_uvm(False)
# [Allocate Tensor2]
...

Because this seemingly reads a global setting to change the behavior of destructor, I am afraid that it can be broken if the setting is changed after it is constructed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. We can add some state to get that to work, but we'd really like to be able to control it at the tensor level if possible, allowing for some tensors as managed, others not. That may solve the problem above in a more holistic way. We'd love to collaborate on that.

@shintaro-iwasaki
Copy link

Thank you! I added a few questions on it (see above). We'd also truly appreciate it if you could add the following:

  • Short examples (either in this code or just in a PR description). This can explain how to use this functionality.
  • Tests are needed for the real PR (though I understand that this PR is WIP and for the demonstration purpose).

Copy link

@jianyuh jianyuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on UVM supports in PyTorch core! There is a reference (hacky) implementation in FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/src/cumem_utils.cu) with the customized operators. Wonder if we can integrate more UVM features from there into PT core.

Comment on lines +1243 to +1244
//C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, hint_device));
//C10_CUDA_CHECK(cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, device));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be the default behavior to set the preferred location to CPU and accessed device to be the current device?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your interest! We debated a bit on this and this is an area that could certainly evolve with more testing/tuning.

t.sizes(), t.strides(), t.dtype().itemsize());

device_guard.set_index(cuda_device_index);
AT_CUDA_CHECK(cudaMemAdvise(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but at the moment, this is not exposed to the user like in FBGEMM, so we may not need all of them. Only the "move kernel" here uses it.

// request a prefetch to new device
uvm_cuda_mem_prefetch_async(iter.tensor(0), stream);

// An explicit sync is always needed when copying back to CPU
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder why this explicit stream synchronization is required when we copy the UVM tensor to CPU? It's good that we avoid the explicit copy when UVM is enabled, but only change the device meta data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the CPU isn't bound by the implicit sync in the CUDA stream. This is kind if the big hammer though. We might be able to use events to make sure we are in sync when going from GPU > CPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is consistent with .to('cpu') for cuda tensors too, isn't it? e.g. unless you specify non_blocking=True, it syncs?

@jayfurmanek
Copy link
Author

Link to the RFC:
pytorch/rfcs#36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants