Allow GPU models to train on sparse matrices that exceed the size of available GPU memory #605
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use CUDA Unified Virtual Memory for sparse matrices on the GPU. This allows GPU models to train on
input sparse matrices that exceed the size of GPU memory, by letting cuda page data to/from host memory using UVM.
This has been tested on a ALS model with around 2B entries in the spare matrix, on a GPU with 16GB of memory. Previously this OOM'ed since we need around 32GB of GPU memory to store the sparse matrix and its transpose, but with this change training succeeded - and was around 20x faster on the GPU than on the CPU.