-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multi-gpu training notebook is giving error if we generate schema from core #651
Comments
This might be due to a change from dataloader. Specifically, NVIDIA-Merlin/dataloader@dbf8816 (and related NVIDIA-Merlin/dataloader@4301447). These dataloader changes did not make it to the 23.02 release (even though they are in the $ docker run --rm -it --gpus all --net host -v ~/data:/workspace/data nvcr.io/nvidia/merlin/merlin-pytorch:23.02 bash
root@0810733-lcedt:/dataloader# git log
commit 02aad2124e247e6a4f229d6638eaaec0931aca8c (grafted, HEAD, tag: v23.02.00)
Author: Karl Higley <[email protected]>
Date: Mon Feb 13 15:23:40 2023 -0500
Replace `nnzs` with `row_lengths` for clarity (#99) If I install the problematic commit and run the multi-gpu notebook, the notebook fails: $ docker run --rm -it --gpus all --net host -v ~/data:/workspace/data nvcr.io/nvidia/merlin/merlin-pytorch:23.02 bash
root@0810733-lcedt:/opt/tritonserver# cd /dataloader/
root@0810733-lcedt:/dataloader# git fetch origin 226ad6903a7abfb5c1288f20eaf7d91eb952e374
root@0810733-lcedt:/dataloader# git checkout 226ad6903a7abfb5c1288f20eaf7d91eb952e374
root@0810733-lcedt:/dataloader# pip install . --no-deps It works again if I revert:
I didn't have time today to see which condition is failing and test out a solution. (My guess is the |
I reopened this since I am getting error when running the multi-gpu notebook. |
I get the error "RuntimeError: CUDA error at: /usr/local/include/rmm/device_uvector.hpp:316: cudaErrorIllegalAddress an illegal memory access was encountered" when I run this notebook on a 2-gpu 32GB NGC instance with 23.02 Pytorch container + the main branch pulled and compiled for all 6 Merlin libraries (core, nvtabular, dataloader, models, systems, transformers4rec). The error is generated after executing the cell "! torchrun --nproc_per_node 2 pyt_trainer.py --path "/workspace/data/preproc_sessions_by_day" --learning-rate 0.0005". The full error message is below: File "pyt_trainer.py", line 101, in Original exception was: Original exception was: Original exception was: Original exception was: Original exception was: Original exception was:
|
It looks like we are seeing an error again due to this change in the dataloader: NVIDIA-Merlin/dataloader@1452e82. If I check out the previous commit (a075ebfd2afc17b97bf8b271bebfbbe308f288e3) the notebook works. I'm not sure exactly which change in the problematic commit is causing the notebook to fail. |
@jperez999 and @karlhigley fyi. |
@edknv is working on this NVIDIA-Merlin/dataloader#132 that might solve multi-gpu error. |
Unfortunately, NVIDIA-Merlin/dataloader#132 doesn't solve the problem for the notebook because we still get an error with list columns. See this issue for more details: NVIDIA-Merlin/dataloader#131. |
Bug description
I am getting the following error when I run multi-gpu training notebook
Steps/Code to reproduce bug
You need to run
01
and03
notebooks in this folder in order. For dataset generation you can useEnvironment details
Additional context
I am using
merlin-pytorch:23.02
image with the latest main branches pulled from libs.The text was updated successfully, but these errors were encountered: