debug model training hangs on NVIDIA B200 with >1 GPU #810

vkuzo · 2025-01-28T21:30:03Z

When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU. Details:

//
// 1 GPU: it works!
//
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_llama_train.sh
...
trains as usual!

//
// 2 GPUs: NCCL error
//
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=2 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:23:00,193 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2)                    
[rank0]:[rank0]:[E128 13:23:00.788972321 ProcessGroupNCCL.cpp:1897] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
[rank0]:CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                              
[rank0]:For debugging consider passing CUDA_LAUNCH_BLOCKING=1                               
[rank0]:Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                  
[rank0]:                                                                      
[rank0]:Exception raised from c10_cuda_check_implementation at /data/users/vasiliy/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):                                           
[rank0]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fdfa3b8d6c8 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                                                       
[rank0]:frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7fdfa3b23426 in /data/users/vasiliy/pytorch/torch/lib/libc10.so)                                                                                                                                   
[rank0]:frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7fdfa3fb9f4e in /data/users/vasiliy/pytorch/torch/lib/libc10_cuda.so)                                                                                                                                                                                                      
[rank0]:frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fdf86956606 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                 
[rank0]:frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fdf86965770 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                        
[rank0]:frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x606 (0x7fdf86966b96 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                             
[rank0]:frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x148 (0x7fdf86967b78 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so)                                                                                                                                                                                                                                            
[rank0]:frame #7: <unknown function> + 0xdbbf4 (0x7fdf85adbbf4 in /home/vasiliy/.conda/envs/pytorch/lib/libstdc++.so.6)                                                                                                                                                                                                                                                                    
[rank0]:frame #8: <unknown function> + 0x89e92 (0x7fdfa4889e92 in /lib64/libc.so.6)                                                                                                          
[rank0]:frame #9: <unknown function> + 0x10ef20 (0x7fdfa490ef20 in /lib64/libc.so.6)                                                                                                         
[rank0]:                                                                                                                                                                                     
[rank0]:Fatal Python error: Aborted                                                                                                                                                          
[rank0]:                                                                                                                                                                                     
[rank0]:Thread 0x00007fdcbce00640 (most recent call first):                                                                                                                                  
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 324 in wait                                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 622 in wait                                                                                                                                                                                                                                                                                           
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run                                                                             
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
[rank0]:  File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 995 in _bootstrap
[rank0]:                                                                
[rank0]:Thread 0x00007fdfa4a89400 (most recent call first):                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/autograd/grad_mode.py", line 85 in __exit__                                                                                                
[rank0]:  File "/data/users/vasiliy/pytorch/torch/utils/_contextlib.py", line 115 in decorate_context                                                                                        
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 288 in wait_for_unshard
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 335 in pre_forward                                                               
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 230 in _pre_forward                                                                                                                                                                                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/_dynamo/eval_frame.py", line 745 in _fn                                                                                                    
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62 in fsdp_hook_wrapper
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1782 in inner                                                                                                  
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1855 in _call_impl                                                                                             
[rank0]:  File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1749 in _wrapped_call_impl                                                                                     
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 309 in main                                                                                                                   
[rank0]:  File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper                                                               
[rank0]:  File "/data/users/vasiliy/torchtitan/train.py", line 436 in <module>

// full error
P1720698802: https://www.internalfb.com/intern/paste/P1720698802/

// 4 GPUs: hangs indefinitely on first forward
(pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh
...
[rank0]:2025-01-28 13:24:27,764 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)                                                                                                                                                                                                                                                                                                
[rank0]:2025-01-28 13:24:27,764 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 32, sequence length 2048, total steps 10 (warmup 2)                                                                                                                                                                                                                  
...hangs here!....

// full error
https://gist.github.com/vkuzo/ce6547e13740b437ee93a1ebf58f7dc4

// 8 GPUs: same as 4 GPUs

The text was updated successfully, but these errors were encountered:

tianyu-l · 2025-01-30T23:06:15Z

@wconstab Do they support NCCL on B200 already? I heard NCCL issues happening on B200 from other people as well.

tianyu-l · 2025-01-31T19:37:43Z

cc: @yifuwang

tianyu-l added the bug Something isn't working label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug model training hangs on NVIDIA B200 with >1 GPU #810

debug model training hangs on NVIDIA B200 with >1 GPU #810

vkuzo commented Jan 28, 2025

tianyu-l commented Jan 30, 2025 •

edited

Loading

tianyu-l commented Jan 31, 2025

debug model training hangs on NVIDIA B200 with >1 GPU #810

debug model training hangs on NVIDIA B200 with >1 GPU #810

Comments

vkuzo commented Jan 28, 2025

tianyu-l commented Jan 30, 2025 • edited Loading

tianyu-l commented Jan 31, 2025

tianyu-l commented Jan 30, 2025 •

edited

Loading