We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU. Details:
// // 1 GPU: it works! // (pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=1 CUDA_VISIBLE_DEVICES=0 ./run_llama_train.sh ... trains as usual! // // 2 GPUs: NCCL error // (pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=2 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh ... [rank0]:2025-01-28 13:23:00,193 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 16, sequence length 2048, total steps 10 (warmup 2) [rank0]:[rank0]:[E128 13:23:00.788972321 ProcessGroupNCCL.cpp:1897] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered [rank0]:CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]:For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [rank0]:Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [rank0]: [rank0]:Exception raised from c10_cuda_check_implementation at /data/users/vasiliy/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): [rank0]:frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fdfa3b8d6c8 in /data/users/vasiliy/pytorch/torch/lib/libc10.so) [rank0]:frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7fdfa3b23426 in /data/users/vasiliy/pytorch/torch/lib/libc10.so) [rank0]:frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7fdfa3fb9f4e in /data/users/vasiliy/pytorch/torch/lib/libc10_cuda.so) [rank0]:frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fdf86956606 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so) [rank0]:frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fdf86965770 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so) [rank0]:frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x606 (0x7fdf86966b96 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so) [rank0]:frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x148 (0x7fdf86967b78 in /data/users/vasiliy/pytorch/torch/lib/libtorch_cuda.so) [rank0]:frame #7: <unknown function> + 0xdbbf4 (0x7fdf85adbbf4 in /home/vasiliy/.conda/envs/pytorch/lib/libstdc++.so.6) [rank0]:frame #8: <unknown function> + 0x89e92 (0x7fdfa4889e92 in /lib64/libc.so.6) [rank0]:frame #9: <unknown function> + 0x10ef20 (0x7fdfa490ef20 in /lib64/libc.so.6) [rank0]: [rank0]:Fatal Python error: Aborted [rank0]: [rank0]:Thread 0x00007fdcbce00640 (most recent call first): [rank0]: File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 324 in wait [rank0]: File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 622 in wait [rank0]: File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run [rank0]: File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 1038 in _bootstrap_inner [rank0]: File "/home/vasiliy/.conda/envs/pytorch/lib/python3.11/threading.py", line 995 in _bootstrap [rank0]: [rank0]:Thread 0x00007fdfa4a89400 (most recent call first): [rank0]: File "/data/users/vasiliy/pytorch/torch/autograd/grad_mode.py", line 85 in __exit__ [rank0]: File "/data/users/vasiliy/pytorch/torch/utils/_contextlib.py", line 115 in decorate_context [rank0]: File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 288 in wait_for_unshard [rank0]: File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 335 in pre_forward [rank0]: File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 230 in _pre_forward [rank0]: File "/data/users/vasiliy/pytorch/torch/_dynamo/eval_frame.py", line 745 in _fn [rank0]: File "/data/users/vasiliy/pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62 in fsdp_hook_wrapper [rank0]: File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1782 in inner [rank0]: File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1855 in _call_impl [rank0]: File "/data/users/vasiliy/pytorch/torch/nn/modules/module.py", line 1749 in _wrapped_call_impl [rank0]: File "/data/users/vasiliy/torchtitan/train.py", line 309 in main [rank0]: File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper [rank0]: File "/data/users/vasiliy/torchtitan/train.py", line 436 in <module> // full error P1720698802: https://www.internalfb.com/intern/paste/P1720698802/ // 4 GPUs: hangs indefinitely on first forward (pytorch) [[email protected] ~/local/torchtitan (main)]$ with-proxy CONFIG_FILE="./train_configs/debug_model.toml" NGPU=4 CUDA_VISIBLE_DEVICES=0,1 ./run_llama_train.sh ... [rank0]:2025-01-28 13:24:27,764 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%) [rank0]:2025-01-28 13:24:27,764 - root - INFO - Training starts at step 1, with local batch size 8, global batch size 32, sequence length 2048, total steps 10 (warmup 2) ...hangs here!.... // full error https://gist.github.com/vkuzo/ce6547e13740b437ee93a1ebf58f7dc4 // 8 GPUs: same as 4 GPUs
The text was updated successfully, but these errors were encountered:
@wconstab Do they support NCCL on B200 already? I heard NCCL issues happening on B200 from other people as well.
Sorry, something went wrong.
cc: @yifuwang
No branches or pull requests
When I try to train the debug model in torchtitan on a single B200, I see that it only trains correctly when limited to a single GPU. Details:
The text was updated successfully, but these errors were encountered: