Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem setting up inter-node communication in deepspeed, seems to be an infiniband issue? #35

Open
Alexis-BX opened this issue Nov 20, 2024 · 0 comments

Comments

@Alexis-BX
Copy link

Hello,
I am trying set up a multinode training code using deepspeed (0.12.2) with torch (2.0.1) and the PDSH (2.35) launcher.
I successfully installed PDSH by hand and all works fine on single node.
However, in 2 node settings, the code blocks in the torch initialization step.
Digging into the details, it seems that it is not managing to establish the inter-node communication (timeout waiting for workers on 2nd node to connect to the master).
I assume this is an NCCL/Infiniband issue, probably miss-configured on my side, however I am having trouble validating this hypothesis as most diagnostic tools require root.

Do you have any intuition as to why this is not working or solutions/options/configurations I should explore please? Can you see any issues in the NCCL config bellow, or could you confirm if everything is working on your side?
Thanks for the help!

Bellow are the NCCL environment variables set:

export NCCL_SOCKET_IFNAME=ib0
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5_0,mlx5_2
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_BLOCKING_WAIT=1
export NCCL_LAUNCH_MODE=PARALLEL

The code I am using to test this:

import os

import torch
import torch.distributed

# print('MASTER_PORT = ', os.getenv('MASTER_PORT'))
# print('MASTER_ADDR = ', os.getenv('MASTER_ADDR'))
# print('WORLD_SIZE = ', os.getenv('WORLD_SIZE'))
# print('RANK = ', os.getenv('RANK'))
# print('LOCAL_RANK = ', os.getenv('LOCAL_RANK'))

print('About to initialize PyTorch Distributed...', flush=True)
torch.distributed.init_process_group(backend='nccl') # <=== Hangs here
print('Completed initialization of PyTorch Distributed', flush=True)

print('Entering barrier...', flush=True)
torch.distributed.barrier()
print('Done with barrier', flush=True)

Command used to run:
deepspeed --hostfile [absolute path]/[jobid]-hosts [absolute path]/test.py

Sample hostfile:

t007-007 slots=8
t007-008 slots=8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant