-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process hangs on 'Setting up PyTorch plugin "bias_act_plugin"...' when using multiple GPUs #41
Comments
No, it definitely shouldn't take long. About the same as on single core. I'd try a couple of things if the problem persists:
If you're running docker, you should NOT need CUDA_VISIBLE_DEVICES separately. I think it's enough to configure available devices using the |
@nurpax Thank you! For posterity: I left the |
How do you do it on windows? It used work perfectly but when I run the project a few days layer it gets stuck. |
I'm not sure what's the exact location and don't have Windows access right now. But here's how you should be able to figure it out: Change
Then run for example generate.py with default options, and check the logs. On my computer, it prints something like this:
This should reveal the Windows location for you. |
Thank you! The cache for windows can be found in 'C:\Users\<user_name>\AppData\Local\torch_extensions\torch_extensions\Cache'. I was able to delete it but also had to reinstall ninja to build bias_act_plugin again. In the end, it worked. |
removing the stale lock file ~/.cache/torch_extensions/py310_cu121/bias_act_plugin/3cb576a0039689487cfba59279dd6d46-nvidia-geforce-rtx-2060/lock worked for me |
I added these lines to train.py as lines 13 and 14 (right under
import os
):I tested the process with --gpus 1 and it spent a few minutes on
Setting up PyTorch plugin "bias_act_plugin"...
but then proceeded to train. However with --gpus 4 it has been hanging on this line for an hour and a half.Here's the
nvidia-smi
printout as well. As you can see three of the cores (2,3,4) have 100% GPU utilization while the first core (0) has 0%. The memory usage does not seem to be changing.Do I just need to be more patient? On one core it really only took a couple of minutes to begin training.
EDIT: note that the cores (0,2,3,4) are not consecutive.
The text was updated successfully, but these errors were encountered: