[Bug]: Multi-Node Tensor-Parallel in #11256 forces TP > cuda_device_count per node #12132
Closed
1 task done
Labels
bug
Something isn't working
Your current environment
Running in Docker container (Kubernetes) on (4) GH200 nodes. 1 GPU per node.
Model Input Dumps
python3 -m vllm.entrypoints.openai.api_server --model /models/my-model
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--served-model-name my-model
--trust-remote-code
--api-key "NONE"
--rope-scaling '{"rope_type":"dynamic","factor":4.0}'
--enable-prefix-caching
--max-model-len 131072
🐛 Describe the bug
@youkaichao, it looks like #11256 forces --tensor-parallel-size to be > per node GPU.
https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L156
Currently testing
main
with (4) nodes, (1) GPU per node results in (same model/code/execution works perfectly in v0.6.4.post1):Before submitting a new issue...
The text was updated successfully, but these errors were encountered: