Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

Open
1 task done
BabyChouSr opened this issue Jan 14, 2025 · 6 comments
Open
1 task done

[Usage]: Running Tensor Parallel on TPUs on Ray Cluster #12058

BabyChouSr opened this issue Jan 14, 2025 · 6 comments
Labels
ray anything related with ray tpu Related to Google TPUs usage How to use vllm

Comments

@BabyChouSr
Copy link

BabyChouSr commented Jan 14, 2025

Your current environment

The output of `python collect_env.py`
The output of `python collect_env.py`
(test_hf_qwen pid=17527, ip=10.130.4.26) Environment Information:
(test_hf_qwen pid=17527, ip=10.130.4.26) Collecting environment information...
(test_hf_qwen pid=17527, ip=10.130.4.26) PyTorch version: 2.6.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) Is debug build: False
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA used to build PyTorch: None
(test_hf_qwen pid=17527, ip=10.130.4.26) ROCM used to build PyTorch: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) OS: Ubuntu 22.04.4 LTS (x86_64)
(test_hf_qwen pid=17527, ip=10.130.4.26) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
(test_hf_qwen pid=17527, ip=10.130.4.26) Clang version: 14.0.0-1ubuntu1.1
(test_hf_qwen pid=17527, ip=10.130.4.26) CMake version: version 3.31.2
(test_hf_qwen pid=17527, ip=10.130.4.26) Libc version: glibc-2.35
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
(test_hf_qwen pid=17527, ip=10.130.4.26) Python platform: Linux-5.19.0-1022-gcp-x86_64-with-glibc2.35
(test_hf_qwen pid=17527, ip=10.130.4.26) Is CUDA available: False
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA runtime version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_MODULE_LOADING set to: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) GPU models and configuration: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) Nvidia driver version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) cuDNN version: No CUDA
(test_hf_qwen pid=17527, ip=10.130.4.26) HIP runtime version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) MIOpen runtime version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) Is XNNPACK available: True
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU:
(test_hf_qwen pid=17527, ip=10.130.4.26) Architecture:                    x86_64
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU op-mode(s):                  32-bit, 64-bit
(test_hf_qwen pid=17527, ip=10.130.4.26) Address sizes:                   48 bits physical, 48 bits virtual
(test_hf_qwen pid=17527, ip=10.130.4.26) Byte Order:                      Little Endian
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU(s):                          240
(test_hf_qwen pid=17527, ip=10.130.4.26) On-line CPU(s) list:             0-239
(test_hf_qwen pid=17527, ip=10.130.4.26) Vendor ID:                       AuthenticAMD
(test_hf_qwen pid=17527, ip=10.130.4.26) Model name:                      AMD EPYC 7B12
(test_hf_qwen pid=17527, ip=10.130.4.26) CPU family:                      23
(test_hf_qwen pid=17527, ip=10.130.4.26) Model:                           49
(test_hf_qwen pid=17527, ip=10.130.4.26) Thread(s) per core:              2
(test_hf_qwen pid=17527, ip=10.130.4.26) Core(s) per socket:              60
(test_hf_qwen pid=17527, ip=10.130.4.26) Socket(s):                       2
(test_hf_qwen pid=17527, ip=10.130.4.26) Stepping:                        0
(test_hf_qwen pid=17527, ip=10.130.4.26) BogoMIPS:                        4499.99
(test_hf_qwen pid=17527, ip=10.130.4.26) Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid
(test_hf_qwen pid=17527, ip=10.130.4.26) Hypervisor vendor:               KVM
(test_hf_qwen pid=17527, ip=10.130.4.26) Virtualization type:             full
(test_hf_qwen pid=17527, ip=10.130.4.26) L1d cache:                       3.8 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L1i cache:                       3.8 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L2 cache:                        60 MiB (120 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) L3 cache:                        480 MiB (30 instances)
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node(s):                    2
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node0 CPU(s):               0-59,120-179
(test_hf_qwen pid=17527, ip=10.130.4.26) NUMA node1 CPU(s):               60-119,180-239
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Itlb multihit:     Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability L1tf:              Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Mds:               Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Meltdown:          Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Mmio stale data:   Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Srbds:             Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) Vulnerability Tsx async abort:   Not affected
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) Versions of relevant libraries:
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] mypy-extensions==1.0.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] numpy==1.26.4
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cublas-cu12==12.4.5.8
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-cupti-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-nvrtc-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cuda-runtime-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cudnn-cu12==9.1.0.70
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cufft-cu12==11.2.1.3
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-curand-cu12==10.3.5.147
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cusolver-cu12==11.6.1.9
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-cusparse-cu12==12.3.1.170
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nccl-cu12==2.21.5
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nvjitlink-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] nvidia-nvtx-cu12==12.4.127
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] pyzmq==26.2.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torch==2.6.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torch-xla==2.6.0+git39e67b5
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] torchvision==0.20.0.dev20241126+cpu
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] transformers==4.47.1
(test_hf_qwen pid=17527, ip=10.130.4.26) [pip3] triton==3.1.0
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] numpy                     1.26.4                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] pyzmq                     26.2.0                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torch                     2.6.0.dev20241126+cpu          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torch-xla                 2.6.0+git39e67b5          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] torchvision               0.20.0.dev20241126+cpu          pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] transformers              4.47.1                   pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) [conda] triton                    3.1.0                    pypi_0    pypi
(test_hf_qwen pid=17527, ip=10.130.4.26) ROCM Version: Could not collect
(test_hf_qwen pid=17527, ip=10.130.4.26) Neuron SDK Version: N/A
(test_hf_qwen pid=17527, ip=10.130.4.26) vLLM Version: N/A (dev)
(test_hf_qwen pid=17527, ip=10.130.4.26) vLLM Build Flags:
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
(test_hf_qwen pid=17527, ip=10.130.4.26) GPU Topology:
(test_hf_qwen pid=17527, ip=10.130.4.26) Could not collect
(test_hf_qwen pid=17527, ip=10.130.4.26) 
(test_hf_qwen pid=17527, ip=10.130.4.26) LD_LIBRARY_PATH=/home/ray/anaconda3/lib/python3.11/site-packages/cv2/../../lib64:/home/ray/anaconda3/lib/python3.11/site-packages/cv2/../../lib64::/usr/lib/x86_64-linux-gnu/:/home/ray/anaconda3/lib
(test_hf_qwen pid=17527, ip=10.130.4.26) OMP_NUM_THREADS=1
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_VISIBLE_DEVICES=
(test_hf_qwen pid=17527, ip=10.130.4.26) CUDA_VISIBLE_DEVICES=
(test_hf_qwen pid=17527, ip=10.130.4.26) TORCHINDUCTOR_COMPILE_THREADS=1
(test_hf_qwen pid=17527, ip=10.130.4.26) TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_ray

How would you like to use vllm

I want to run tensor-parallel inference using TPUs in a ray cluster. It seems like the Ray cluster picks up the accelerator that we need but then when vllm tries to initialize the ray cluster, it doesn't know that, so it doesn't reuse the TPUs that the cluster has already picked up. I was wondering how people would implement this? Thanks!

Code:

from vllm import LLM

@ray.remote(resources={"TPU": 4, "TPU-v4-8-head": 1})
def test():
    llm = LLM(model=Qwen/Qwen2.5-7B-Instruct, enforce_eager=True, max_model_len=8192, tensor_parallel_size=4)

Error:

(test_hf pid=1616, ip=10.130.0.8) INFO 01-15 09:04:53 config.py:510] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
(test_hf pid=1616, ip=10.130.0.8) Connecting to existing Ray cluster at address: 10.130.2.110:6379...
(test_hf pid=1616, ip=10.130.0.8) Calling ray.init() again after it has already been called.
Traceback (most recent call last):
  File "/tmp/ray/session_2025-01-09_10-24-52_724484_545/runtime_resources/working_dir_files/_ray_pkg_0e6ac5e67e3f89c8/experiments/llama_test.py", line 75, in <module>
    raise e
  File "/tmp/ray/session_2025-01-09_10-24-52_724484_545/runtime_resources/working_dir_files/_ray_pkg_0e6ac5e67e3f89c8/experiments/llama_test.py", line 73, in <module>
    ray.get(ref)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2691, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::test_hf() (pid=1616, ip=10.130.0.8)
  File "/tmp/ray/session_2025-01-09_10-24-52_724484_545/runtime_resources/working_dir_files/_ray_pkg_0e6ac5e67e3f89c8/experiments/llama_test.py", line 59, in test_hf
    classifier = AutoClassifier.from_model_path(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-09_10-24-52_724484_545/runtime_resources/working_dir_files/_ray_pkg_0e6ac5e67e3f89c8/marin/processing/classification/classifier.py", line 281, in from_model_path
    return cls._MODEL_NAME_TO_CLS_DICT[key](model_name_or_path, attribute_name, model_type, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-01-09_10-24-52_724484_545/runtime_resources/working_dir_files/_ray_pkg_0e6ac5e67e3f89c8/marin/processing/classification/classifier.py", line 213, in __init__
    self.llm = LLM(model=model_name, enforce_eager=True, max_model_len=8192, tensor_parallel_size=4)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm-0.6.6.post1/vllm/utils.py", line 986, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm-0.6.6.post1/vllm/entrypoints/llm.py", line 230, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm-0.6.6.post1/vllm/engine/llm_engine.py", line 515, in from_engine_args
    executor_class = cls._get_executor_cls(engine_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm-0.6.6.post1/vllm/engine/llm_engine.py", line 453, in _get_executor_cls
    initialize_ray_cluster(engine_config.parallel_config)
  File "/opt/vllm/vllm-0.6.6.post1/vllm/executor/ray_utils.py", line 300, in initialize_ray_cluster
    raise ValueError(
ValueError: Current node has no TPU available. current_node_resource={'ray-marin-us-central2-worker-2c310153-tpu': 1.0, 'CPU': 118.0, 'memory': 328490690150.0, 'object_store_memory': 32641751449.0, 'accelerator_type:TPU-V4': 1.0, 'node:10.130.0.8': 1.0}. vLLM engine cannot start without TPU. Make sure you have at least 1 TPU available in a node current_node_id='70354097fbebce320701224b766747b2c30936f9c1edf1d930d7723b' current_ip='10.130.0.8'.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@BabyChouSr BabyChouSr added the usage How to use vllm label Jan 14, 2025
@robertgshaw2-redhat
Copy link
Collaborator

how did you install vllm?

@BabyChouSr
Copy link
Author

BabyChouSr commented Jan 14, 2025

Thanks for your quick reply!

Since #11695 wasn't merged in 0.6.6.post1 yet, I have a bit of a hack to install the requirements-tpu.txt manually in my docker. Here are the docker steps:

ARG VLLM_VERSION=0.6.6.post1

RUN sudo apt update && sudo apt install unzip -y
RUN sudo mkdir -p /opt/vllm
RUN sudo chown -R $(whoami) /opt/vllm
RUN cd /opt/vllm && curl -sLO "https://github.com/vllm-project/vllm/archive/refs/tags/v${VLLM_VERSION}.zip" && unzip v${VLLM_VERSION}.zip

WORKDIR /opt/vllm/vllm-${VLLM_VERSION}
RUN pip uninstall torch torch-xla -y
RUN sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev -y
RUN pip install -r requirements-common.txt
RUN pip install cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2
RUN pip install --no-cache-dir torch_xla[tpu]@https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241126-cp311-cp311-linux_x86_64.whl -f https://storage.googleapis.com/libtpu-releases/index.html
RUN pip install torchvision==0.20.0.dev20241126+cpu torch==2.6.0.dev20241126+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu
RUN pip install jax==0.4.36.dev20241122 jaxlib==0.4.36.dev20241122 -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
RUN VLLM_TARGET_DEVICE="tpu" python3 setup.py develop

@ruisearch42 ruisearch42 added the ray anything related with ray label Jan 15, 2025
@ruisearch42
Copy link
Collaborator

ruisearch42 commented Jan 15, 2025

Looks like Ray could recognize 'accelerator_type:TPU-V4', but somehow the 'TPU' resource count was not correctly auto detected. Maybe try debug like this: #10155 (comment)

@BabyChouSr
Copy link
Author

thanks for the help @ruisearch42, and hope you've been doing well! Some extra things that might help us debug. In the ray remote function itself, I added the following:

print("TPU IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["TPU"]))
print("TPU_VISIBLE_CHIPS: {}".format(os.environ["TPU_VISIBLE_CHIPS"]))

For the first line, I get 0,1,2,3 as expected. For the second line, I get TPU_VISIBLE_CHIPS in not a environment variable as a KeyError.

@BabyChouSr
Copy link
Author

Another update is, I spun up a new v4-8 instance (without Ray, I did this manually). It seems like running vllm serve Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 4 works in this case. So, it seems like this is not working because the instance is trying to load the existing Ray cluster, and somehow it is not picking up the TPUs correctly.

@bvrockwell
Copy link
Contributor

@richardsliu @dyli-google

@ruisearch42 ruisearch42 added the tpu Related to Google TPUs label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ray anything related with ray tpu Related to Google TPUs usage How to use vllm
Projects
None yet
Development

No branches or pull requests

4 participants