Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

Closed
erkintelnyx opened this issue Sep 19, 2024 · 11 comments · Fixed by #11979
Labels
bug Something isn't working rocm stale

Comments

@erkintelnyx
Copy link

erkintelnyx commented Sep 19, 2024

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.5.0.dev20240726+rocm6.1
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI100 (gfx908:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40093
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        48 bits physical, 48 bits virtual
CPU(s):                               16
On-line CPU(s) list:                  0-15
Thread(s) per core:                   1
Core(s) per socket:                   16
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            AuthenticAMD
CPU family:                           25
Model:                                1
Model name:                           AMD EPYC 7713 64-Core Processor
Stepping:                             1
CPU MHz:                              2000.000
BogoMIPS:                             4000.00
Virtualization:                       AMD-V
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            1 MiB
L1i cache:                            1 MiB
L2 cache:                             8 MiB
L3 cache:                             16 MiB
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities

Versions of relevant libraries:
[pip3] mypy==1.7.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] optree==0.9.1
[pip3] pytorch-triton-rocm==3.0.0+21eae954ef
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.0.dev20240726+rocm6.1
[pip3] torchvision==0.20.0.dev20240726+rocm6.1
[pip3] transformers==4.43.2
[pip3] triton==3.0.0
[conda] No relevant packages
ROCM Version: 6.1.40093-bd86f1708
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@a8c1d161a7d87dbc6c7cccfce303dcbe2e4ed6be
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

err_execute_model_input_20240919-094504.pkl.zip

🐛 Describe the bug

When I start the model via:
vllm serve OpenGVLab/InternVL2-Llama3-76B --tensor-parallel-size 8 --max-model-len 8000

I get:

  File "/vllm-workspace/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner.py", line 1590, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 488, in forward
    vision_embeddings = self._process_image_input(image_input)
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 471, in _process_image_input
    image_embeds = self.extract_feature(image_input["data"])
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 395, in extract_feature
    vit_embeds = self.vision_model(pixel_values=pixel_values)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 356, in forward
    encoder_outputs = self.encoder(inputs_embeds=hidden_states)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 298, in forward
    hidden_states = encoder_layer(hidden_states)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 267, in forward
    hidden_states = hidden_states + self.attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 203, in forward
    x = x.transpose(1, 2).view(B, N, -1)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/envs/py_3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 318, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 113, in from_engine_args
    return cls(
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 69, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/vllm-workspace/vllm/engine/llm_engine.py", line 331, in __init__
    self._initialize_kv_caches()
  File "/vllm-workspace/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/vllm-workspace/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
  File "/vllm-workspace/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner.py", line 1236, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner_base.py", line 144, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240919-094504.pkl): view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.```
@erkintelnyx erkintelnyx added the bug Something isn't working label Sep 19, 2024
@DarkLight1337
Copy link
Member

Hmm, I ran this locally and didn't get such error. Could you share the dimensions of the images which you inputted to the model?

@erkintelnyx
Copy link
Author

This is at the start up of the server so before doing any inference.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 19, 2024

Looking at your environment, it seems that you're running this on AMD GPUs. Maybe there is some bug related to that? @youkaichao @WoosukKwon

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 19, 2024

It may be that F.scaled_dot_product_attention followed by transpose(1, 2) results in non-contiguous output for AMD.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 20, 2024

I just tried running this command (after downloading the HF repo locally) on 8x MI250 (ROCm 6.1) and failed to repro this issue. Can you tell us more about your setup by running rocm-smi --showtopo?:

My `collect_env.py` output:
PyTorch version: 2.5.0.dev20240708+rocm6.1
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40091
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          96
On-line CPU(s) list:             0-95
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7643 48-Core Processor
CPU family:                      25
Model:                           1
Thread(s) per core:              1
Core(s) per socket:              48
Socket(s):                       2
Stepping:                        1
Frequency boost:                 enabled
CPU max MHz:                     3640.9170
CPU min MHz:                     1500.0000
BogoMIPS:                        4599.92
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization:                  AMD-V
L1d cache:                       3 MiB (96 instances)
L1i cache:                       3 MiB (96 instances)
L2 cache:                        48 MiB (96 instances)
L3 cache:                        512 MiB (16 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-47
NUMA node1 CPU(s):               48-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] onnx==1.16.1
[pip3] open_clip_torch==2.26.1
[pip3] optree==0.11.0
[pip3] pytorch-lightning==2.0.7
[pip3] pytorch-triton-rocm==3.0.0+21eae954ef
[pip3] pyzmq==25.1.2
[pip3] sentence-transformers==3.0.1
[pip3] taming-transformers==0.0.1
[pip3] torch==2.5.0.dev20240708+rocm6.1
[pip3] torchaudio==2.4.0.dev20240708+rocm6.1
[pip3] torchdiffeq==0.2.4
[pip3] torchmetrics==1.4.0.post0
[pip3] torchsde==0.2.6
[pip3] torchvision==0.20.0.dev20240708+rocm6.1
[pip3] transformers==4.43.3
[pip3] triton==3.0.0
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] open-clip-torch           2.26.1                   pypi_0    pypi
[conda] pytorch-lightning         2.0.7                    pypi_0    pypi
[conda] pytorch-triton-rocm       3.0.0+21eae954ef          pypi_0    pypi
[conda] pyzmq                     26.0.3                   pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] taming-transformers       0.0.1                    pypi_0    pypi
[conda] torch                     2.5.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] torchaudio                2.4.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] torchdiffeq               0.2.4                    pypi_0    pypi
[conda] torchmetrics              1.4.0.post0              pypi_0    pypi
[conda] torchsde                  0.2.6                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: 6.1.40091-a8dbc0c19
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@9cc373f39036af789fb1ffc1e06b23766996d3f4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
My `rocm-smi` output:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           30           30           15           30           15           30           
GPU1   15           0            30           15           30           45           30           15           
GPU2   30           30           0            15           15           30           15           30           
GPU3   30           15           15           0            30           15           30           45           
GPU4   15           30           15           30           0            15           30           30           
GPU5   30           45           30           15           15           0            30           15           
GPU6   15           30           15           30           30           30           0            15           
GPU7   30           15           30           45           30           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

Edit: I see that you have MI100 GPUs, but the ROCm and Triton versions are similar to mine.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 20, 2024

I see that you have MI100 GPUs, but the ROCm and Triton versions are similar to mine.

ROCm 6.1 is not officially supported in vLLM for MI100, so that may be why.

@DarkLight1337
Copy link
Member

cc @alexeykondrat

@erkintelnyx
Copy link
Author

I just tried running this command (after downloading the HF repo locally) on 8x MI250 (ROCm 6.1) and failed to repro this issue. Can you tell us more about your setup by running rocm-smi --showtopo?:

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           40           40           40           40           
GPU1   15           0            15           15           40           40           40           40           
GPU2   15           15           0            15           40           40           40           40           
GPU3   15           15           15           0            40           40           40           40           
GPU4   40           40           40           40           0            15           15           15           
GPU5   40           40           40           40           15           0            15           15           
GPU6   40           40           40           40           15           15           0            15           
GPU7   40           40           40           40           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            2            2            2            2            
GPU1   1            0            1            1            2            2            2            2            
GPU2   1            1            0            1            2            2            2            2            
GPU3   1            1            1            0            2            2            2            2            
GPU4   2            2            2            2            0            1            1            1            
GPU5   2            2            2            2            1            0            1            1            
GPU6   2            2            2            2            1            1            0            1            
GPU7   2            2            2            2            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU1   XGMI         0            XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU2   XGMI         XGMI         0            XGMI         PCIE         PCIE         PCIE         PCIE         
GPU3   XGMI         XGMI         XGMI         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            XGMI         XGMI         XGMI         
GPU5   PCIE         PCIE         PCIE         PCIE         XGMI         0            XGMI         XGMI         
GPU6   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         0            XGMI         
GPU7   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: -1
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: -1
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: -1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: -1
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: -1
GPU[6]          : (Topology) Numa Node: 0
GPU[6]          : (Topology) Numa Affinity: -1
GPU[7]          : (Topology) Numa Node: 0
GPU[7]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

This is in a container

docker build -f vllm/Dockerfile.rocm \
        --build-arg TRY_FA_WHEEL="0" \
        --build-arg PYTORCH_ROCM_ARCH=gfx908 \
        --build-arg FA_GFX_ARCHS=gfx908 \
        --build-arg FA_BRANCH="ae7928c5aed53cf6e75cc792baa9126b2abfcf1a"

The previous commit I was using was working though (fde47d3).

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 20, 2024

As a sanity check, make sure that your downloaded version of OpenGVLab/InternVL2-Llama3-76B is up to date. A simple way to check is to run AutoConfig.from_pretrained(...) and check that it's the same as the one listed on the HF repo.

@youkaichao
Copy link
Member

cc @hongxiayang

MengqingCao added a commit to MengqingCao/vllm that referenced this issue Sep 27, 2024
Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rocm stale
Projects
None yet
3 participants