[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

erkintelnyx · 2024-09-19T10:03:19Z

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.5.0.dev20240726+rocm6.1
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b)
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI100 (gfx908:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40093
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        48 bits physical, 48 bits virtual
CPU(s):                               16
On-line CPU(s) list:                  0-15
Thread(s) per core:                   1
Core(s) per socket:                   16
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            AuthenticAMD
CPU family:                           25
Model:                                1
Model name:                           AMD EPYC 7713 64-Core Processor
Stepping:                             1
CPU MHz:                              2000.000
BogoMIPS:                             4000.00
Virtualization:                       AMD-V
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            1 MiB
L1i cache:                            1 MiB
L2 cache:                             8 MiB
L3 cache:                             16 MiB
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities

Versions of relevant libraries:
[pip3] mypy==1.7.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] optree==0.9.1
[pip3] pytorch-triton-rocm==3.0.0+21eae954ef
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.0.dev20240726+rocm6.1
[pip3] torchvision==0.20.0.dev20240726+rocm6.1
[pip3] transformers==4.43.2
[pip3] triton==3.0.0
[conda] No relevant packages
ROCM Version: 6.1.40093-bd86f1708
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@a8c1d161a7d87dbc6c7cccfce303dcbe2e4ed6be
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

err_execute_model_input_20240919-094504.pkl.zip

🐛 Describe the bug

When I start the model via:
vllm serve OpenGVLab/InternVL2-Llama3-76B --tensor-parallel-size 8 --max-model-len 8000

I get:

  File "/vllm-workspace/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner.py", line 1590, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 488, in forward
    vision_embeddings = self._process_image_input(image_input)
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 471, in _process_image_input
    image_embeds = self.extract_feature(image_input["data"])
  File "/vllm-workspace/vllm/model_executor/models/internvl.py", line 395, in extract_feature
    vit_embeds = self.vision_model(pixel_values=pixel_values)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 356, in forward
    encoder_outputs = self.encoder(inputs_embeds=hidden_states)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 298, in forward
    hidden_states = encoder_layer(hidden_states)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 267, in forward
    hidden_states = hidden_states + self.attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1735, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1746, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/models/intern_vit.py", line 203, in forward
    x = x.transpose(1, 2).view(B, N, -1)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/envs/py_3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 318, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 113, in from_engine_args
    return cls(
  File "/vllm-workspace/vllm/engine/multiprocessing/engine.py", line 69, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/vllm-workspace/vllm/engine/llm_engine.py", line 331, in __init__
    self._initialize_kv_caches()
  File "/vllm-workspace/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/vllm-workspace/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
  File "/vllm-workspace/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner.py", line 1236, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner_base.py", line 144, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240919-094504.pkl): view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.```

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-09-19T10:53:20Z

Hmm, I ran this locally and didn't get such error. Could you share the dimensions of the images which you inputted to the model?

erkintelnyx · 2024-09-19T13:49:31Z

This is at the start up of the server so before doing any inference.

DarkLight1337 · 2024-09-19T13:52:33Z

Looking at your environment, it seems that you're running this on AMD GPUs. Maybe there is some bug related to that? @youkaichao @WoosukKwon

DarkLight1337 · 2024-09-19T13:56:06Z

It may be that F.scaled_dot_product_attention followed by transpose(1, 2) results in non-contiguous output for AMD.

DarkLight1337 · 2024-09-20T02:37:31Z

I just tried running this command (after downloading the HF repo locally) on 8x MI250 (ROCm 6.1) and failed to repro this issue. Can you tell us more about your setup by running rocm-smi --showtopo?:

My `collect_env.py` output:

PyTorch version: 2.5.0.dev20240708+rocm6.1
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40091
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          96
On-line CPU(s) list:             0-95
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7643 48-Core Processor
CPU family:                      25
Model:                           1
Thread(s) per core:              1
Core(s) per socket:              48
Socket(s):                       2
Stepping:                        1
Frequency boost:                 enabled
CPU max MHz:                     3640.9170
CPU min MHz:                     1500.0000
BogoMIPS:                        4599.92
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Virtualization:                  AMD-V
L1d cache:                       3 MiB (96 instances)
L1i cache:                       3 MiB (96 instances)
L2 cache:                        48 MiB (96 instances)
L3 cache:                        512 MiB (16 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-47
NUMA node1 CPU(s):               48-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] onnx==1.16.1
[pip3] open_clip_torch==2.26.1
[pip3] optree==0.11.0
[pip3] pytorch-lightning==2.0.7
[pip3] pytorch-triton-rocm==3.0.0+21eae954ef
[pip3] pyzmq==25.1.2
[pip3] sentence-transformers==3.0.1
[pip3] taming-transformers==0.0.1
[pip3] torch==2.5.0.dev20240708+rocm6.1
[pip3] torchaudio==2.4.0.dev20240708+rocm6.1
[pip3] torchdiffeq==0.2.4
[pip3] torchmetrics==1.4.0.post0
[pip3] torchsde==0.2.6
[pip3] torchvision==0.20.0.dev20240708+rocm6.1
[pip3] transformers==4.43.3
[pip3] triton==3.0.0
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] open-clip-torch           2.26.1                   pypi_0    pypi
[conda] pytorch-lightning         2.0.7                    pypi_0    pypi
[conda] pytorch-triton-rocm       3.0.0+21eae954ef          pypi_0    pypi
[conda] pyzmq                     26.0.3                   pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] taming-transformers       0.0.1                    pypi_0    pypi
[conda] torch                     2.5.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] torchaudio                2.4.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] torchdiffeq               0.2.4                    pypi_0    pypi
[conda] torchmetrics              1.4.0.post0              pypi_0    pypi
[conda] torchsde                  0.2.6                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240708+rocm6.1          pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: 6.1.40091-a8dbc0c19
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2@9cc373f39036af789fb1ffc1e06b23766996d3f4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

My `rocm-smi` output:

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           30           30           15           30           15           30           
GPU1   15           0            30           15           30           45           30           15           
GPU2   30           30           0            15           15           30           15           30           
GPU3   30           15           15           0            30           15           30           45           
GPU4   15           30           15           30           0            15           30           30           
GPU5   30           45           30           15           15           0            30           15           
GPU6   15           30           15           30           30           30           0            15           
GPU7   30           15           30           45           30           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

Edit: I see that you have MI100 GPUs, but the ROCm and Triton versions are similar to mine.

DarkLight1337 · 2024-09-20T04:14:45Z

I see that you have MI100 GPUs, but the ROCm and Triton versions are similar to mine.

ROCm 6.1 is not officially supported in vLLM for MI100, so that may be why.

DarkLight1337 · 2024-09-20T04:20:29Z

cc @alexeykondrat

erkintelnyx · 2024-09-20T13:59:40Z

I just tried running this command (after downloading the HF repo locally) on 8x MI250 (ROCm 6.1) and failed to repro this issue. Can you tell us more about your setup by running rocm-smi --showtopo?:

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           40           40           40           40           
GPU1   15           0            15           15           40           40           40           40           
GPU2   15           15           0            15           40           40           40           40           
GPU3   15           15           15           0            40           40           40           40           
GPU4   40           40           40           40           0            15           15           15           
GPU5   40           40           40           40           15           0            15           15           
GPU6   40           40           40           40           15           15           0            15           
GPU7   40           40           40           40           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            2            2            2            2            
GPU1   1            0            1            1            2            2            2            2            
GPU2   1            1            0            1            2            2            2            2            
GPU3   1            1            1            0            2            2            2            2            
GPU4   2            2            2            2            0            1            1            1            
GPU5   2            2            2            2            1            0            1            1            
GPU6   2            2            2            2            1            1            0            1            
GPU7   2            2            2            2            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU1   XGMI         0            XGMI         XGMI         PCIE         PCIE         PCIE         PCIE         
GPU2   XGMI         XGMI         0            XGMI         PCIE         PCIE         PCIE         PCIE         
GPU3   XGMI         XGMI         XGMI         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            XGMI         XGMI         XGMI         
GPU5   PCIE         PCIE         PCIE         PCIE         XGMI         0            XGMI         XGMI         
GPU6   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         0            XGMI         
GPU7   PCIE         PCIE         PCIE         PCIE         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: -1
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: -1
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: -1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: -1
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: -1
GPU[6]          : (Topology) Numa Node: 0
GPU[6]          : (Topology) Numa Affinity: -1
GPU[7]          : (Topology) Numa Node: 0
GPU[7]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

This is in a container

docker build -f vllm/Dockerfile.rocm \
        --build-arg TRY_FA_WHEEL="0" \
        --build-arg PYTORCH_ROCM_ARCH=gfx908 \
        --build-arg FA_GFX_ARCHS=gfx908 \
        --build-arg FA_BRANCH="ae7928c5aed53cf6e75cc792baa9126b2abfcf1a"

The previous commit I was using was working though (fde47d3).

DarkLight1337 · 2024-09-20T14:31:14Z

As a sanity check, make sure that your downloaded version of OpenGVLab/InternVL2-Llama3-76B is up to date. A simple way to check is to run AutoConfig.from_pretrained(...) and check that it's the same as the one listed on the HF repo.

youkaichao · 2024-09-20T21:29:27Z

cc @hongxiayang

github-actions · 2024-12-20T01:59:36Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

erkintelnyx added the bug Something isn't working label Sep 19, 2024

DarkLight1337 mentioned this issue Sep 20, 2024

[Misc] Show AMD GPU topology in collect_env.py #8649

Merged

DarkLight1337 added the rocm label Sep 20, 2024

MengqingCao added a commit to MengqingCao/vllm that referenced this issue Sep 27, 2024

fix vllm-project#8630

70aebc7

MengqingCao mentioned this issue Sep 27, 2024

[Bugfix] fix #8630 #8880

Closed

github-actions bot added the stale label Dec 20, 2024

liaoyanqing666 mentioned this issue Jan 13, 2025

[Bug]: The usage of .transpose() and .view() consecutively is not recommended. #11978

Closed

1 task

DarkLight1337 mentioned this issue Jan 13, 2025

[Bug] Fix usage of .transpose() and .view() consecutively. #11979

Merged

DarkLight1337 closed this as completed in #11979 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

erkintelnyx commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 19, 2024

erkintelnyx commented Sep 19, 2024

DarkLight1337 commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024

erkintelnyx commented Sep 20, 2024

DarkLight1337 commented Sep 20, 2024 •

edited

Loading

youkaichao commented Sep 20, 2024

github-actions bot commented Dec 20, 2024

[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

[Bug]: OpenGVLab/InternVL2-Llama3-76B: view size is not compatible with input tensor's size and stride #8630

Comments

erkintelnyx commented Sep 19, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

DarkLight1337 commented Sep 19, 2024

erkintelnyx commented Sep 19, 2024

DarkLight1337 commented Sep 19, 2024 • edited Loading

DarkLight1337 commented Sep 19, 2024 • edited Loading

DarkLight1337 commented Sep 20, 2024 • edited Loading

DarkLight1337 commented Sep 20, 2024 • edited Loading

DarkLight1337 commented Sep 20, 2024

erkintelnyx commented Sep 20, 2024

DarkLight1337 commented Sep 20, 2024 • edited Loading

youkaichao commented Sep 20, 2024

github-actions bot commented Dec 20, 2024

erkintelnyx commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 19, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024 •

edited

Loading

DarkLight1337 commented Sep 20, 2024 •

edited

Loading