Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Issue serving Mixtral 8x7B on H100 #443

Open
ghost opened this issue Mar 17, 2024 · 9 comments
Open

[BUG] Issue serving Mixtral 8x7B on H100 #443

ghost opened this issue Mar 17, 2024 · 9 comments

Comments

@ghost
Copy link

ghost commented Mar 17, 2024

Running into issues when serving Mixtral 8x7B on 4 x H100 (TP=4) with deepspeed-mii v0.2.3 with all other arguments default in the base image from nvidia nvidia/cuda:12.3.1-devel-ubuntu22.04

The traces showed

undefined symbol: _Z19cuda_wf6af16_linearRN2at6TensorES1_S1_S1_S1_S1_iiii

There's also a warning: FP6 quantization kernel is only supported on Ampere architectures, but I did not specify quantization when launching the server. Seems like there's an unused kernel getting imported but it's not registered on Grace Hopper devices.

When I downgrade to v0.2.2, I ran into the following error

Arch unsupported for MoE GEMM
@sidagarwal2805
Copy link

I ran into the same problem with V100s with exact same error output. Was fixed when I switched to A100s

@ghost
Copy link
Author

ghost commented Mar 18, 2024

I ran into the same problem with V100s with exact same error output. Was fixed when I switched to A100

Yea - can confirm this works on A100, but not on H100

@mrwyattii
Copy link
Contributor

Thanks for reporting this. It seems there was a bug introduced in the latest release when we added FP6 quantization support. I will investigate and fix the bug. Thank you!

@xiaoxiawu-microsoft
Copy link

@JamesTheZ may know about this.

@JamesTheZ
Copy link
Contributor

@JamesTheZ may know about this.

Seems because the current implementation only compiles cuda_linear_kernels.cpp on Ampere: https://github.com/microsoft/DeepSpeed/blob/330d36bb39b8dd33b5603ee0024705db38aab534/op_builder/inference_core_ops.py#L75-L81

@Taishi-N324
Copy link

Taishi-N324 commented Mar 28, 2024

I'm encountering an issue with meta-llama/Llama-2-7b-chat-hf on an H100 due to an undefined symbol: _Z19cuda_wf6af16_linearRN2at6TensorES1_S1_S1_S1_S1_iiii, and it's not working. I've also faced the same problem with mistralai/Mistral-7B-v0.1. Neither of these models is functioning in my setup.

I've attempted using multiple versions of deepspeed-mii (0.2.1, 0.2.2, and 0.2.3), as well as different versions of PyTorch (2.2.1, 2.1.2, and 2.1.0), but none of these combinations seem to work. Additionally, even went as far as compiling directly from the source, but unfortunately, I haven't had any success.

Is anyone else experiencing the same issue or has any suggestions on how to resolve it?

import mii
pipe = mii.pipeline("meta-llama/Llama-2-7b-chat-hf")
NVIDIA H100 80GB
Driver Version: 535.104.12
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/taishi/workplace/mii/venv/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/taishi/workplace/mii/venv/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 999.98 GB
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

@deroholic
Copy link

Downgrading to this will work:
deepspeed 0.13.5
deepspeed-mii 0.2.2

github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this issue Apr 15, 2024
Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.

Related issue: deepspeedai/DeepSpeed-MII#443.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this issue May 9, 2024
…dai#5333)

Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.

Related issue: deepspeedai/DeepSpeed-MII#443.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
umchand pushed a commit to umchand/DeepSpeed that referenced this issue May 20, 2024
…dai#5333)

Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.

Related issue: deepspeedai/DeepSpeed-MII#443.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this issue Jun 11, 2024
…dai#5333)

Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.

Related issue: deepspeedai/DeepSpeed-MII#443.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
@seven-mile
Copy link

Any update on this issue?

@seven-mile
Copy link

seven-mile commented Jun 16, 2024

I found that it's an issue from the upstream FasterTransformer, check these lines.

But faster transformer is already migrated to TensorRT-LLM, which indeed have an implementation under sm_90.

Do you have a plan to solve it? Or is PR welcomed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants