The NVIDIA driver on your system is too old (found version 11080). #20

candowu · 2023-11-29T08:17:44Z

NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8

export MODEL_PATH=Yi-34B-Chat-4bits
01ai/Yi-34B-Chat-4bits  $ export MODEL_ID=01-ai/Yi-34B-Chat-4bits
01ai/Yi-34B-Chat-4bits  $ docker run -it --gpus=all --net=host --shm-size=1g \

-v $MODEL_PATH:$MODEL_PATH
-e DEVICE=cuda:1
-e NCCL_DEBUG=INFO
docker.io/vectorchai/scalellm:latest --logtostderr --model_path=$MODEL_PATH --model_id=$MODEL_ID --model_type=Yi
I20231129 08:13:34.992501 7 main.cpp:135] Using devices: cuda:1
W20231129 08:13:34.993809 7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
I20231129 08:13:34.993916 7 engine.cpp:91] Initializing model from: /data4/candowu/modelscope/01ai/Yi-34B-Chat-4bits
W20231129 08:13:34.993944 7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
I20231129 08:13:35.245934 7 engine.cpp:98] Initializing model with dtype: Half
I20231129 08:13:35.245993 7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
terminate called after throwing an instance of 'c10::Error'
what(): The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:53 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f2c0dc6e38b in /app/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xbf (0x7f2c0dc68f3f in /app/lib/libc10.so)
frame #2: c10::cuda::device_count_ensure_non_zero() + 0x18c (0x7f2c0e0535dc in /app/lib/libc10_cuda.so)

guocuimi · 2023-11-29T08:28:52Z

Thank you for reporting this issue. It appears that an upgrade of your NVIDIA driver to version 525.* is necessary. Our image was built with PyTorch 2.* and CUDA 12.1, which requires a minimum driver version of 525.*.
Please note that the CUDA version is not a concern in this case, as the Docker image does not utilize CUDA. Upgrading your NVIDIA driver should resolve the issue.

FYI: here is the CUDA version and minimum required driver version.
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

            | Linux x86_64 Driver Version
CUDA 12.3.x | >=525.60.13
CUDA 12.2.x | >=525.60.13
CUDA 12.1.x | >=525.60.13
CUDA 12.0.x | >=525.60.13

guocuimi · 2023-12-03T00:51:47Z

We are thrilled to share that ScaleLLM has expanded its compatibility to include both CUDA 11.8 and CUDA 12.1. I've just released a new version specifically for this purpose. You can check it out here: New Release for CUDA 11.8 Support.

Currently, the Docker image is being built, and you can track its progress here: Docker Image Build Progress. Once completed, the new image tailored for CUDA 11.8 will be available in this repository: Docker Repository for CUDA 11.8 Image.

To update your Docker image, follow this example command:

docker run -it --gpus=all --net=host \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ \
  -e DEVICE=cuda:0 \
  docker.io/vectorchai/scalellm_cu118:latest --logtostderr

guocuimi self-assigned this Nov 29, 2023

guocuimi added the good first issue Good for newcomers label Nov 29, 2023

guocuimi closed this as completed Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The NVIDIA driver on your system is too old (found version 11080). #20

The NVIDIA driver on your system is too old (found version 11080). #20

candowu commented Nov 29, 2023

guocuimi commented Nov 29, 2023 •

edited

Loading

guocuimi commented Dec 3, 2023

The NVIDIA driver on your system is too old (found version 11080). #20

The NVIDIA driver on your system is too old (found version 11080). #20

Comments

candowu commented Nov 29, 2023

guocuimi commented Nov 29, 2023 • edited Loading

guocuimi commented Dec 3, 2023

guocuimi commented Nov 29, 2023 •

edited

Loading