[Bug] deepseek v2.5 w4a16 can not run #2362

cxmt-ai-tc · 2024-12-05T11:47:07Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

i was using model from https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16 , and its config.josn like
Hugging Face's logo
Hugging Face
Search models, datasets, users...
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing

Log In
Sign Up

nm-testing
/
DeepSeek-V2.5-W4A16

like
0

Follow

NM Testing
34
Safetensors
deepseek_v2
custom_code
compressed-tensors
Model card
Files and versions
Community
1
DeepSeek-V2.5-W4A16
/
config.json

mgoin's picture
mgoin
Updated compression_config to quantization_config
11d3311
VERIFIED
about 2 months ago
raw

Copy download link
history
blame
contribute
delete

2.66 kB
{
"_name_or_path": "/home/dsikka/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V2.5/snapshots/24b08cb750e0c2757de112d2e16327cb21ed4833",
"architectures": [
"DeepseekV2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV2Config",
"AutoModel": "modeling_deepseek.DeepseekV2Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 100000,
"eos_token_id": 100001,
"ep_size": 1,
"first_k_dense_replace": 1,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 12288,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v2",
"moe_intermediate_size": 1536,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 160,
"n_shared_experts": 2,
"norm_topk_prob": false,
"num_attention_heads": 128,
"num_experts_per_tok": 6,
"num_hidden_layers": 60,
"num_key_value_heads": 128,
"pretraining_tp": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 16.0,
"scoring_func": "softmax",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 3,
"topk_method": "group_limited_greedy",
"torch_dtype": "bfloat16",
"transformers_version": "4.44.2",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 102400,
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": 2.265805157986176,
"ignore": [
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {
"format": "dense",
"global_sparsity": 0.21918901165186397,
"registry_requires_subclass": false,
"sparsity_structure": "unstructured"
}
}
}
i run this model using command like :
python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code

got error like:
root@s0pgpuap12:/workspace/sglang# python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code
[2024-12-05 03:45:04] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=510750847, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-05 03:45:11 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:11 TP3] Init torch distributed begin.
[2024-12-05 03:45:12 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP2] Init torch distributed begin.
[2024-12-05 03:45:12 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP1] Init torch distributed begin.
[2024-12-05 03:45:12 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP0] Init torch distributed begin.
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-05 03:45:13 TP0] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP1] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP2] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP3] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:14 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP3] lm_eval is not installed, GPTQ may not be usable
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
[2024-12-05 03:45:14 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

Reproduction

python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code

Environment

Python: 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A40
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.14
PyTorch: 2.4.0+cu121
sglang: 0.4.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.2
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.2
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.11
packaging: 24.2
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.54.4
anthropic: 0.39.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU1 NV4 X PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU2 PIX PIX X NV4 SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU3 PIX PIX NV4 X SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU4 SYS SYS SYS SYS X NV4 PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU5 SYS SYS SYS SYS NV4 X PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU6 SYS SYS SYS SYS PIX PIX X NV4 SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU7 SYS SYS SYS SYS PIX PIX NV4 X SYS SYS PIX PIX 32-47,96-111 2 N/A
NIC0 PIX PIX PIX PIX SYS SYS SYS SYS X PIX SYS SYS
NIC1 PIX PIX PIX PIX SYS SYS SYS SYS PIX X SYS SYS
NIC2 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS X PIX
NIC3 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3

ulimit soft: 1048576

zhyncs · 2024-12-05T20:16:15Z

May you try the latest main? ref #2364

cxmt-ai-tc · 2024-12-06T01:24:22Z

May you try the latest main? ref #2364

same error

ispobock · 2024-12-07T08:29:47Z

@cxmt-ai-tc This is a compressed tensors W4A16 quantized DeepSeek model. We plan to support it later but not with high priority, since AWQ W4A16 is already supported. Maybe you can try AWQ first.

cxmt-ai-tc · 2024-12-09T07:44:47Z

@cxmt-ai-tc This is a compressed tensors W4A16 quantized DeepSeek model. We plan to support it later but not with high priority, since AWQ W4A16 is already supported. Maybe you can try AWQ first.

i tried AWQ model(https://huggingface.co/casperhansen/deepseek-coder-v2-instruct-awq), got the error on A40 like this :

CUDA_VISIBLE_DEVICES=2,3,6,7 python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ --port 50800 --host 0.0.0.0 --tp 4 --trust-remote-code --context-length 512 --max-running-requests 100 --mem-fraction-static 0.8 --disable-cuda-graph
[2024-12-08 23:43:01] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=512, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=50800, mem_fraction_static=0.8, max_running_requests=100, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=232609575, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
INFO 12-08 23:43:01 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP2] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP3] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP1] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP0] Init torch distributed begin.
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-08 23:43:10 TP1] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP0] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP3] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP2] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:11 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP2] lm_eval is not installed, GPTQ may not be usable
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/26 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 4% Completed | 1/26 [00:00<00:17, 1.40it/s]
Loading safetensors checkpoint shards: 8% Completed | 2/26 [00:01<00:21, 1.12it/s]
Loading safetensors checkpoint shards: 12% Completed | 3/26 [00:02<00:20, 1.11it/s]
Loading safetensors checkpoint shards: 15% Completed | 4/26 [00:03<00:19, 1.11it/s]
Loading safetensors checkpoint shards: 19% Completed | 5/26 [00:04<00:19, 1.06it/s]
Loading safetensors checkpoint shards: 23% Completed | 6/26 [00:05<00:19, 1.02it/s]
Loading safetensors checkpoint shards: 27% Completed | 7/26 [00:06<00:19, 1.00s/it]
Loading safetensors checkpoint shards: 31% Completed | 8/26 [00:07<00:17, 1.01it/s]
Loading safetensors checkpoint shards: 35% Completed | 9/26 [00:08<00:17, 1.01s/it]
Loading safetensors checkpoint shards: 38% Completed | 10/26 [00:09<00:15, 1.02it/s]
Loading safetensors checkpoint shards: 42% Completed | 11/26 [00:10<00:14, 1.04it/s]
Loading safetensors checkpoint shards: 46% Completed | 12/26 [00:11<00:13, 1.06it/s]
Loading safetensors checkpoint shards: 50% Completed | 13/26 [00:12<00:12, 1.06it/s]
Loading safetensors checkpoint shards: 54% Completed | 14/26 [00:13<00:11, 1.07it/s]
Loading safetensors checkpoint shards: 58% Completed | 15/26 [00:14<00:10, 1.07it/s]
Loading safetensors checkpoint shards: 62% Completed | 16/26 [00:15<00:09, 1.06it/s]
Loading safetensors checkpoint shards: 65% Completed | 17/26 [00:16<00:08, 1.07it/s]
Loading safetensors checkpoint shards: 69% Completed | 18/26 [00:17<00:07, 1.07it/s]
Loading safetensors checkpoint shards: 73% Completed | 19/26 [00:17<00:06, 1.06it/s]
Loading safetensors checkpoint shards: 77% Completed | 20/26 [00:19<00:07, 1.18s/it]
Loading safetensors checkpoint shards: 81% Completed | 21/26 [00:20<00:05, 1.14s/it]
Loading safetensors checkpoint shards: 85% Completed | 22/26 [00:21<00:04, 1.13s/it]
Loading safetensors checkpoint shards: 88% Completed | 23/26 [00:22<00:03, 1.13s/it]
Loading safetensors checkpoint shards: 92% Completed | 24/26 [00:24<00:02, 1.12s/it]
Loading safetensors checkpoint shards: 96% Completed | 25/26 [00:25<00:01, 1.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:25<00:00, 1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:25<00:00, 1.01it/s]

[2024-12-08 23:43:51 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP0] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP1] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP2] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP3] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:52 TP0] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP1] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP2] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP3] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52] INFO: Started server process [25439]
[2024-12-08 23:43:52] INFO: Waiting for application startup.
[2024-12-08 23:43:52] INFO: Application startup complete.
[2024-12-08 23:43:52] INFO: Uvicorn running on http://0.0.0.0:50800 (Press CTRL+C to quit)
[2024-12-08 23:43:53] INFO: 127.0.0.1:40164 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-08 23:43:53 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
[2024-12-08 23:43:57 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 103, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 134, in forward_thread_func
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 154, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 686, in forward
return self.forward_extend(forward_batch)
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 655, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
Killed

ispobock · 2024-12-14T14:40:12Z

@cxmt-ai-tc Could you try the latest sglang and vllm version? It seems vllm is not correctly installed.

cxmt-ai-tc · 2024-12-16T10:49:47Z

@cxmt-ai-tc Could you try the latest sglang and vllm version? It seems vllm is not correctly installed.

i using conda py310, install vllm -0.6.4.post1, it was get the same error in vlllm server mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] deepseek v2.5 w4a16 can not run #2362

[Bug] deepseek v2.5 w4a16 can not run #2362

cxmt-ai-tc commented Dec 5, 2024

zhyncs commented Dec 5, 2024

cxmt-ai-tc commented Dec 6, 2024

ispobock commented Dec 7, 2024

cxmt-ai-tc commented Dec 9, 2024

ispobock commented Dec 14, 2024

cxmt-ai-tc commented Dec 16, 2024

[Bug] deepseek v2.5 w4a16 can not run #2362

[Bug] deepseek v2.5 w4a16 can not run #2362

Comments

cxmt-ai-tc commented Dec 5, 2024

Checklist

Describe the bug

Reproduction

Environment

zhyncs commented Dec 5, 2024

cxmt-ai-tc commented Dec 6, 2024

ispobock commented Dec 7, 2024

cxmt-ai-tc commented Dec 9, 2024

ispobock commented Dec 14, 2024

cxmt-ai-tc commented Dec 16, 2024