-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] deepseek v2.5 w4a16 can not run #2362
Comments
May you try the latest main? ref #2364 |
same error |
@cxmt-ai-tc This is a compressed tensors W4A16 quantized DeepSeek model. We plan to support it later but not with high priority, since AWQ W4A16 is already supported. Maybe you can try AWQ first. |
i tried AWQ model(https://huggingface.co/casperhansen/deepseek-coder-v2-instruct-awq), got the error on A40 like this : CUDA_VISIBLE_DEVICES=2,3,6,7 python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ --port 50800 --host 0.0.0.0 --tp 4 --trust-remote-code --context-length 512 --max-running-requests 100 --mem-fraction-static 0.8 --disable-cuda-graph [2024-12-08 23:43:51 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json |
@cxmt-ai-tc Could you try the latest sglang and vllm version? It seems vllm is not correctly installed. |
i using conda py310, install vllm -0.6.4.post1, it was get the same error in vlllm server mode |
Checklist
Describe the bug
i was using model from https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16 , and its config.josn like
Hugging Face's logo
Hugging Face
Search models, datasets, users...
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
nm-testing
/
DeepSeek-V2.5-W4A16
like
0
Follow
NM Testing
34
Safetensors
deepseek_v2
custom_code
compressed-tensors
Model card
Files and versions
Community
1
DeepSeek-V2.5-W4A16
/
config.json
mgoin's picture
mgoin
Updated compression_config to quantization_config
11d3311
VERIFIED
about 2 months ago
raw
Copy download link
history
blame
contribute
delete
2.66 kB
{
"_name_or_path": "/home/dsikka/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V2.5/snapshots/24b08cb750e0c2757de112d2e16327cb21ed4833",
"architectures": [
"DeepseekV2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV2Config",
"AutoModel": "modeling_deepseek.DeepseekV2Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 100000,
"eos_token_id": 100001,
"ep_size": 1,
"first_k_dense_replace": 1,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 12288,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v2",
"moe_intermediate_size": 1536,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 160,
"n_shared_experts": 2,
"norm_topk_prob": false,
"num_attention_heads": 128,
"num_experts_per_tok": 6,
"num_hidden_layers": 60,
"num_key_value_heads": 128,
"pretraining_tp": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 16.0,
"scoring_func": "softmax",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 3,
"topk_method": "group_limited_greedy",
"torch_dtype": "bfloat16",
"transformers_version": "4.44.2",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 102400,
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": 2.265805157986176,
"ignore": [
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {
"format": "dense",
"global_sparsity": 0.21918901165186397,
"registry_requires_subclass": false,
"sparsity_structure": "unstructured"
}
}
}
i run this model using command like :
python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code
got error like:
root@s0pgpuap12:/workspace/sglang# python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code
[2024-12-05 03:45:04] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=510750847, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-05 03:45:11 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:11 TP3] Init torch distributed begin.
[2024-12-05 03:45:12 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP2] Init torch distributed begin.
[2024-12-05 03:45:12 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP1] Init torch distributed begin.
[2024-12-05 03:45:12 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP0] Init torch distributed begin.
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-05 03:45:13 TP0] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP1] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP2] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP3] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:14 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP3] lm_eval is not installed, GPTQ may not be usable
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
[2024-12-05 03:45:14 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError
[2024-12-05 03:45:14 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError
[2024-12-05 03:45:14 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError
[2024-12-05 03:45:14 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError
Reproduction
python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code
Environment
Python: 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A40
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.14
PyTorch: 2.4.0+cu121
sglang: 0.4.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.2
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.2
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.11
packaging: 24.2
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.54.4
anthropic: 0.39.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU1 NV4 X PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU2 PIX PIX X NV4 SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU3 PIX PIX NV4 X SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU4 SYS SYS SYS SYS X NV4 PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU5 SYS SYS SYS SYS NV4 X PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU6 SYS SYS SYS SYS PIX PIX X NV4 SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU7 SYS SYS SYS SYS PIX PIX NV4 X SYS SYS PIX PIX 32-47,96-111 2 N/A
NIC0 PIX PIX PIX PIX SYS SYS SYS SYS X PIX SYS SYS
NIC1 PIX PIX PIX PIX SYS SYS SYS SYS PIX X SYS SYS
NIC2 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS X PIX
NIC3 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: