Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] deepseek v2.5 w4a16 can not run #2362

Open
5 tasks done
cxmt-ai-tc opened this issue Dec 5, 2024 · 6 comments
Open
5 tasks done

[Bug] deepseek v2.5 w4a16 can not run #2362

cxmt-ai-tc opened this issue Dec 5, 2024 · 6 comments

Comments

@cxmt-ai-tc
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

i was using model from https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16 , and its config.josn like
Hugging Face's logo
Hugging Face
Search models, datasets, users...
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing

Log In
Sign Up

nm-testing
/
DeepSeek-V2.5-W4A16

like
0

Follow

NM Testing
34
Safetensors
deepseek_v2
custom_code
compressed-tensors
Model card
Files and versions
Community
1
DeepSeek-V2.5-W4A16
/
config.json

mgoin's picture
mgoin
Updated compression_config to quantization_config
11d3311
VERIFIED
about 2 months ago
raw

Copy download link
history
blame
contribute
delete

2.66 kB
{
"_name_or_path": "/home/dsikka/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V2.5/snapshots/24b08cb750e0c2757de112d2e16327cb21ed4833",
"architectures": [
"DeepseekV2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV2Config",
"AutoModel": "modeling_deepseek.DeepseekV2Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 100000,
"eos_token_id": 100001,
"ep_size": 1,
"first_k_dense_replace": 1,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 12288,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v2",
"moe_intermediate_size": 1536,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 160,
"n_shared_experts": 2,
"norm_topk_prob": false,
"num_attention_heads": 128,
"num_experts_per_tok": 6,
"num_hidden_layers": 60,
"num_key_value_heads": 128,
"pretraining_tp": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 16.0,
"scoring_func": "softmax",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 3,
"topk_method": "group_limited_greedy",
"torch_dtype": "bfloat16",
"transformers_version": "4.44.2",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 102400,
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": null,
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": null,
"dynamic": false,
"group_size": null,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "channel",
"symmetric": true,
"type": "int"
}
}
},
"format": "pack-quantized",
"global_compression_ratio": 2.265805157986176,
"ignore": [
"lm_head"
],
"kv_cache_scheme": null,
"quant_method": "compressed-tensors",
"quantization_status": "compressed",
"sparsity_config": {
"format": "dense",
"global_sparsity": 0.21918901165186397,
"registry_requires_subclass": false,
"sparsity_structure": "unstructured"
}
}
}
i run this model using command like :
python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code

got error like:
root@s0pgpuap12:/workspace/sglang# python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code
[2024-12-05 03:45:04] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=510750847, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-05 03:45:11 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:11 TP3] Init torch distributed begin.
[2024-12-05 03:45:12 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP2] Init torch distributed begin.
[2024-12-05 03:45:12 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP1] Init torch distributed begin.
[2024-12-05 03:45:12 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-05 03:45:12 TP0] Init torch distributed begin.
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-05 03:45:13 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-05 03:45:13 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-05 03:45:13 TP0] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP1] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP2] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:13 TP3] Load weight begin. avail mem=43.69 GB
[2024-12-05 03:45:14 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-05 03:45:14 TP3] lm_eval is not installed, GPTQ may not be usable
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 12-05 03:45:14 compressed_tensors_wNa16.py:83] Using MarlinLinearKernel for CompressedTensorsWNA16
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
[2024-12-05 03:45:14 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

[2024-12-05 03:45:14 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1489, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 194, in init
self.tp_worker = TpWorkerClass(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 61, in init
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in init
self.model_runner = ModelRunner(
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 152, in init
self.load_model()
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 246, in load_model
self.model = get_model(
File "/workspace/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 357, in load_model
model = _initialize_model(
File "/workspace/sglang/python/sglang/srt/model_loader/loader.py", line 138, in _initialize_model
return model_class(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 734, in init
self.model = DeepseekV2Model(config, quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 695, in init
[
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 696, in
DeepseekV2DecoderLayer(
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 626, in init
self.mlp = DeepseekV2MoE(config=config, quant_config=quant_config)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 115, in init
self.experts = FusedMoE(
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 227, in init
assert self.quant_method is not None
AssertionError

Reproduction

python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-v2.5-w4a16/ --port 30000 --host 0.0.0.0 --tp 4 --trust-remote-code

Environment

Python: 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A40
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.14
PyTorch: 2.4.0+cu121
sglang: 0.4.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.2
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.2
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.11
packaging: 24.2
psutil: 6.1.0
pydantic: 2.9.2
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.54.4
anthropic: 0.39.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU1 NV4 X PIX PIX SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU2 PIX PIX X NV4 SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU3 PIX PIX NV4 X SYS SYS SYS SYS PIX PIX SYS SYS 0-15,64-79 0 N/A
GPU4 SYS SYS SYS SYS X NV4 PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU5 SYS SYS SYS SYS NV4 X PIX PIX SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU6 SYS SYS SYS SYS PIX PIX X NV4 SYS SYS PIX PIX 32-47,96-111 2 N/A
GPU7 SYS SYS SYS SYS PIX PIX NV4 X SYS SYS PIX PIX 32-47,96-111 2 N/A
NIC0 PIX PIX PIX PIX SYS SYS SYS SYS X PIX SYS SYS
NIC1 PIX PIX PIX PIX SYS SYS SYS SYS PIX X SYS SYS
NIC2 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS X PIX
NIC3 SYS SYS SYS SYS PIX PIX PIX PIX SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3

ulimit soft: 1048576

@zhyncs
Copy link
Member

zhyncs commented Dec 5, 2024

May you try the latest main? ref #2364

@cxmt-ai-tc
Copy link
Author

May you try the latest main? ref #2364

same error

@ispobock
Copy link
Collaborator

ispobock commented Dec 7, 2024

@cxmt-ai-tc This is a compressed tensors W4A16 quantized DeepSeek model. We plan to support it later but not with high priority, since AWQ W4A16 is already supported. Maybe you can try AWQ first.

@cxmt-ai-tc
Copy link
Author

@cxmt-ai-tc This is a compressed tensors W4A16 quantized DeepSeek model. We plan to support it later but not with high priority, since AWQ W4A16 is already supported. Maybe you can try AWQ first.

i tried AWQ model(https://huggingface.co/casperhansen/deepseek-coder-v2-instruct-awq), got the error on A40 like this :

CUDA_VISIBLE_DEVICES=2,3,6,7 python3 -m sglang.launch_server --model-path /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ --port 50800 --host 0.0.0.0 --tp 4 --trust-remote-code --context-length 512 --max-running-requests 100 --mem-fraction-static 0.8 --disable-cuda-graph
[2024-12-08 23:43:01] server_args=ServerArgs(model_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_path='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=512, device='cuda', served_model_name='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', chat_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=50800, mem_fraction_static=0.8, max_running_requests=100, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=4, stream_interval=1, random_seed=232609575, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
INFO 12-08 23:43:01 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP2] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP3] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP1] Init torch distributed begin.
INFO 12-08 23:43:08 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
[2024-12-08 23:43:08 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-08 23:43:08 TP0] Init torch distributed begin.
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:43:09 utils.py:1008] Found nccl from library libnccl.so.2
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:43:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2024-12-08 23:43:10 TP1] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP0] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP3] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:10 TP2] Load weight begin. avail mem=44.06 GB
[2024-12-08 23:43:11 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP1] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-08 23:43:11 TP2] lm_eval is not installed, GPTQ may not be usable
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/26 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 4% Completed | 1/26 [00:00<00:17, 1.40it/s]
Loading safetensors checkpoint shards: 8% Completed | 2/26 [00:01<00:21, 1.12it/s]
Loading safetensors checkpoint shards: 12% Completed | 3/26 [00:02<00:20, 1.11it/s]
Loading safetensors checkpoint shards: 15% Completed | 4/26 [00:03<00:19, 1.11it/s]
Loading safetensors checkpoint shards: 19% Completed | 5/26 [00:04<00:19, 1.06it/s]
Loading safetensors checkpoint shards: 23% Completed | 6/26 [00:05<00:19, 1.02it/s]
Loading safetensors checkpoint shards: 27% Completed | 7/26 [00:06<00:19, 1.00s/it]
Loading safetensors checkpoint shards: 31% Completed | 8/26 [00:07<00:17, 1.01it/s]
Loading safetensors checkpoint shards: 35% Completed | 9/26 [00:08<00:17, 1.01s/it]
Loading safetensors checkpoint shards: 38% Completed | 10/26 [00:09<00:15, 1.02it/s]
Loading safetensors checkpoint shards: 42% Completed | 11/26 [00:10<00:14, 1.04it/s]
Loading safetensors checkpoint shards: 46% Completed | 12/26 [00:11<00:13, 1.06it/s]
Loading safetensors checkpoint shards: 50% Completed | 13/26 [00:12<00:12, 1.06it/s]
Loading safetensors checkpoint shards: 54% Completed | 14/26 [00:13<00:11, 1.07it/s]
Loading safetensors checkpoint shards: 58% Completed | 15/26 [00:14<00:10, 1.07it/s]
Loading safetensors checkpoint shards: 62% Completed | 16/26 [00:15<00:09, 1.06it/s]
Loading safetensors checkpoint shards: 65% Completed | 17/26 [00:16<00:08, 1.07it/s]
Loading safetensors checkpoint shards: 69% Completed | 18/26 [00:17<00:07, 1.07it/s]
Loading safetensors checkpoint shards: 73% Completed | 19/26 [00:17<00:06, 1.06it/s]
Loading safetensors checkpoint shards: 77% Completed | 20/26 [00:19<00:07, 1.18s/it]
Loading safetensors checkpoint shards: 81% Completed | 21/26 [00:20<00:05, 1.14s/it]
Loading safetensors checkpoint shards: 85% Completed | 22/26 [00:21<00:04, 1.13s/it]
Loading safetensors checkpoint shards: 88% Completed | 23/26 [00:22<00:03, 1.13s/it]
Loading safetensors checkpoint shards: 92% Completed | 24/26 [00:24<00:02, 1.12s/it]
Loading safetensors checkpoint shards: 96% Completed | 25/26 [00:25<00:01, 1.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:25<00:00, 1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:25<00:00, 1.01it/s]

[2024-12-08 23:43:51 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.float16, avail mem=13.00 GB
[2024-12-08 23:43:51 TP0] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP1] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP2] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP3] Memory pool end. avail mem=8.85 GB
[2024-12-08 23:43:51 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:51 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2024-12-08 23:43:52 TP0] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP1] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP2] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52 TP3] max_total_num_tokens=64926, max_prefill_tokens=16384, max_running_requests=100, context_len=512
[2024-12-08 23:43:52] INFO: Started server process [25439]
[2024-12-08 23:43:52] INFO: Waiting for application startup.
[2024-12-08 23:43:52] INFO: Application startup complete.
[2024-12-08 23:43:52] INFO: Uvicorn running on http://0.0.0.0:50800 (Press CTRL+C to quit)
[2024-12-08 23:43:53] INFO: 127.0.0.1:40164 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-12-08 23:43:53 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
[2024-12-08 23:43:57 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 103, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 134, in forward_thread_func

logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 154, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 686, in forward
return self.forward_extend(forward_batch)
File "/workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 655, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 739, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 151, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/sglang/python/sglang/srt/layers/fused_moe_triton/layer.py", line 555, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
WARNING 12-08 23:43:57 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
Killed

@ispobock
Copy link
Collaborator

@cxmt-ai-tc Could you try the latest sglang and vllm version? It seems vllm is not correctly installed.

@cxmt-ai-tc
Copy link
Author

@cxmt-ai-tc Could you try the latest sglang and vllm version? It seems vllm is not correctly installed.

i using conda py310, install vllm -0.6.4.post1, it was get the same error in vlllm server mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants