Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

handoku · 2025-01-13T07:11:00Z

I built and installed trtllm using deepseek branch. Following doc, I got a int4 weight only engine.

However, example run outputs garbage words:

root@pod-test:/data/workspace/examples/deepseek_v3# mpirun --allow-run-as-root -n 8 python3 ../run.py --input_text "Today is a nice day." \
        --max_output_len 30 \
        --tokenizer_dir /data/workspace/DeepSeek-V3-bf16  \
        --engine_dir /data/workspace/trtllm_engines/tp8-seq4096-bs16 \
        --top_p 0.95 \
        --temperature 0.3
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 0
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 7
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 2
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 6
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 3
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 1
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 5
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 4
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 4
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 7
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 3
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 2
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 1
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 5
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 0
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 6
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] Rank 7 is using GPU 7
[TensorRT-LLM][INFO] Rank 5 is using GPU 5
[TensorRT-LLM][INFO] Rank 4 is using GPU 4
[TensorRT-LLM][INFO] Rank 6 is using GPU 6
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 4
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 7
[TensorRT-LLM][INFO] Detecting local TP group for rank 6
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 5
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 4
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 6
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 7
[TensorRT-LLM][INFO] TP group is intra-node for rank 5
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.55 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.37 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.773377895355225 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.785563945770264 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.773581981658936 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.77360153198242 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.78640174865723 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.791614055633545 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.81747221946716 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.77362585067749 sec
Input [Text 0]: "<｜begin▁of▁sentence｜>Today is a nice day."
Output [Text 0 Beam 0]: " a
:
	 and 
, in. and as
, and, the a,,     a,,, and000 superimposed"
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session

nv-guomingz · 2025-01-13T09:00:02Z

Hi @handoku thanks for reporting this issue, we'll take a look firstly.

Songyanfei · 2025-01-14T02:16:39Z

I encountered the same issue when using the 4-bit weight-only version of DeepSeekV3.

nv-guomingz · 2025-01-14T14:55:20Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

handoku · 2025-01-15T03:21:20Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

Songyanfei · 2025-01-15T03:49:57Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

nv-guomingz · 2025-01-15T03:51:16Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

we're going to support fp8 inference soon.

handoku · 2025-01-15T06:36:35Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

I am using sglang to serve deepseek-v3 for now. Though it support fp8 and MLA optimization, it still need two 8-gpu nodes to do inference. I was hopping trtllm-int4 can save resource and improve throughput.

Songyanfei · 2025-01-16T01:04:17Z

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

I am using sglang to serve deepseek-v3 for now. Though it support fp8 and MLA optimization, it still need two 8-gpu nodes to do inference. I was hopping trtllm-int4 can save resource and improve throughput.

Exactly! INT4 is particularly appealing since it allows running on a single node, and A100/A800 GPUs don’t support the FP8 data format. This makes INT4 a great choice, especially for MoE models.

Harley-ZP · 2025-01-16T08:29:37Z

HI, how did you build and install the deepseek branch of trt?
I find this way in documents, but it seems too complicated and I cannot reach public network.
How did u made it?

handoku · 2025-01-16T08:47:41Z

@Harley-ZP Find a computer connected to the Internet, and try pulling docker images from ngc, then build trtllm wheel with this cmd. Otherwise, installing dependencies sometimes can really be tricky...

nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Jan 14, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 14, 2025

Songyanfei mentioned this issue Jan 15, 2025

[BUG]使用TensorRT-llm 的Deepseek分支部署4bit weight only的deepseekV3回答乱码 deepseek-ai/DeepSeek-V3#272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

handoku commented Jan 13, 2025 •

edited

Loading

nv-guomingz commented Jan 13, 2025

Songyanfei commented Jan 14, 2025

nv-guomingz commented Jan 14, 2025 •

edited

Loading

handoku commented Jan 15, 2025

Songyanfei commented Jan 15, 2025

nv-guomingz commented Jan 15, 2025

handoku commented Jan 15, 2025 •

edited

Loading

Songyanfei commented Jan 16, 2025

Harley-ZP commented Jan 16, 2025 •

edited

Loading

handoku commented Jan 16, 2025

Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

Comments

handoku commented Jan 13, 2025 • edited Loading

nv-guomingz commented Jan 13, 2025

Songyanfei commented Jan 14, 2025

nv-guomingz commented Jan 14, 2025 • edited Loading

handoku commented Jan 15, 2025

Songyanfei commented Jan 15, 2025

nv-guomingz commented Jan 15, 2025

handoku commented Jan 15, 2025 • edited Loading

Songyanfei commented Jan 16, 2025

Harley-ZP commented Jan 16, 2025 • edited Loading

handoku commented Jan 16, 2025

handoku commented Jan 13, 2025 •

edited

Loading

nv-guomingz commented Jan 14, 2025 •

edited

Loading

handoku commented Jan 15, 2025 •

edited

Loading

Harley-ZP commented Jan 16, 2025 •

edited

Loading