Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseek-v3 int4 weight only inference outputs garbage words with TP 8 on nvidia H20 GPU #2683

Open
handoku opened this issue Jan 13, 2025 · 10 comments
Labels
Investigating Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers

Comments

@handoku
Copy link

handoku commented Jan 13, 2025

I built and installed trtllm using deepseek branch. Following doc, I got a int4 weight only engine.

However, example run outputs garbage words:

root@pod-test:/data/workspace/examples/deepseek_v3# mpirun --allow-run-as-root -n 8 python3 ../run.py --input_text "Today is a nice day." \
        --max_output_len 30 \
        --tokenizer_dir /data/workspace/DeepSeek-V3-bf16  \
        --engine_dir /data/workspace/trtllm_engines/tp8-seq4096-bs16 \
        --top_p 0.95 \
        --temperature 0.3
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 0
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 7
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 2
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 6
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 3
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 1
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 5
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[01/13/2025-15:00:38] [TRT-LLM] [I] Using C++ session
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 4
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.17.0.dev2024121700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 4
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 7
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 3
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 2
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 1
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 5
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 0
[TensorRT-LLM][INFO] MPI size: 8, MPI local size: 8, rank: 6
[TensorRT-LLM][INFO] Rank 2 is using GPU 2
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 3 is using GPU 3
[TensorRT-LLM][INFO] Rank 7 is using GPU 7
[TensorRT-LLM][INFO] Rank 5 is using GPU 5
[TensorRT-LLM][INFO] Rank 4 is using GPU 4
[TensorRT-LLM][INFO] Rank 6 is using GPU 6
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 61
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Loaded engine size: 46553 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] Detecting local TP group for rank 4
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 3
[TensorRT-LLM][INFO] Detecting local TP group for rank 7
[TensorRT-LLM][INFO] Detecting local TP group for rank 6
[TensorRT-LLM][INFO] Detecting local TP group for rank 2
[TensorRT-LLM][INFO] Detecting local TP group for rank 5
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 4
[TensorRT-LLM][INFO] TP group is intra-node for rank 2
[TensorRT-LLM][INFO] TP group is intra-node for rank 6
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] TP group is intra-node for rank 7
[TensorRT-LLM][INFO] TP group is intra-node for rank 5
[TensorRT-LLM][INFO] TP group is intra-node for rank 3
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2265.58 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 46523 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 633.07 KB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.55 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.37 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.10 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 95.22 GiB, available: 45.32 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3131
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 40.80 GiB for max tokens in paged KV cache (200384).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.773377895355225 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.785563945770264 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.773581981658936 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.77360153198242 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.78640174865723 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.791614055633545 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.81747221946716 sec
[01/13/2025-15:01:28] [TRT-LLM] [I] Load engine takes: 49.77362585067749 sec
Input [Text 0]: "<|begin▁of▁sentence|>Today is a nice day."
Output [Text 0 Beam 0]: " a
:
	 and 
, in. and​ as
, and, the a,,     a,,, and000 superimposed"
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Refreshed the MPI local session
@nv-guomingz
Copy link
Collaborator

Hi @handoku thanks for reporting this issue, we'll take a look firstly.

@Songyanfei
Copy link

I encountered the same issue when using the 4-bit weight-only version of DeepSeekV3.

@nv-guomingz nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Jan 14, 2025
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 14, 2025
@nv-guomingz
Copy link
Collaborator

nv-guomingz commented Jan 14, 2025

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

@handoku
Copy link
Author

handoku commented Jan 15, 2025

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

@Songyanfei
Copy link

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

@nv-guomingz
Copy link
Collaborator

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

we're going to support fp8 inference soon.

@handoku
Copy link
Author

handoku commented Jan 15, 2025

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

I am using sglang to serve deepseek-v3 for now. Though it support fp8 and MLA optimization, it still need two 8-gpu nodes to do inference. I was hopping trtllm-int4 can save resource and improve throughput.

@Songyanfei
Copy link

Hi @handoku it's a known issue for deepseek-v3 int4/int8 quantization. Since the Deepseek-v3 didn't publish the int4/int8 metrics yet, we don't recommend quantize the deepseek-v3 with non-fp8 recipe at this moment.

Too bad. Trtllm currently doesn't support deepseek-v3's fp8 inference either. Thus, we can't do inference but in bf16/fp16 precision? This model is too large, we can't afford to run it in fp16. And trtllm_backend multi-node serving is not that convenient...

I came across this while following the DeepSeekV3 documentation. Based on the description in the documentation, it appears to support INT8/INT4 quantized inference. However, in practice, it turns out to be completely unusable. The memory consumption in FP16 is entirely unaffordable, making the whole situation quite tricky.

I am using sglang to serve deepseek-v3 for now. Though it support fp8 and MLA optimization, it still need two 8-gpu nodes to do inference. I was hopping trtllm-int4 can save resource and improve throughput.

Exactly! INT4 is particularly appealing since it allows running on a single node, and A100/A800 GPUs don’t support the FP8 data format. This makes INT4 a great choice, especially for MoE models.

@Harley-ZP
Copy link

Harley-ZP commented Jan 16, 2025

HI, how did you build and install the deepseek branch of trt?
I find this way in documents, but it seems too complicated and I cannot reach public network.
How did u made it?

@handoku
Copy link
Author

handoku commented Jan 16, 2025

@Harley-ZP Find a computer connected to the Internet, and try pulling docker images from ngc, then build trtllm wheel with this cmd. Otherwise, installing dependencies sometimes can really be tricky...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigating Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants