Update TensorRT-LLM (NVIDIA#1019)

* Update TensorRT-LLM --------- Co-authored-by: erenup <[email protected]> Co-authored-by: Shixiaowei02 <[email protected]>
llsj14 · Jan 31, 2024 · e06f537 · e06f537
1 parent da79354
commit e06f537
Show file tree

Hide file tree

Showing 756 changed files with 3,085,880 additions and 2,434,985 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ TensorRT-LLM
 <div align="left">
 
 ## Latest News
-* [2024/01/30] [ New **XQA-kernel** provides **2.4x more Llama-70B throughput** within the same latency budget](./docs/source/blogs/XQA-kernel.md)
+* [2024/30/01] [ New **XQA-kernel** provides **2.4x more Llama-70B throughput** within the same latency budget](./docs/source/blogs/XQA-kernel.md)
 * [2023/12/04] [**Falcon-180B** on a **single H200** GPU with INT4 AWQ, and **6.7x faster Llama-70B** over A100](./docs/source/blogs/Falcon180B-H200.md)
 * [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
 * [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
@@ -106,7 +106,8 @@ After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacent
 please run the following commands to install TensorRT-LLM for x86_64 users.
 
 ```bash
-# Obtain and start the basic docker image environment
+# Please use the `nvidia-docker` application, using only `docker` may cause exceptions.
+# Obtain and start the basic docker image environment.
 nvidia-docker run --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04
 
 # Install dependencies, TensorRT-LLM requires Python 3.10
@@ -167,8 +168,8 @@ python convert_checkpoint.py --model_dir ./bloom/560M/ \
                 --output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/
 # May need to add trtllm-build to PATH, export PATH=/usr/local/bin:$PATH
 trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/ \
-                --use_gemm_plugin float16 \
-                --use_gpt_attention_plugin float16 \
+                --gemm_plugin float16 \
+                --gpt_attention_plugin float16 \
                 --output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
 ```
 
@@ -264,7 +265,7 @@ The list of supported models is:
 
 * [Baichuan](examples/baichuan)
 * [BART](examples/enc_dec)
-* [Bert](examples/bert)
+* [BERT](examples/bert)
 * [Blip2](examples/blip2)
 * [BLOOM](examples/bloom)
 * [ChatGLM](examples/chatglm)
@@ -286,6 +287,7 @@ The list of supported models is:
 * [Phi-1.5/Phi-2](examples/phi)
 * [Qwen](examples/qwen)
 * [Replit Code](examples/mpt)
+* [RoBERTa](examples/bert)
 * [SantaCoder](examples/gpt)
 * [StarCoder](examples/gpt)
 * [T5](examples/enc_dec)

diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -69,16 +69,44 @@ If you want to get the logits, you could run gptSessionBenchmark with `--print_a
 
 #### Prepare dataset
 
-Run a preprocessing script to prepare dataset. This script converts the prompts(string) in the dataset to input_ids.
+Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input token ids, output tokens length and time delays* to control request rate by gptManagerBenchmark.
+
+This tool can be used in 2 different modes of traffic generation.
+
+1 – Dataset
+
+“Prompt”, “Instruction” (optional) and “Answer” specified as sentences in a Json file
+
+The tool will tokenize the words and instruct the model to generate a specified number of output tokens for a request.
+
 ```
 python3 prepare_dataset.py \
-    --dataset <path/to/dataset> \
-    --max_input_len 300 \
-    --tokenizer_dir <path/to/tokenizer> \
-    --tokenizer_type auto \
     --output preprocessed_dataset.json
+    --request-rate 10 \
+    --time-delay-dist exponential_dist \
+    --tokenizer <path/to/tokenizer> \
+    dataset
+    --dataset <path/to/dataset> \
+    --max-input-len 300
 ```
-For `tokenizer_dir`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `gpt2` will both work. The tokenizer will be downloaded automatically for the latter case.
+
+2 – Normal token length distribution
+
+This mode allows the user to generate normal token length distributions with a mean and std deviation specified.
+For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.
+
+```
+ python prepare_dataset.py \
+  --output token-norm-dist.json \
+  --request-rate 10 \
+  --time-delay-dist constant \
+  --tokenizer <path/to/tokenizer> \
+   token-norm-dist \
+   --num-requests 100 \
+   --input-mean 100 --input-stdev 10 --output-mean 15 --output-stdev 0 --num-requests 100
+```
+
+For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
 
 #### Prepare TensorRT-LLM engines
 Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
@@ -100,6 +128,7 @@ Take GPT-350M as an example for single GPU V1 batching
     --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
     --type V1 \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
+    --max_num_samples 500
 ```
 
 Take GPT-350M as an example for 2-GPU inflight batching
@@ -109,4 +138,5 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
     --type IFB \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
+    --max_num_samples 500
 ```