Skip to content

Commit

Permalink
Update TensorRT-LLM (NVIDIA#1019)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

---------

Co-authored-by: erenup <[email protected]>
Co-authored-by: Shixiaowei02 <[email protected]>
  • Loading branch information
3 people authored Jan 31, 2024
1 parent da79354 commit e06f537
Show file tree
Hide file tree
Showing 756 changed files with 3,085,880 additions and 2,434,985 deletions.
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ TensorRT-LLM
<div align="left">

## Latest News
* [2024/01/30] [ New **XQA-kernel** provides **2.4x more Llama-70B throughput** within the same latency budget](./docs/source/blogs/XQA-kernel.md)
* [2024/30/01] [ New **XQA-kernel** provides **2.4x more Llama-70B throughput** within the same latency budget](./docs/source/blogs/XQA-kernel.md)
* [2023/12/04] [**Falcon-180B** on a **single H200** GPU with INT4 AWQ, and **6.7x faster Llama-70B** over A100](./docs/source/blogs/Falcon180B-H200.md)
* [2023/11/27] [SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/)
* [2023/11/13] [H200 achieves nearly 12,000 tok/sec on Llama2-13B](./docs/source/blogs/H200launch.md)
Expand Down Expand Up @@ -106,7 +106,8 @@ After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacent
please run the following commands to install TensorRT-LLM for x86_64 users.

```bash
# Obtain and start the basic docker image environment
# Please use the `nvidia-docker` application, using only `docker` may cause exceptions.
# Obtain and start the basic docker image environment.
nvidia-docker run --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
Expand Down Expand Up @@ -167,8 +168,8 @@ python convert_checkpoint.py --model_dir ./bloom/560M/ \
--output_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/
# May need to add trtllm-build to PATH, export PATH=/usr/local/bin:$PATH
trtllm-build --checkpoint_dir ./bloom/560M/trt_ckpt/fp16/1-gpu/ \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--output_dir ./bloom/560M/trt_engines/fp16/1-gpu/
```

Expand Down Expand Up @@ -264,7 +265,7 @@ The list of supported models is:

* [Baichuan](examples/baichuan)
* [BART](examples/enc_dec)
* [Bert](examples/bert)
* [BERT](examples/bert)
* [Blip2](examples/blip2)
* [BLOOM](examples/bloom)
* [ChatGLM](examples/chatglm)
Expand All @@ -286,6 +287,7 @@ The list of supported models is:
* [Phi-1.5/Phi-2](examples/phi)
* [Qwen](examples/qwen)
* [Replit Code](examples/mpt)
* [RoBERTa](examples/bert)
* [SantaCoder](examples/gpt)
* [StarCoder](examples/gpt)
* [T5](examples/enc_dec)
Expand Down
42 changes: 36 additions & 6 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,16 +69,44 @@ If you want to get the logits, you could run gptSessionBenchmark with `--print_a

#### Prepare dataset

Run a preprocessing script to prepare dataset. This script converts the prompts(string) in the dataset to input_ids.
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input token ids, output tokens length and time delays* to control request rate by gptManagerBenchmark.

This tool can be used in 2 different modes of traffic generation.

1 – Dataset

“Prompt”, “Instruction” (optional) and “Answer” specified as sentences in a Json file

The tool will tokenize the words and instruct the model to generate a specified number of output tokens for a request.

```
python3 prepare_dataset.py \
--dataset <path/to/dataset> \
--max_input_len 300 \
--tokenizer_dir <path/to/tokenizer> \
--tokenizer_type auto \
--output preprocessed_dataset.json
--request-rate 10 \
--time-delay-dist exponential_dist \
--tokenizer <path/to/tokenizer> \
dataset
--dataset <path/to/dataset> \
--max-input-len 300
```
For `tokenizer_dir`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `gpt2` will both work. The tokenizer will be downloaded automatically for the latter case.

2 – Normal token length distribution

This mode allows the user to generate normal token length distributions with a mean and std deviation specified.
For example, setting mean=100 and std dev=10 would generate requests where 95.4% of values are in <80,120> range following the normal probability distribution. Setting std dev=0 will generate all requests with the same mean number of tokens.

```
python prepare_dataset.py \
--output token-norm-dist.json \
--request-rate 10 \
--time-delay-dist constant \
--tokenizer <path/to/tokenizer> \
token-norm-dist \
--num-requests 100 \
--input-mean 100 --input-stdev 10 --output-mean 15 --output-stdev 0 --num-requests 100
```

For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.

#### Prepare TensorRT-LLM engines
Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
Expand All @@ -100,6 +128,7 @@ Take GPT-350M as an example for single GPU V1 batching
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--type V1 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--max_num_samples 500
```

Take GPT-350M as an example for 2-GPU inflight batching
Expand All @@ -109,4 +138,5 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
--type IFB \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--max_num_samples 500
```
Loading

0 comments on commit e06f537

Please sign in to comment.