Mixtral 8x7-v0.1 Hangs after serving a few requests #457

aaditya-srivathsan · 2024-05-15T01:15:19Z

System Info

A100 160GB(2*80)

Who can help?

@byshiue @kaiyux

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Build for source by cloning the main on tensorrtllm_backend

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Download weights from HF

pip install -r requirements.txt # install latest version of transformers, needed for Mixtral

git lfs install
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1

Set Directory and generate engines

export HF_LLAMA_MODEL=/path/Mixtral-8x7B-v0.1
export UNIFIED_CKPT_PATH=/path/mixtral-56B/
export ENGINE_PATH=/path/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/

python3 ./examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype float16 \
                             --tp_size 2 \
                             --dtype float16 \
                            --use_weight_only \
                            --weight_only_precision int4 

python3 -m tensorrt_llm.commands.build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
                 --output_dir ${ENGINE_PATH} \
                 --gemm_plugin float16 \
                 --max_input_len 32000 \

Then Start your triton server like so

cp all_models/inflight_batcher_llm/ mixtral_ifb -r

python3 tools/fill_template.py -i mixtral_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i mixtral_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:20000,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=mixtral_ifb/ --log

Finally in a separate terminal

sudo docker run --gpus all --rm -it --net host -v /home/azureuser/:/home/azureuser/ nvcr.io/nvidia/tritonserver:24.03-py3-sdk

perf_analyzer -m ensemble --input-data llm_inputs.json --measurement-interval 45000 --service-kind triton --request-rate-range 0.5:1.5:0.5 --request-distribution constant --stability-percentage 1000 -i grpc -u localhost:8001 --shape max_tokens:1 --shape text_input:1 -f output2k_large.csv --verbose-csv --collect-metrics

Expected behavior

The expected behavior should be getting thoughput and latency numbers

actual behavior

Command just hangs and doesnt return anything

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 45000 msec
  Latency limit: 0 msec
  Request Rate limit: 1.5 requests per seconds
  Using uniform distribution on request generation
  Using synchronous calls for inference
  Stabilizing using average latency

Request Rate: 0.5 inference requests per seconds
failed to find the requested model version

additional notes

I wrote a custom script which uses gprc over tritonclient to send synchronous requests. Initially it completes the request in 8seconds but after 40 such requests it just hangs.

The tritonserver logs after verbosity are like this

I0514 21:01:55.276668 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110579,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.284479 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110580,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.292292 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110581,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.299936 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110582,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.307794 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110583,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.315951 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110584,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.323544 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110585,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.331394 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110586,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.339250 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110587,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}

And never returns a response back and just hands

Quantization to int 4 doesnt help either

The text was updated successfully, but these errors were encountered:

ganeshku1 · 2024-05-20T20:12:56Z

@aaditya-srivathsan We are reviewing this ticket will get back to with updates.

aaditya-srivathsan · 2024-05-31T00:55:38Z

@ganeshku1 any update on this?

ganeshku1 · 2024-05-31T15:40:34Z

@aaditya-srivathsan We are working on resolving this issue. Will update this thread once this issue is resolved.

cc: @dyastremsky

rmccorm4 · 2024-06-10T19:56:36Z

Hi @aaditya-srivathsan, I've seen some similar issues reported that were solved by setting --use_custom_all_reduce disable.

Can you try this to see if it helps?

aaditya-srivathsan · 2024-06-13T19:59:33Z

Sure let me try this and ill let you know if this works or not

aaditya-srivathsan · 2024-06-25T23:47:25Z

This did help thank you very much!

aaditya-srivathsan added the bug Something isn't working label May 15, 2024

aaditya-srivathsan closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

aaditya-srivathsan commented May 15, 2024

ganeshku1 commented May 20, 2024

aaditya-srivathsan commented May 31, 2024

ganeshku1 commented May 31, 2024 •

edited

Loading

rmccorm4 commented Jun 10, 2024

aaditya-srivathsan commented Jun 13, 2024

aaditya-srivathsan commented Jun 25, 2024

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

Comments

aaditya-srivathsan commented May 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

ganeshku1 commented May 20, 2024

aaditya-srivathsan commented May 31, 2024

ganeshku1 commented May 31, 2024 • edited Loading

rmccorm4 commented Jun 10, 2024

aaditya-srivathsan commented Jun 13, 2024

aaditya-srivathsan commented Jun 25, 2024

ganeshku1 commented May 31, 2024 •

edited

Loading