Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

Closed
2 of 4 tasks
aaditya-srivathsan opened this issue May 15, 2024 · 6 comments
Closed
2 of 4 tasks

Mixtral 8x7-v0.1 Hangs after serving a few requests #457

aaditya-srivathsan opened this issue May 15, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@aaditya-srivathsan
Copy link

System Info

A100 160GB(2*80)

Who can help?

@byshiue @kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build for source by cloning the main on tensorrtllm_backend

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Download weights from HF

pip install -r requirements.txt # install latest version of transformers, needed for Mixtral

git lfs install
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1

Set Directory and generate engines

export HF_LLAMA_MODEL=/path/Mixtral-8x7B-v0.1
export UNIFIED_CKPT_PATH=/path/mixtral-56B/
export ENGINE_PATH=/path/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/

python3 ./examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype float16 \
                             --tp_size 2 \
                             --dtype float16 \
                            --use_weight_only \
                            --weight_only_precision int4 

python3 -m tensorrt_llm.commands.build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
                 --output_dir ${ENGINE_PATH} \
                 --gemm_plugin float16 \
                 --max_input_len 32000 \

Then Start your triton server like so

cp all_models/inflight_batcher_llm/ mixtral_ifb -r

python3 tools/fill_template.py -i mixtral_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i mixtral_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:20000,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=mixtral_ifb/ --log

Finally in a separate terminal

sudo docker run --gpus all --rm -it --net host -v /home/azureuser/:/home/azureuser/ nvcr.io/nvidia/tritonserver:24.03-py3-sdk

perf_analyzer -m ensemble --input-data llm_inputs.json --measurement-interval 45000 --service-kind triton --request-rate-range 0.5:1.5:0.5 --request-distribution constant --stability-percentage 1000 -i grpc -u localhost:8001 --shape max_tokens:1 --shape text_input:1 -f output2k_large.csv --verbose-csv --collect-metrics

Expected behavior

The expected behavior should be getting thoughput and latency numbers

actual behavior

Command just hangs and doesnt return anything

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 45000 msec
  Latency limit: 0 msec
  Request Rate limit: 1.5 requests per seconds
  Using uniform distribution on request generation
  Using synchronous calls for inference
  Stabilizing using average latency

Request Rate: 0.5 inference requests per seconds
failed to find the requested model version

additional notes

I wrote a custom script which uses gprc over tritonclient to send synchronous requests. Initially it completes the request in 8seconds but after 40 such requests it just hangs.

The tritonserver logs after verbosity are like this

I0514 21:01:55.276668 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110579,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.284479 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110580,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.292292 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110581,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.299936 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110582,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.307794 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110583,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.315951 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110584,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.323544 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110585,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.331394 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110586,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.339250 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110587,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}

And never returns a response back and just hands

Quantization to int 4 doesnt help either

@aaditya-srivathsan aaditya-srivathsan added the bug Something isn't working label May 15, 2024
@ganeshku1
Copy link

@aaditya-srivathsan We are reviewing this ticket will get back to with updates.

@aaditya-srivathsan
Copy link
Author

@ganeshku1 any update on this?

@ganeshku1
Copy link

ganeshku1 commented May 31, 2024

@aaditya-srivathsan We are working on resolving this issue. Will update this thread once this issue is resolved.

cc: @dyastremsky

@rmccorm4
Copy link
Contributor

Hi @aaditya-srivathsan, I've seen some similar issues reported that were solved by setting --use_custom_all_reduce disable.

Can you try this to see if it helps?

@aaditya-srivathsan
Copy link
Author

Sure let me try this and ill let you know if this works or not

@aaditya-srivathsan
Copy link
Author

This did help thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants