Support bfloat16 LoRa Adaptors #403

TheCodeWrangler · 2024-04-11T12:48:10Z

I have a Mistral7B model with fine-tuned LoRa weights with datatype bfloat16.

I ran into issues when attempting to use my adaptors which were compiled for bfloat16

Running the following command in order to convert them to the .npy format which allows me to follow the example:

python3 hf_lora_convert.py \
        -o ${LOCAL_COMPILED_WEIGHTS}/lora/0 \
        -i ${LORA_DIR_1} \
        --storage-type bfloat16 \
        --verbose

Results inthe following error

    main(args)
  File "/code/hf_lora_convert.py", line 141, in main
    convert_hf_model(args.in_file, args.storage_type, args.out_dir)
  File "/code/hf_lora_convert.py", line 122, in convert_hf_model
    dim=0).unsqueeze(0).to(dtype=str_dtype_to_torch(dtype)).cpu().numpy()
TypeError: Got unsupported ScalarType BFloat16
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024032600
Traceback (most recent call last):
  File "/code/hf_lora_convert.py", line 176, in <module>

likely a limitation of numpy not natively supporting bfloat16

I went ahead and converted to float32 instead just to continue testing (and hoping that precision was maintained)

....

I had some hope that in triton I would be able to still use BFLOAT16 because I see it as a support datatype BF16

When I load the models and configs and send them to the backend (which was compiled for bfloat16)

After calling in triton-inference-server i get the following error

[TensorRT-LLM][ERROR] Assertion failed: Expected lora weights to be the same data type as base model (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/loraUtils.cpp:66)

Is there any way for me to pass in the adaptor weights as bfloat16?

I noticed that TYPE_BF16 is listed here

But it does not seem like pb_utils.triton_string_to_numpy("TYPE_BF16") is able to use it (since numpy does not have bfloat16)

Expected behavior

Documented example of using bfloat models with LoRa adaptors.

actual behavior

Examples of FP16.

The text was updated successfully, but these errors were encountered:

byshiue · 2024-04-15T06:44:17Z

You can replace the

converted_weights = torch.concatenate(
        converted_weights,
        dim=0).unsqueeze(0).to(dtype=str_dtype_to_torch(dtype)).cpu().numpy()

by

converted_weights = torch_to_numpy(
        torch.concatenate(
            converted_weights,
            dim=0).unsqueeze(0).to(dtype=str_dtype_to_torch(dtype)).cpu())

and import torch_to_numpy from tensorrt_llm._utils. This issue would be fixed in the main branch update of this week.

TheCodeWrangler · 2024-04-16T15:57:12Z

How can i pass BFLOAT adapter weights to the backend? How could I build the pbtext message in "preprocessing" for example to be of bfloat16 datatype?

What updates do i need to do in the config.pbtxt

byshiue · 2024-04-17T00:28:26Z

The issue of converting LoRA adapter is fixed in latest main branch.

For running bfloat16 on backend, you should only change the lora_weights of tensorrt_llm/config.pbtxt to TYPE_BF16. preprocessing module is independent to inference data type.

TheCodeWrangler · 2024-04-17T12:34:50Z

I would like to manage loading the lora weights on the first call to that adapter in my preprocessing model.py

I am not sure how to package the weights as bfloat16 in order to send them to the tensorrt_llm model after updating the tensorrt_llm/config.pbtxt field to expect TYPE_BF16 due to being unable to package the pb_utils.InferenceResponse. After looking at the source I am unsure how to send bfloat16 to the tensorrt_llm model between two models in an ensemble.

Seems like the python backend cannot handle this datatype
https://github.com/triton-inference-server/python_backend/blob/main/src/pb_stub_utils.cc#L170-L171

byshiue · 2024-04-24T23:51:14Z

If you hope to send bf16 data between two models and think triton does not support it, please ask in the triton repo. They are more familiar for this part.

TheCodeWrangler added the bug Something isn't working label Apr 11, 2024

byshiue self-assigned this Apr 15, 2024

byshiue added the triaged Issue has been triaged by maintainers label Apr 15, 2024

kaiyux mentioned this issue Apr 16, 2024

Update TensorRT-LLM NVIDIA/TensorRT-LLM#1455

Merged

kaiyux mentioned this issue Jun 5, 2024

TensorRT-LLM v0.10 update NVIDIA/TensorRT-LLM#1734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bfloat16 LoRa Adaptors #403

Support bfloat16 LoRa Adaptors #403

TheCodeWrangler commented Apr 11, 2024 •

edited

Loading

byshiue commented Apr 15, 2024

TheCodeWrangler commented Apr 16, 2024

byshiue commented Apr 17, 2024

TheCodeWrangler commented Apr 17, 2024 •

edited

Loading

byshiue commented Apr 24, 2024

Support bfloat16 LoRa Adaptors #403

Support bfloat16 LoRa Adaptors #403

Comments

TheCodeWrangler commented Apr 11, 2024 • edited Loading

Expected behavior

actual behavior

byshiue commented Apr 15, 2024

TheCodeWrangler commented Apr 16, 2024

byshiue commented Apr 17, 2024

TheCodeWrangler commented Apr 17, 2024 • edited Loading

byshiue commented Apr 24, 2024

TheCodeWrangler commented Apr 11, 2024 •

edited

Loading

TheCodeWrangler commented Apr 17, 2024 •

edited

Loading