[New Model]: Support Tencent-Hunyuan-Large #10043

mgoin · 2024-11-05T16:34:40Z

The model to consider.

https://huggingface.co/tencent/Tencent-Hunyuan-Large

Tencent released a 389B MoE with only 52B activated parameters which beats the Llama 3.1 405B.
There are three checkpoints in the model card: Pretrain, Instruct, and Instruct-FP8 (AutoFP8 format)

Some notable features of the model:

High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.
KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.
Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.
Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.
Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.

I think the inclusion of Cross-Layer Attention (CLA) described in https://arxiv.org/abs/2405.12981 and by Character.AI is the most interesting element.

The closest model vllm already supports.

Since there is a shared expert at each MoE MLP, I think DeepSeekV2 is the closest comparison.

What's your difficulty of supporting the model you want?

Medium to high difficulty. I believe the difficulty lies with supporting CLA, where most other feature should already be implementable.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

simon-mo · 2024-11-05T18:29:00Z

@heheda12345 has some initial prototype to support CLA. We can put down a timeline of maybe around EOY/early next year.

heheda12345 · 2024-11-06T03:02:45Z

Hunyuan's huggingface code implements cross layer attention as an encoder-decoder cross attention with the kv of previous layer as encoder, and q of current layer as decoder. I think we can implement it easily without my new prototype.
hf reference code

mgoin added the new model Requests to new models label Nov 5, 2024

heheda12345 mentioned this issue Dec 20, 2024

[RFC]: Hybrid Memory Allocator #11382

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: Support Tencent-Hunyuan-Large #10043

[New Model]: Support Tencent-Hunyuan-Large #10043

mgoin commented Nov 5, 2024

simon-mo commented Nov 5, 2024

heheda12345 commented Nov 6, 2024

[New Model]: Support Tencent-Hunyuan-Large #10043

[New Model]: Support Tencent-Hunyuan-Large #10043

Comments

mgoin commented Nov 5, 2024

The model to consider.

The closest model vllm already supports.

What's your difficulty of supporting the model you want?

Before submitting a new issue...

simon-mo commented Nov 5, 2024

heheda12345 commented Nov 6, 2024