Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Model]: Support Tencent-Hunyuan-Large #10043

Open
1 task done
mgoin opened this issue Nov 5, 2024 · 2 comments
Open
1 task done

[New Model]: Support Tencent-Hunyuan-Large #10043

mgoin opened this issue Nov 5, 2024 · 2 comments
Labels
new model Requests to new models

Comments

@mgoin
Copy link
Member

mgoin commented Nov 5, 2024

The model to consider.

https://huggingface.co/tencent/Tencent-Hunyuan-Large

Tencent released a 389B MoE with only 52B activated parameters which beats the Llama 3.1 405B.
There are three checkpoints in the model card: Pretrain, Instruct, and Instruct-FP8 (AutoFP8 format)

Some notable features of the model:

  • High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.

  • KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.

  • Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.

  • Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.

  • Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.

I think the inclusion of Cross-Layer Attention (CLA) described in https://arxiv.org/abs/2405.12981 and by Character.AI is the most interesting element.

The closest model vllm already supports.

Since there is a shared expert at each MoE MLP, I think DeepSeekV2 is the closest comparison.

What's your difficulty of supporting the model you want?

Medium to high difficulty. I believe the difficulty lies with supporting CLA, where most other feature should already be implementable.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mgoin mgoin added the new model Requests to new models label Nov 5, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Nov 5, 2024

@heheda12345 has some initial prototype to support CLA. We can put down a timeline of maybe around EOY/early next year.

@heheda12345
Copy link
Collaborator

Hunyuan's huggingface code implements cross layer attention as an encoder-decoder cross attention with the kv of previous layer as encoder, and q of current layer as decoder. I think we can implement it easily without my new prototype.
hf reference code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models
Projects
None yet
Development

No branches or pull requests

3 participants