You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tencent released a 389B MoE with only 52B activated parameters which beats the Llama 3.1 405B.
There are three checkpoints in the model card: Pretrain, Instruct, and Instruct-FP8 (AutoFP8 format)
Some notable features of the model:
High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.
KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.
Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.
Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.
Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.
Since there is a shared expert at each MoE MLP, I think DeepSeekV2 is the closest comparison.
What's your difficulty of supporting the model you want?
Medium to high difficulty. I believe the difficulty lies with supporting CLA, where most other feature should already be implementable.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Hunyuan's huggingface code implements cross layer attention as an encoder-decoder cross attention with the kv of previous layer as encoder, and q of current layer as decoder. I think we can implement it easily without my new prototype. hf reference code
The model to consider.
https://huggingface.co/tencent/Tencent-Hunyuan-Large
Tencent released a 389B MoE with only 52B activated parameters which beats the Llama 3.1 405B.
There are three checkpoints in the model card: Pretrain, Instruct, and Instruct-FP8 (AutoFP8 format)
Some notable features of the model:
High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.
KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.
Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.
Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.
Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.
I think the inclusion of Cross-Layer Attention (CLA) described in https://arxiv.org/abs/2405.12981 and by Character.AI is the most interesting element.
The closest model vllm already supports.
Since there is a shared expert at each MoE MLP, I think DeepSeekV2 is the closest comparison.
What's your difficulty of supporting the model you want?
Medium to high difficulty. I believe the difficulty lies with supporting CLA, where most other feature should already be implementable.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: