[Paper BUG] About descriptions of the original MTP, little suggestion #252

chuhac · 2025-01-10T03:01:12Z

Thanks to all the people at deepseek who really value technology for this great project, I'm now also reproducing MTP myself for some know-how conclusions and I have an advice about possible clarifications.

The bug in the paper
In section 2.2 line 6[1],

parallelly predicts 𝐷 additional tokens using independent output heads

I fully understand your main claim is the "parallel" in comparison to your "sequentially predict". However, after checking meta's MTP paper [2], in the Section 2 (Column 2, Page 2) line 7,

n independent output heads implemented in terms of transformer layers $f_{hi}$, , and a shared unembedding matrix $f_u$

They use a shared "unembedding head", i.e., lm_head module or output_layer module while the parallel final layers are independent. If you ask me for my implementation, the model final norm block is also shared. So I suggest that the writing here could be changed to:

Different from Gloeckle et al. (2024), which parallelly predicts 𝐷 additional tokens using independent MTP transformer blocks before a shared output head, we let MTP transformer blocks sequentially to predict additional tokens at each prediction depth and keep the complete causal chain.

This also fits well with your Equation.23.

[1] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
[2 Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737.

Best,

16x3b · 2025-01-12T10:29:19Z

You seem to grasp the concept of MTP well. What is the novelty and hubbub of MTP all about? I'm not sure I understand the premise. Are you able to explain the concept in simple terms for the uninitiated?

chuhac · 2025-01-12T12:32:27Z

MTP seeks to help model predicts more than one tokens during one forward pass, which I call it "Ability to plan ahead", which may be especially useful in math or code domains.

About training: MTP makes the model to learn to generate several tokens and plan the next-k-tokens in advance, which gives the model additional capabilities on top of next-(one) token prediction. In order to achieve this effect, Deepseek-V3 or Meta MTP inserts extra learnable parameters which utilizes the hidden states of backbone decoder to predict k tokens afterwards. However, this inclusion of additional modules places demands on pipeline parallelism and requires additional optimization.

About inference: Though deepseek didn't release the weights of MTP head, the multi head prediction can predict several tokens in a single forward pass, which can be utilized for speculative decoding or accelerate the inference throughput. If anyone wants to give it a try, Meta has their MTP weights released here: https://huggingface.co/facebook/multi-token-prediction

chuhac changed the title ~~[Paper BUG] About descriptions of the original MTP, little suggestionDeepSeek-V3 Technical Report~~ [Paper BUG] About descriptions of the original MTP, little suggestion Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paper BUG] About descriptions of the original MTP, little suggestion #252

[Paper BUG] About descriptions of the original MTP, little suggestion #252

chuhac commented Jan 10, 2025 •

edited

Loading

16x3b commented Jan 12, 2025

chuhac commented Jan 12, 2025

[Paper BUG] About descriptions of the original MTP, little suggestion #252

[Paper BUG] About descriptions of the original MTP, little suggestion #252

Comments

chuhac commented Jan 10, 2025 • edited Loading

16x3b commented Jan 12, 2025

chuhac commented Jan 12, 2025

chuhac commented Jan 10, 2025 •

edited

Loading