You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DeepSeek paper mentions 4-way TP for the MLA attention layer at inference time. However, from code, it seems that different card has its own Linear module (e.g.
The DeepSeek paper mentions 4-way TP for the MLA attention layer at inference time. However, from code, it seems that different card has its own Linear module (e.g.
DeepSeek-V3/inference/model.py
Line 427 in ee4c4ea
Is KV duplicated across all TP ranks in this case? Or, does different card have different KVs?
The text was updated successfully, but these errors were encountered: