Question about MLA with TP #283

ipiszy · 2025-01-14T18:26:23Z

The DeepSeek paper mentions 4-way TP for the MLA attention layer at inference time. However, from code, it seems that different card has its own Linear module (e.g.

DeepSeek-V3/inference/model.py

Line 427 in ee4c4ea

self.wkv_a = Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)

) to project KV latents to num_heads, and either KV or the linear module are sharded across TP.

Is KV duplicated across all TP ranks in this case? Or, does different card have different KVs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about MLA with TP #283

Question about MLA with TP #283

ipiszy commented Jan 14, 2025 •

edited

Loading

Question about MLA with TP #283

Question about MLA with TP #283

Comments

ipiszy commented Jan 14, 2025 • edited Loading

ipiszy commented Jan 14, 2025 •

edited

Loading