You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
wangye805
changed the title
[TE] Study how TP is implemented in Transformer Engine
[TE] Investigate parallelism implementation in Transformer Engine
Apr 26, 2024
2024/04/26
1). Tensor parallelism (TP) investigation:
1.1). TP implementation follows from the megatron-lm paper https://arxiv.org/pdf/1909.08053.pdf.
1.2). how TP and data parallelism (DP) works together is described in https://huggingface.co/transformers/v4.9.2/parallelism.html
1.3). TP supported in TE at the beginning of NVTE repo v0.1: NVIDIA/TransformerEngine@996ea16
2). Sequence parallelism (SP) investigation:
2.1). SP implementation follows from the paper: https://arxiv.org/pdf/2205.05198.pdf
2.2). SP usually applied together with TP and shares the same TP group size
2.3). SP supported in TE at the beginning of NVTE repo v0.1 as well
3). Userbuffers introduced in NVTE to improve the runtime in SP: overlap the communication (all gather and reduce scatter) and gemm in layernorm linear, linear, fc1, and fc2, to get higher GPU utilization rate
3.1). dependencies: mpi, nccl, gdr copy,
3.2). device specific codes related to NVLink registers and asm
4). Atomic gemm was used to further optimize the communication and gemm overlap in SP: https://docs.nvidia.com/cuda/cublas/#atomics-synchronization.
4.1). It requires cublaslt or hipblaslt to support this feature by adding counter, producer and receiver mode to its API
4.2). In NVTE, need to create producer, consumer, and counter implementation, probably need asm code
The text was updated successfully, but these errors were encountered: