[TE] Investigate parallelism implementation in Transformer Engine #34

wangye805 · 2024-04-23T16:01:31Z

study how tensor parallelism works
study how sequence parallelism works
study why user buffers introduced in NVTE and how it works
study how atomic gemm works
investigate whether pipeline parallelism is applied in NVTE

wangye805 · 2024-04-26T16:01:37Z

2024/04/26
1). Tensor parallelism (TP) investigation:
1.1). TP implementation follows from the megatron-lm paper https://arxiv.org/pdf/1909.08053.pdf.
1.2). how TP and data parallelism (DP) works together is described in https://huggingface.co/transformers/v4.9.2/parallelism.html
1.3). TP supported in TE at the beginning of NVTE repo v0.1: NVIDIA/TransformerEngine@996ea16
2). Sequence parallelism (SP) investigation:
2.1). SP implementation follows from the paper: https://arxiv.org/pdf/2205.05198.pdf
2.2). SP usually applied together with TP and shares the same TP group size
2.3). SP supported in TE at the beginning of NVTE repo v0.1 as well
3). Userbuffers introduced in NVTE to improve the runtime in SP: overlap the communication (all gather and reduce scatter) and gemm in layernorm linear, linear, fc1, and fc2, to get higher GPU utilization rate
3.1). dependencies: mpi, nccl, gdr copy,
3.2). device specific codes related to NVLink registers and asm
4). Atomic gemm was used to further optimize the communication and gemm overlap in SP: https://docs.nvidia.com/cuda/cublas/#atomics-synchronization.
4.1). It requires cublaslt or hipblaslt to support this feature by adding counter, producer and receiver mode to its API
4.2). In NVTE, need to create producer, consumer, and counter implementation, probably need asm code

wangye805 self-assigned this Apr 23, 2024

wangye805 changed the title ~~[TE] Study how TP is implemented in Transformer Engine~~ [TE] Investigate parallelism implementation in Transformer Engine Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TE] Investigate parallelism implementation in Transformer Engine #34

[TE] Investigate parallelism implementation in Transformer Engine #34

wangye805 commented Apr 23, 2024 •

edited

Loading

wangye805 commented Apr 26, 2024

[TE] Investigate parallelism implementation in Transformer Engine #34

[TE] Investigate parallelism implementation in Transformer Engine #34

Comments

wangye805 commented Apr 23, 2024 • edited Loading

wangye805 commented Apr 26, 2024

wangye805 commented Apr 23, 2024 •

edited

Loading