Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TE] Investigate parallelism implementation in Transformer Engine #34

Open
4 of 5 tasks
wangye805 opened this issue Apr 23, 2024 · 1 comment
Open
4 of 5 tasks
Assignees

Comments

@wangye805
Copy link
Contributor

wangye805 commented Apr 23, 2024

  • study how tensor parallelism works
  • study how sequence parallelism works
  • study why user buffers introduced in NVTE and how it works
  • study how atomic gemm works
  • investigate whether pipeline parallelism is applied in NVTE
@wangye805 wangye805 self-assigned this Apr 23, 2024
@wangye805 wangye805 changed the title [TE] Study how TP is implemented in Transformer Engine [TE] Investigate parallelism implementation in Transformer Engine Apr 26, 2024
@wangye805
Copy link
Contributor Author

2024/04/26
1). Tensor parallelism (TP) investigation:
1.1). TP implementation follows from the megatron-lm paper https://arxiv.org/pdf/1909.08053.pdf.
1.2). how TP and data parallelism (DP) works together is described in https://huggingface.co/transformers/v4.9.2/parallelism.html
1.3). TP supported in TE at the beginning of NVTE repo v0.1: NVIDIA/TransformerEngine@996ea16
2). Sequence parallelism (SP) investigation:
2.1). SP implementation follows from the paper: https://arxiv.org/pdf/2205.05198.pdf
2.2). SP usually applied together with TP and shares the same TP group size
2.3). SP supported in TE at the beginning of NVTE repo v0.1 as well
3). Userbuffers introduced in NVTE to improve the runtime in SP: overlap the communication (all gather and reduce scatter) and gemm in layernorm linear, linear, fc1, and fc2, to get higher GPU utilization rate
3.1). dependencies: mpi, nccl, gdr copy,
3.2). device specific codes related to NVLink registers and asm
4). Atomic gemm was used to further optimize the communication and gemm overlap in SP: https://docs.nvidia.com/cuda/cublas/#atomics-synchronization.
4.1). It requires cublaslt or hipblaslt to support this feature by adding counter, producer and receiver mode to its API
4.2). In NVTE, need to create producer, consumer, and counter implementation, probably need asm code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant