Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms on heterogenous accelerators supported by Microsoft Azure. MSCCL currently supports NVIDIA and AMD GPUs. The research prototype of this project is microsoft/msccl.
MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms on heterogenous accelerators. To achieve this, MSCCL has multiple components:
-
MSCCL toolkit: Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor to run on the backend. Example provides some instances on how MSCCL toolkit with the runtime works. Please refer to MSCCL toolkit for more information.
-
MSCCL scheduler: MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors.
-
MSCCL executor: MSCCL executor is a set of libraries that are responsible for running custom-written collective communication algorithms on heterogenous accelerators. Each kind of accelerator has a corresponding executor library that is specifically optimized it. Different executor libraries share the same interface to run MSCCL algorithm IR from MSCCL toolkit and talk with MSCCL scheduler. For NVIDIA GPUs, it's msccl-executor-nccl which is built on top of NCCL. For AMD GPUs, it's RCCL which already integrated all MSCCL executor features.
-
MSCCL test toolkit(msccl-tests-nccl): These tests check both the performance and the correctness of MSCCL operations.
For reference, FP16 All-Reduce and All-Gather algorithms were tested and compared on ND H100 v5 VM, using msccl-tests-nccl.
FP16 All-Reduce Latency (us) | All-Gather Latency (us) | ||||||
---|---|---|---|---|---|---|---|
Message Size | NCCL | MSCCL | MSCCL Speedup | Message Size | NCCL | MSCCL | MSCCL Speedup |
1KB | 13.12 | 7.50 | 1.80x | 1KB | 9.54 | 5.65 | 1.69x |
2KB | 14.39 | 7.48 | 1.92x | 2KB | 9.8 | 5.7 | 1.72x |
4KB | 15.28 | 7.49 | 2.04x | 4KB | 9.78 | 5.43 | 1.80x |
8KB | 15.69 | 7.67 | 2.04x | 8KB | 9.78 | 5.47 | 1.81x |
16KB | 16.64 | 8.03 | 2.07x | 16KB | 10.29 | 5.53 | 1.86x |
32KB | 19.3 | 9.08 | 2.13x | 32KB | 12.49 | 5.75 | 2.17x |
64KB | 20 | 10.36 | 1.93x | 64KB | 12.87 | 5.95 | 2.16x |
128KB | 20.42 | 11.06 | 1.85x | 128KB | 13.16 | 6.38 | 2.06x |
256KB | 20.5 | 12.86 | 1.60x | 256KB | 13.23 | 7.26 | 1.82x |
512KB | 29.89 | 19.14 | 1.56x | 512KB | 13.39 | 8.71 | 1.54x |
1MB | 31.94 | 22.31 | 1.43x | 1MB | 18.33 | 12.3 | 1.49x |
2MB | 37.95 | 33.43 | 1.14x | 2MB | 23.18 | 17.75 | 1.31x |
4MB | 49.28 | 43.97 | 1.12x | 4MB | 33.66 | 23.37 | 1.44x |
8MB | 77.01 | 68.16 | 1.13x | 8MB | 44.7 | 38.54 | 1.16x |
16MB | 116 | 115.7 | 1.00x | 16MB | 67.19 | 67.16 | 1.00x |
32MB | 187.2 | 186.5 | 1.00x | 32MB | 104.7 | 98.4 | 1.06x |
64MB | 317.4 | 315.7 | 1.01x | 64MB | 192.4 | 181.9 | 1.06x |
128MB | 572.5 | 570.4 | 1.00x | 128MB | 368.3 | 348.4 | 1.06x |
256MB | 1079 | 1075.6 | 1.00x | 256MB | 699.5 | 680.7 | 1.03x |
512MB | 2071.1 | 2067.9 | 1.00x | 512MB | 1358.6 | 1339.3 | 1.01x |
1GB | 4028.7 | 4026.8 | 1.00x | 1GB | 2663.8 | 2633 | 1.01x |
In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs:
$ git clone https://github.com/Azure/msccl.git --recurse-submodules
$ git clone https://github.com/Azure/msccl.git --recurse-submodules
$ cd msccl/executor/msccl-executor-nccl
$ make -j src.build
$ cd ../
$ cd ../
$ cd tests/msccl-tests-nccl/
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j
$ cd ../
$ cd ../
- for ndv4, we already have algo optimized, you can use msccl scheduler to apply this algo directly to the executor, below is the steps to apply the scheduler
$ sudo apt-get install libcurl4-openssl-dev nlohmann-json3-dev
$ cd scheduler/msccl-scheduler
for nccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make
for rccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make PLATFORM=RCCL
$ make install
- for customize the msccl algo for your system, you can install MSCCL toolkit to compile a few custom algorithms:
$ git clone https://github.com/Azure/msccl-tools.git
$ cd msccl-tools/
$ pip install .
$ cd ../
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
$ cd ../
The compiler's generated code is an XML file (test.xml
) that is fed to MSCCL runtime. To evaluate its performance, copy the test.xml
to the msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system:
$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0
[0] NCCL INFO Connected 1 MSCCL algorithms
You may evaluate the performance of test.xml
by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. MSCCL toolkit has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit CLA.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.