RDMA support #124

crazyboycjr · 2018-01-26T13:11:08Z

Description

This pull request enable RDMA over Converged Ethernet support for ps-lite.

In order to send data with RDMA, the data should be put on a registered memory buffer.

src/rdma_van.h implements Van interface with rdma_cm and ibverbs.
include/ps/internal/allocator.h implements the memory allocator to manage this area.
bfc_allocator.h and bfc_allocator.cc implements the memory allocation alogrithm which is encapsulated by allocator.h.
Class SRMem in include/ps/srmem.h provides array-like access method on this region and can be constructed from SArray and C-like array.

Compile & Test

The patch depends on libibverbs and librdmacm and assumes there is a NIC with RDMA support and all drivers are working.

make -j $(nproc) USE_RDMA=1

and run the binaries under tests/ with env DMLC_ENABLE_RDMA=1

The tests programs should run well on multiple machines.

Acknowledgement

Thanks @snowzjx for supplying the BFC algorithm, and this allocation algorithm could be replaced by other better or more appropriate algorithms. Thanks @byronyi and @snowzjx for giving many valuable suggestions.

Comments

The SArray currently maintains a shared pointer, and implements zero-copy operation by moving the pointer to userspace data, and allocate and deallocate the memory with new and delete. Under RDMA and GPU Direct circumstance, the devices(NIC, GPU) need to register its own memory region. The use of new and delete by SArray cannot satisfy the device's memory management requirement.

Would it be a good idea to pass an extra allocator parameter to SArray thus allocation and deallocation can be managed by the device relevant allocator?

byronyi · 2018-01-27T15:00:13Z

Ping @mli appreciate if you could suggest a reviewer for this PR

mli · 2018-01-27T18:16:41Z

Thanks for the contribution! I'm curious about the performance about RDMA. Is there any benchmark result for it?

crazyboycjr · 2018-01-29T16:11:38Z

My benchmark code is in test_kv_app_benchmark.cc, before running the code, either DMLC_NODE_HOST or DMLC_INTERFACE(like ens6f0 on my machine) should be set to RDMA interface corresponsding value.

DMLC_ENABLE_RDMA=1 ./local.sh 1 1 ./test_kv_app_benchmark

Enviroment: Debian GNU/Linux 9, kernel version 4.9.25
Hardware: Intel Xeon E5-2603, Mellanox ConnectX-4 (40Gb)

Here are some benchmarks, the values below are in millisecond.

local 1v1 ZMQ (ms)

num	10000	100000	250000	500000	1000000	10000000
Push	7.40908	30.4501	69.0517	160.054	394.602	6164.78
Pull	9.03454	38.3155	90.0546	249.64	638.13	7579.4

local 1v1 RDMA (ms)

num	10000	100000	250000	500000	1000000	10000000
Push	5.78536	27.8317	60.1017	126.319	311.827	3756.74
Pull	7.0873	36.7264	77.6618	151.969	370.524	4824.83

dist 1v1 ZMQ (ms)

num	10000	100000	250000	500000	1000000	10000000
Push	15.2365	82.0496	133.96	311.553	670.521	9209.26
Pull	23.8208	139.206	237.501	558.699	1175.36	16677.8

dist 1v1 RDMA (ms)

num	10000	100000	250000	500000	1000000	10000000
Push	10.0092	73.8385	172.956	340.11	726.466	10253.3
Pull	10.2565	71.9637	160.857	304.912	694.12	9128.75

In brief, ZMQ performs a little bit better than RDMA on Push operation, but RDMA performs much better than ZMQ on Pull operation.

I haven't test it with more than 3 hosts. I think in case of N servers and M workers, the result should conform to 1v1 case on a lossless RDMA network.

crazyboycjr · 2018-01-31T05:19:04Z

ping @mli

byronyi · 2018-02-17T12:53:08Z

Hi @mli, mind to check again? We spent a lot time on this PR and would love if you could find a reviewer for it.

mli · 2018-02-19T03:18:52Z

Thanks, the results look great. I put the code review request into our internal system.

…

On Sat, Feb 17, 2018 at 4:53 AM, Bairen Yi ***@***.***> wrote: Hi @mli <https://github.com/mli>, mind to check again? We spent a lot time on this PR and would love if you could find a reviewer for it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZv4QxoJqNrM4Azh0mec_0LoQJbXhLFks5tVsu1gaJpZM4RuR96> .

mli · 2018-02-19T03:24:48Z

Both one of our team members and me will help this PR together.

byronyi · 2018-02-19T03:49:22Z

Fantastic. Happy lunar new year!

mli

Thanks again for the contribution. I did a brief review and made some general comments. Since I'm not familiar with RDMA, I may not be able to provide detailed comments.

For curiosity,

Is there any cloud platform such as AWS supporting RoCE so we can have a test?
Have you tried to run some mxnet workloads and noticed performance difference?
What's the constraint for RDMA? I saw an array size constraint such as 1GB.

mli · 2018-03-15T03:54:17Z

Makefile

@@ -19,6 +19,13 @@ endif

 INCPATH = -I./src -I./include -I$(DEPS_PATH)/include
 CFLAGS = -std=c++11 -msse2 -fPIC -O3 -ggdb -Wall -finline-functions $(INCPATH) $(ADD_CFLAGS)
+LIBS = -pthread


It may break macos build

mli · 2018-03-15T03:54:48Z

README.md

@@ -47,6 +47,14 @@ git clone https://github.com/dmlc/ps-lite
 cd ps-lite && make -j4
 ```

+### Build with RDMA support
+
+You can add `USE_RDMA=1` to enable RDMA support.


Add more instructions about how to install the dependent libraries

mli · 2018-03-15T03:55:53Z

include/ps/internal/allocator.h

@@ -0,0 +1,96 @@
+/**


How about creating a folder rdma and move all related files to this folder?

mli · 2018-03-15T03:56:19Z

include/ps/internal/allocator.h

+ */
+#ifndef PS_INTERNAL_ALLOCATOR_H_
+#define PS_INTERNAL_ALLOCATOR_H_
+#ifdef MXNET_USE_RDMA


Change to DMLC_USE_RDMA

mli · 2018-03-15T03:58:32Z

include/ps/internal/allocator.h

+  std::unordered_map<void *, std::pair<struct ibv_mr *, size_t>> addr_mr_;
+
+ private:
+  int64_t addr2offset(void *addr) {


Please follow the Google C++ style for function names

mli · 2018-03-15T03:59:07Z

include/ps/internal/van.h

+    /**
+     * \brief pack meta into protobuf
+     */
+    void PackMetaPB(const Meta& meta, PBMeta *pb);


Kinds of duplicated with PackMeta?

crazyboycjr · 2018-03-15T08:02:07Z

Thanks for your response and comments @mli !! I will fix above comments and answer your questions in the next following days as early as possible.

limingfan · 2018-03-23T15:49:41Z

hi, I have one question about it: compared with the raw ps-lite working on IPoIP, can this implementation gets over 20% speedup? According some previous discussions, it seems the ZMQ library is pretty efficient for data transmission!

rahul003 · 2018-06-15T02:51:06Z

@crazyboycjr @byronyi Did you get a chance to address Mu's comments above?

crazyboycjr · 2018-06-22T08:52:54Z

Hi, @limingfan we haven't compared it with ps-lite on IPoIB, but we did some test on 100GbE TCP network. The acceleration ratio depends on factors of model, batch-size, PS and worker's network topology and overlap of back propagation and parameter synchronization.

We have done tests on some CNN models. The results shows that the AlexNet and VGG models can achieve more than 1.5x acceleration easily. However models like ResNet-50 cannot accelerate more because of the well overlap of computation and communication.

By the way, hi, @rahul003 , because we found that on models like Inception-BN, the rdma implementation slows down the training, so I'm trying to figure out the reason. I want to resolve Mu's comments after I address the slowing down problem.

Sorry about my late response:pray::pray::pray:.

eric-haibin-lin · 2019-02-14T18:53:58Z

@crazyboycjr @byronyi any update?

changlan · 2019-06-26T22:13:21Z

Hi all,

Along with our effort on BytePS (https://github.com/bytedance/byteps), we just open sourced our internal RDMA implementation (https://github.com/bytedance/ps-lite) for PS-Lite at Bytedance. It is based on @crazyboycjr's implementation (actually @crazyboycjr also worked on it during his internship here). @ymjiang and I has been tinkering with it for several months. Now it consistently outperforms ZMQ TCP across every model we benchmarked (e.g. AlexNet, ResNet, VGG, Inception).

@eric-haibin-lin and @mli: I'd be happy to open a PR to send the patch upstream if there is enough interest. Let me know how you'd like to proceed!

ymjiang · 2019-06-27T00:57:35Z

I would like to complement @changlan 's statement. Let me share some end-to-end results on distributed training.

We use Tesla V100 GPUs, and set batch size as 32. Each machine (no NVLink) has 8 GPUs, and machines are inter-connected by 100 Gbps networking (can support TCP and RoCEv2). When using TCP, we are referring to the vanilla ZeroMQ implementation of ps-lite.

Note: The values are images per second.

Resnet50:

#GPU	TCP	RoCEv2
8	1008	2019
16	2279	4037
32	5048	7798
64	6954	14780

VGG16:

#GPU	TCP	RDMA
8	163	303
16	361	692
32	694	1393
64	1370	2777

byronyi · 2019-06-27T01:58:42Z

Since AWS does provision 100Gbps RDMA via EFA, I imagine this would be more relevant :)

byronyi · 2019-08-29T15:13:09Z

Users should refer to #151 as it is merged. @crazyboycjr @eric-haibin-lin I am leaning to closing this one. Any comments?

crazyboycjr · 2019-08-29T15:17:20Z

Sure, I'm OK with it. @byronyi

eric-haibin-lin · 2019-08-30T20:57:53Z

Merged in #151

snowzjx and others added 2 commits January 26, 2018 00:31

Enable RDMA support

7d987da

merge dmlc master

600db29

crazyboycjr added 4 commits January 28, 2018 20:46

fixbug in van.cc PackMetaPB

92fc596

fixbug about doing rdma_disconnect

a5207dc

fixbug in PackMetaPB, and Start() in rdma_van.h

004564b

add benchmark code

f52b38c

This was referenced Jan 29, 2018

How to use native Infiniband instead of IPoIB apache/mxnet#4766

Closed

Does MXNet support RDMA over Converged Ethernet (ROCE) apache/mxnet#5826

Closed

mli reviewed Mar 15, 2018

View reviewed changes

zhiqi-0 mentioned this pull request Aug 29, 2018

Bugs when set MXNET_CPU_WORKER_NTHREADS > 1 zhiqi-0/RDMA-MXNet-ps-lite#1

Closed

ymjiang mentioned this pull request Jun 26, 2019

Non Zero-Copy Implementation in ZPull #146

Closed

changlan mentioned this pull request Jul 16, 2019

RDMA Implementation #151

Merged

eric-haibin-lin closed this Aug 30, 2019

ymjiang mentioned this pull request May 15, 2020

Why not using RDMA WRITE( with imm) or RDMA READ to avoid additional memory copy? #174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA support #124

RDMA support #124

crazyboycjr commented Jan 26, 2018

byronyi commented Jan 27, 2018

mli commented Jan 27, 2018

crazyboycjr commented Jan 29, 2018 •

edited

Loading

crazyboycjr commented Jan 31, 2018

byronyi commented Feb 17, 2018

mli commented Feb 19, 2018 via email

mli commented Feb 19, 2018

byronyi commented Feb 19, 2018

mli left a comment

mli Mar 15, 2018

mli Mar 15, 2018

mli Mar 15, 2018

mli Mar 15, 2018

mli Mar 15, 2018

mli Mar 15, 2018

crazyboycjr commented Mar 15, 2018

limingfan commented Mar 23, 2018

rahul003 commented Jun 15, 2018

crazyboycjr commented Jun 22, 2018

eric-haibin-lin commented Feb 14, 2019

changlan commented Jun 26, 2019 •

edited

Loading

ymjiang commented Jun 27, 2019

byronyi commented Jun 27, 2019

byronyi commented Aug 29, 2019 •

edited

Loading

crazyboycjr commented Aug 29, 2019

eric-haibin-lin commented Aug 30, 2019

RDMA support #124

RDMA support #124

Conversation

crazyboycjr commented Jan 26, 2018

Description

Compile & Test

Acknowledgement

Comments

byronyi commented Jan 27, 2018

mli commented Jan 27, 2018

crazyboycjr commented Jan 29, 2018 • edited Loading

crazyboycjr commented Jan 31, 2018

byronyi commented Feb 17, 2018

mli commented Feb 19, 2018 via email

mli commented Feb 19, 2018

byronyi commented Feb 19, 2018

mli left a comment

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

mli Mar 15, 2018

Choose a reason for hiding this comment

crazyboycjr commented Mar 15, 2018

limingfan commented Mar 23, 2018

rahul003 commented Jun 15, 2018

crazyboycjr commented Jun 22, 2018

eric-haibin-lin commented Feb 14, 2019

changlan commented Jun 26, 2019 • edited Loading

ymjiang commented Jun 27, 2019

byronyi commented Jun 27, 2019

byronyi commented Aug 29, 2019 • edited Loading

crazyboycjr commented Aug 29, 2019

eric-haibin-lin commented Aug 30, 2019

crazyboycjr commented Jan 29, 2018 •

edited

Loading

changlan commented Jun 26, 2019 •

edited

Loading

byronyi commented Aug 29, 2019 •

edited

Loading