-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA support #124
RDMA support #124
Conversation
Ping @mli appreciate if you could suggest a reviewer for this PR |
Thanks for the contribution! I'm curious about the performance about RDMA. Is there any benchmark result for it? |
My benchmark code is in
Enviroment: Debian GNU/Linux 9, kernel version 4.9.25 Here are some benchmarks, the values below are in millisecond. local 1v1 ZMQ (ms)
local 1v1 RDMA (ms)
dist 1v1 ZMQ (ms)
dist 1v1 RDMA (ms)
In brief, ZMQ performs a little bit better than RDMA on Push operation, but RDMA performs much better than ZMQ on Pull operation. I haven't test it with more than 3 hosts. I think in case of N servers and M workers, the result should conform to 1v1 case on a lossless RDMA network. |
ping @mli |
Hi @mli, mind to check again? We spent a lot time on this PR and would love if you could find a reviewer for it. |
Thanks, the results look great. I put the code review request into our
internal system.
…On Sat, Feb 17, 2018 at 4:53 AM, Bairen Yi ***@***.***> wrote:
Hi @mli <https://github.com/mli>, mind to check again? We spent a lot
time on this PR and would love if you could find a reviewer for it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#124 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZv4QxoJqNrM4Azh0mec_0LoQJbXhLFks5tVsu1gaJpZM4RuR96>
.
|
Both one of our team members and me will help this PR together. |
Fantastic. Happy lunar new year! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the contribution. I did a brief review and made some general comments. Since I'm not familiar with RDMA, I may not be able to provide detailed comments.
For curiosity,
- Is there any cloud platform such as AWS supporting RoCE so we can have a test?
- Have you tried to run some mxnet workloads and noticed performance difference?
- What's the constraint for RDMA? I saw an array size constraint such as 1GB.
@@ -19,6 +19,13 @@ endif | |||
|
|||
INCPATH = -I./src -I./include -I$(DEPS_PATH)/include | |||
CFLAGS = -std=c++11 -msse2 -fPIC -O3 -ggdb -Wall -finline-functions $(INCPATH) $(ADD_CFLAGS) | |||
LIBS = -pthread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may break macos build
@@ -47,6 +47,14 @@ git clone https://github.com/dmlc/ps-lite | |||
cd ps-lite && make -j4 | |||
``` | |||
|
|||
### Build with RDMA support | |||
|
|||
You can add `USE_RDMA=1` to enable RDMA support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more instructions about how to install the dependent libraries
@@ -0,0 +1,96 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a folder rdma
and move all related files to this folder?
*/ | ||
#ifndef PS_INTERNAL_ALLOCATOR_H_ | ||
#define PS_INTERNAL_ALLOCATOR_H_ | ||
#ifdef MXNET_USE_RDMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to DMLC_USE_RDMA
std::unordered_map<void *, std::pair<struct ibv_mr *, size_t>> addr_mr_; | ||
|
||
private: | ||
int64_t addr2offset(void *addr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the Google C++ style for function names
/** | ||
* \brief pack meta into protobuf | ||
*/ | ||
void PackMetaPB(const Meta& meta, PBMeta *pb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinds of duplicated with PackMeta?
Thanks for your response and comments @mli !! I will fix above comments and answer your questions in the next following days as early as possible. |
hi, I have one question about it: compared with the raw ps-lite working on IPoIP, can this implementation gets over 20% speedup? According some previous discussions, it seems the ZMQ library is pretty efficient for data transmission! |
@crazyboycjr @byronyi Did you get a chance to address Mu's comments above? |
Hi, @limingfan we haven't compared it with ps-lite on IPoIB, but we did some test on 100GbE TCP network. The acceleration ratio depends on factors of model, batch-size, PS and worker's network topology and overlap of back propagation and parameter synchronization. We have done tests on some CNN models. The results shows that the AlexNet and VGG models can achieve more than 1.5x acceleration easily. However models like ResNet-50 cannot accelerate more because of the well overlap of computation and communication. By the way, hi, @rahul003 , because we found that on models like Inception-BN, the rdma implementation slows down the training, so I'm trying to figure out the reason. I want to resolve Mu's comments after I address the slowing down problem. Sorry about my late response:pray::pray::pray:. |
@crazyboycjr @byronyi any update? |
Hi all, Along with our effort on BytePS (https://github.com/bytedance/byteps), we just open sourced our internal RDMA implementation (https://github.com/bytedance/ps-lite) for PS-Lite at Bytedance. It is based on @crazyboycjr's implementation (actually @crazyboycjr also worked on it during his internship here). @ymjiang and I has been tinkering with it for several months. Now it consistently outperforms ZMQ TCP across every model we benchmarked (e.g. AlexNet, ResNet, VGG, Inception). @eric-haibin-lin and @mli: I'd be happy to open a PR to send the patch upstream if there is enough interest. Let me know how you'd like to proceed! |
I would like to complement @changlan 's statement. Let me share some end-to-end results on distributed training. We use Tesla V100 GPUs, and set batch size as 32. Each machine (no NVLink) has 8 GPUs, and machines are inter-connected by 100 Gbps networking (can support TCP and RoCEv2). When using TCP, we are referring to the vanilla ZeroMQ implementation of ps-lite. Note: The values are Resnet50:
VGG16:
|
Since AWS does provision 100Gbps RDMA via EFA, I imagine this would be more relevant :) |
Users should refer to #151 as it is merged. @crazyboycjr @eric-haibin-lin I am leaning to closing this one. Any comments? |
Sure, I'm OK with it. @byronyi |
Merged in #151 |
Description
This pull request enable RDMA over Converged Ethernet support for ps-lite.
In order to send data with RDMA, the data should be put on a registered memory buffer.
src/rdma_van.h
implementsVan
interface with rdma_cm and ibverbs.include/ps/internal/allocator.h
implements the memory allocator to manage this area.bfc_allocator.h
andbfc_allocator.cc
implements the memory allocation alogrithm which is encapsulated byallocator.h
.SRMem
ininclude/ps/srmem.h
provides array-like access method on this region and can be constructed fromSArray
and C-like array.Compile & Test
The patch depends on
libibverbs
andlibrdmacm
and assumes there is a NIC with RDMA support and all drivers are working.and run the binaries under
tests/
with envDMLC_ENABLE_RDMA=1
The tests programs should run well on multiple machines.
Acknowledgement
Thanks @snowzjx for supplying the BFC algorithm, and this allocation algorithm could be replaced by other better or more appropriate algorithms. Thanks @byronyi and @snowzjx for giving many valuable suggestions.
Comments
The
SArray
currently maintains a shared pointer, and implements zero-copy operation by moving the pointer to userspace data, and allocate and deallocate the memory withnew
anddelete
. Under RDMA and GPU Direct circumstance, the devices(NIC, GPU) need to register its own memory region. The use ofnew
anddelete
bySArray
cannot satisfy the device's memory management requirement.Would it be a good idea to pass an extra allocator parameter to
SArray
thus allocation and deallocation can be managed by the device relevant allocator?