Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA support #124

Closed
wants to merge 6 commits into from
Closed

RDMA support #124

wants to merge 6 commits into from

Conversation

crazyboycjr
Copy link
Contributor

Description

This pull request enable RDMA over Converged Ethernet support for ps-lite.

In order to send data with RDMA, the data should be put on a registered memory buffer.

  • src/rdma_van.h implements Van interface with rdma_cm and ibverbs.
  • include/ps/internal/allocator.h implements the memory allocator to manage this area.
  • bfc_allocator.h and bfc_allocator.cc implements the memory allocation alogrithm which is encapsulated by allocator.h.
  • Class SRMem in include/ps/srmem.h provides array-like access method on this region and can be constructed from SArray and C-like array.

Compile & Test

The patch depends on libibverbs and librdmacm and assumes there is a NIC with RDMA support and all drivers are working.

make -j $(nproc) USE_RDMA=1

and run the binaries under tests/ with env DMLC_ENABLE_RDMA=1

The tests programs should run well on multiple machines.

Acknowledgement

Thanks @snowzjx for supplying the BFC algorithm, and this allocation algorithm could be replaced by other better or more appropriate algorithms. Thanks @byronyi and @snowzjx for giving many valuable suggestions.

Comments

The SArray currently maintains a shared pointer, and implements zero-copy operation by moving the pointer to userspace data, and allocate and deallocate the memory with new and delete. Under RDMA and GPU Direct circumstance, the devices(NIC, GPU) need to register its own memory region. The use of new and delete by SArray cannot satisfy the device's memory management requirement.

Would it be a good idea to pass an extra allocator parameter to SArray thus allocation and deallocation can be managed by the device relevant allocator?

@byronyi
Copy link

byronyi commented Jan 27, 2018

Ping @mli appreciate if you could suggest a reviewer for this PR

@mli
Copy link
Member

mli commented Jan 27, 2018

Thanks for the contribution! I'm curious about the performance about RDMA. Is there any benchmark result for it?

@crazyboycjr
Copy link
Contributor Author

crazyboycjr commented Jan 29, 2018

My benchmark code is in test_kv_app_benchmark.cc, before running the code, either DMLC_NODE_HOST or DMLC_INTERFACE(like ens6f0 on my machine) should be set to RDMA interface corresponsding value.

DMLC_ENABLE_RDMA=1 ./local.sh 1 1 ./test_kv_app_benchmark

Enviroment: Debian GNU/Linux 9, kernel version 4.9.25
Hardware: Intel Xeon E5-2603, Mellanox ConnectX-4 (40Gb)

Here are some benchmarks, the values below are in millisecond.

local 1v1 ZMQ (ms)

num 10000 100000 250000 500000 1000000 10000000
Push 7.40908 30.4501 69.0517 160.054 394.602 6164.78
Pull 9.03454 38.3155 90.0546 249.64 638.13 7579.4

local 1v1 RDMA (ms)

num 10000 100000 250000 500000 1000000 10000000
Push 5.78536 27.8317 60.1017 126.319 311.827 3756.74
Pull 7.0873 36.7264 77.6618 151.969 370.524 4824.83

dist 1v1 ZMQ (ms)

num 10000 100000 250000 500000 1000000 10000000
Push 15.2365 82.0496 133.96 311.553 670.521 9209.26
Pull 23.8208 139.206 237.501 558.699 1175.36 16677.8

dist 1v1 RDMA (ms)

num 10000 100000 250000 500000 1000000 10000000
Push 10.0092 73.8385 172.956 340.11 726.466 10253.3
Pull 10.2565 71.9637 160.857 304.912 694.12 9128.75

In brief, ZMQ performs a little bit better than RDMA on Push operation, but RDMA performs much better than ZMQ on Pull operation.

I haven't test it with more than 3 hosts. I think in case of N servers and M workers, the result should conform to 1v1 case on a lossless RDMA network.

@crazyboycjr
Copy link
Contributor Author

ping @mli

@byronyi
Copy link

byronyi commented Feb 17, 2018

Hi @mli, mind to check again? We spent a lot time on this PR and would love if you could find a reviewer for it.

@mli
Copy link
Member

mli commented Feb 19, 2018 via email

@mli
Copy link
Member

mli commented Feb 19, 2018

Both one of our team members and me will help this PR together.

@byronyi
Copy link

byronyi commented Feb 19, 2018

Fantastic. Happy lunar new year!

Copy link
Member

@mli mli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the contribution. I did a brief review and made some general comments. Since I'm not familiar with RDMA, I may not be able to provide detailed comments.

For curiosity,

  1. Is there any cloud platform such as AWS supporting RoCE so we can have a test?
  2. Have you tried to run some mxnet workloads and noticed performance difference?
  3. What's the constraint for RDMA? I saw an array size constraint such as 1GB.

@@ -19,6 +19,13 @@ endif

INCPATH = -I./src -I./include -I$(DEPS_PATH)/include
CFLAGS = -std=c++11 -msse2 -fPIC -O3 -ggdb -Wall -finline-functions $(INCPATH) $(ADD_CFLAGS)
LIBS = -pthread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may break macos build

@@ -47,6 +47,14 @@ git clone https://github.com/dmlc/ps-lite
cd ps-lite && make -j4
```

### Build with RDMA support

You can add `USE_RDMA=1` to enable RDMA support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more instructions about how to install the dependent libraries

@@ -0,0 +1,96 @@
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a folder rdma and move all related files to this folder?

*/
#ifndef PS_INTERNAL_ALLOCATOR_H_
#define PS_INTERNAL_ALLOCATOR_H_
#ifdef MXNET_USE_RDMA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to DMLC_USE_RDMA

std::unordered_map<void *, std::pair<struct ibv_mr *, size_t>> addr_mr_;

private:
int64_t addr2offset(void *addr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the Google C++ style for function names

/**
* \brief pack meta into protobuf
*/
void PackMetaPB(const Meta& meta, PBMeta *pb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinds of duplicated with PackMeta?

@crazyboycjr
Copy link
Contributor Author

Thanks for your response and comments @mli !! I will fix above comments and answer your questions in the next following days as early as possible.

@limingfan
Copy link

hi, I have one question about it: compared with the raw ps-lite working on IPoIP, can this implementation gets over 20% speedup? According some previous discussions, it seems the ZMQ library is pretty efficient for data transmission!

@rahul003
Copy link
Contributor

@crazyboycjr @byronyi Did you get a chance to address Mu's comments above?

@crazyboycjr
Copy link
Contributor Author

Hi, @limingfan we haven't compared it with ps-lite on IPoIB, but we did some test on 100GbE TCP network. The acceleration ratio depends on factors of model, batch-size, PS and worker's network topology and overlap of back propagation and parameter synchronization.

We have done tests on some CNN models. The results shows that the AlexNet and VGG models can achieve more than 1.5x acceleration easily. However models like ResNet-50 cannot accelerate more because of the well overlap of computation and communication.

By the way, hi, @rahul003 , because we found that on models like Inception-BN, the rdma implementation slows down the training, so I'm trying to figure out the reason. I want to resolve Mu's comments after I address the slowing down problem.

Sorry about my late response:pray::pray::pray:.

@eric-haibin-lin
Copy link
Member

@crazyboycjr @byronyi any update?

@changlan
Copy link
Contributor

changlan commented Jun 26, 2019

Hi all,

Along with our effort on BytePS (https://github.com/bytedance/byteps), we just open sourced our internal RDMA implementation (https://github.com/bytedance/ps-lite) for PS-Lite at Bytedance. It is based on @crazyboycjr's implementation (actually @crazyboycjr also worked on it during his internship here). @ymjiang and I has been tinkering with it for several months. Now it consistently outperforms ZMQ TCP across every model we benchmarked (e.g. AlexNet, ResNet, VGG, Inception).

@eric-haibin-lin and @mli: I'd be happy to open a PR to send the patch upstream if there is enough interest. Let me know how you'd like to proceed!

@ymjiang
Copy link
Member

ymjiang commented Jun 27, 2019

I would like to complement @changlan 's statement. Let me share some end-to-end results on distributed training.

We use Tesla V100 GPUs, and set batch size as 32. Each machine (no NVLink) has 8 GPUs, and machines are inter-connected by 100 Gbps networking (can support TCP and RoCEv2). When using TCP, we are referring to the vanilla ZeroMQ implementation of ps-lite.

Note: The values are images per second.

Resnet50:

#GPU TCP RoCEv2
8 1008 2019
16 2279 4037
32 5048 7798
64 6954 14780

VGG16:

#GPU TCP RDMA
8 163 303
16 361 692
32 694 1393
64 1370 2777

@byronyi
Copy link

byronyi commented Jun 27, 2019

Since AWS does provision 100Gbps RDMA via EFA, I imagine this would be more relevant :)

@changlan changlan mentioned this pull request Jul 16, 2019
@byronyi
Copy link

byronyi commented Aug 29, 2019

Users should refer to #151 as it is merged. @crazyboycjr @eric-haibin-lin I am leaning to closing this one. Any comments?

@crazyboycjr
Copy link
Contributor Author

Sure, I'm OK with it. @byronyi

@eric-haibin-lin
Copy link
Member

Merged in #151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants