-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Broadcasting ops are slow #8219
Comments
100x is a little surprising. The code was written like this to be consistent between CPU and GPU. If the performance difference is really that big we can consider writing a separate implementation for CPU. Would you like to propose a fix? |
Turns out part of the difference is because the broadcasting version is single-threaded. If I set the number of threads to 1, the difference is a bit smaller (~20x):
Looking at the assembly in perf this doesn't seem all that surprising. (array+scalar) are nice linear simd instructions, the broadcasting version is a mess involving lots of Sadly, I don't think I'll have the time to look into this in detail the next couple of days or probably weeks. |
Assigning Chris to keep track of this. We'll probably be able to get to this in a couple weeks. |
In case I get a chance to look at it, I’d like to know what to profile. What’s your model and data set? What are your tensor dimensions? I assume float32 type... |
Oh, heh there’s the python code... missed that on my phone’s tiny screen. |
By the way, often things are faster with OMP_NUM_THREADS=1 when the tensors aren’t huge and the calculations are relatively simple. I assume there’s an OMP loop outside of the kernel you showed. L |
@cjolivier01 I didn't see any openmp stuff in the broadcast implementation, I don't think it is multithreaded atm. Only the plain |
@aseyboldt Which code version? |
By the way, I ran this script several times: import mxnet as mx
import time
a = mx.sym.var('a')
b = mx.sym.var('b')
a_ = mx.nd.ones((2**20,))
b1 = mx.nd.ones((2**20,))
b2 = mx.nd.ones((1,))
func1 = (a + b).bind(mx.cpu(), {'a': a_, 'b': b1})
func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), {'a': a_, 'b': b2})
func3 = (a + 1.).bind(mx.cpu(), {'a': a_})
for x in range(4):
print("PASS: {}...", x)
# array + array
start = time.time()
for i in range(1000):
func1.forward()[0].wait_to_read()
print("func1: {}".format(time.time() - start))
# boadcast_add(scalar, array)
start = time.time()
for i in range(1000):
func2.forward()[0].wait_to_read()
print("func2: {}".format(time.time() - start))
# array + scalar
start = time.time()
for i in range(1000):
func3.forward()[0].wait_to_read()
print("func3: {}".format(time.time() - start))
print(" ") |
I get this output: ('PASS: {}...', 1) ('PASS: {}...', 2) ('PASS: {}...', 3) |
broadcast_add has more overhead than a simple elemwise_add, so I would expect it to be a bit slower. I believe the current implementation of "+" uses elemwise_add if a and b shapes are the same (a + b portion), and broadcast_add() if they differ. |
it is not at all surprising that scalar add 9the third) is the fastest. It's fewer mem reads, far less overhead in general. |
note that @piiswrong refactored broadcast a few weeks ago... |
@cjolivier01 Thanks for looking into this. 🙂 I haven't updated to mxnet 1.0 yet, so it is possible that this is fixed now (I only have slow internet at the moment so I can't update). Looking at the code I don't think so however. The broadcasting import os
os.environ['OMP_NUM_THREADS'] = '1'
import numpy as np
import mxnet as mx
import time
a = mx.sym.var('a')
b = mx.sym.var('b')
a_ = mx.nd.ones((2**17, 10, 10))
b_ = mx.nd.ones((1,))
c_ = a_.copy()
x = a_.asnumpy()
y = b_.asnumpy()
z = c_.asnumpy()
func1 = (a + b).bind(mx.cpu(), {'a': a_, 'b': c_})
func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), {'a': a_, 'b': b_})
for _ in range(2):
# elemwise
start = time.time()
for i in range(100):
func1.forward()[0].wait_to_read()
print("func1: {}".format(time.time() - start))
# boadcast_add(array, array)
start = time.time()
for i in range(100):
func2.forward()[0].wait_to_read()
print("func2: {}".format(time.time() - start))
# numpy elemwise
start = time.time()
out = np.zeros_like(x)
for i in range(100):
np.add(x, z, out=out)
print("numpy1: {}".format(time.time() - start))
# numpy broadcast
start = time.time()
for i in range(100):
np.add(x, y, out=out)
print("numpy2: {}".format(time.time() - start))
print() which gives me (different machine than the last benchmark)
For numpy the broadcasting op is faster than the normal one, for mxnet it is 10x slower. In the non-broadcasting case both numpy and mxnet are bound by memory bandwidth, and this is still more or less the case in the broadcasting case for numpy, but not for mxnet. This seems to happen in general for the broadcasting ops in mxnet, not only when a scalar is added. (Although numpy can't use up all the memory bandwidth in some cases either, it never slows down nearly as much as mxnet) My guess as to why func2 is so much slower than func1 is that the index juggling in |
I've run the codes by @aseyboldt again using a 2.xlarge + the current master and the result is like this:
|
|
Taking a look now. I tried to run the script provided by @cjolivier01 , on the latest master: 97da5e3. I was able to see a 4x slower speed for broadcast_ops over a range of 1000 runs, not 100x. I am running on a p2.8xlarge. |
Thanks @anirudh2290 for looking into this. This will be very beneficial. |
Tracking here: https://issues.apache.org/jira/browse/MXNET-323 |
I have been looking at this issue. MXNet forward pass for broadcast_add is much faster than tensorflow and pytorch. To give some numbers here (Experiments conducted on my setup of p2.8xlarge, Testing for CPU performance): For broadcasting a tensor of shape (1,) to a tensor of shape (2**17, 10, 10), only forward pass: When we include both forward and backward pass: So we decide to look at the MXNet backward pass and try out some optimizations. We try out using LaunchEx so that each thread gets a bigger chunk of workload. This by itself doesn't help. The bottleneck for the backward pass is the for loop that runs for each thread: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.h#L164 There are a lot of repeated computations for computing coords and dot inside the for loop. We try to cache this computation. This improves the speed more than 1.4X to around 12 seconds for mxnet. This involves extra memory which is the drawback. You can see the rough implementation here: https://github.com/anirudh2290/mxnet/blob/cached_broadcast/src/operator/tensor/broadcast_reduce-inl.h#L233 We suspect (not yet confirmed) that Tensorflow uses the eigen library to do the reduce. This is something that we should run experiments for and consider as a replacement as we deprecate mshadow. Next steps:
Here is the script. Please let me know if there is any caveat that I have missed:
|
@pengzhao-intel is this something you guys can help with? |
OK, will take a look for this issue. |
@rongzha1 will help for this issue |
Curious to know if this is still being worked on? @pengzhao-intel @anirudh2290 |
#11252 this PR addressed this solution. For 2:
This may not be a trivial effort and requires a detailed performance analysis for mshadow and with eigen for the common use cases. Also, this affect only ops not supported by MKLDNN and the benefits may not be worth the effort. |
I can't see a good reason why
broadcast_add(array, value)
should be much slower thanarray + scalar
, but the speed difference is almost 100x:A bit of digging led me to this function:
https://github.com/apache/incubator-mxnet/blob/bbf1c0b9d49b432fc46b911656f7f3fdd2521f98/src/operator/tensor/broadcast_reduce-inl.h#L142-L152
This is called for each element of the result array, and does a lot of index juggling. Broadcasting can be implemented much faster by using strides of 0 in dimensions where we want to broadcast. Additionally it is possible to optimize some inner loops further in many cases. The best reference I know of would be the numpy iterator api https://docs.scipy.org/doc/numpy-dev/reference/c-api.iterator.html
The same problem seem to be in the reduction ops in the same file.
For the benchmark I used git commit 573a010 on a linux machine with intel cpu.
The text was updated successfully, but these errors were encountered: