-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-323] Improve performance of broadcast ops backward pass #11252
Conversation
@piiswrong can you help take a look ? |
@@ -602,6 +602,11 @@ void Reduce(Stream<gpu> *s, const TBlob& small, const OpReqType req, | |||
ReduceImpl<Reducer, ndim, DType, OP>(stream, small, req, big, workspace, config); | |||
} | |||
|
|||
template <typename Reducer, int ndim, typename DType, typename OP> | |||
void ReduceWithExtraMem(Stream<cpu>* s, const TBlob& small, const OpReqType req, | |||
const Tensor<cpu, 1, char>& workspace, const TBlob& big) {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReduceWithExtraMem
is only used for the cpu implementation and prevents the build from failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary_broadcast_op.h
already included broadcast_reduce-inl.h
. Why is it necessary to add this function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadcast_reduce-inl.h includes the code in broadcast_reduce-inl.cuh or some part of code (including ReduceWithExtraMem) in broadcast_reduce-inl.h based on if CUDACC is defined. https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.h#L171. This causes build to fail when omitting ReduceWithExtraMem in broadcast_reduce-inl.cuh.
template<typename xpu, typename LOP, typename ROP> | ||
inline typename std::enable_if<std::is_same<xpu, gpu>::value, void>::type | ||
BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs, | ||
const OpContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation
const Tensor<cpu, 1, char>& workspace, const TBlob& big) { | ||
if (req == kNullOp) return; | ||
Shape<ndim> rshape, rstride; | ||
diff(small.shape_.get<ndim>(), big.shape_.get<ndim>(), &rshape, &rstride); | ||
int N = small.shape_.Size(), M = rshape.Size(); | ||
seq_reduce_compute<Reducer, ndim, DType, OP>( | ||
N, M, req == kAddTo, big.dptr<DType>(), small.dptr<DType>(), | ||
big.shape_.get<ndim>(), small.shape_.get<ndim>(), rshape, rstride); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation should be 2 spaces?
template<typename xpu, typename LOP, typename ROP> | ||
inline typename std::enable_if<std::is_same<xpu, gpu>::value, void>::type | ||
BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs, | ||
const OpContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: alignment of lines
@@ -544,20 +545,25 @@ void BinaryBroadcastBackwardUseNone(const nnvm::NodeAttrs& attrs, | |||
const TBlob out = inputs[0].reshape(new_oshape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this implementation is only for cpu, is it better to replace xpu with cpu inside?
@@ -602,6 +602,11 @@ void Reduce(Stream<gpu> *s, const TBlob& small, const OpReqType req, | |||
ReduceImpl<Reducer, ndim, DType, OP>(stream, small, req, big, workspace, config); | |||
} | |||
|
|||
template <typename Reducer, int ndim, typename DType, typename OP> | |||
void ReduceWithExtraMem(Stream<cpu>* s, const TBlob& small, const OpReqType req, | |||
const Tensor<cpu, 1, char>& workspace, const TBlob& big) {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary_broadcast_op.h
already included broadcast_reduce-inl.h
. Why is it necessary to add this function here?
@eric-haibin-lin I have addressed your comments. |
…e#11252) * Fix cached broadcast * Fix * Use seq_reduce_compute logic for stable sum * Fix lint * Add declarations * Add elemwise binary broadcast op cuh file * Add license for elemwise_binary_broadcast_op-inl.cuh * Fix broadcast * Fix indentation * Use cpu and gpu instead of xpu
Description
This PR tries to improve the performance of broadcast ops backward pass, by caching the intermediate computations and using LaunchEx. The speedup for both forward and backward pass combined for the broadcast_add is around 1.4X. The experiments have been run on p2.8xlarge.
The below numbers are for broadcasting a tensor of shape (1,) to a tensor of shape destination shape as given below.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.