-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon No.36】优化 lerp_grad op 在 GPU 上的计算性能 #45946
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
✅ This PR's description meets the template requirements! |
@Ligoml 请问下目前有两个失败的流水线检查,CE-Framework中显示test_quantile.py存在bug,这个好像测试的不是我修改的算子。 |
注释掉了一个新增的单测 这个单测结果是正确的,print出grad值都是一致的,但是Optest里的check_grad会判定diff大于1e-7, 好像是assertLessEqual判定的错误 |
This comment was marked as resolved.
This comment was marked as resolved.
麻烦请教各位大佬,现在CE不通过的错误是什么原因呢? |
建议不在.cc文件里增加代码,还是使用impl.h就可以,这样应该不会报覆盖率的问题 |
好的我试试 |
303ecf0
to
89e45e1
Compare
@ZzSean 请问CI-build里提示refusing to merge unrelated histories是什么原因呢?现在只保留了两条commit update: 我重跑CI-Clone后不提示上边的错了,但是build又提示docker md5 check failed😂 |
81b4f44
to
170b04d
Compare
和commit数量无关,需要merge下最新的develop分支 |
好的谢谢 应该可以了 |
// x:1*2*3, y:2*2*3,无需reduce。 | ||
|
||
int reduce_flag = XYNeedReduce(x, y, out); | ||
if (reduce_flag == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个建议,这里是不是可以不用区分这么多种情况
对于dy来说,有2个运算,1. weight与out_grad的广播乘,设结果为tmp;2. 将tmp reduce到y的维度大小
对于dx来说,同样的,1. 上述的第一步中的结果tmp作为输入,做一次scale运算,a*tmp+b,其中a=-1,b=1;2. 再将结果reduce到x的维度大小
以上的broadcast与reduce都有封装好的kernel可以直接调用,这样可以使整个代码结构更加清晰,且很多条件判断可以省略
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有个建议,这里是不是可以不用区分这么多种情况 对于dy来说,有2个运算,1. weight与out_grad的广播乘,设结果为tmp;2. 将tmp reduce到y的维度大小 对于dx来说,同样的,1. 上述的第一步中的结果tmp作为输入,做一次scale运算,a*tmp+b,其中a=-1,b=1;2. 再将结果reduce到x的维度大小 以上的broadcast与reduce都有封装好的kernel可以直接调用,这样可以使整个代码结构更加清晰,且很多条件判断可以省略
目前的计算逻辑其实和这个是比较类似的,做了情况的区分是因为在一些dx或者dy不需要进行reduce的情况下(有些维度不一致的情况下也不需要reduce),可以省略这一步以提高速度,通过reduce_flag进行控制。看起来分的情况多是逻辑上的判断写的比较复杂,最终其实只有是否需要reduce两种。
auto* dy = y_grad; | ||
DDim dx_dims; | ||
DDim dy_dims; | ||
Eigen::DSizes<int, D * 2> dx_reshape_dims; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里reduce能否使用Paddle内的ReduceKernel,性能应该会比Eigen好,且可以不用再封装GetRduceResult
这个函数。可以参考下ElementwiseGrad这类算子
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里reduce能否使用Paddle内的ReduceKernel,性能应该会比Eigen好,且可以不用再封装
GetRduceResult
这个函数。可以参考下ElementwiseGrad这类算子
好的 我参考测试下使用paddle的reduceKernel,之前没有找到对应算子的名字就自己封装了
int out_rank = out_dims.size(); | ||
int smaller_rank = std::min(x_rank, y_rank); | ||
if (std::max(x_rank, y_rank) < out_rank) { | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1
->true
int y_idx = y_rank - i; | ||
int out_idx = out_rank - i; | ||
if (x_dims[x_idx] != y_dims[y_idx]) { | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
namespace phi { | ||
|
||
template <typename T> | ||
__global__ void GetLerpGrad(const T* weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kernel名字还是改下吧,改成LerpGradKernelImpl
好一些
} | ||
|
||
template <typename T> | ||
__global__ void GetLerpGradRankZero(const T* weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LerpGradScalarKernelImpl
const int x_size, | ||
const int y_size) { | ||
CUDA_KERNEL_LOOP_TYPE(idx, out_size, int64_t) { | ||
T temp_dx = weight[0] * dout[idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当weight为scalar的时候,这里可以把weight放入寄存器中,不参与循环,减少重复读
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的 我修改下
@@ -1,4 +1,4 @@ | |||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. | |||
// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该文件并不是这个PR新增,不要改copyright年份。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该文件并不是这个PR新增,不要改copyright年份。
done
namespace phi { | ||
|
||
template <typename T> | ||
__global__ void LerpGradKernelImpl(const T* weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该计算似乎可以调用ElementwiseKernel
实现,多输出的用法可参考:
phi::funcs::ElementwiseKernel<T, decltype(functor), 2>( | |
dev_ctx, ins, &outs, functor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该计算似乎可以调用
ElementwiseKernel
实现,多输出的用法可参考:
phi::funcs::ElementwiseKernel<T, decltype(functor), 2>( dev_ctx, ins, &outs, functor);
之前在做另一个算子优化时测试使用elementwisekernel速度没有直接实现的快,这里我测试一下看哪个速度更好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的 已修复,再观察下CI
} | ||
|
||
template <typename T> | ||
__global__ void LerpGradScalarKernelImpl(const T* weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad); | ||
std::vector<DenseTensor*> out_tensors = {&b_weight, &b_out}; | ||
|
||
phi::BroadcastTensorsKernel<T, Context>(ctx, in_tensors, out_tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处也可以调用BroadcastKernel
,性能应该会好一些。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处爷可以调用
BroadcastKernel
,性能应该会好一些。
好的 我测试一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个PR修改成当前实现也OK,可优先修复一下CI问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个PR修改成当前实现也OK,可优先修复一下CI问题
已修复,CI测试基本都已通过。又一个Kunlun-kp-build提示失败,我重启流水线提示没有权限,麻烦大佬帮忙重启下可以吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个CI不是required,由于网络原因失败,我咨询过了,可以不用管这个CI。
LGTM for PR-CI-Kunlun-KP-Build |
const int y_grad_size, | ||
T* x_grad_data, | ||
T* y_grad_data) { | ||
if (weight.dims().size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个if
里面的判断条件不对?应该用weight.numel() == 1
?我看了下,单测里面没有w是1-D且shape不是[1]的测试。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个
if
里面的判断条件不对?应该用weight.numel() == 1
?我看了下,单测里面没有w是1-D且shape不是[1]的测试。
是的,bug已修复,在单测中也将TestLerpBroadWToXY修改为1-D且shape不是[1],同时测试broadcast和此处提到的情况
if (weight.dims().size() == 1) { | ||
// condition when weight is a scalar | ||
const T* weight_data = weight.data<T>(); | ||
// const T weight_scalar = weight_data[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
无用的注释可以删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
无用的注释可以删除
done
// const T weight_scalar = weight_data[0]; | ||
const T* out_grad_data = out_grad.data<T>(); | ||
const int out_size = out_grad.numel(); | ||
const int weight_size = weight.numel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里建议使用int64_t
类型
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里建议使用
int64_t
类型
done
const T* dout, | ||
T* dx, | ||
T* dy, | ||
const int out_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调这个kernel的要求是x、y、weight的size都一样?那参数是不是只用传一个numel
就行?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的x和y的大小可能是不一致的,例如x:[2,3,4], y:[2,2,3,4],只有axis0不同的情况下是不需要进行broadcast到同样维度的,但是需要在kernel里对x和y计算的数据量进行区分
DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad); | ||
std::vector<DenseTensor*> out_tensors = {&b_weight, &b_out}; | ||
|
||
phi::BroadcastTensorsKernel<T, Context>(ctx, in_tensors, out_tensors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个CI不是required,由于网络原因失败,我咨询过了,可以不用管这个CI。
// broadcast weight with out_grad's dimensions | ||
const std::vector<const DenseTensor*> in_tensors = {&weight, &out_grad}; | ||
DenseTensor b_weight = phi::EmptyLike<T>(ctx, out_grad); | ||
DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解,out应该永远都不需要broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解,out应该永远都不需要broadcast?
是的 out是不需要broadcast 的,这里做这步操作是为了使用BroadcastTensorsKernel将weight进行broadcast,由于out_grad本身是const的无法作为入参,所以加了这一步操作
原OP Benchmark里面的2个lerp配置不够典型,故PaddlePaddle/benchmark#1530 将第2个double类型的配置改成了一个实际模型中用到且性能很差的配置。另外PR中还贴了模型中几个其他的配置的性能,可供参考。 OP Benchmark CI性能如下: 若您还有兴趣做进一步性能优化,可以考虑以下几个方面:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
性能优化效果已验收,见#45946 (comment) ,辛苦再加一下x_grad、y_grad可能为nullptr的逻辑。
// check if x_grad and y_grad need to be reduced | ||
// if x has a different dimension with y or weight in the middle axis, then | ||
// they need to be broadcast and then reduced. | ||
bool reduce_flag = XYNeedReduce(x, y, out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/paddle/phi/kernels/impl/lerp_grad_kernel_impl.h
Lines 45 to 56 in 9a849a3
if (dx) { | |
dx_dims = phi::funcs::ExtendDims2Rank(dx->dims(), D); | |
phi::funcs::GetBroadcastDims<D>(dx_dims, dout_dims, &dx_bcast_dims); | |
} | |
if (dy) { | |
dy_dims = phi::funcs::ExtendDims2Rank(dy->dims(), D); | |
phi::funcs::GetBroadcastDims<D>(dy_dims, dout_dims, &dy_bcast_dims); | |
} | |
phi::funcs::GetBroadcastDims<D>(w_dims, dout_dims, &w_bcast_dims); | |
auto eigen_w = phi::EigenTensor<T, D>::From(w, w_dims); | |
auto eigen_dout = phi::EigenTensor<T, D>::From(dout); |
注意到develop实现代码中,存在对dx、dy是否为空的判断。Paddle的任何算子,当存在多个梯度输出时,所有的梯度变量都允许为nullptr,所以还需要加一下相关的逻辑。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,我加一下再提交
if (!reduce_flag) { | ||
int x_grad_size = 0, y_grad_size = 0; | ||
T* x_grad_data; | ||
T* y_grad_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里最好把指针初始化为nullptr吧
DenseTensor b_xgrad = phi::EmptyLike<T, Context>(ctx, out_grad); | ||
DenseTensor b_ygrad = phi::EmptyLike<T, Context>(ctx, out_grad); | ||
T* x_grad_data; | ||
T* y_grad_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里最好把指针初始化为nullptr吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已通过全部CI测试
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good work~
回复:1. 我在benchmark中增加了多一条配置:x: [16, 1, 1, ,1], y:[16, 3, 244, 244], w:[3, 1, 244],三者都需要进行broadcast的情况。前向由于是一样的实现没有提升,后向大概提升100x,我后续补一下详细结果(已更新)。 两个kernel替换的工作下一次优化时进行 |
@Rayman96 这个DtoD拷贝不是benchmark产生的,你看PaddlePaddle/benchmark#1530 ,我跑的develop的nvprof,并没有这个DtoD拷贝。 |
确实比较奇怪,会不会是由于PaddlePaddle/benchmark#1530 里全都是Eigen实现所以没有呀? 我在本地测试好像只要有自己实现的kernel就会出现DtoD |
PR types
Performance optimization
PR changes
OPs
Describe
实验开发环境:
硬件:Tesla-P4
软件环境:CUDA 11.2,CuDNN 8
代码逻辑简介:
对于cpu端,直接复制原本的逻辑,可以正常运行。
对于GPU端,需要考虑X,Y,W三者broadcoat的情况。
1. 真正计算过程中只涉及W和Dout,如果w是单个值(大部分情况),可以不对weight进行broadcast,直接与dout相乘;
2. 如果weight是矩阵,需要先broadcast到dout的维度然后对应相乘。
3. 对于X,Y维度不同时,如果是类似(x: 1x2x3, y:2x2x3)这样只需要broadcast的维度在起始维度,那么可以不用将x拓展到y的维度。
4. 如果是类似(x:2x1x3, y:2x2x3)这样需要拓展的维度在中间的话,就需要先将x拓展进行计算,计算完成后在reduce回自身维度。
4的计算时间与原始计算步骤相当,从1~4的计算过程逐渐增大。所以情况1的性能优化应该最好,4的最差