Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.36】优化 lerp_grad op 在 GPU 上的计算性能 #45946

Merged
merged 16 commits into from
Oct 10, 2022

Conversation

Rayman96
Copy link
Contributor

@Rayman96 Rayman96 commented Sep 10, 2022

PR types

Performance optimization

PR changes

OPs

Describe

实验开发环境:
硬件:Tesla-P4
软件环境:CUDA 11.2,CuDNN 8

Case No. input_shape data_type Paddle_modify Perf(us) Perf_over_paddle_origin(%) Perf_over_pytorch(%)
1 [16, 102400] float32 134.98 65.56% 41.04% |
2 [16, 204800] float64 543.17 37.96% 42.55%

代码逻辑简介:
对于cpu端,直接复制原本的逻辑,可以正常运行。
对于GPU端,需要考虑X,Y,W三者broadcoat的情况。
1. 真正计算过程中只涉及W和Dout,如果w是单个值(大部分情况),可以不对weight进行broadcast,直接与dout相乘;
2. 如果weight是矩阵,需要先broadcast到dout的维度然后对应相乘。
3. 对于X,Y维度不同时,如果是类似(x: 1x2x3, y:2x2x3)这样只需要broadcast的维度在起始维度,那么可以不用将x拓展到y的维度。
4. 如果是类似(x:2x1x3, y:2x2x3)这样需要拓展的维度在中间的话,就需要先将x拓展进行计算,计算完成后在reduce回自身维度。

4的计算时间与原始计算步骤相当,从1~4的计算过程逐渐增大。所以情况1的性能优化应该最好,4的最差

@paddle-bot
Copy link

paddle-bot bot commented Sep 10, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Sep 10, 2022
@paddle-bot
Copy link

paddle-bot bot commented Sep 10, 2022

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@Rayman96 Rayman96 marked this pull request as ready for review September 14, 2022 12:25
@Rayman96 Rayman96 changed the title For lerp 【Hackathon No.36】优化 lerp_grad op 在 GPU 上的计算性能 Sep 14, 2022
@Rayman96
Copy link
Contributor Author

@Ligoml 请问下目前有两个失败的流水线检查,CE-Framework中显示test_quantile.py存在bug,这个好像测试的不是我修改的算子。
CI-Coverage里显示覆盖不足,这里的覆盖是使用的paddle/python/paddle/fluid/tests/unittests/test_lerp_op.py这个单测文件吗?原有的单测里确实覆盖不全,是否需要我修改这个单测文件。

@Rayman96
Copy link
Contributor Author

Rayman96 commented Sep 15, 2022

注释掉了一个新增的单测

这个单测结果是正确的,print出grad值都是一致的,但是Optest里的check_grad会判定diff大于1e-7, 好像是assertLessEqual判定的错误

@Rayman96

This comment was marked as resolved.

@Rayman96
Copy link
Contributor Author

麻烦请教各位大佬,现在CE不通过的错误是什么原因呢?
另外CI-Coverage失败是说CPU端代码的问题,CPU端我是直接拷贝的原有实现应该也没有问题

@ZzSean
Copy link
Contributor

ZzSean commented Sep 28, 2022

建议不在.cc文件里增加代码,还是使用impl.h就可以,这样应该不会报覆盖率的问题

@Rayman96
Copy link
Contributor Author

建议不在.cc文件里增加代码,还是使用impl.h就可以,这样应该不会报覆盖率的问题

好的我试试

@Rayman96 Rayman96 force-pushed the for_lerp branch 2 times, most recently from 303ecf0 to 89e45e1 Compare September 28, 2022 08:53
@Rayman96
Copy link
Contributor Author

Rayman96 commented Sep 28, 2022

@ZzSean 请问CI-build里提示refusing to merge unrelated histories是什么原因呢?现在只保留了两条commit

update: 我重跑CI-Clone后不提示上边的错了,但是build又提示docker md5 check failed😂

@CLAassistant
Copy link

CLAassistant commented Sep 29, 2022

CLA assistant check
All committers have signed the CLA.

@Rayman96 Rayman96 force-pushed the for_lerp branch 2 times, most recently from 81b4f44 to 170b04d Compare September 29, 2022 01:58
@luotao1
Copy link
Contributor

luotao1 commented Sep 29, 2022

请问CI-build里提示refusing to merge unrelated histories是什么原因呢?现在只保留了两条commit

和commit数量无关,需要merge下最新的develop分支

@Rayman96
Copy link
Contributor Author

请问CI-build里提示refusing to merge unrelated histories是什么原因呢?现在只保留了两条commit

和commit数量无关,需要merge下最新的develop分支

好的谢谢 应该可以了

paddle/phi/kernels/gpu/lerp_grad_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/lerp_grad_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/lerp_grad_kernel.cu Outdated Show resolved Hide resolved
python/paddle/fluid/tests/unittests/test_lerp_op.py Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/lerp_grad_kernel.cu Outdated Show resolved Hide resolved
// x:1*2*3, y:2*2*3,无需reduce。

int reduce_flag = XYNeedReduce(x, y, out);
if (reduce_flag == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个建议,这里是不是可以不用区分这么多种情况
对于dy来说,有2个运算,1. weight与out_grad的广播乘,设结果为tmp;2. 将tmp reduce到y的维度大小
对于dx来说,同样的,1. 上述的第一步中的结果tmp作为输入,做一次scale运算,a*tmp+b,其中a=-1,b=1;2. 再将结果reduce到x的维度大小
以上的broadcast与reduce都有封装好的kernel可以直接调用,这样可以使整个代码结构更加清晰,且很多条件判断可以省略

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个建议,这里是不是可以不用区分这么多种情况 对于dy来说,有2个运算,1. weight与out_grad的广播乘,设结果为tmp;2. 将tmp reduce到y的维度大小 对于dx来说,同样的,1. 上述的第一步中的结果tmp作为输入,做一次scale运算,a*tmp+b,其中a=-1,b=1;2. 再将结果reduce到x的维度大小 以上的broadcast与reduce都有封装好的kernel可以直接调用,这样可以使整个代码结构更加清晰,且很多条件判断可以省略

目前的计算逻辑其实和这个是比较类似的,做了情况的区分是因为在一些dx或者dy不需要进行reduce的情况下(有些维度不一致的情况下也不需要reduce),可以省略这一步以提高速度,通过reduce_flag进行控制。看起来分的情况多是逻辑上的判断写的比较复杂,最终其实只有是否需要reduce两种。

auto* dy = y_grad;
DDim dx_dims;
DDim dy_dims;
Eigen::DSizes<int, D * 2> dx_reshape_dims;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里reduce能否使用Paddle内的ReduceKernel,性能应该会比Eigen好,且可以不用再封装GetRduceResult这个函数。可以参考下ElementwiseGrad这类算子

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里reduce能否使用Paddle内的ReduceKernel,性能应该会比Eigen好,且可以不用再封装GetRduceResult这个函数。可以参考下ElementwiseGrad这类算子

好的 我参考测试下使用paddle的reduceKernel,之前没有找到对应算子的名字就自己封装了

int out_rank = out_dims.size();
int smaller_rank = std::min(x_rank, y_rank);
if (std::max(x_rank, y_rank) < out_rank) {
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1->true

int y_idx = y_rank - i;
int out_idx = out_rank - i;
if (x_dims[x_idx] != y_dims[y_idx]) {
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

namespace phi {

template <typename T>
__global__ void GetLerpGrad(const T* weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel名字还是改下吧,改成LerpGradKernelImpl好一些

}

template <typename T>
__global__ void GetLerpGradRankZero(const T* weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LerpGradScalarKernelImpl

const int x_size,
const int y_size) {
CUDA_KERNEL_LOOP_TYPE(idx, out_size, int64_t) {
T temp_dx = weight[0] * dout[idx];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当weight为scalar的时候,这里可以把weight放入寄存器中,不参与循环,减少重复读

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的 我修改下

@@ -1,4 +1,4 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该文件并不是这个PR新增,不要改copyright年份。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该文件并不是这个PR新增,不要改copyright年份。

done

namespace phi {

template <typename T>
__global__ void LerpGradKernelImpl(const T* weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该计算似乎可以调用ElementwiseKernel实现,多输出的用法可参考:

phi::funcs::ElementwiseKernel<T, decltype(functor), 2>(
dev_ctx, ins, &outs, functor);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该计算似乎可以调用ElementwiseKernel实现,多输出的用法可参考:

phi::funcs::ElementwiseKernel<T, decltype(functor), 2>(
dev_ctx, ins, &outs, functor);

之前在做另一个算子优化时测试使用elementwisekernel速度没有直接实现的快,这里我测试一下看哪个速度更好

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR修改成当前实现也OK,可优先修复一下CI问题:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的 已修复,再观察下CI

}

template <typename T>
__global__ void LerpGradScalarKernelImpl(const T* weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad);
std::vector<DenseTensor*> out_tensors = {&b_weight, &b_out};

phi::BroadcastTensorsKernel<T, Context>(ctx, in_tensors, out_tensors);
Copy link
Contributor

@Xreki Xreki Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处也可以调用BroadcastKernel,性能应该会好一些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处爷可以调用BroadcastKernel,性能应该会好一些。

好的 我测试一下

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR修改成当前实现也OK,可优先修复一下CI问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR修改成当前实现也OK,可优先修复一下CI问题

已修复,CI测试基本都已通过。又一个Kunlun-kp-build提示失败,我重启流水线提示没有权限,麻烦大佬帮忙重启下可以吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个CI不是required,由于网络原因失败,我咨询过了,可以不用管这个CI。

@AnnaTrainingG
Copy link
Contributor

LGTM for PR-CI-Kunlun-KP-Build

const int y_grad_size,
T* x_grad_data,
T* y_grad_data) {
if (weight.dims().size() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个if里面的判断条件不对?应该用weight.numel() == 1?我看了下,单测里面没有w是1-D且shape不是[1]的测试。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个if里面的判断条件不对?应该用weight.numel() == 1?我看了下,单测里面没有w是1-D且shape不是[1]的测试。

是的,bug已修复,在单测中也将TestLerpBroadWToXY修改为1-D且shape不是[1],同时测试broadcast和此处提到的情况

if (weight.dims().size() == 1) {
// condition when weight is a scalar
const T* weight_data = weight.data<T>();
// const T weight_scalar = weight_data[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无用的注释可以删除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无用的注释可以删除

done

// const T weight_scalar = weight_data[0];
const T* out_grad_data = out_grad.data<T>();
const int out_size = out_grad.numel();
const int weight_size = weight.numel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议使用int64_t类型

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议使用int64_t类型

done

const T* dout,
T* dx,
T* dy,
const int out_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调这个kernel的要求是x、y、weight的size都一样?那参数是不是只用传一个numel就行?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的x和y的大小可能是不一致的,例如x:[2,3,4], y:[2,2,3,4],只有axis0不同的情况下是不需要进行broadcast到同样维度的,但是需要在kernel里对x和y计算的数据量进行区分

DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad);
std::vector<DenseTensor*> out_tensors = {&b_weight, &b_out};

phi::BroadcastTensorsKernel<T, Context>(ctx, in_tensors, out_tensors);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个CI不是required,由于网络原因失败,我咨询过了,可以不用管这个CI。

// broadcast weight with out_grad's dimensions
const std::vector<const DenseTensor*> in_tensors = {&weight, &out_grad};
DenseTensor b_weight = phi::EmptyLike<T>(ctx, out_grad);
DenseTensor b_out = phi::EmptyLike<T>(ctx, out_grad);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解,out应该永远都不需要broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解,out应该永远都不需要broadcast?

是的 out是不需要broadcast 的,这里做这步操作是为了使用BroadcastTensorsKernel将weight进行broadcast,由于out_grad本身是const的无法作为入参,所以加了这一步操作

@Xreki
Copy link
Contributor

Xreki commented Oct 9, 2022

原OP Benchmark里面的2个lerp配置不够典型,故PaddlePaddle/benchmark#1530 将第2个double类型的配置改成了一个实际模型中用到且性能很差的配置。另外PR中还贴了模型中几个其他的配置的性能,可供参考。

OP Benchmark CI性能如下:

  • 配置一:前向+反向总GPU时间加速比为1.77x,反向GPU时间加速比为2.26x。
    image
    image

  • 配置一:pr优化后nvprof结果:
    image

  • 配置二:前向+反向总GPU时间加速比为686x,反向GPU时间加速比为1597x。
    image

  • 配置二:pr优化后nvprof结果:
    image

优化效果汇总如下表,符合黑客松的验收要求:
image


若您还有兴趣做进一步性能优化,可以考虑以下几个方面:

  1. 目前没有w.shape不是[1]的配置的性能,L123 - L146 BroadcastTensorsKernel + LerpGradKernelImpl可以考虑使用一个BroadcastKernel调用来替换。
  2. nvprof中存在[CUDA memcpy DtoD],应该是来自反向,可以排查下来源,看是否可以避免。

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

性能优化效果已验收,见#45946 (comment) ,辛苦再加一下x_grad、y_grad可能为nullptr的逻辑。

// check if x_grad and y_grad need to be reduced
// if x has a different dimension with y or weight in the middle axis, then
// they need to be broadcast and then reduced.
bool reduce_flag = XYNeedReduce(x, y, out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (dx) {
dx_dims = phi::funcs::ExtendDims2Rank(dx->dims(), D);
phi::funcs::GetBroadcastDims<D>(dx_dims, dout_dims, &dx_bcast_dims);
}
if (dy) {
dy_dims = phi::funcs::ExtendDims2Rank(dy->dims(), D);
phi::funcs::GetBroadcastDims<D>(dy_dims, dout_dims, &dy_bcast_dims);
}
phi::funcs::GetBroadcastDims<D>(w_dims, dout_dims, &w_bcast_dims);
auto eigen_w = phi::EigenTensor<T, D>::From(w, w_dims);
auto eigen_dout = phi::EigenTensor<T, D>::From(dout);

注意到develop实现代码中,存在对dx、dy是否为空的判断。Paddle的任何算子,当存在多个梯度输出时,所有的梯度变量都允许为nullptr,所以还需要加一下相关的逻辑。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,我加一下再提交

if (!reduce_flag) {
int x_grad_size = 0, y_grad_size = 0;
T* x_grad_data;
T* y_grad_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里最好把指针初始化为nullptr吧

DenseTensor b_xgrad = phi::EmptyLike<T, Context>(ctx, out_grad);
DenseTensor b_ygrad = phi::EmptyLike<T, Context>(ctx, out_grad);
T* x_grad_data;
T* y_grad_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里最好把指针初始化为nullptr吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已通过全部CI测试

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work~

@Rayman96
Copy link
Contributor Author

Rayman96 commented Oct 10, 2022

原OP Benchmark里面的2个lerp配置不够典型,故PaddlePaddle/benchmark#1530 将第2个double类型的配置改成了一个实际模型中用到且性能很差的配置。另外PR中还贴了模型中几个其他的配置的性能,可供参考。

OP Benchmark CI性能如下:

  • 配置一:前向+反向总GPU时间加速比为1.77x,反向GPU时间加速比为2.26x。
    image
    image
  • 配置一:pr优化后nvprof结果:
    image
  • 配置二:前向+反向总GPU时间加速比为686x,反向GPU时间加速比为1597x。
    image
  • 配置二:pr优化后nvprof结果:
    image

优化效果汇总如下表,符合黑客松的验收要求: image

若您还有兴趣做进一步性能优化,可以考虑以下几个方面:

  1. 目前没有w.shape不是[1]的配置的性能,L123 - L146 BroadcastTensorsKernel + LerpGradKernelImpl可以考虑使用一个BroadcastKernel调用来替换。
  2. nvprof中存在[CUDA memcpy DtoD],应该是来自反向,可以排查下来源,看是否可以避免。

回复:1. 我在benchmark中增加了多一条配置:x: [16, 1, 1, ,1], y:[16, 3, 244, 244], w:[3, 1, 244],三者都需要进行broadcast的情况。前向由于是一样的实现没有提升,后向大概提升100x,我后续补一下详细结果(已更新)。
image

两个kernel替换的工作下一次优化时进行
2. [CUDA memcpy DtoD] 的操作我查看运行图好像是由于benchmark执行中产生。我用benchmark测试别的算子也发现在前向和后向中间会加入一段 memcpy DtoD,具体原因还没有详细研究。
企业微信截图_cda95ce9-bfe2-476c-92df-203f2b0e948d

@Xreki Xreki merged commit ef61df3 into PaddlePaddle:develop Oct 10, 2022
@Xreki
Copy link
Contributor

Xreki commented Oct 10, 2022

@Rayman96 这个DtoD拷贝不是benchmark产生的,你看PaddlePaddle/benchmark#1530 ,我跑的develop的nvprof,并没有这个DtoD拷贝。

@Rayman96
Copy link
Contributor Author

PaddlePaddle/benchmark#1530

确实比较奇怪,会不会是由于PaddlePaddle/benchmark#1530 里全都是Eigen实现所以没有呀? 我在本地测试好像只要有自己实现的kernel就会出现DtoD
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants