-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon No.33】为 Paddle 优化 erfinv op 在 GPU 上的计算性能 #199
Conversation
| Case No. | device | input_shape | input_type | Paddle Perf(ms) | | ||
|---|---|---|---|---| | ||
| 1 | RTX 2070s | [-1L, 204800L] | float32 | 0.1438 | | ||
| 2 | RTX 2070s |[10L, 20L, 30L, 40L, 5L, 6L] | float64 8| 8.6485 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64 8
这块数据好像有些问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
笔误,已纠正
|
||
Pytorch中对Erfinv算子的实现基于GPU计算, forward整体性能如下(基于pytorch v1.12): | ||
|
||
| Case No. | device | input_shape | input_type | Paddle Perf(ms) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle Perf(ms)
这部分是不是应该改成 Pytorch Perf(ms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
笔误,已纠正
|
||
## 2.1 关键模块与性能提升点 | ||
|
||
通过使用飞桨内部的Elementwise Kernel来进行计算。通过向量化读取、向量化写入以及gpu_launch_config.h中的线程配置方法对算子进行优化,预计提升1.2倍。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
性能提升预估1.2x倍提升后,数值上之后距离torch的性能还有差异,可以尝试看下底层C++端二者是否还有什么实现差异。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesLim-sy 尝试了torch的c++实现方式,也尝试了ndtri函数实现,性能没有明显提升。最终使用cuda内置函数,得到了2倍以上的提升,相比torch也有1倍以上的提升。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
为 Paddle 优化 erfinv op 在 GPU 上的计算性能
任务:PaddlePaddle/Paddle#44072 (comment)