-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize update_loss_scaling_op #32554
Optimize update_loss_scaling_op #32554
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
auto starts_h_tensor = | ||
memory::Alloc(platform::CPUPlace(), (xs_size + 1) * sizeof(int64_t)); | ||
int64_t* starts_h = reinterpret_cast<int64_t*>(starts_h_tensor->ptr()); | ||
|
||
auto starts_d_tensor = | ||
memory::Alloc(dev_ctx, (xs_size + 1) * sizeof(int64_t)); | ||
int64_t* starts_d = reinterpret_cast<int64_t*>(starts_d_tensor->ptr()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions about variable names:
starts_h_tensor
--> h_in_starts_mem
starts_h
--> h_in_starts
starts_d_tensor
--> d_in_starts_mem
starts_d
--> d_in_starts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
size_t xs_size = xs.size(); | ||
// alloc each tensor's start index and copy to device | ||
auto starts_h_tensor = | ||
memory::Alloc(platform::CPUPlace(), (xs_size + 1) * sizeof(int64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可构造一次platform::CPUPlace()对象,后续使用,不需要多次构建临时对象。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
auto outs_addr_h_tensor = | ||
memory::Alloc(platform::CPUPlace(), xs_size * sizeof(T*)); | ||
T** outs_addr_h = reinterpret_cast<T**>(outs_addr_h_tensor->ptr()); | ||
|
||
auto outs_addr_d_tensor = memory::Alloc(dev_ctx, xs_size * sizeof(T*)); | ||
T** outs_addr_d = reinterpret_cast<T**>(outs_addr_d_tensor->ptr()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions about variable names:
outs_addr_h_tensor
--> h_out_addrs_mem
outs_addr_h
--> h_out_addrs
outs_addr_d_tensor
--> d_out_addrs_mem
outs_addr_d
--> d_out_addrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
int64_t block = std::min(static_cast<int64_t>(1024), total_num); | ||
int64_t block_num = block * 50; // each thread deal with 50 data | ||
int64_t grid = (total_num + block_num - 1) / block_num; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block
--> threads_per_block
block_num
--> elements_per_block
grid
--> blocks_per_grid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
const int tid = threadIdx.x + blockIdx.x * blockDim.x; | ||
|
||
// copy starts array from global memory to shared memory | ||
extern __shared__ int64_t starts_s[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starts_s
--> s_starts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for (int64_t id = tid; id < total_num; id += blockDim.x * gridDim.x) { | ||
// get the "out" index of "id" | ||
int next_out_index = out_index; | ||
while (id < starts_s[next_out_index]) next_out_index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in line 57 will not be triggered forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
经验证,这一行的确不会被走到,已删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
* optimize update_loss_scaling_op by fused for loop to one kernel, test=develop * remove useless while loop and optimize variable name, test=develop * optimize variable name from out_addrs_tensor to out_addrs_mem, test=develop * optimize variable name for readable by change prefix identifier from t_ to local_
PR types
Performance optimization
PR changes
OPs
Describe
起因:
与
CheckFiniteAndUnscale
类似,timeline中显示update_loss_scaling_op
在一次运行中多次调用了FillIf
,最多调用了300次,且其中包含多个小kernel,存在优化点。代码分析:
同样的,原有代码中存在一个
for
循环:outs
为一个vector<Tensor*>
,无论tensor多大,for
循环对其中的每个tensor
都需要调用一次FillIf
。优化
优化方法1:
commit id:ad79dff
显然,融合(fused)kernel,将外部
for
循环去掉,改为无论xs.size()
大小均只用调用一次kernel效果应该最为明显。基本思路与PR31954相同,这里不再赘叙。需要额外提一句的是,由于该
FillIf
只是将value
一个个赋值给outs
中的值,因此若一个thread只处理一个数据会导致线程数过多,计算资源利用率低,为改善这种现象,因此这里设置为一个线程处理50个数据以降低warp切换开销。优化2:
commit id:527779a
check_finite_and_unscale
和update_loss_scaling_op
kernel中的无用行while (id < s_starts[index]) index++;
,经验证,此行在两kernel中都不会被走到。check_finite_and_unscale
和update_loss_scaling_op
kernel中变量的命名,使之更清晰明了。优化效果:
update_loss_scaling_op
ResNet50 AMP
模型速度(V100-SXM2-16GB机器单卡)ResNet50收敛性验证
模型地址:[ResNet50_fp16.sh]