-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HybridParallel]Add Recompute for PipeLineParallel #34607
[HybridParallel]Add Recompute for PipeLineParallel #34607
Conversation
Thanks for your contribution! |
f28cc8a
to
ae6ac75
Compare
ctx.tensor_shapes.append(arg.shape) | ||
partition = _split_activation(arg.detach()).clone() | ||
# TODO(shenliang03) not use calculate stream to D2H to speed | ||
arg = partition.cpu() if _recompute_offload else partition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offload in dygraph is Sooooo easy!!! lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fleet.utils.recompute could do in the same way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, currently support hybrid_parallel first.
tensor_shapes[i]) | ||
tensors[i].stop_gradient = state | ||
inputs[idx] = tensors[i].cuda( | ||
device_id) if _recompute_offload else tensors[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should sync here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait the H2D copy finish before conduct the following computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu()
is sync operation, we don't need do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f8eedf2
to
88ee4ad
Compare
88ee4ad
to
5e50e53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for op_function_generator
PR types
New features
PR changes
Others
Describe
Add Recompute for PipeLineParallel
1、接口形式
2、功能支持
相比paddle原生的recompute,有以下几处不同:
3、性能对比
GPT-117M模型,V100-32G, FP32,MP=4, PP=2, mircrobatch=2, global_batch_size=128,中间卡显存
?? recompute + offload + MP切分的组合显存相比更大?
nvidia-smi显示的显存,可以已经释放但被paddle缓存住了。
4、精度对比
在GPT-117M,MP2_PP2下验证精度
DP2_MP2_PP2 + AMP
5、TODO