We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在报告里的预训练阶段,我看到说是将模型的每一层放到不同的GPU中训练,想请问在分层后每层的梯度值是如何计算的呢?譬如说第一层传到第二层的输入是h1,然后反向传播得到了第二层参数对h1的导数(假设为g1),那在更新第一层参数时是仍然使用链式法则(g1乘上h1关于参数的导数)吗?还是逐层训练(假设每一层都是独立的模型,直接用loss对h1求导再链式法则更新其他梯度)?如果是前者的话怎么解决梯度消失的问题呢?(梯度爆炸我看到好像使用了gradient clipping norm且值设为1.0),感谢回答!
The text was updated successfully, but these errors were encountered:
No branches or pull requests
在报告里的预训练阶段,我看到说是将模型的每一层放到不同的GPU中训练,想请问在分层后每层的梯度值是如何计算的呢?譬如说第一层传到第二层的输入是h1,然后反向传播得到了第二层参数对h1的导数(假设为g1),那在更新第一层参数时是仍然使用链式法则(g1乘上h1关于参数的导数)吗?还是逐层训练(假设每一层都是独立的模型,直接用loss对h1求导再链式法则更新其他梯度)?如果是前者的话怎么解决梯度消失的问题呢?(梯度爆炸我看到好像使用了gradient clipping norm且值设为1.0),感谢回答!
The text was updated successfully, but these errors were encountered: