关于S^2 Attention在半精度下的微调grad_norm为nan #4160

fffffq99 · 2024-06-08T10:00:11Z

Reminder

I have read the README and searched the existing issues.

System Info

我尝试使用longlora的微调方法来微调llm，当我开启S^2 Attention，我发现当我使用精度类型为fp16时，终端总是显示grad_norm为nan，甚至当我使用示例数据“identity.json“微调时，尽管loss显示为0，grad_norm仍旧为nan。
但是，我不做任何修改，精度类型换为fp32时，grad_norm将会很正常。
我的pytorch版本是2.2.2。

Reproduction

Train

Expected behavior

No response

Others

该问题已经被解决，问题出现在longlora.py#cat，
attn_output = torch.cat( ( attn_output[:, :, : self.num_heads // 2], attn_output[:, :, self.num_heads // 2 :].roll(groupsz // 2, dims=1), ) )
这里的torch.cat没有指定维度dim，默认是dim=0，而逻辑上应该是dim=2。虽然后续代码执行了reshape，但是在16精度下，似乎反向传播时还是不能找到真正对应的位置。
经过测试，
指定dim=2可以解决此问题，如下：
attn_output = torch.cat( ( attn_output[:, :, : self.num_heads // 2], attn_output[:, :, self.num_heads // 2 :].roll(groupsz // 2, dims=1), ), dim=2 )
或者使用longlora原始代码的方法直接赋值：
attn_output[:, :, self.num_heads // 2 :] = attn_output[:, :, self.num_heads // 2 :].roll(groupsz // 2, dims=1)

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-06-10T16:41:01Z

Fixed, thank you for helping us to identify this critical issue.

Before a793e84:

After a793e84:

hiyouga added the pending This problem is yet to be addressed label Jun 8, 2024

hiyouga closed this as completed in a793e84 Jun 10, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于S^2 Attention在半精度下的微调grad_norm为nan #4160

关于S^2 Attention在半精度下的微调grad_norm为nan #4160

fffffq99 commented Jun 8, 2024 •

edited

Loading

hiyouga commented Jun 10, 2024

关于S^2 Attention在半精度下的微调grad_norm为nan #4160

关于S^2 Attention在半精度下的微调grad_norm为nan #4160

Comments

fffffq99 commented Jun 8, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jun 10, 2024

fffffq99 commented Jun 8, 2024 •

edited

Loading