Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

梯度累计在transformers>=4.46.0时出错 #6639

Open
1 task done
gom168 opened this issue Jan 14, 2025 · 3 comments · May be fixed by #6628
Open
1 task done

梯度累计在transformers>=4.46.0时出错 #6639

gom168 opened this issue Jan 14, 2025 · 3 comments · May be fixed by #6628
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@gom168
Copy link

gom168 commented Jan 14, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.1
  • Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.35
  • Python version: 3.10.16
  • PyTorch version: 2.4.0+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.8.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • DeepSpeed version: 0.14.4

Reproduction

数据共三十条,在单卡上进行sft
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 4 \
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
{'loss': 0.9863, 'grad_norm': 14.851867126840325, 'learning_rate': 4.267766952966369e-06, 'epoch': 0.93}                  
 25%|█████████████████████▌                                                                | 7/28 [00:42<01:46,  5.06s/it]
num_items_in_batch:279, rank=0
num_items_in_batch:279, rank=0
num_items_in_batch:470, rank=0
num_items_in_batch:470, rank=0
{'loss': 1.2364, 'grad_norm': 22.580278741351034, 'learning_rate': 4.058724504646834e-06, 'epoch': 1.07}                  
 29%|████████████████████████▌                                                             | 8/28 [00:46<01:38,  4.92s/it]
num_items_in_batch:470, rank=0
num_items_in_batch:470, rank=0
num_items_in_batch:1392, rank=0
num_items_in_batch:1392, rank=0
{'loss': 0.4617, 'grad_norm': 10.032145902847288, 'learning_rate': 3.830080191288342e-06, 'epoch': 1.2}                   
 32%|███████████████████████████▋                                                          | 9/28 [00:51<01:31,  4.80s/it]
num_items_in_batch:1392, rank=0
num_items_in_batch:1392, rank=0
num_items_in_batch:347, rank=0
num_items_in_batch:347, rank=0

Others

由于transformers在4.46后的版本更新了梯度累计的计算公式,其需要额外计算num_items_in_batch来进行梯度累计计算。但当sft训练时,如果epoch>1,且训练数据条数不能正好整除梯度累计数时,其在跨越epoch的时候num_items_in_batch的计算存在问题,这直接影响了最后的loss计算,同时我们测试了transformers最新版本中该问题已经被修复,但llama-factory无法兼容到4.48的版本,有什么好的解决方案嘛

@gom168 gom168 added bug Something isn't working pending This problem is yet to be addressed labels Jan 14, 2025
@hiyouga hiyouga linked a pull request Jan 14, 2025 that will close this issue
2 tasks
@hiyouga
Copy link
Owner

hiyouga commented Jan 14, 2025

This issue may be fixed in #6628 . However, we have observed another issue in the latest version of transformers, we will merge #6628 after the next transformers release.

@hiyouga hiyouga closed this as completed Jan 14, 2025
@hiyouga hiyouga reopened this Jan 14, 2025
@hiyouga
Copy link
Owner

hiyouga commented Jan 14, 2025

Now you can use transformers 4.45.2

@gom168
Copy link
Author

gom168 commented Jan 14, 2025

ok, thanks for your patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants