梯度累计在transformers>=4.46.0时出错 #6639

gom168 · 2025-01-14T09:05:00Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.1
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.35
Python version: 3.10.16
PyTorch version: 2.4.0+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.8.6
GPU type: NVIDIA A800-SXM4-80GB
DeepSpeed version: 0.14.4

Reproduction

数据共三十条，在单卡上进行sft
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 4 \
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
num_items_in_batch:525, rank=0
{'loss': 0.9863, 'grad_norm': 14.851867126840325, 'learning_rate': 4.267766952966369e-06, 'epoch': 0.93}                  
 25%|█████████████████████▌                                                                | 7/28 [00:42<01:46,  5.06s/it]
num_items_in_batch:279, rank=0
num_items_in_batch:279, rank=0
num_items_in_batch:470, rank=0
num_items_in_batch:470, rank=0
{'loss': 1.2364, 'grad_norm': 22.580278741351034, 'learning_rate': 4.058724504646834e-06, 'epoch': 1.07}                  
 29%|████████████████████████▌                                                             | 8/28 [00:46<01:38,  4.92s/it]
num_items_in_batch:470, rank=0
num_items_in_batch:470, rank=0
num_items_in_batch:1392, rank=0
num_items_in_batch:1392, rank=0
{'loss': 0.4617, 'grad_norm': 10.032145902847288, 'learning_rate': 3.830080191288342e-06, 'epoch': 1.2}                   
 32%|███████████████████████████▋                                                          | 9/28 [00:51<01:31,  4.80s/it]
num_items_in_batch:1392, rank=0
num_items_in_batch:1392, rank=0
num_items_in_batch:347, rank=0
num_items_in_batch:347, rank=0

Others

由于transformers在4.46后的版本更新了梯度累计的计算公式，其需要额外计算num_items_in_batch来进行梯度累计计算。但当sft训练时，如果epoch>1，且训练数据条数不能正好整除梯度累计数时，其在跨越epoch的时候num_items_in_batch的计算存在问题，这直接影响了最后的loss计算，同时我们测试了transformers最新版本中该问题已经被修复，但llama-factory无法兼容到4.48的版本，有什么好的解决方案嘛

The text was updated successfully, but these errors were encountered:

hiyouga · 2025-01-14T09:11:45Z

This issue may be fixed in #6628 . However, we have observed another issue in the latest version of transformers, we will merge #6628 after the next transformers release.

hiyouga · 2025-01-14T09:12:15Z

Now you can use transformers 4.45.2

gom168 · 2025-01-14T09:17:49Z

ok, thanks for your patience

gom168 added bug Something isn't working pending This problem is yet to be addressed labels Jan 14, 2025

hiyouga linked a pull request Jan 14, 2025 that will close this issue

[version] support transformers 4.48 & Byebye python 3.8 #6628

Open

2 tasks

hiyouga closed this as completed Jan 14, 2025

hiyouga reopened this Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

梯度累计在transformers>=4.46.0时出错 #6639

梯度累计在transformers>=4.46.0时出错 #6639

gom168 commented Jan 14, 2025

hiyouga commented Jan 14, 2025

hiyouga commented Jan 14, 2025

gom168 commented Jan 14, 2025

梯度累计在transformers>=4.46.0时出错 #6639

梯度累计在transformers>=4.46.0时出错 #6639

Comments

gom168 commented Jan 14, 2025

Reminder

System Info

Reproduction

Others

hiyouga commented Jan 14, 2025

hiyouga commented Jan 14, 2025

gom168 commented Jan 14, 2025