Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Attempting to unscale FP16 gradients. #1764

Closed
Lucien20000118 opened this issue Dec 7, 2023 · 19 comments
Closed

ValueError: Attempting to unscale FP16 gradients. #1764

Lucien20000118 opened this issue Dec 7, 2023 · 19 comments
Labels
solved This problem has been already solved

Comments

@Lucien20000118
Copy link

I ran this command.

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path openlm-research/open_llama_7b \
    --do_train \
    --dataset train \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16
[INFO|training_args.py:1345] 2023-12-07 06:09:02,164 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-07 06:09:02,164 >> PyTorch: setting up devices
[INFO|trainer.py:1760] 2023-12-07 06:09:03,760 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-12-07 06:09:03,761 >>   Num examples = 78,303
[INFO|trainer.py:1762] 2023-12-07 06:09:03,761 >>   Num Epochs = 3
[INFO|trainer.py:1763] 2023-12-07 06:09:03,761 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-12-07 06:09:03,761 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-12-07 06:09:03,761 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-12-07 06:09:03,761 >>   Total optimization steps = 14,682
[INFO|trainer.py:1769] 2023-12-07 06:09:03,762 >>   Number of trainable parameters = 4,194,304
  0%|                                                                                                                                                                                               | 0/14682 [00:00<?, ?it/s][WARNING|logging.py:290] 2023-12-07 06:09:03,766 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 68, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1950, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

我昨天使用時是正常的,但當我今天改變了資料集大小後出現了這個問題,請問是發生了甚麼事呢?

@zhuxh529
Copy link

zhuxh529 commented Dec 7, 2023

我也遇到这个问题,也是把训练数据集调大了一点后出现的,我用的是 chatglm3

@Liusifei
Copy link

Liusifei commented Dec 7, 2023

Same here when scaling up the train set.

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

我也遇到这个问题,也是把训练数据集调大了一点后出现的,我用的是 chatglm3

+1

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

我是在alpaca_zh后面增加的自己的数据集

@zhuxh529
Copy link

zhuxh529 commented Dec 8, 2023

我用的是modelscope提供的服务器,每次上去需要重装依赖,大家也是嘛?

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

我用的是modelscope提供的服务器,每次上去需要重装依赖,大家也是嘛?

我是在自己本地

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

我用的是modelscope提供的服务器,每次上去需要重装依赖,大家也是嘛?

之前我都是自己单独创建json文件,不过是很早以前版本

@Lucien20000118
Copy link
Author

我是創建自己的json文件,週二資料集大小大約5萬筆資料是可行的,週四將資料集擴充到7萬多筆資料卻出現上述的問題

@Lucien20000118
Copy link
Author

Lucien20000118 commented Dec 8, 2023

我用的是modelscope提供的服务器,每次上去需要重装依赖,大家也是嘛?

我是使用runpod雲端GPU進行訓練,就必須每次都重裝依賴沒錯

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

不太可能是数据集的问题,看起来更像是依赖版本问题

@Lucien20000118
Copy link
Author

不太可能是数据集的问题,看起来更像是依赖版本问题

你是對的,測試較小資料集仍不起作用。
那應該是requirments的問題,但我是透過pip install -r requirements.txt安裝的版本,正常來說不應該有問題。
或是前幾天的更新有使用到較高的版本。

@hiyouga
Copy link
Owner

hiyouga commented Dec 8, 2023

provide your system info

@hiyouga hiyouga added the pending This problem is yet to be addressed label Dec 8, 2023
@GXKIM
Copy link

GXKIM commented Dec 8, 2023

provide your system info

linux centos 7

torch 1.13.1
transformers 4.34.1
datasets 2.14.7
accelerate 0.25.0
peft 0.7.0
trl 0.7.4

@Lucien20000118
Copy link
Author

provide your system info

ubuntu 22.04

pytorch:2.0.1
py3.10
cuda11.8.0
accelerate-0.25.0
transformers-4.34.1
datasets-2.14.7
peft-0.7.0
trl-0.7.4

@hiyouga
Copy link
Owner

hiyouga commented Dec 8, 2023

We recommend to use peft==0.6.0

@GXKIM
Copy link

GXKIM commented Dec 8, 2023

We recommend to use peft==0.6.0

Thank you for your reply, the error has been resolved.

@hiyouga hiyouga added the bug Something isn't working label Dec 8, 2023
@hiyouga hiyouga closed this as completed in d42c0b1 Dec 8, 2023
@zhuxh529
Copy link

zhuxh529 commented Dec 9, 2023

感谢回答,改成peft==0.6.0就可以运行了

@Lucien20000118
Copy link
Author

Thank you for replying.

hiyouga added a commit that referenced this issue Dec 11, 2023
@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Dec 11, 2023
@Cauthygaussian
Copy link

We recommend to use peft==0.6.0

thank you very much, I could solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

6 participants