ValueError: Attempting to unscale FP16 gradients. #1764

Lucien20000118 · 2023-12-07T07:12:49Z

I ran this command.

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path openlm-research/open_llama_7b \
    --do_train \
    --dataset train \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir checkpoint \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16

[INFO|training_args.py:1345] 2023-12-07 06:09:02,164 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-07 06:09:02,164 >> PyTorch: setting up devices
[INFO|trainer.py:1760] 2023-12-07 06:09:03,760 >> ***** Running training *****
[INFO|trainer.py:1761] 2023-12-07 06:09:03,761 >>   Num examples = 78,303
[INFO|trainer.py:1762] 2023-12-07 06:09:03,761 >>   Num Epochs = 3
[INFO|trainer.py:1763] 2023-12-07 06:09:03,761 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1766] 2023-12-07 06:09:03,761 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1767] 2023-12-07 06:09:03,761 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1768] 2023-12-07 06:09:03,761 >>   Total optimization steps = 14,682
[INFO|trainer.py:1769] 2023-12-07 06:09:03,762 >>   Number of trainable parameters = 4,194,304
  0%|                                                                                                                                                                                               | 0/14682 [00:00<?, ?it/s][WARNING|logging.py:290] 2023-12-07 06:09:03,766 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 68, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1950, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

我昨天使用時是正常的，但當我今天改變了資料集大小後出現了這個問題，請問是發生了甚麼事呢?

The text was updated successfully, but these errors were encountered:

zhuxh529 · 2023-12-07T09:31:34Z

我也遇到这个问题，也是把训练数据集调大了一点后出现的，我用的是 chatglm3

Liusifei · 2023-12-07T18:14:20Z

Same here when scaling up the train set.

GXKIM · 2023-12-08T05:09:52Z

我也遇到这个问题，也是把训练数据集调大了一点后出现的，我用的是 chatglm3

+1

GXKIM · 2023-12-08T05:10:33Z

我是在alpaca_zh后面增加的自己的数据集

zhuxh529 · 2023-12-08T05:18:50Z

我用的是modelscope提供的服务器，每次上去需要重装依赖，大家也是嘛？

GXKIM · 2023-12-08T05:19:48Z

我用的是modelscope提供的服务器，每次上去需要重装依赖，大家也是嘛？

我是在自己本地

GXKIM · 2023-12-08T05:21:07Z

我用的是modelscope提供的服务器，每次上去需要重装依赖，大家也是嘛？

之前我都是自己单独创建json文件，不过是很早以前版本

Lucien20000118 · 2023-12-08T06:30:33Z

我是創建自己的json文件，週二資料集大小大約5萬筆資料是可行的，週四將資料集擴充到7萬多筆資料卻出現上述的問題

Lucien20000118 · 2023-12-08T06:31:31Z

我用的是modelscope提供的服务器，每次上去需要重装依赖，大家也是嘛？

我是使用runpod雲端GPU進行訓練，就必須每次都重裝依賴沒錯

GXKIM · 2023-12-08T07:26:06Z

不太可能是数据集的问题，看起来更像是依赖版本问题

Lucien20000118 · 2023-12-08T07:47:47Z

不太可能是数据集的问题，看起来更像是依赖版本问题

你是對的，測試較小資料集仍不起作用。
那應該是requirments的問題，但我是透過pip install -r requirements.txt安裝的版本，正常來說不應該有問題。
或是前幾天的更新有使用到較高的版本。

hiyouga · 2023-12-08T07:59:34Z

provide your system info

GXKIM · 2023-12-08T08:04:30Z

provide your system info

linux centos 7

torch 1.13.1
transformers 4.34.1
datasets 2.14.7
accelerate 0.25.0
peft 0.7.0
trl 0.7.4

Lucien20000118 · 2023-12-08T08:10:35Z

provide your system info

ubuntu 22.04

pytorch:2.0.1
py3.10
cuda11.8.0
accelerate-0.25.0
transformers-4.34.1
datasets-2.14.7
peft-0.7.0
trl-0.7.4

hiyouga · 2023-12-08T08:11:48Z

We recommend to use peft==0.6.0

GXKIM · 2023-12-08T08:16:11Z

We recommend to use peft==0.6.0

Thank you for your reply, the error has been resolved.

zhuxh529 · 2023-12-09T15:37:24Z

感谢回答，改成peft==0.6.0就可以运行了

Lucien20000118 · 2023-12-11T03:18:27Z

Thank you for replying.

Cauthygaussian · 2023-12-29T03:07:12Z

We recommend to use peft==0.6.0

thank you very much， I could solve the problem.

hiyouga added the pending This problem is yet to be addressed label Dec 8, 2023

hiyouga added the bug Something isn't working label Dec 8, 2023

hiyouga closed this as completed in d42c0b1 Dec 8, 2023

hiyouga added a commit that referenced this issue Dec 11, 2023

use peft 0.7.0, fix #1561 #1764

9ce1b0e

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Attempting to unscale FP16 gradients. #1764

ValueError: Attempting to unscale FP16 gradients. #1764

Lucien20000118 commented Dec 7, 2023

zhuxh529 commented Dec 7, 2023

Liusifei commented Dec 7, 2023

GXKIM commented Dec 8, 2023

GXKIM commented Dec 8, 2023

zhuxh529 commented Dec 8, 2023

GXKIM commented Dec 8, 2023

GXKIM commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023 •

edited

Loading

GXKIM commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023

hiyouga commented Dec 8, 2023

GXKIM commented Dec 8, 2023 •

edited

Loading

Lucien20000118 commented Dec 8, 2023

hiyouga commented Dec 8, 2023

GXKIM commented Dec 8, 2023

zhuxh529 commented Dec 9, 2023

Lucien20000118 commented Dec 11, 2023

Cauthygaussian commented Dec 29, 2023

ValueError: Attempting to unscale FP16 gradients. #1764

ValueError: Attempting to unscale FP16 gradients. #1764

Comments

Lucien20000118 commented Dec 7, 2023

zhuxh529 commented Dec 7, 2023

Liusifei commented Dec 7, 2023

GXKIM commented Dec 8, 2023

GXKIM commented Dec 8, 2023

zhuxh529 commented Dec 8, 2023

GXKIM commented Dec 8, 2023

GXKIM commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023 • edited Loading

GXKIM commented Dec 8, 2023

Lucien20000118 commented Dec 8, 2023

hiyouga commented Dec 8, 2023

GXKIM commented Dec 8, 2023 • edited Loading

Lucien20000118 commented Dec 8, 2023

hiyouga commented Dec 8, 2023

GXKIM commented Dec 8, 2023

zhuxh529 commented Dec 9, 2023

Lucien20000118 commented Dec 11, 2023

Cauthygaussian commented Dec 29, 2023

Lucien20000118 commented Dec 8, 2023 •

edited

Loading

GXKIM commented Dec 8, 2023 •

edited

Loading