【问题】为什么要把可训练参数精度强行转换为全精度？ #4549

LaniakeaS · 2024-06-26T05:59:17Z

我在尝试全参微调，发现显存不够用。排查后发现llama-factory会强制把精度设置在fp32。由于我使用了deepspeed，所以无法使用pure bf16参数。

想问一下这个步骤的必要性是什么？能否在使用deepspeed的情况下也支持bf16和fp16？

hiyouga · 2024-06-26T12:34:11Z

deepspeed 不支持 pure_bf16

LaniakeaS · 2024-06-27T01:31:45Z

你误会我的意思了，我不是要它支持pure-bf16。举个例子来说，adapter.py中的_setup_full_tuning的param.data.to(torch.float32)这行代码让我发生了OOM的问题，我把这行代码注释掉就可以训练了。所以我想要的是是否可以提供某个参数来让我选择是否使用这里的float32转换。

hiyouga · 2024-06-27T06:24:53Z

你用的是 deepspeed stage 多少？

LaniakeaS · 2024-06-27T06:31:20Z

zero-3

hiyouga · 2024-06-27T06:33:38Z

理论上 zero3 不会走到那个逻辑，你用的是最新代码吗

LaniakeaS · 2024-06-27T06:39:52Z

抱歉搞错了，我是在发现OOM之后，改回了zero-3。之前在zero-2的情况下，会触发fp32转换导致的OOM，然后我把cast to fp32那行代码注释掉就可以在zero-2的条件下训练了。

hiyouga · 2024-06-27T16:43:30Z

试试用 pure_bf16: true 和 bf16: true 再跑下

hzwwww · 2024-07-07T05:26:24Z

@LaniakeaS 大佬你好，注释掉cast to fp32后，我报了这个错😂，你有没有碰到过

  File "src/train.py", line 28, in <module>
    main()
  File "src/train.py", line 19, in main
    run_exp()
  File "/LLaMA-Factory-0.8.2/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/LLaMA-Factory-0.8.2/src/llamafactory/train/sft/workflow.py", line 88, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2030, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/deepspeed.py", line 393, in deepspeed_init
    hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/deepspeed.py", line 265, in trainer_config_finalize
    raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds fp16.enabled=false vs hf fp16|fp16_full_eval+fp16_backend(amp)=False
The easiest method is to set these DeepSpeed config values to 'auto'.```

LaniakeaS · 2024-07-12T07:07:53Z

没碰到过看起来是deepspeed参数和huggingface参数冲突了检查一下你的配置文件或者传参

LaniakeaS · 2024-07-12T07:19:17Z

试试用 pure_bf16: true 和 bf16: true 再跑下

可以了感谢

github-actions bot added the pending This problem is yet to be addressed label Jun 26, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 26, 2024

hiyouga closed this as completed Jun 26, 2024

hiyouga added a commit that referenced this issue Jun 27, 2024

fix #4549

8ed6b36

PrimaLuz pushed a commit to PrimaLuz/LLaMA-Factory that referenced this issue Jul 1, 2024

fix hiyouga#4549

a6dfc53

xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024

fix hiyouga#4549

015f6b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【问题】为什么要把可训练参数精度强行转换为全精度？ #4549

【问题】为什么要把可训练参数精度强行转换为全精度？ #4549

LaniakeaS commented Jun 26, 2024

hiyouga commented Jun 26, 2024

LaniakeaS commented Jun 27, 2024

hiyouga commented Jun 27, 2024

LaniakeaS commented Jun 27, 2024 •

edited

Loading

hiyouga commented Jun 27, 2024

LaniakeaS commented Jun 27, 2024

hiyouga commented Jun 27, 2024

hzwwww commented Jul 7, 2024

LaniakeaS commented Jul 12, 2024

LaniakeaS commented Jul 12, 2024

【问题】为什么要把可训练参数精度强行转换为全精度？ #4549

【问题】为什么要把可训练参数精度强行转换为全精度？ #4549

Comments

LaniakeaS commented Jun 26, 2024

hiyouga commented Jun 26, 2024

LaniakeaS commented Jun 27, 2024

hiyouga commented Jun 27, 2024

LaniakeaS commented Jun 27, 2024 • edited Loading

hiyouga commented Jun 27, 2024

LaniakeaS commented Jun 27, 2024

hiyouga commented Jun 27, 2024

hzwwww commented Jul 7, 2024

LaniakeaS commented Jul 12, 2024

LaniakeaS commented Jul 12, 2024

LaniakeaS commented Jun 27, 2024 •

edited

Loading