Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【问题】为什么要把可训练参数精度强行转换为全精度? #4549

Closed
LaniakeaS opened this issue Jun 26, 2024 · 10 comments
Closed
Labels
solved This problem has been already solved

Comments

@LaniakeaS
Copy link

我在尝试全参微调,发现显存不够用。排查后发现llama-factory会强制把精度设置在fp32。由于我使用了deepspeed,所以无法使用pure bf16参数。

想问一下这个步骤的必要性是什么?能否在使用deepspeed的情况下也支持bf16和fp16?

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 26, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 26, 2024

deepspeed 不支持 pure_bf16

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 26, 2024
@hiyouga hiyouga closed this as completed Jun 26, 2024
@LaniakeaS
Copy link
Author

你误会我的意思了,我不是要它支持pure-bf16。举个例子来说,adapter.py中的_setup_full_tuning的param.data.to(torch.float32)这行代码让我发生了OOM的问题,我把这行代码注释掉就可以训练了。所以我想要的是是否可以提供某个参数来让我选择是否使用这里的float32转换。

@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

你用的是 deepspeed stage 多少?

@LaniakeaS
Copy link
Author

LaniakeaS commented Jun 27, 2024

zero-3

@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

理论上 zero3 不会走到那个逻辑,你用的是最新代码吗

@LaniakeaS
Copy link
Author

抱歉搞错了,我是在发现OOM之后,改回了zero-3。之前在zero-2的情况下,会触发fp32转换导致的OOM,然后我把cast to fp32那行代码注释掉就可以在zero-2的条件下训练了。

hiyouga added a commit that referenced this issue Jun 27, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

试试用 pure_bf16: truebf16: true 再跑下

PrimaLuz pushed a commit to PrimaLuz/LLaMA-Factory that referenced this issue Jul 1, 2024
@hzwwww
Copy link

hzwwww commented Jul 7, 2024

@LaniakeaS 大佬你好,注释掉cast to fp32后,我报了这个错😂,你有没有碰到过

  File "src/train.py", line 28, in <module>
    main()
  File "src/train.py", line 19, in main
    run_exp()
  File "/LLaMA-Factory-0.8.2/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/LLaMA-Factory-0.8.2/src/llamafactory/train/sft/workflow.py", line 88, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2030, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/deepspeed.py", line 393, in deepspeed_init
    hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
  File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/deepspeed.py", line 265, in trainer_config_finalize
    raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds fp16.enabled=false vs hf fp16|fp16_full_eval+fp16_backend(amp)=False
The easiest method is to set these DeepSpeed config values to 'auto'.```

@LaniakeaS
Copy link
Author

没碰到过 看起来是deepspeed参数和huggingface参数冲突了 检查一下你的配置文件或者传参

@LaniakeaS
Copy link
Author

试试用 pure_bf16: truebf16: true 再跑下

可以了 感谢

xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants