accelerate多卡训练报错 #2735

xienan0326 · 2024-03-07T07:50:10Z

Reminder

I have read the README and searched the existing issues.

Reproduction

accelerate launch --config_file config.yaml src/train_bash.py
--stage dpo
--do_train
--model_name_or_path out_dir/sft_test
--dataset comparison_gpt4_zh
--template default
--finetuning_type full
--output_dir out_dir/dpo_test
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 1e-5
--num_train_epochs 1.0
--plot_loss
--fp16

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 38, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 47, in run_dpo
trainer = CustomDPOTrainer(
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/trainer.py", line 62, in init
self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1287, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.

Expected behavior

如何修复，正常进行训练

System Info

No response

Others

No response

hiyouga · 2024-03-07T07:51:05Z

报啥错？

xienan0326 · 2024-03-07T07:52:28Z

报啥错？

Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 38, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 47, in run_dpo
trainer = CustomDPOTrainer(
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/trainer.py", line 62, in init
self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1287, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.

hiyouga · 2024-03-07T08:16:43Z

更新代码重试一下

hiyouga added the pending This problem is yet to be addressed label Mar 7, 2024

hiyouga closed this as completed in f74f804 Mar 7, 2024

hiyouga reopened this Mar 7, 2024

xienan0326 closed this as completed Mar 7, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 7, 2024

tybalex pushed a commit to sanjay920/LLaMA-Factory that referenced this issue Mar 15, 2024

fix hiyouga#2735

93c90fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate多卡训练报错 #2735

accelerate多卡训练报错 #2735

xienan0326 commented Mar 7, 2024 •

edited

Loading

hiyouga commented Mar 7, 2024

xienan0326 commented Mar 7, 2024

hiyouga commented Mar 7, 2024

accelerate多卡训练报错 #2735

accelerate多卡训练报错 #2735

Comments

xienan0326 commented Mar 7, 2024 • edited Loading

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Mar 7, 2024

xienan0326 commented Mar 7, 2024

hiyouga commented Mar 7, 2024

xienan0326 commented Mar 7, 2024 •

edited

Loading