Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate多卡训练报错 #2735

Closed
1 task done
xienan0326 opened this issue Mar 7, 2024 · 3 comments
Closed
1 task done

accelerate多卡训练报错 #2735

xienan0326 opened this issue Mar 7, 2024 · 3 comments
Labels
solved This problem has been already solved

Comments

@xienan0326
Copy link

xienan0326 commented Mar 7, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

accelerate launch --config_file config.yaml src/train_bash.py
--stage dpo
--do_train
--model_name_or_path out_dir/sft_test
--dataset comparison_gpt4_zh
--template default
--finetuning_type full
--output_dir out_dir/dpo_test
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 1e-5
--num_train_epochs 1.0
--plot_loss
--fp16

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 38, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 47, in run_dpo
trainer = CustomDPOTrainer(
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/trainer.py", line 62, in init
self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1287, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.

Expected behavior

如何修复,正常进行训练

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Mar 7, 2024

报啥错?

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 7, 2024
@xienan0326
Copy link
Author

报啥错?

Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 38, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 47, in run_dpo
trainer = CustomDPOTrainer(
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/trainer.py", line 62, in init
self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1287, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.

@hiyouga
Copy link
Owner

hiyouga commented Mar 7, 2024

更新代码重试一下

@hiyouga hiyouga closed this as completed in f74f804 Mar 7, 2024
@hiyouga hiyouga reopened this Mar 7, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 7, 2024
tybalex pushed a commit to sanjay920/LLaMA-Factory that referenced this issue Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants