-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerate多卡训练报错 #2735
Comments
报啥错? |
Traceback (most recent call last): |
更新代码重试一下 |
Reminder
Reproduction
accelerate launch --config_file config.yaml src/train_bash.py
--stage dpo
--do_train
--model_name_or_path out_dir/sft_test
--dataset comparison_gpt4_zh
--template default
--finetuning_type full
--output_dir out_dir/dpo_test
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 1e-5
--num_train_epochs 1.0
--plot_loss
--fp16
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in
main()
File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main
run_exp()
File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 38, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 47, in run_dpo
trainer = CustomDPOTrainer(
File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/trainer.py", line 62, in init
self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1287, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.
Expected behavior
如何修复,正常进行训练
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: