Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: __init__() got an unexpected keyword argument 'compute_dtype' #5334

Closed
1 task done
GlennCGL opened this issue Sep 2, 2024 · 3 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@GlennCGL
Copy link

GlennCGL commented Sep 2, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.4.dev0

Reproduction

training config

    --deepspeed ${ds_config_path} \
    --stage dpo \
    --pref_beta 0.1 \
    --pref_loss sigmoid \
    --model_name_or_path ${model_name_or_path}  \
    --do_train \
    --dataset ${dataset} \
    --dataset_dir ${dataset_dir} \
    --preprocessing_num_workers 32 \
    --cutoff_len ${cutoff_len} \
    --template qwen \
    --finetuning_type ${finetuning_type} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --learning_rate 5e-06 \
    --overwrite_output_dir \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 1 \
    --save_steps 1 \
    --save_strategy epoch \
    --warmup_ratio 0.1 \
    --weight_decay 0.01 \
    --bf16 True \
    --save_only_model \
    --plot_loss True \
    --gradient_checkpointing True  2>&1 | tee $log_file

ERROR MESSAGE:

09/03/2024 01:15:45 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/03/2024 01:15:45 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
09/03/2024 01:15:45 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
09/03/2024 01:15:45 - INFO - llamafactory.model.adapter - Fine-tuning method: Full
09/03/2024 01:15:45 - INFO - llamafactory.model.loader - trainable params: 7,615,616,512 || all params: 7,615,616,512 || trainable%: 100.0000
Traceback (most recent call last):
  File "src/train.py", line 28, in <module>
    main()
  File "src/train.py", line 19, in main
    run_exp()
  File "/ossfs/workspace/workspace/LLaMA-Factory-for-DPO/src/llamafactory/train/tuner.py", line 56, in run_exp
    run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/ossfs/workspace/workspace/LLaMA-Factory-for-DPO/src/llamafactory/train/dpo/workflow.py", line 58, in run_dpo
    ref_model = create_ref_model(model_args, finetuning_args)
  File "/ossfs/workspace/workspace/LLaMA-Factory-for-DPO/src/llamafactory/train/trainer_utils.py", line 121, in create_ref_model
    ref_model_args = ModelArguments.copyfrom(model_args)
  File "/ossfs/workspace/workspace/LLaMA-Factory-for-DPO/src/llamafactory/hparams/model_args.py", line 266, in copyfrom
    new_arg = cls(**arg_dict)
TypeError: __init__() got an unexpected keyword argument 'compute_dtype'
[2024-09-03 01:15:49,372] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 418107) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 810, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 2, 2024
@GlennCGL
Copy link
Author

GlennCGL commented Sep 2, 2024

bf16 + lora is ok.

however,

  1. full parameter training has some bugs.
  2. fp16 + lora has some nan problems. (Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.)

@yuanjing-jane
Copy link

I had the same problem, but I rolled back the deepspeed=0.13.5 and fixed the problem.

When you encounter a torch.*. api(deepspeed/elasticity/elastic_agent.py) import error after rolling back the version, manually modify the source code as follows to resolve the error.

image
image

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024
@hiyouga hiyouga closed this as completed in 59d2b31 Sep 3, 2024
@hiyouga
Copy link
Owner

hiyouga commented Sep 3, 2024

fixed

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants