full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

aistream69 · 2024-08-29T09:26:15Z

Reminder

I have read the README and searched the existing issues.

System Info

full+reward模式，Qwen1.5-0.5B-Chat微调训练时，如果不添加--save_safetensors会报错:
RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.embed_tokens.weight', 'lm_head.weight'}].
添加--save_safetensors后虽然不再报错，但保存模型时src/llamafactory/train/callbacks.py的函数fix_valuehead_checkpoint内os.remove(path_to_checkpoint)会删除pytorch_model.bin,导致保存的模型无法使用，请问该如何解决?谢谢.

Reproduction

llamafactory-cli train --stage rm --do_train True --model_name_or_path models/Qwen1.5-0.5B-Chat --preprocessing_num_workers 16 --finetuning_type full --quantization_method bitsandbytes --template qwen --flash_attn auto --dataset_dir data --dataset dpo_en_demo --cutoff_len 256 --learning_rate 0.0002 --num_train_epochs 3.0 --max_samples 500 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to none --output_dir saves/Qwen1.5-0.5B-Chat/full_rm --bf16 True --plot_loss True --ddp_timeout 180000000 --include_num_input_tokens_seen True --save_safetensors False

Expected behavior

No response

Others

No response

hiyouga · 2024-08-29T12:19:33Z

fixed

github-actions bot added the pending This problem is yet to be addressed label Aug 29, 2024

hiyouga added bug Something isn't working solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Aug 29, 2024

hiyouga closed this as completed in 364b757 Aug 29, 2024

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5305

300f082

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5305

2124eb8

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5305

681437c

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5305

1059f2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

aistream69 commented Aug 29, 2024

hiyouga commented Aug 29, 2024

full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

full+reward微调训练,添加--save_safetensors False后会删除pytorch_model.bin #5305

Comments

aistream69 commented Aug 29, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Aug 29, 2024