You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
System Info
full+reward模式,Qwen1.5-0.5B-Chat微调训练时,如果不添加--save_safetensors会报错:
RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.embed_tokens.weight', 'lm_head.weight'}].
添加--save_safetensors后虽然不再报错,但保存模型时src/llamafactory/train/callbacks.py的函数fix_valuehead_checkpoint内os.remove(path_to_checkpoint)会删除pytorch_model.bin,导致保存的模型无法使用,请问该如何解决?谢谢.
hiyouga
added
bug
Something isn't working
solved
This problem has been already solved
and removed
bug
Something isn't working
pending
This problem is yet to be addressed
labels
Aug 29, 2024
Reminder
System Info
full+reward模式,Qwen1.5-0.5B-Chat微调训练时,如果不添加--save_safetensors会报错:
RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'model.embed_tokens.weight', 'lm_head.weight'}].
添加--save_safetensors后虽然不再报错,但保存模型时src/llamafactory/train/callbacks.py的函数fix_valuehead_checkpoint内os.remove(path_to_checkpoint)会删除pytorch_model.bin,导致保存的模型无法使用,请问该如何解决?谢谢.
Reproduction
llamafactory-cli train --stage rm --do_train True --model_name_or_path models/Qwen1.5-0.5B-Chat --preprocessing_num_workers 16 --finetuning_type full --quantization_method bitsandbytes --template qwen --flash_attn auto --dataset_dir data --dataset dpo_en_demo --cutoff_len 256 --learning_rate 0.0002 --num_train_epochs 3.0 --max_samples 500 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to none --output_dir saves/Qwen1.5-0.5B-Chat/full_rm --bf16 True --plot_loss True --ddp_timeout 180000000 --include_num_input_tokens_seen True --save_safetensors False
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: