-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
求问模型ppo之后已读乱回怎么办 #4012
Comments
ppo lr 可能太大了 |
我试着把模型的lr缩到了1e-6,但是还是会在途中报KL变负,然后模型开始已读乱回,麻了 |
ppo 缺少一些参数,请参照示例脚本 |
之前的 PPO 有些问题,已经修复 |
Reminder
Reproduction
rm:
deepspeed --include localhost:1,2,3,4 --master_port=9001 src/train_bash.py --deepspeed ds_config.json --stage rm --do_train True --model_name_or_path /home/ywj_0/llm_safety/model/llama2-7b-hf --finetuning_type full --template llama2 --dataset_dir /home/ywj_0/llm_safety/dataset/raw/v4_training/train --dataset safety_llama_rlhf --cutoff_len 4096 --learning_rate 1e-06 --num_train_epochs 1.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 500 --warmup_steps 50 --output_dir /home/ywj_0/llm_safety/model/safety-rm-0530 --fp16 True --plot_loss True --val_size 0.1 --per_device_eval_batch_size 1 --evaluation_strategy steps --eval_steps 50
ppo:
deepspeed --include localhost:1,2,3,4 --master_port=9010 src/train_bash.py --deepspeed ds_config.json --stage ppo --do_train True --model_name_or_path /home/ywj_0/llm_safety/model/llama2-7b-hf --finetuning_type full --template llama2 --dataset_dir /home/ywj_0/llm_safety/dataset/raw/v4_training/train --dataset safety_llama_ppo --cutoff_len 4096 --learning_rate 5e-06 --num_train_epochs 1 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 50 --output_dir /home/ywj_0/llm_safety/model/safety-ppo-0530 --bf16 True --reward_model /home/ywj_0/llm_safety/model/safety-rm-0530 --reward_model_type full --plot_loss True --temperature 1.0 --val_size 0.1 --per_device_eval_batch_size 1 --evaluation_strategy steps --eval_steps 50
reward model的损失也蛮小的 但是不知道为什么会出现这种情况。。
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: