-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reward/chosen is decreasing #42
Comments
I expected the reward/chosen to increase, but since the goal of SimPo is to maximize the difference between reward/chosen and reward/rejected, it is acceptable for reward/chosen to decrease to a certain extent. However, the extent of the decrease in reward/chosen seems a bit large compared to reward/chosen - reward/rejected. |
Hi, Yes, this is reasonable. The reward margin should increase but the reward on chosen responses may slightly decrease (and the reward on rejected decreases more rapidly). In general, we don't want the reward on chosen to decrease too much (as that implies the likelihood of chosen responses is decreasing), and you may use a larger Best, |
get it! Thanks for the quick reply. |
@zhangguoxin1 |
In addition to the suggestions by Mengzhou, you may try the following as well:
Best, |
Hi!
I am fine-tuning LLaMA3 on the hh-rlhf dataset using SimPo and noticed that the reward/chosen reward is decreasing. Is this reasonable?
`# SimPOTrainer arguments
bf16: true
beta: 2.5
gamma: 1.4
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
do_eval: true
eval_strategy: steps
eval_steps: 500
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: False
learning_rate: 5.0e-5
num_train_epochs: 1
log_level: info
logging_steps: 5
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
optim: adamw_torch
output_dir: outputs/llama-3-8b-instruct-simpo-hh
run_name: llama-3-8b-instruct-simpo-hh
force_use_ref_model: True
push_to_hub: false
save_strategy: "steps"
save_steps: 500
remove_unused_columns: False
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
`
The text was updated successfully, but these errors were encountered: