reward/chosen is decreasing #42

zhangguoxin1 · 2024-07-15T03:10:56Z

Hi!
I am fine-tuning LLaMA3 on the hh-rlhf dataset using SimPo and noticed that the reward/chosen reward is decreasing. Is this reasonable?
`# SimPOTrainer arguments

bf16: true
beta: 2.5
gamma: 1.4
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
do_eval: true
eval_strategy: steps
eval_steps: 500
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: False
learning_rate: 5.0e-5
num_train_epochs: 1
log_level: info
logging_steps: 5
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
optim: adamw_torch
output_dir: outputs/llama-3-8b-instruct-simpo-hh
run_name: llama-3-8b-instruct-simpo-hh
force_use_ref_model: True
push_to_hub: false
save_strategy: "steps"
save_steps: 500
remove_unused_columns: False
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
`

zhangguoxin1 · 2024-07-15T03:18:34Z

I expected the reward/chosen to increase, but since the goal of SimPo is to maximize the difference between reward/chosen and reward/rejected, it is acceptable for reward/chosen to decrease to a certain extent. However, the extent of the decrease in reward/chosen seems a bit large compared to reward/chosen - reward/rejected.

yumeng5 · 2024-07-15T03:23:50Z

Hi,

Yes, this is reasonable. The reward margin should increase but the reward on chosen responses may slightly decrease (and the reward on rejected decreases more rapidly). In general, we don't want the reward on chosen to decrease too much (as that implies the likelihood of chosen responses is decreasing), and you may use a larger beta or a lower learning rate to mitigate the decrease of reward on chosen responses.

Best,
Yu

zhangguoxin1 · 2024-07-15T03:38:46Z

get it!

Thanks for the quick reply.

zhangguoxin1 · 2024-08-19T08:33:06Z

Hi,
I used Simpo in my task with qwen2_7B (there are approximately 40,000 data entries), but the model generated repeated sentences and pre-trained data. The parameters are as follows:

pref_beta: 2.5
simpo_gamma: 1.0
learning_rate: 1.0e-6
num_train_epochs: 3.0

and I'm try use a larger beta=8.0

xiamengzhou · 2024-08-19T17:12:29Z

@zhangguoxin1
I think you should be using Qwen2-7B-Instruct rather than Qwen2-7B if you only running PO? Also I'd suggest that you use online data rather offline data that is generated by other models.

yumeng5 · 2024-08-19T17:52:39Z

Hi @zhangguoxin1

In addition to the suggestions by Mengzhou, you may try the following as well:

decrease the learning rate (we usually start learning rate search around 5e-7)
reduce the number of training epochs (we generally train the model for only one epoch)

Best,
Yu

zhangguoxin1 changed the title ~~reward/chosen~~ reward/chosen is decreasing Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward/chosen is decreasing #42

reward/chosen is decreasing #42

zhangguoxin1 commented Jul 15, 2024 •

edited

Loading

zhangguoxin1 commented Jul 15, 2024

yumeng5 commented Jul 15, 2024

zhangguoxin1 commented Jul 15, 2024

zhangguoxin1 commented Aug 19, 2024 •

edited

Loading

xiamengzhou commented Aug 19, 2024

yumeng5 commented Aug 19, 2024

reward/chosen is decreasing #42

reward/chosen is decreasing #42

Comments

zhangguoxin1 commented Jul 15, 2024 • edited Loading

zhangguoxin1 commented Jul 15, 2024

yumeng5 commented Jul 15, 2024

zhangguoxin1 commented Jul 15, 2024

zhangguoxin1 commented Aug 19, 2024 • edited Loading

xiamengzhou commented Aug 19, 2024

yumeng5 commented Aug 19, 2024

zhangguoxin1 commented Jul 15, 2024 •

edited

Loading

zhangguoxin1 commented Aug 19, 2024 •

edited

Loading