reward model训练的重载以及评测 #4743

yata0 · 2024-07-09T15:18:08Z

Reminder

I have read the README and searched the existing issues.

System Info

Platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version: 2.2.2+cu121 (GPU)
Transformers version: 4.42.3
Datasets version: 2.18.0
Accelerate version: 0.32.1
PEFT version: 0.11.1
TRL version: 0.9.6
GPU type: Tesla V100-SXM2-32GB

Reproduction

导出：lamafactory-cli export --model_name_or_path=“./save” --stage=rm --export_dir="./see12" --template=default
测试：

from trl import AutoModelForCausalLMWithValueHead
model_path = "./see12"
model = AutoModelForCausalLMWithValueHead.from_pretrained(model_path, trust_remote_code=True)

报错：
v_head weight is found. This IS expected if you are not resuming PPO training

#4379 (comment)

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-07-09T15:29:05Z

仅支持用 llamafactory 加载 RM
llamafactory-cli api --model_name_or_path xx --template xx --stage rm

yata0 · 2024-07-09T15:45:18Z

llamafactory-cli api --model_name_or_path xx --template xx --stage rm

怎么对rm做评测呢？

yata0 · 2024-07-09T15:45:45Z

llamafactory-cli api --model_name_or_path xx --template xx --stage rm

怎么对rm做评测呢？

@hiyouga

hiyouga · 2024-07-09T15:57:23Z

把训练脚本里的 do_train 改成 do_eval

xd2333 · 2024-07-18T23:47:13Z

我知道了，修改yaml
do_train: false
do_eval: false
do_predict: true
adapter_name_or_path: 训练后的lora

奖励预测结果会在output_dir里

bruceguo123 · 2024-07-31T08:31:08Z

@hiyouga @xd2333
只输出了100个结果，为什么呢？数据中有1000条数据的。

命令： llamafactory-cli train /root/autodl-tmp/llm_prj/AdGen/config/reward_infer_model.yaml
reward_infer_model.yaml文件内容：

model

model_name_or_path: /root/autodl-tmp/llm_prj/AdGen/reward_model/merge

method

stage: rm
do_train: false
do_eval: false
do_predict: true

do_train: false
do_eval: false
do_predict: true

dataset

dataset: ad_dpo
template: qwen
cutoff_len: 1024
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /root/autodl-tmp/llm_prj/AdGen/reward_model/infer
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

dataset: ad_dpo配置如下：
"ad_dpo": {
"file_name": "/root/autodl-tmp/llm_prj/AdGen/data/dpo/ad_dpo.jsonl",
"ranking": true,
"columns": {
"prompt": "instruction",
"chosen": "chosen",
"rejected": "rejected"
}
}

输出：

xd2333 · 2024-08-01T14:28:40Z

@hiyouga @xd2333 只输出了100个结果，为什么呢？数据中有1000条数据的。

命令： llamafactory-cli train /root/autodl-tmp/llm_prj/AdGen/config/reward_infer_model.yaml reward_infer_model.yaml文件内容：

model

model_name_or_path: /root/autodl-tmp/llm_prj/AdGen/reward_model/merge

method

stage: rm do_train: false do_eval: false do_predict: true

do_train: false do_eval: false do_predict: true

dataset

dataset: ad_dpo template: qwen cutoff_len: 1024 max_samples: 10000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /root/autodl-tmp/llm_prj/AdGen/reward_model/infer logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

dataset: ad_dpo配置如下： "ad_dpo": { "file_name": "/root/autodl-tmp/llm_prj/AdGen/data/dpo/ad_dpo.jsonl", "ranking": true, "columns": { "prompt": "instruction", "chosen": "chosen", "rejected": "rejected" } }

输出：

设置eval_dataset: ad_dpo，删除val_size: 0.1、eval_strategy: steps、eval_steps: 500

rover5056 · 2024-10-28T13:10:33Z

仅支持用 llamafactory 加载 RM llamafactory-cli api --model_name_or_path xx --template xx --stage rm

请问请求多模态的标准的请求脚本可以提供一个 demo case么～试了半天不知道怎么拼接 message。。。
使用这个启动的：
llamafactory-cli api --stage rm --template qwen2_vl --model_name_or_path models/qwen2_vl_rm_lora_1027_3sets

或者使用 trl 库的话该怎么加载模型推理得到分数～

感谢感谢～

@hiyouga @xd2333

world2025 · 2024-11-08T04:11:17Z

请问一下，训练reward model支持这样的数据格式吗 openbookqa，一个prompt+多个response，跟Instructgpt一样

github-actions bot added the pending This problem is yet to be addressed label Jul 9, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 9, 2024

hiyouga closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward model训练的重载以及评测 #4743

reward model训练的重载以及评测 #4743

yata0 commented Jul 9, 2024

hiyouga commented Jul 9, 2024

yata0 commented Jul 9, 2024

yata0 commented Jul 9, 2024

hiyouga commented Jul 9, 2024

xd2333 commented Jul 18, 2024

bruceguo123 commented Jul 31, 2024 •

edited

Loading

xd2333 commented Aug 1, 2024 •

edited

Loading

model

method

dataset

output

train

eval

rover5056 commented Oct 28, 2024

world2025 commented Nov 8, 2024

reward model训练的重载以及评测 #4743

reward model训练的重载以及评测 #4743

Comments

yata0 commented Jul 9, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jul 9, 2024

yata0 commented Jul 9, 2024

yata0 commented Jul 9, 2024

hiyouga commented Jul 9, 2024

xd2333 commented Jul 18, 2024

bruceguo123 commented Jul 31, 2024 • edited Loading

model

method

dataset

output

train

eval

xd2333 commented Aug 1, 2024 • edited Loading

model

method

dataset

output

train

eval

rover5056 commented Oct 28, 2024

world2025 commented Nov 8, 2024

bruceguo123 commented Jul 31, 2024 •

edited

Loading

xd2333 commented Aug 1, 2024 •

edited

Loading