Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reward model训练的重载以及评测 #4743

Closed
1 task done
yata0 opened this issue Jul 9, 2024 · 9 comments
Closed
1 task done

reward model训练的重载以及评测 #4743

yata0 opened this issue Jul 9, 2024 · 9 comments
Labels
solved This problem has been already solved

Comments

@yata0
Copy link

yata0 commented Jul 9, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • Platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • PyTorch version: 2.2.2+cu121 (GPU)
  • Transformers version: 4.42.3
  • Datasets version: 2.18.0
  • Accelerate version: 0.32.1
  • PEFT version: 0.11.1
  • TRL version: 0.9.6
  • GPU type: Tesla V100-SXM2-32GB

Reproduction

  1. 导出:lamafactory-cli export --model_name_or_path=“./save” --stage=rm --export_dir="./see12" --template=default
    image

  2. 测试:

from trl import AutoModelForCausalLMWithValueHead
model_path = "./see12"
model = AutoModelForCausalLMWithValueHead.from_pretrained(model_path, trust_remote_code=True)

报错:
v_head weight is found. This IS expected if you are not resuming PPO training

#4379 (comment)

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 9, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jul 9, 2024

仅支持用 llamafactory 加载 RM
llamafactory-cli api --model_name_or_path xx --template xx --stage rm

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 9, 2024
@hiyouga hiyouga closed this as completed Jul 9, 2024
@yata0
Copy link
Author

yata0 commented Jul 9, 2024

llamafactory-cli api --model_name_or_path xx --template xx --stage rm

怎么对rm做评测呢?

@yata0
Copy link
Author

yata0 commented Jul 9, 2024

llamafactory-cli api --model_name_or_path xx --template xx --stage rm

怎么对rm做评测呢?

@hiyouga

@hiyouga
Copy link
Owner

hiyouga commented Jul 9, 2024

把训练脚本里的 do_train 改成 do_eval

@xd2333
Copy link
Contributor

xd2333 commented Jul 18, 2024

我知道了,修改yaml
do_train: false
do_eval: false
do_predict: true
adapter_name_or_path: 训练后的lora

奖励预测结果会在output_dir里

@bruceguo123
Copy link

bruceguo123 commented Jul 31, 2024

@hiyouga @xd2333
只输出了100个结果,为什么呢?数据中有1000条数据的。

命令: llamafactory-cli train /root/autodl-tmp/llm_prj/AdGen/config/reward_infer_model.yaml
reward_infer_model.yaml文件内容:

model

model_name_or_path: /root/autodl-tmp/llm_prj/AdGen/reward_model/merge

method

stage: rm
do_train: false
do_eval: false
do_predict: true

do_train: false
do_eval: false
do_predict: true

dataset

dataset: ad_dpo
template: qwen
cutoff_len: 1024
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /root/autodl-tmp/llm_prj/AdGen/reward_model/infer
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

dataset: ad_dpo配置如下:
"ad_dpo": {
"file_name": "/root/autodl-tmp/llm_prj/AdGen/data/dpo/ad_dpo.jsonl",
"ranking": true,
"columns": {
"prompt": "instruction",
"chosen": "chosen",
"rejected": "rejected"
}
}

输出:
image

@xd2333
Copy link
Contributor

xd2333 commented Aug 1, 2024

@hiyouga @xd2333 只输出了100个结果,为什么呢?数据中有1000条数据的。

命令: llamafactory-cli train /root/autodl-tmp/llm_prj/AdGen/config/reward_infer_model.yaml reward_infer_model.yaml文件内容:

model

model_name_or_path: /root/autodl-tmp/llm_prj/AdGen/reward_model/merge

method

stage: rm do_train: false do_eval: false do_predict: true

do_train: false do_eval: false do_predict: true

dataset

dataset: ad_dpo template: qwen cutoff_len: 1024 max_samples: 10000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /root/autodl-tmp/llm_prj/AdGen/reward_model/infer logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

dataset: ad_dpo配置如下: "ad_dpo": { "file_name": "/root/autodl-tmp/llm_prj/AdGen/data/dpo/ad_dpo.jsonl", "ranking": true, "columns": { "prompt": "instruction", "chosen": "chosen", "rejected": "rejected" } }

输出: image

设置eval_dataset: ad_dpo,删除val_size: 0.1、eval_strategy: steps、eval_steps: 500

@rover5056
Copy link

仅支持用 llamafactory 加载 RM llamafactory-cli api --model_name_or_path xx --template xx --stage rm

请问请求多模态的标准的请求脚本可以提供一个 demo case么~ 试了半天不知道怎么拼接 message。。。
使用这个启动的:
llamafactory-cli api --stage rm --template qwen2_vl --model_name_or_path models/qwen2_vl_rm_lora_1027_3sets

或者使用 trl 库的话该怎么加载模型推理得到分数~

感谢感谢~

@hiyouga @xd2333

@world2025
Copy link

请问一下,训练reward model支持这样的数据格式吗 openbookqa,一个prompt+多个response,跟Instructgpt一样

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

6 participants