历史消息貌似没有正确加到训练数据中 #4683

ylsdamxssjxxdd · 2024-07-04T16:21:08Z

Reminder

I have read the README and searched the existing issues.

System Info

root@8d9356cd571a:/nerv/nerv-workspace# llamafactory-cli env
[2024-07-04 16:14:35,683] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible

llamafactory version: 0.8.3.dev0
Platform: Linux-5.10.0-8-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version: 2.3.1+cu121 (GPU)
Transformers version: 4.42.3
Datasets version: 2.20.0
Accelerate version: 0.32.1
PEFT version: 0.11.1
TRL version: 0.9.4
GPU type: NVIDIA GeForce RTX 3090 Ti
DeepSpeed version: 0.14.4
vLLM version: 0.5.0.post1

Reproduction

我修改了LLaMA-Factory-main/src/llamafactory/data/loader.py中的print_function(next(iter(dataset)))为

for item in dataset:
    print_function(item)

这样应该可以打印出所有训练数据

我的数据集就是，一个字没改~

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "system": "系统提示词（选填）",
    "history": [
      ["第一轮指令（选填）", "第一轮回答（选填）"],
      ["第二轮指令（选填）", "第二轮回答（选填）"]
    ]
  }
]

训练qwen2时weiui输出

input_ids:
[151644, 8948, 198, 72448, 45139, 99689, 9909, 30767, 68756, 7552, 151645, 198, 151644, 872, 198, 99363, 99620, 109504, 9909, 30767, 68756, 7552, 151645, 198, 151644, 77091, 198, 99363, 99620, 102104, 9909, 30767, 68756, 7552, 151645]
inputs:
<|im_start|>system
系统提示词（选填）<|im_end|>
<|im_start|>user
第一轮指令（选填）<|im_end|>
<|im_start|>assistant
第一轮回答（选填）<|im_end|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 99363, 99620, 102104, 9909, 30767, 68756, 7552, 151645]
labels:
第一轮回答（选填）<|im_end|>

Expected behavior

按道理weiui应该输出
第一轮回答
第二轮回答
模型回答
但是现在只有第一轮回答，说明历史消息没有正确加到训练数据中~

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-07-04T16:59:49Z

抱歉已修复

maksimstw · 2024-07-21T21:33:42Z

没太理解，这意味着在这个bug修复前，模型训练的时候都只是训练在第一轮上？之后几轮的数据完全没用上？请问这个bug存在多久了，是指到两周前才修好吗？

hiyouga · 2024-07-22T01:12:27Z

@maksimstw 在被修复为止存在了7天时间

github-actions bot added the pending This problem is yet to be addressed label Jul 4, 2024

hiyouga closed this as completed in e43809b Jul 4, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jul 4, 2024

xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024

fix hiyouga#4683

7eb8593

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

历史消息貌似没有正确加到训练数据中 #4683

历史消息貌似没有正确加到训练数据中 #4683

ylsdamxssjxxdd commented Jul 4, 2024 •

edited

Loading

hiyouga commented Jul 4, 2024

maksimstw commented Jul 21, 2024

hiyouga commented Jul 22, 2024

历史消息貌似没有正确加到训练数据中 #4683

历史消息貌似没有正确加到训练数据中 #4683

Comments

ylsdamxssjxxdd commented Jul 4, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jul 4, 2024

maksimstw commented Jul 21, 2024

hiyouga commented Jul 22, 2024

ylsdamxssjxxdd commented Jul 4, 2024 •

edited

Loading