💡 [REQUEST] - 支持Audio的微调方案 #745

Lingeng56 · 2025-01-17T09:44:39Z

起始日期 | Start Date

No response

实现PR | Implementation PR

之后会更新MiniCPM-O的audio到text的微调方案吗？目前我自己只能根据model_server里的处理流程，试着把audio处理成推理的样子

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Gaiejj · 2025-01-18T14:30:11Z

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

Cuiunbo · 2025-01-18T15:44:25Z

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

BUAADreamer · 2025-01-19T07:45:14Z

LLaMA-Factory has supported audio-text to text fine-tuning and inference, you can also try it 🤗

hiyouga/LLaMA-Factory#6701

Lingeng56 · 2025-01-20T03:26:34Z

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

Cuiunbo · 2025-01-21T03:20:39Z

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

uangshiyon · 2025-01-22T06:39:51Z

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

您好，请问一下，多个音频比如一个是用于声音克隆的音频，一个是需要改变声音的音频，这种场景的微调数据json大概是什么样的？

Lingeng56 · 2025-01-24T08:50:40Z

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

Lingeng56 · 2025-01-24T09:39:33Z

我这边试图在跑LLama-Factory的代码，发现跑不通呢，会报错，传入的processor是一个None。然后LLama-Factory的main branch也把你们的pr pending了

Lingeng56 · 2025-01-24T09:45:15Z

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

Gaiejj · 2025-01-24T09:52:59Z

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

BUAADreamer · 2025-01-24T10:30:18Z

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

当然支持，您可以直接在这里 https://github.com/BUAADreamer/LLaMA-Factory/blob/13d252fa7856ecb14ba6907e5adb10070e5cdde4/src/llamafactory/data/template.py#L958 新增你的模板，加上以下几行：

_register_template(
    name="minicpm_o_audio",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
    stop_words=["<|im_end|>"],
    default_system=(
        "不管输入什么都需要直接翻译为英文"
    ),
    mm_plugin=get_mm_plugin(name="minicpm_v", image_token="<image>", video_token="<video>"),
)

并在yaml文件中使用 template: minicpm_o_audio 即可

Lingeng56 · 2025-01-24T10:30:27Z

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

不知道有没有tut可以让我使用自己的数据集，我现在只能查代码看看怎么把自己的数据用来做training

Lingeng56 · 2025-01-24T10:31:08Z

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

当然支持，您可以直接在这里自定义你的模板为如下格式：

_register_template(
name="minicpm_o_audio",
format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
stop_words=["<|im_end|>"],
default_system=(
"不管输入什么都需要直接翻译为英文"
),
mm_plugin=get_mm_plugin(name="minicpm_v", image_token="", video_token="

hello 我跑了您修改后的llama factory的branch，会报process是Nonetype的Error

BUAADreamer · 2025-01-24T12:48:16Z

暂时推荐使用transformers==4.45.0，可以稳定跑通微调和推理

【重要】使用以下方式安装最新的llamafactory以及相应的库

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
pip3 install transformers==4.45.0
pip3 install huggingface_hub==0.25.0

Gaiejj · 2025-01-24T19:48:13Z

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

不知道有没有tut可以让我使用自己的数据集，我现在只能查代码看看怎么把自己的数据用来做training

其实readme和文档主页就有示例，您看看能不能满足您的需求？

Lingeng56 added the question Further information is requested label Jan 17, 2025

YuzaChongyi assigned Cuiunbo Jan 18, 2025

Gaiejj mentioned this issue Jan 18, 2025

feat: support minicpm-o audio fine-tuning PKU-Alignment/align-anything#117

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 [REQUEST] - 支持Audio的微调方案 #745

💡 [REQUEST] - 支持Audio的微调方案 #745

Lingeng56 commented Jan 17, 2025

Gaiejj commented Jan 18, 2025

Cuiunbo commented Jan 18, 2025

BUAADreamer commented Jan 19, 2025 •

edited

Loading

Lingeng56 commented Jan 20, 2025

Cuiunbo commented Jan 21, 2025

uangshiyon commented Jan 22, 2025

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

Gaiejj commented Jan 24, 2025

BUAADreamer commented Jan 24, 2025 •

edited

Loading

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

BUAADreamer commented Jan 24, 2025

Gaiejj commented Jan 24, 2025

💡 [REQUEST] - 支持Audio的微调方案 #745

💡 [REQUEST] - 支持Audio的微调方案 #745

Comments

Lingeng56 commented Jan 17, 2025

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Gaiejj commented Jan 18, 2025

Cuiunbo commented Jan 18, 2025

BUAADreamer commented Jan 19, 2025 • edited Loading

Lingeng56 commented Jan 20, 2025

Cuiunbo commented Jan 21, 2025

uangshiyon commented Jan 22, 2025

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

Gaiejj commented Jan 24, 2025

BUAADreamer commented Jan 24, 2025 • edited Loading

Lingeng56 commented Jan 24, 2025

Lingeng56 commented Jan 24, 2025

BUAADreamer commented Jan 24, 2025

Gaiejj commented Jan 24, 2025

BUAADreamer commented Jan 19, 2025 •

edited

Loading

BUAADreamer commented Jan 24, 2025 •

edited

Loading