examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

JerryZeyu · 2024-08-23T02:00:38Z

Reminder

I have read the README and searched the existing issues.

System Info

用ds_z3_config.json的时候就会报错，错误显示：pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank3]: stage3_prefetch_bucket_size
[rank3]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]

请问这是deepspeed的版本问题吗

Reproduction

torch == 2.4.0
deepspeed == 0.15.0
llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

Expected behavior

No response

Others

No response

JerryZeyu · 2024-08-23T02:03:02Z

但是用llamafactory-cli train examples/train_lora/llama3_lora_sft_ds0.yaml的时候就不会报错

zhangguoxin1 · 2024-08-23T07:50:09Z

我也遇到此类问题

junqi-lu · 2024-08-26T01:29:10Z

同样的错误，将 deepspeed 退回到 0.14.0 版本对我有用。

sunzhufeng12345 · 2024-08-26T02:45:23Z

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应，无法运行，pytroch是根据cuda版本来的

junqi-lu · 2024-08-26T02:48:17Z

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应，无法运行，pytroch是根据cuda版本来的

所有依赖版本都采用仓库推荐版本呢？

sunzhufeng12345 · 2024-08-26T02:50:56Z

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应，无法运行，pytroch是根据cuda版本来的

所有依赖版本都采用仓库推荐版本呢？
我与楼主的版本一样，也是出现了相同的问题，我不是使用llamafactory训练的，是复现longwrite的时候遇到的这个问题
torch == 2.4.0
deepspeed == 0.15.0

gannim · 2024-08-27T05:45:15Z

I encountered a similar issue, and it was resolved by using DeepSpeed version 0.14.4. I suspect that the problem arises in later versions of DeepSpeed due to type checking with Pydantic. Specifically, when the stage3_prefetch_bucket_size option is set to auto, Accelerate calculates it based on the model's hidden size. However, I suspect that it might not be properly converted to an integer during this process, leading to the error.

see hiyouga/LLaMA-Factory#5252 (comment)

chenhuiyu · 2024-08-28T10:08:17Z

I encountered a similar issue, and it was resolved by using DeepSpeed version 0.14.4. I suspect that the problem arises in later versions of DeepSpeed due to type checking with Pydantic. Specifically, when the stage3_prefetch_bucket_size option is set to auto, Accelerate calculates it based on the model's hidden size. However, I suspect that it might not be properly converted to an integer during this process, leading to the error.

Thanks! this solution solve my issue!

HughesZhang2021 · 2024-09-03T17:48:01Z

deepspeed==0.14.4 solved

see hiyouga/LLaMA-Factory#5252 (comment)

github-actions bot added the pending This problem is yet to be addressed label Aug 23, 2024

zjysteven added a commit to zjysteven/lmms-finetune that referenced this issue Aug 27, 2024

fix deepspeed to 0.14.4

0af3bf0

see hiyouga/LLaMA-Factory#5252 (comment)

bys0318 mentioned this issue Sep 3, 2024

有人 train 成功了吗？ THUDM/LongWriter#19

Open

2 tasks

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024

hiyouga closed this as completed Sep 3, 2024

hiyouga added a commit that referenced this issue Sep 3, 2024

fix #5252

ebddce5

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5252

7c132d5

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5252

2050320

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5252

ff0f661

yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024

fix hiyouga#5252

f77af2d

linyueqian pushed a commit to zjysteven/lmms-finetune that referenced this issue Sep 13, 2024

fix deepspeed to 0.14.4

c944f67

see hiyouga/LLaMA-Factory#5252 (comment)

HideLord mentioned this issue Sep 30, 2024

ds_z3_config.json stage3_prefetch_bucket_size 应该是一个整数 #5586

Open

1 task

yangyang6666 mentioned this issue Oct 23, 2024

华为NPU适配，依赖冲突。 #5763

Open

1 task

danielwusg pushed a commit to sunfanyunn/lmms-finetune that referenced this issue Nov 18, 2024

fix deepspeed to 0.14.4

28fcf78

see hiyouga/LLaMA-Factory#5252 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

JerryZeyu commented Aug 23, 2024

JerryZeyu commented Aug 23, 2024

zhangguoxin1 commented Aug 23, 2024

junqi-lu commented Aug 26, 2024

sunzhufeng12345 commented Aug 26, 2024

junqi-lu commented Aug 26, 2024

sunzhufeng12345 commented Aug 26, 2024

gannim commented Aug 27, 2024

chenhuiyu commented Aug 28, 2024

HughesZhang2021 commented Sep 3, 2024

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

Comments

JerryZeyu commented Aug 23, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

JerryZeyu commented Aug 23, 2024

zhangguoxin1 commented Aug 23, 2024

junqi-lu commented Aug 26, 2024

sunzhufeng12345 commented Aug 26, 2024

junqi-lu commented Aug 26, 2024

sunzhufeng12345 commented Aug 26, 2024

gannim commented Aug 27, 2024

chenhuiyu commented Aug 28, 2024

HughesZhang2021 commented Sep 3, 2024