Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

Closed
1 task done
JerryZeyu opened this issue Aug 23, 2024 · 9 comments
Closed
1 task done

examples/train_lora/llama3_lora_sft_ds3.yaml 报错 #5252

JerryZeyu opened this issue Aug 23, 2024 · 9 comments
Labels
solved This problem has been already solved

Comments

@JerryZeyu
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

用ds_z3_config.json的时候就会报错,错误显示:pydantic_core._pydantic_core.ValidationError: 1 validation error for DeepSpeedZeroConfig
[rank3]: stage3_prefetch_bucket_size
[rank3]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]

请问这是deepspeed的版本问题吗

Reproduction

torch == 2.4.0
deepspeed == 0.15.0
llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Aug 23, 2024
@JerryZeyu
Copy link
Author

但是用llamafactory-cli train examples/train_lora/llama3_lora_sft_ds0.yaml的时候就不会报错

@zhangguoxin1
Copy link

我也遇到此类问题

@junqi-lu
Copy link

同样的错误,将 deepspeed 退回到 0.14.0 版本对我有用。

@sunzhufeng12345
Copy link

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应,无法运行,pytroch是根据cuda版本来的

@junqi-lu
Copy link

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应,无法运行,pytroch是根据cuda版本来的

所有依赖版本都采用仓库推荐版本呢?

@sunzhufeng12345
Copy link

将 deepspeed 退回到 0.14.0 版本又报与pytroch版本不对应,无法运行,pytroch是根据cuda版本来的

所有依赖版本都采用仓库推荐版本呢?
我与楼主的版本一样,也是出现了相同的问题,我不是使用llamafactory训练的,是复现longwrite的时候遇到的这个问题
torch == 2.4.0
deepspeed == 0.15.0

@gannim
Copy link

gannim commented Aug 27, 2024

I encountered a similar issue, and it was resolved by using DeepSpeed version 0.14.4. I suspect that the problem arises in later versions of DeepSpeed due to type checking with Pydantic. Specifically, when the stage3_prefetch_bucket_size option is set to auto, Accelerate calculates it based on the model's hidden size. However, I suspect that it might not be properly converted to an integer during this process, leading to the error.

zjysteven added a commit to zjysteven/lmms-finetune that referenced this issue Aug 27, 2024
@chenhuiyu
Copy link
Contributor

I encountered a similar issue, and it was resolved by using DeepSpeed version 0.14.4. I suspect that the problem arises in later versions of DeepSpeed due to type checking with Pydantic. Specifically, when the stage3_prefetch_bucket_size option is set to auto, Accelerate calculates it based on the model's hidden size. However, I suspect that it might not be properly converted to an integer during this process, leading to the error.

Thanks! this solution solve my issue!

@HughesZhang2021
Copy link

deepspeed==0.14.4 solved

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 3, 2024
@hiyouga hiyouga closed this as completed Sep 3, 2024
hiyouga added a commit that referenced this issue Sep 3, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
yuwangnexusera pushed a commit to yuwangnexusera/LLaMA-Factory that referenced this issue Sep 5, 2024
linyueqian pushed a commit to zjysteven/lmms-finetune that referenced this issue Sep 13, 2024
danielwusg pushed a commit to sunfanyunn/lmms-finetune that referenced this issue Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

8 participants