Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

更新代码后,重新执行finetune.sh出错, TypeError: init_process_group() got multiple values for keyword argument 'backend' #112

Open
alisyzhu opened this issue Apr 26, 2023 · 7 comments

Comments

@alisyzhu
Copy link

昨日重新拉取git code之后,再次执行finetune.sh,torchrun 就会报错。
【初始环境】
A100 * 1
accelerate 0.18.0
bitsandbytes 0.37.2
transformers 4.29.0.dev0
【修改环境v1】
执行pip install transformers==4.28.1
结果:仍然错误
【修改环境v2】
执行 pip install git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560
结果:仍然报错,错误如下:
image

image

image

@Facico
Copy link
Owner

Facico commented Apr 26, 2023

finetune.py一个月都没改过了,老问题了你单卡就别用torchrun了,直接用python跑
(报错的时候,你可以把你的错误在我们仓库搜一下)

@dizhenx
Copy link

dizhenx commented Apr 27, 2023

finetune.py一个月都没改过了,老问题了你单卡就别用torchrun了,直接用python跑 (报错的时候,你可以把你的错误在我们仓库搜一下)

我现在运行bash finetune_others_continue.sh也报这个错,这个错误是在调用finetune.py:237行时发生的

@Facico
Copy link
Owner

Facico commented May 4, 2023

@dizhenx 这个和哪个脚本没关系,多卡用torchrun(我们脚本都是默认多卡的),单卡就不要用了,直接用python

@wangrui6
Copy link

wangrui6 commented May 19, 2023

@Facico A100 的训练参数组合有经验吗?
···

# optimized for RTX 4090. for larger GPUs, increase some of these?

MICRO_BATCH_SIZE = 4 # this could actually be 5 but i like powers of 2
BATCH_SIZE = 128
MAX_STEPS = None
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 3 # we don't always need 3 tbh
LEARNING_RATE = 3e-4 # the Karpathy constant
CUTOFF_LEN = 256 # 256 accounts for about 96% of the data
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
VAL_SET_SIZE = args.test_size #2000
···
尤其是前几个

@wangrui6
Copy link

@Facico 另外想问一下finetune的速度如何?用4090 finetune vicuna13b,100K的samples大概要多久?有可以参考的数据吗?

@benjamin555
Copy link

vicuna13b

这个框架支持精调vicuna13b吗?

@Facico
Copy link
Owner

Facico commented Jun 29, 2023

@wangrui6 一般只用根据硬件需求调CUTOFF_LEN。不太记得了,应该是几十万数据大概跑了200h
@benjamin555 基底是llama 的都支持

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants