-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
执行命令 报错 flash_attn未安装, 安装后报错ImportError. 使用docker compose ,docker同样的问题 #4592
Comments
INSTALL_FLASHATTN=true
|
INSTALL_FLASHATTN=true后安装的是新版本会报错,按照 Dao-AILab/flash-attention#966 (comment) 安装torch==2.3.0、flash-attn==2.5.8 解决undefined symbol: _ZN3c104cuda14ExchangeDeviceEa. |
应该是 GLM 模型代码的问题,你可以试着更新一下文件:https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/modeling_chatglm.py#L30-L36 |
试了多次,发现docker里需要torch==2.1.2 和 pip install flash-attn --no-build-isolation才能跑起来,装了后torchtext和torchvision都得换成0.16.2。上面提到的torch==2.3.0、flash-attn==2.5.8也不行,不知道第一次怎么成功的,是不是和docker里的cuda版本有关?后面试了下docker compose,无论怎么试都跑不了 flash-attn这个东西能不能不调用啊,我用pip install -e .编译的环境装flash-attn就卡死不动了,只能用docker |
已经修复了 e3141f5 |
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml可以用了.但用令行加参数llamafactory-cli train --stage sft --do_train True也就是webui界面还是会提示未安装 flash_attn. 尝试docker里尝试安装 flash_attn会报错. 26号下的docker compose在另一台双4090显卡电脑里能运行.报错这台电脑是单4090 exit code: 1
Traceback (most recent call last):
note: This error originates from a subprocess, and is likely not a problem with pip. |
Reminder
System Info
OS:wsl2
cuda-12.3
最新llamafactory,docker compose
Reproduction
命令行:
llamafactory-cli train
--stage sft
--do_train True
--model_name_or_path /home/xx/.cache/modelscope/hub/ZhipuAI/glm-4-9b-chat/
--preprocessing_num_workers 16
--finetuning_type lora
--template glm4
--dataset_dir data
--dataset test
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 50
--warmup_steps 0
--optim adamw_torch
--packing False
--report_to none
--output_dir saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26
--fp16 True
--plot_loss True
--ddp_timeout 180000000
--include_num_input_tokens_seen True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target all
命令行加或不加 --flash_attn auto
以及使用 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
.yaml:
model
model_name_or_path: modles/ZhipuAI/glm-4-9b-chat/
method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
dataset
dataset: test
template: glm4
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
output
output_dir: saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
都报错如下
##################################
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/app/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/app/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/app/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/app/src/llamafactory/model/loader.py", line 152, in load_model
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 550, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 326, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 181, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run
pip install flash_attn
#####################################################################
安装flash_attn后报错
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 5, in
from llamafactory.cli import main
File "/app/src/llamafactory/init.py", line 17, in
from .cli import VERSION
File "/app/src/llamafactory/cli.py", line 21, in
from . import launcher
File "/app/src/llamafactory/launcher.py", line 15, in
from llamafactory.train.tuner import run_exp
File "/app/src/llamafactory/train/tuner.py", line 27, in
from ..model import load_model, load_tokenizer
File "/app/src/llamafactory/model/init.py", line 15, in
from .loader import load_config, load_model, load_tokenizer
File "/app/src/llamafactory/model/loader.py", line 28, in
from .patcher import patch_config, patch_model, patch_tokenizer, patch_valuehead_model
File "/app/src/llamafactory/model/patcher.py", line 30, in
from .model_utils.longlora import configure_longlora
File "/app/src/llamafactory/model/model_utils/longlora.py", line 25, in
from transformers.models.llama.modeling_llama import (
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 54, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/usr/local/lib/python3.10/dist-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa。
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: