Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行命令 报错 flash_attn未安装, 安装后报错ImportError. 使用docker compose ,docker同样的问题 #4592

Closed
1 task done
goodmaney opened this issue Jun 27, 2024 · 6 comments
Labels
solved This problem has been already solved

Comments

@goodmaney
Copy link

goodmaney commented Jun 27, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

OS:wsl2
cuda-12.3
最新llamafactory,docker compose

Reproduction

命令行:
llamafactory-cli train
--stage sft
--do_train True
--model_name_or_path /home/xx/.cache/modelscope/hub/ZhipuAI/glm-4-9b-chat/
--preprocessing_num_workers 16
--finetuning_type lora
--template glm4
--dataset_dir data
--dataset test
--cutoff_len 1024
--learning_rate 5e-05
--num_train_epochs 3.0
--max_samples 100000
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 50
--warmup_steps 0
--optim adamw_torch
--packing False
--report_to none
--output_dir saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26
--fp16 True
--plot_loss True
--ddp_timeout 180000000
--include_num_input_tokens_seen True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target all
命令行加或不加 --flash_attn auto
以及使用 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
.yaml:

model

model_name_or_path: modles/ZhipuAI/glm-4-9b-chat/

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: all

dataset

dataset: test
template: glm4
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/GLM-4-9B-Chat/lora/train_2024-06-27-13-02-26
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

都报错如下
##################################
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/app/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/app/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/app/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/app/src/llamafactory/model/loader.py", line 152, in load_model
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 550, in from_pretrained
model_class = get_class_from_dynamic_module(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
final_module = get_cached_module_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 326, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 181, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn
#####################################################################
安装flash_attn后报错

Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 5, in
from llamafactory.cli import main
File "/app/src/llamafactory/init.py", line 17, in
from .cli import VERSION
File "/app/src/llamafactory/cli.py", line 21, in
from . import launcher
File "/app/src/llamafactory/launcher.py", line 15, in
from llamafactory.train.tuner import run_exp
File "/app/src/llamafactory/train/tuner.py", line 27, in
from ..model import load_model, load_tokenizer
File "/app/src/llamafactory/model/init.py", line 15, in
from .loader import load_config, load_model, load_tokenizer
File "/app/src/llamafactory/model/loader.py", line 28, in
from .patcher import patch_config, patch_model, patch_tokenizer, patch_valuehead_model
File "/app/src/llamafactory/model/patcher.py", line 30, in
from .model_utils.longlora import configure_longlora
File "/app/src/llamafactory/model/model_utils/longlora.py", line 25, in
from transformers.models.llama.modeling_llama import (
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 54, in
from flash_attn import flash_attn_func, flash_attn_varlen_func
File "/usr/local/lib/python3.10/dist-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 10, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda14ExchangeDeviceEa。

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 27, 2024
@goodmaney goodmaney changed the title docker compose执行命令 报错 flash_attn未安装,安装后报错ImportError 执行命令 报错 flash_attn未安装, 安装后报错ImportError. 使用docker compose ,docker同样的问题 Jun 27, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

INSTALL_FLASHATTN=true

INSTALL_FLASHATTN: false

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 27, 2024
@hiyouga hiyouga closed this as completed Jun 27, 2024
@goodmaney
Copy link
Author

goodmaney commented Jun 27, 2024

INSTALL_FLASHATTN=true

INSTALL_FLASHATTN=true后安装的是新版本会报错,按照 Dao-AILab/flash-attention#966 (comment) 安装torch==2.3.0、flash-attn==2.5.8 解决undefined symbol: _ZN3c104cuda14ExchangeDeviceEa.
flash-attn对4090是必须的吗?
使用上面命令行训练会出现#4441 (comment) 中的错误, SDPA attention是修改那个参数?

@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

应该是 GLM 模型代码的问题,你可以试着更新一下文件:https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/modeling_chatglm.py#L30-L36

@goodmaney
Copy link
Author

goodmaney commented Jun 27, 2024

INSTALL_FLASHATTN=true

INSTALL_FLASHATTN: false

试了多次,发现docker里需要torch==2.1.2 和 pip install flash-attn --no-build-isolation才能跑起来,装了后torchtext和torchvision都得换成0.16.2。上面提到的torch==2.3.0、flash-attn==2.5.8也不行,不知道第一次怎么成功的,是不是和docker里的cuda版本有关?后面试了下docker compose,无论怎么试都跑不了

flash-attn这个东西能不能不调用啊,我用pip install -e .编译的环境装flash-attn就卡死不动了,只能用docker

@hiyouga
Copy link
Owner

hiyouga commented Jun 27, 2024

已经修复了 e3141f5

@goodmaney
Copy link
Author

goodmaney commented Jun 28, 2024

已经修复了 e3141f5

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml可以用了.但用令行加参数llamafactory-cli train --stage sft --do_train True也就是webui界面还是会提示未安装 flash_attn. 尝试docker里尝试安装 flash_attn会报错. 26号下的docker compose在另一台双4090显卡电脑里能运行.报错这台电脑是单4090

exit code: 1
╰─> [165 lines of output]
fatal: not a git repository (or any of the parent directories): .git

  torch.__version__  = 2.3.0a0+ebedce2


  /usr/local/lib/python3.10/dist-packages/setuptools/__init__.py:80: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
  !!

          ********************************************************************************
          Requirements should be satisfied by a PEP 517 installer.
          If you are using pip, you can try `pip install --use-pep517`.
          ********************************************************************************

  !!
    dist.fetch_build_eggs(dist.setup_requires)

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2095, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '4']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-x_jhgpxb/flash-attn_2f2e7ee88bc743f1bc99623ecc04d0cc/setup.py", line 311, in <module>
      setup(
    File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 103, in setup
      return distutils.core.setup(**attrs)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/pip-install-x_jhgpxb/flash-attn_2f2e7ee88bc743f1bc99623ecc04d0cc/setup.py", line 266, in run
      return super().run()
    File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 368, in run
      self.run_command("build")
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
      super().run_command(command)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 88, in run
      _build_ext.run(self)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 870, in build_extensions
      build_ext.build_extensions(self)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 249, in build_extension
      _build_ext.build_extension(self, ext)
    File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
      super(build_ext, self).build_extension(ext)
    File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 683, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1773, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2111, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)

hiyouga added a commit that referenced this issue Jun 30, 2024
PrimaLuz pushed a commit to PrimaLuz/LLaMA-Factory that referenced this issue Jul 1, 2024
xtchen96 pushed a commit to xtchen96/LLaMA-Factory that referenced this issue Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants