Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch and deepspeed versions are currently incompatible #4271

Closed
1 task done
canberkgurel opened this issue Jun 13, 2024 · 5 comments
Closed
1 task done

torch and deepspeed versions are currently incompatible #4271

canberkgurel opened this issue Jun 13, 2024 · 5 comments
Labels
solved This problem has been already solved

Comments

@canberkgurel
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

- `llamafactory` version: 0.8.2.dev0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version: 2.4.0a0+07cecf4168.nv24.05 (GPU)
- Transformers version: 4.41.2
- Datasets version: 2.20.0
- Accelerate version: 0.31.0
- PEFT version: 0.11.1
- TRL version: 0.9.4
- GPU type: NVIDIA RTX A6000
- DeepSpeed version: 0.14.0

Reproduction

The ImportError: cannot import name 'log' error originates from the deepspeed library, specifically in the elastic_agent.py file.
The import statement from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port is failing because log cannot be found in the specified module.
The torch library might have been updated, and the log function or object was removed or relocated.
The version of deepspeed in use might not be compatible with the version of torch installed.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/trl/import_utils.py", line 180, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 46, in <module>
    from .utils import (
  File "/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py", line 51, in <module>
    import deepspeed
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/config.py", line 42, in <module>
    from ..elasticity import (
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 5, in <module>
    from llamafactory.cli import main
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/__init__.py", line 3, in <module>
    from .cli import VERSION
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/cli.py", line 7, in <module>
    from . import launcher
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/launcher.py", line 1, in <module>
    from llamafactory.train.tuner import run_exp
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/train/tuner.py", line 11, in <module>
    from .dpo import run_dpo
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/train/dpo/__init__.py", line 1, in <module>
    from .workflow import run_dpo
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 11, in <module>
    from .trainer import CustomDPOTrainer
  File "/llama3_finetuning/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 10, in <module>
    from trl import DPOTrainer
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/usr/local/lib/python3.10/dist-packages/trl/import_utils.py", line 171, in __getattr__
    value = getattr(module, name)
  File "/usr/local/lib/python3.10/dist-packages/trl/import_utils.py", line 170, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.10/dist-packages/trl/import_utils.py", line 182, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import trl.trainer.dpo_trainer because of the following error (look up to see its traceback):
cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py)

Expected behavior

pip install deepspeed installs version 0.14.0. This version isn't compatible with the torch library.
One possible solution is to install a newer version of deepspeed. pip install deepspeed==0.14.1 and pip install deepspeed==0.14.3 works.

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 13, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 14, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jun 14, 2024

fixed

@lonngxiang
Copy link

on: 2.4.0a0+07cecf41

same error
llamafactory 0.8.4.dev0
torch 2.4.0
torchaudio 2.1.2+cu118
torchmetrics 1.4.0.post0
torchvision 0.19.0
tornado 6.4
tqdm 4.66.4
traitlets 5.14.3
transformers 4.43.3
triton 3.0.0
trl 0.9.6
typer 0.12.3
types-python-dateutil 2.9.0.20240316
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.6

@dogeeelin
Copy link

same error, it seems it does not fix yet

@soultrans
Copy link

+1

@dogeeelin
Copy link

same error, it seems it does not fix yet
I tried pip install deepspeed==0.14.4 solved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

5 participants