Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device 0 is not recognized #24

Closed
giobin opened this issue Nov 17, 2023 · 6 comments · Fixed by #25
Closed

Device 0 is not recognized #24

giobin opened this issue Nov 17, 2023 · 6 comments · Fixed by #25
Assignees
Labels
bug Something isn't working

Comments

@giobin
Copy link

giobin commented Nov 17, 2023

Hello!
First of all, very nice work!

I have an issue with running the example PPO_finetuning. It seems that it doesn't recognize the GPU device.

I'm runnignon this setup:
Screenshot 2023-11-17 alle 18 28 05

My command is the folowing:
python -m lamorel_launcher.launch --config-path /data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/ --config-name local_gpu_config rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=t5-small
and this is the Error:

/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel_launcher/launch.py:15: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='', config_name='')
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-11-17 18:22:01,325][root][INFO] - Using nproc_per_node=2.
[2023-11-17 18:22:01,325][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py:150: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='config', config_name='config')
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py:150: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='config', config_name='config')
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[2023-11-17 18:22:03,085][accelerate.utils.other][WARNING] - Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2023-11-17 18:22:03,085][lamorel_logger][INFO] - Init rl group for process 0
[2023-11-17 18:22:03,087][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-11-17 18:22:03,361][lamorel_logger][INFO] - Init rl group for process 1
[2023-11-17 18:22:03,361][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 1
[2023-11-17 18:22:03,361][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2023-11-17 18:22:03,361][lamorel_logger][INFO] - Init llm group for process 1
[2023-11-17 18:22:03,362][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2023-11-17 18:22:03,362][lamorel_logger][INFO] - Init llm group for process 0
[2023-11-17 18:22:03,362][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:3 to store for rank: 0
[2023-11-17 18:22:03,363][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:3 to store for rank: 1
[2023-11-17 18:22:03,363][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-17 18:22:03,363][lamorel_logger][INFO] - Init rl-llm group for process 1
[2023-11-17 18:22:03,373][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
[2023-11-17 18:22:03,373][lamorel_logger][INFO] - Init rl-llm group for process 0
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 1
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:4 to store for rank: 0
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-17 18:22:03,384][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
[2023-11-17 18:22:03,385][lamorel_logger][INFO] - 2 gpus available for current LLM but using only model_parallelism_size = 1
[2023-11-17 18:22:03,385][lamorel_logger][INFO] - Devices on process 1 (index 0): [0]
Parallelising HF LLM on 1 devices
Loading model t5-small
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/gym/utils/passive_env_checker.py:165: UserWarning: WARN: The obs returned by the `reset()` method is not within the observation space.
  logger.warn(f"{pre} is not within the observation space.")
/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/gym/utils/passive_env_checker.py:133: UserWarning: WARN: The obs returned by the `reset()` method should be an int or np.int64, actual type: <class 'str'>
  logger.warn(f"{pre} should be an int or np.int64, actual type: {type(obs)}")
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py", line 164, in main
    lm_server = Caller(config_args.lamorel_args,
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 53, in __init__
    Server(
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/server/server.py", line 40, in __init__
    self._model = HF_LLM(config.llm_args, devices, use_cpu)
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 38, in __init__
    device_map = infer_auto_device_map(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 923, in infer_auto_device_map
    max_memory = get_max_memory(max_memory)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 674, in get_max_memory
    raise ValueError(
ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk''

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py", line 199, in main
    output = lm_server.custom_module_fns(['score', 'value'],
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 95, in custom_module_fns
    return self.__call_model(InstructionsEnum.FORWARD, True, module_function_keys=module_function_keys,
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel/caller.py", line 99, in __call_model
    dist.gather_object(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1758, in gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2075, in all_gather
    work.wait()
RuntimeError: [/opt/conda/conda-bld/pytorch_1659484809662/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:15580

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2023-11-17 18:22:06,348][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 71611) of binary: /home/gbonetta/miniconda3/envs/lamorel_env/bin/python
Error executing job with overrides: ['rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py', 'rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments', 'lamorel_args.accelerate_args.machine_rank=0', 'lamorel_args.llm_args.model_path=t5-small']
Traceback (most recent call last):
  File "/data/disk1/share/gbonetta/progetti/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
    launch_command(accelerate_args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gbonetta/miniconda3/envs/lamorel_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-11-17_18:22:06
  host      : hltnlp-gpu-a
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 71612)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-17_18:22:06
  host      : hltnlp-gpu-a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 71611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

my conda env contains the following packages:

conda list
# packages in environment at /home/gbonetta/miniconda3/envs/lamorel_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.0.0                    pypi_0    pypi
accelerate                0.24.1                   pypi_0    pypi
aiohttp                   3.8.6                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
annotated-types           0.6.0                    pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
anyio                     3.7.1                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
babyai                    0.1.0                     dev_0    <develop>
babyai-text               0.1.0                     dev_0    <develop>
blas                      1.0                         mkl  
blosc                     1.11.1                   pypi_0    pypi
brotli-python             1.0.9            py39h6a678d5_7  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.08.22           h06a4308_0  
cachetools                5.3.2                    pypi_0    pypi
certifi                   2023.7.22        py39h06a4308_0  
cffi                      1.15.1           py39h5eee18b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.1.7                    pypi_0    pypi
cloudpickle               3.0.0                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
comm                      0.2.0                    pypi_0    pypi
contourpy                 1.2.0                    pypi_0    pypi
cryptography              41.0.3           py39hdda0065_0  
cudatoolkit               11.3.1               h2bc3f7f_2  
cycler                    0.12.1                   pypi_0    pypi
datasets                  2.15.0                   pypi_0    pypi
debugpy                   1.8.0                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
dill                      0.3.7                    pypi_0    pypi
distro                    1.8.0                    pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
exceptiongroup            1.1.3                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.13.1                   pypi_0    pypi
fonttools                 4.44.3                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
frozenlist                1.4.0                    pypi_0    pypi
fsspec                    2023.10.0                pypi_0    pypi
giflib                    5.2.1                h5eee18b_3  
gitdb                     4.0.11                   pypi_0    pypi
gitpython                 3.1.40                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
google-auth               2.23.4                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.59.2                   pypi_0    pypi
gym                       0.26.1                   pypi_0    pypi
gym-minigrid              1.0.1                     dev_0    <develop>
gym-notices               0.0.8                    pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
httpcore                  1.0.2                    pypi_0    pypi
httpx                     0.25.1                   pypi_0    pypi
huggingface-hub           0.19.3                   pypi_0    pypi
hydra-core                1.3.2                    pypi_0    pypi
idna                      3.4              py39h06a4308_0  
imageio                   2.32.0                   pypi_0    pypi
importlib-metadata        6.8.0                    pypi_0    pypi
importlib-resources       6.1.1                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
ipykernel                 6.26.0                   pypi_0    pypi
ipython                   8.17.2                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jpeg                      9e                   h5eee18b_1  
jupyter-client            8.6.0                    pypi_0    pypi
jupyter-core              5.5.0                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lamorel                   0.1                       dev_0    <develop>
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.1                h6a678d5_0  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.3.2                h11a3e52_0  
libwebp-base              1.3.2                h5eee18b_0  
lz4-c                     1.9.4                h6a678d5_0  
markdown                  3.5.1                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
matplotlib                3.8.1                    pypi_0    pypi
matplotlib-inline         0.1.6                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py39h5eee18b_1  
mkl_fft                   1.3.8            py39h5eee18b_0  
mkl_random                1.2.4            py39hdb19cb5_0  
multidict                 6.0.4                    pypi_0    pypi
multiprocess              0.70.15                  pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.5.8                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1  
numpy                     1.26.0           py39h5f9d8c6_0  
numpy-base                1.26.0           py39hb5e798b_0  
oauthlib                  3.2.2                    pypi_0    pypi
omegaconf                 2.3.0                    pypi_0    pypi
openai                    1.3.0                    pypi_0    pypi
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openssl                   3.0.12               h7f8727e_0  
packaging                 23.2                     pypi_0    pypi
pandas                    2.1.3                    pypi_0    pypi
parso                     0.8.3                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pillow                    10.0.1           py39ha6cbd5a_0  
pip                       23.3             py39h06a4308_0  
platformdirs              4.0.0                    pypi_0    pypi
prompt-toolkit            3.0.41                   pypi_0    pypi
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.6                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pyarrow                   14.0.1                   pypi_0    pypi
pyarrow-hotfix            0.5                      pypi_0    pypi
pyasn1                    0.5.0                    pypi_0    pypi
pyasn1-modules            0.3.0                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  2.5.1                    pypi_0    pypi
pydantic-core             2.14.3                   pypi_0    pypi
pygments                  2.16.1                   pypi_0    pypi
pyopenssl                 23.2.0           py39h06a4308_0  
pyparsing                 3.1.1                    pypi_0    pypi
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.18               h955ad1f_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   1.12.1          py3.9_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3.post1             pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
pyzmq                     25.1.1                   pypi_0    pypi
readline                  8.2                  h5eee18b_0  
regex                     2023.10.3                pypi_0    pypi
requests                  2.31.0           py39h06a4308_0  
requests-oauthlib         1.3.1                    pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
safetensors               0.4.0                    pypi_0    pypi
scipy                     1.11.3                   pypi_0    pypi
sentencepiece             0.1.99                   pypi_0    pypi
sentry-sdk                1.35.0                   pypi_0    pypi
setproctitle              1.3.3                    pypi_0    pypi
setuptools                68.0.0           py39h06a4308_0  
six                       1.16.0                   pypi_0    pypi
smmap                     5.0.1                    pypi_0    pypi
sniffio                   1.3.0                    pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
stack-data                0.6.3                    pypi_0    pypi
tbb                       2021.8.0             hdb19cb5_0  
tensorboard               2.7.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tensorboardx              1.8                      pypi_0    pypi
termcolor                 2.3.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.15.0                   pypi_0    pypi
torchaudio                0.12.1               py39_cu113    pytorch
torchvision               0.13.1               py39_cu113    pytorch
tornado                   6.3.3                    pypi_0    pypi
tqdm                      4.64.0                   pypi_0    pypi
traitlets                 5.13.0                   pypi_0    pypi
transformers              4.35.2                   pypi_0    pypi
typing-extensions         4.8.0                    pypi_0    pypi
typing_extensions         4.7.1            py39h06a4308_0  
tzdata                    2023.3                   pypi_0    pypi
urllib3                   1.26.18          py39h06a4308_0  
wandb                     0.16.0                   pypi_0    pypi
wcwidth                   0.2.10                   pypi_0    pypi
werkzeug                  3.0.1                    pypi_0    pypi
wheel                     0.41.2           py39h06a4308_0  
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.2                h5eee18b_0  
yarl                      1.9.2                    pypi_0    pypi
zipp                      3.17.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0  

While my pip shows the following:

pip list
Package                 Version      Editable project location
----------------------- ------------ ------------------------------------------------------------------------------------------
absl-py                 2.0.0
accelerate              0.24.1
aiohttp                 3.8.6
aiosignal               1.3.1
annotated-types         0.6.0
antlr4-python3-runtime  4.9.3
anyio                   3.7.1
appdirs                 1.4.4
asttokens               2.4.1
async-timeout           4.0.3
attrs                   23.1.0
babyai                  0.1.0        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text/babyai
babyai-text             0.1.0        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text
blosc                   1.11.1
Brotli                  1.0.9
cachetools              5.3.2
certifi                 2023.7.22
cffi                    1.15.1
charset-normalizer      2.0.4
click                   8.1.7
cloudpickle             3.0.0
colorama                0.4.6
comm                    0.2.0
contourpy               1.2.0
cryptography            41.0.3
cycler                  0.12.1
datasets                2.15.0
debugpy                 1.8.0
decorator               5.1.1
dill                    0.3.7
distro                  1.8.0
docker-pycreds          0.4.0
exceptiongroup          1.1.3
executing               2.0.1
filelock                3.13.1
fonttools               4.44.3
frozenlist              1.4.0
fsspec                  2023.10.0
gitdb                   4.0.11
GitPython               3.1.40
google-auth             2.23.4
google-auth-oauthlib    0.4.6
grpcio                  1.59.2
gym                     0.26.1
gym-minigrid            1.0.1        /data/disk1/share/gbonetta/progetti/Grounding_LLMs_with_online_RL/babyai-text/gym-minigrid
gym-notices             0.0.8
h11                     0.14.0
httpcore                1.0.2
httpx                   0.25.1
huggingface-hub         0.19.3
hydra-core              1.3.2
idna                    3.4
imageio                 2.32.0
importlib-metadata      6.8.0
importlib-resources     6.1.1
ipykernel               6.26.0
ipython                 8.17.2
jedi                    0.19.1
jupyter_client          8.6.0
jupyter_core            5.5.0
kiwisolver              1.4.5
lamorel                 0.1          /data/disk1/share/gbonetta/progetti/lamorel/lamorel/src
Markdown                3.5.1
MarkupSafe              2.1.3
matplotlib              3.8.1
matplotlib-inline       0.1.6
mkl-fft                 1.3.8
mkl-random              1.2.4
mkl-service             2.4.0
multidict               6.0.4
multiprocess            0.70.15
nest-asyncio            1.5.8
numpy                   1.26.0
oauthlib                3.2.2
omegaconf               2.3.0
openai                  1.3.0
packaging               23.2
pandas                  2.1.3
parso                   0.8.3
pexpect                 4.8.0
Pillow                  10.0.1
pip                     23.3
platformdirs            4.0.0
prompt-toolkit          3.0.41
protobuf                3.20.3
psutil                  5.9.6
ptyprocess              0.7.0
pure-eval               0.2.2
pyarrow                 14.0.1
pyarrow-hotfix          0.5
pyasn1                  0.5.0
pyasn1-modules          0.3.0
pycparser               2.21
pydantic                2.5.1
pydantic_core           2.14.3
Pygments                2.16.1
pyOpenSSL               23.2.0
pyparsing               3.1.1
PySocks                 1.7.1
python-dateutil         2.8.2
pytz                    2023.3.post1
PyYAML                  6.0.1
pyzmq                   25.1.1
regex                   2023.10.3
requests                2.31.0
requests-oauthlib       1.3.1
rsa                     4.9
safetensors             0.4.0
scipy                   1.11.3
sentencepiece           0.1.99
sentry-sdk              1.35.0
setproctitle            1.3.3
setuptools              68.0.0
six                     1.16.0
smmap                   5.0.1
sniffio                 1.3.0
stack-data              0.6.3
tensorboard             2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorboardX            1.8
termcolor               2.3.0
tokenizers              0.15.0
torch                   1.12.1
torchaudio              0.12.1
torchvision             0.13.1
tornado                 6.3.3
tqdm                    4.64.0
traitlets               5.13.0
transformers            4.35.2
typing_extensions       4.7.1
tzdata                  2023.3
urllib3                 1.26.18
wandb                   0.16.0
wcwidth                 0.2.10
Werkzeug                3.0.1
wheel                   0.41.2
xxhash                  3.4.1
yarl                    1.9.2
zipp                    3.17.0

and I am using python 3.9.18.

The configuration i am using in local_gpu_config.yaml:
lamorel_args: log_level: info allow_subgraph_use_whith_gradient: false distributed_setup_args: n_rl_processes: 1 n_llm_processes: 1 accelerate_args: config_file: ../configs/accelerate/default_config.yaml machine_rank: 0 main_process_ip: 127.0.0.1 num_machines: 1 llm_args: model_type: seq2seq model_path: t5-small pretrained: true minibatch_size: 192 pre_encode_inputs: true parallelism: use_gpu: true model_parallelism_size: 1 synchronize_gpus_after_scoring: false empty_cuda_cache_after_scoring: false rl_script_args: path: ??? name_environment: 'BabyAI-GoToRedBall-v0' epochs: 2 steps_per_epoch: 128 minibatch_size: 64 gradient_batch_size: 16 ppo_epochs: 4 lam: 0.99 gamma: 0.99 target_kl: 0.01 max_ep_len: 1000 lr: 1e-4 entropy_coef: 0.01 value_loss_coef: 0.5 clip_eps: 0.2 max_grad_norm: 0.5 save_freq: 100 output_dir: ???
but anyway it seems irrelevant if i change the machine_rank.

Do you have some suggestion on what might be happening?
Thank you!

@ClementRomac
Copy link
Collaborator

ClementRomac commented Nov 19, 2023

Hi,

It seems your pytorch version is pretty old. Could you try upgrading it?
I will update the dependencies in setup.py.

@ewanlee
Copy link

ewanlee commented Nov 20, 2023

Hi,

It seems your pytorch version is pretty old. Could you try upgrading it? I will update the dependencies in setup.py.

Hello! Thank you very much for open-sourcing this project, it has been extremely helpful for me!

I encountered the same issue: ValueError: Device 0 is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'. My PyTorch version is 2.1.1, and the CUDA version is 11.8.

In addition, when I directly import accelerate in IPython and run accelerate.utils.get_max_memory(), I can get normal return results.

image

Is it possible that there is a strange conflict with the accelerate package during execution?

@ClementRomac ClementRomac self-assigned this Nov 21, 2023
@ClementRomac ClementRomac added the bug Something isn't working label Nov 21, 2023
@ClementRomac ClementRomac linked a pull request Nov 21, 2023 that will close this issue
@ClementRomac
Copy link
Collaborator

Hi,

I managed to reproduce it locally and fixed it in this PR.
Please let me know if the PR also works for you before I merge it to the main branch.

@giobin
Copy link
Author

giobin commented Nov 21, 2023

Hi,

I tried it out and it works now!
Thanks

@ClementRomac
Copy link
Collaborator

Awesome, merging the PR and closing the issue!

@ewanlee
Copy link

ewanlee commented Nov 22, 2023

Hi,

I managed to reproduce it locally and fixed it in this PR. Please let me know if the PR also works for you before I merge it to the main branch.

This also works for me! Thank you very much :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants