-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device 0 is not recognized #24
Comments
Hi, It seems your pytorch version is pretty old. Could you try upgrading it? |
Hi, I managed to reproduce it locally and fixed it in this PR. |
Hi, I tried it out and it works now! |
Awesome, merging the PR and closing the issue! |
This also works for me! Thank you very much :) |
Hello!
First of all, very nice work!
I have an issue with running the example PPO_finetuning. It seems that it doesn't recognize the GPU device.
I'm runnignon this setup:
My command is the folowing:
python -m lamorel_launcher.launch --config-path /data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/ --config-name local_gpu_config rl_script_args.path=/data/disk1/share/gbonetta/progetti/lamorel/examples/PPO_finetuning/main.py rl_script_args.output_dir=/data/disk1/share/gbonetta/progetti/lamorel/gio_experiments lamorel_args.accelerate_args.machine_rank=0 lamorel_args.llm_args.model_path=t5-small
and this is the Error:
my conda env contains the following packages:
While my pip shows the following:
and I am using python 3.9.18.
The configuration i am using in local_gpu_config.yaml:
lamorel_args: log_level: info allow_subgraph_use_whith_gradient: false distributed_setup_args: n_rl_processes: 1 n_llm_processes: 1 accelerate_args: config_file: ../configs/accelerate/default_config.yaml machine_rank: 0 main_process_ip: 127.0.0.1 num_machines: 1 llm_args: model_type: seq2seq model_path: t5-small pretrained: true minibatch_size: 192 pre_encode_inputs: true parallelism: use_gpu: true model_parallelism_size: 1 synchronize_gpus_after_scoring: false empty_cuda_cache_after_scoring: false rl_script_args: path: ??? name_environment: 'BabyAI-GoToRedBall-v0' epochs: 2 steps_per_epoch: 128 minibatch_size: 64 gradient_batch_size: 16 ppo_epochs: 4 lam: 0.99 gamma: 0.99 target_kl: 0.01 max_ep_len: 1000 lr: 1e-4 entropy_coef: 0.01 value_loss_coef: 0.5 clip_eps: 0.2 max_grad_norm: 0.5 save_freq: 100 output_dir: ???
but anyway it seems irrelevant if i change the machine_rank.
Do you have some suggestion on what might be happening?
Thank you!
The text was updated successfully, but these errors were encountered: