Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run this on 8 GPUs #56

Open
Shiki-X opened this issue Feb 10, 2025 · 7 comments
Open

How to run this on 8 GPUs #56

Shiki-X opened this issue Feb 10, 2025 · 7 comments

Comments

@Shiki-X
Copy link

Shiki-X commented Feb 10, 2025

The scripts are below, but the program is stuck when I run run.sh
How to run this code on 8 GPUs, if I want to use 7B or largger model?

(1) run.sh
export N_GPUS=8
export BASE_MODEL="/Qwen-3B-Instruct"
export DATA_DIR="/data"
export ROLLOUT_TP_SIZE=8
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct
export VLLM_ATTENTION_BACKEND=XFORMERS
bash ./scripts/train_tiny_zero_h100_ppo.sh

(2) train_tiny_zero_h100_ppo.sh
python3 -m verl.trainer.main_ppo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=32
data.val_batch_size=32
data.max_prompt_length=512
data.max_response_length=2048
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=64
actor_rollout_ref.actor.ppo_micro_batch_size=4
actor_rollout_ref.rollout.log_prob_micro_batch_size=4
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.gpu_memory_utilization=0.4
actor_rollout_ref.ref.log_prob_micro_batch_size=2
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=4
algorithm.kl_ctrl.kl_coef=0.001
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log

@zijianh4
Copy link

Similar problems, when I tried to use 8 L40 GPUs to train Qwen2.5-7b model with GRPO, it will stuck after initial the wandb and the curve of GPUs will just go to value close to 0.

Image

My configuration is like this:

#!/bin/bash

export N_GPUS=8
export BASE_MODEL=Qwen/Qwen2.5-7B
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=all_math-qwen2.5-7b_grpo
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero_all_math_grpo.sh

and

gsm8k_train_path=/data/TinyZero_gsm8k/train.parquet
gsm8k_test_path=/data/TinyZero_gsm8k/test.parquet
math_train_path=/data/TinyZero_math/train.parquet
math_test_path=/data/TinyZero_math/test.parquet
AIME_train_path=/data/TinyZero_AIME/train.parquet
AIME_test_path=/data/TinyZero_AIME/test.parquet

train_files="['$gsm8k_train_path', '$math_train_path', '$AIME_train_path']"
test_files="['$gsm8k_test_path', '$math_test_path', '$AIME_test_path']"

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=128 \
data.val_batch_size=640 \
data.max_prompt_length=2048 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.project_name=all_math \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

The same problem also happens when training on countdown with GRPO.

@AstonyJ
Copy link

AstonyJ commented Feb 12, 2025

I encountered the same issue. Have you solved it?

@yuleiqin
Copy link

I encountered the same issue. Have you solved it?

same problem; 4 gpus work; 8 gpus stuck

@AstonyJ
Copy link

AstonyJ commented Feb 12, 2025

2 gpus work; 4、8 gpus stuck

@cpchenpi
Copy link

I didn't check too many details, but modifying these two parameters works in my environment (8 80G GPUs):

actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.ref.log_prob_micro_batch_size=8 \

@yenanjing
Copy link

same problem; 2, 4 gpus work; 8 gpus stuck

@yenanjing
Copy link

I didn't check too many details, but modifying these two parameters works in my environment (8 80G GPUs):

actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.ref.log_prob_micro_batch_size=8 \

it works on 8 gpus, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants