How to run this on 8 GPUs #56

Shiki-X · 2025-02-10T05:09:46Z

The scripts are below, but the program is stuck when I run run.sh
How to run this code on 8 GPUs, if I want to use 7B or largger model?

(1) run.sh
export N_GPUS=8
export BASE_MODEL="/Qwen-3B-Instruct"
export DATA_DIR="/data"
export ROLLOUT_TP_SIZE=8
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct
export VLLM_ATTENTION_BACKEND=XFORMERS
bash ./scripts/train_tiny_zero_h100_ppo.sh

(2) train_tiny_zero_h100_ppo.sh
python3 -m verl.trainer.main_ppo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=32
data.val_batch_size=32
data.max_prompt_length=512
data.max_response_length=2048
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=64
actor_rollout_ref.actor.ppo_micro_batch_size=4
actor_rollout_ref.rollout.log_prob_micro_batch_size=4
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.gpu_memory_utilization=0.4
actor_rollout_ref.ref.log_prob_micro_batch_size=2
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=4
algorithm.kl_ctrl.kl_coef=0.001
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log

zijianh4 · 2025-02-12T07:08:02Z

Similar problems, when I tried to use 8 L40 GPUs to train Qwen2.5-7b model with GRPO, it will stuck after initial the wandb and the curve of GPUs will just go to value close to 0.

My configuration is like this:

#!/bin/bash

export N_GPUS=8
export BASE_MODEL=Qwen/Qwen2.5-7B
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=all_math-qwen2.5-7b_grpo
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero_all_math_grpo.sh

and

gsm8k_train_path=/data/TinyZero_gsm8k/train.parquet
gsm8k_test_path=/data/TinyZero_gsm8k/test.parquet
math_train_path=/data/TinyZero_math/train.parquet
math_test_path=/data/TinyZero_math/test.parquet
AIME_train_path=/data/TinyZero_AIME/train.parquet
AIME_test_path=/data/TinyZero_AIME/test.parquet

train_files="['$gsm8k_train_path', '$math_train_path', '$AIME_train_path']"
test_files="['$gsm8k_test_path', '$math_test_path', '$AIME_test_path']"

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=128 \
data.val_batch_size=640 \
data.max_prompt_length=2048 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.project_name=all_math \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

The same problem also happens when training on countdown with GRPO.

AstonyJ · 2025-02-12T08:17:49Z

I encountered the same issue. Have you solved it?

yuleiqin · 2025-02-12T10:10:28Z

I encountered the same issue. Have you solved it?

same problem; 4 gpus work; 8 gpus stuck

AstonyJ · 2025-02-12T10:18:57Z

2 gpus work; 4、8 gpus stuck

cpchenpi · 2025-02-13T07:34:18Z

I didn't check too many details, but modifying these two parameters works in my environment (8 80G GPUs):

actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.ref.log_prob_micro_batch_size=8 \

yenanjing · 2025-02-14T05:05:04Z

same problem; 2, 4 gpus work; 8 gpus stuck

yenanjing · 2025-02-14T05:15:22Z

I didn't check too many details, but modifying these two parameters works in my environment (8 80G GPUs):
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.ref.log_prob_micro_batch_size=8 \

it works on 8 gpus, thanks!

AstonyJ mentioned this issue Feb 12, 2025

Does this project support GRPO? #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run this on 8 GPUs #56

How to run this on 8 GPUs #56

Shiki-X commented Feb 10, 2025

zijianh4 commented Feb 12, 2025

AstonyJ commented Feb 12, 2025

yuleiqin commented Feb 12, 2025

AstonyJ commented Feb 12, 2025

cpchenpi commented Feb 13, 2025

yenanjing commented Feb 14, 2025

yenanjing commented Feb 14, 2025

How to run this on 8 GPUs #56

How to run this on 8 GPUs #56

Comments

Shiki-X commented Feb 10, 2025

zijianh4 commented Feb 12, 2025

AstonyJ commented Feb 12, 2025

yuleiqin commented Feb 12, 2025

AstonyJ commented Feb 12, 2025

cpchenpi commented Feb 13, 2025

yenanjing commented Feb 14, 2025

yenanjing commented Feb 14, 2025