Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 gpu is not working , 2 gpus out of memory #5

Open
deter3 opened this issue Jan 25, 2025 · 29 comments
Open

1 gpu is not working , 2 gpus out of memory #5

deter3 opened this issue Jan 25, 2025 · 29 comments

Comments

@deter3
Copy link

deter3 commented Jan 25, 2025

how to deal with the error below , 1A100 PCIe 80gb . Followed the instruction with error below . 2A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2H200 has very good training result .

TWO gpus:
export N_GPUS=2
export BASE_MODEL=Qwen/Qwen2.5-3B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

one gpu:
export N_GPUS=1
export BASE_MODEL=Qwen/Qwen2.5-1.5B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export EXPERIMENT_NAME=countdown-qwen2.5-1.5b
export VLLM_ATTENTION_BACKEND=XFORMERS

Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
    dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
@deter3 deter3 changed the title unsupported operand type(s) for //: 'int' and 'str' 1 gpu is not working , 2 gpus out of memory Jan 25, 2025
@rucnyz
Copy link

rucnyz commented Jan 25, 2025

Seems like their experiments are done on H200 (with 141G memory)
https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

@deter3
Copy link
Author

deter3 commented Jan 25, 2025

Seems like their experiments are done on H200 (with 141G memory) https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

I checked and that 8*NVIDIA H200 running for 2 more hours , not 1 or 2 GPUs.

@JerryWu-code
Copy link

Same problems here, I've used two A100 and encounterd out of memory and train for a few examples. Have any recommend parameters for small batchsize but equivalent performance?

@JackCloudman
Copy link

JackCloudman commented Jan 25, 2025

I'm testing with 2xH200 and apparently it's working
Edit: 3B version
Image

@deter3
Copy link
Author

deter3 commented Jan 25, 2025

I'm testing with 2xH200 and apparently it's working Edit: 3B version Image

3B version 2xH200 is working .

@RobertMcCarthy97
Copy link

https://wandb.ai/jiayipan/TinyZero/runs/m19na0qi/overview

This run should work on 2 A100's?

@Benjoyo
Copy link

Benjoyo commented Jan 25, 2025

Seems like their experiments are done on H200 (with 141G memory) https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

I checked and that 8*NVIDIA H200 running for 2 more hours , not 1 or 2 GPUs.

That is 7b though, quite a difference to 3 or 1.5b

@ZihanWang314
Copy link

Try adding actor_rollout_ref.model.enable_gradient_checkpointing=True to the config?

@jiangchengchengark
Copy link

If I understand correctly, actor_rollout_def.rrollout.gpu_cemory_utilization controls the video memory of vllm. I feel that it can synchronously reduce the total batch size and vllm's video memory usage to free up more training space

@jiangchengchengark
Copy link

如果我理解正确的话,actor_rollout_def.rrollout.gpu_cemory_utilization 控制着 vllm 的显存,我感觉可以同步减少总 batch size 和 vllm 的显存使用量,从而释放更多的训练空间

I am trying

@jiangchengchengark
Copy link

jiangchengchengark commented Jan 26, 2025

Now I am scaling down the batch and it is running well. I speculate that the GPU usage of VLLM may not be sufficient for such a large batch of training. If we convert H100 140G to 0.4, VLLM would take up approximately 56G. If we use an 80G graphics card for training and do not adjust the gpu_cemory_utilization or batch_size, VLLM will only have 32G of video memory available. Will there be a shortage of video memory?

@lokmantsui
Copy link

e3df048 worked on 4x A100 40GB gpus with git diff

diff --git a/scripts/export.txt b/scripts/export.txt
new file mode 100644
index 0000000..d1502c8
--- /dev/null
+++ b/scripts/export.txt
@@ -0,0 +1,6 @@
+export N_GPUS=4
+export BASE_MODEL=Qwen/Qwen2.5-3B
+export DATA_DIR=~/data/countdown
+export ROLLOUT_TP_SIZE=4
+export EXPERIMENT_NAME=countdown-qwen2.5-3b
+export VLLM_ATTENTION_BACKEND=XFORMERS
\ No newline at end of file
diff --git a/scripts/train_tiny_zero.sh b/scripts/train_tiny_zero.sh
index 3b2e01c..4de91d6 100644
--- a/scripts/train_tiny_zero.sh
+++ b/scripts/train_tiny_zero.sh
@@ -8,14 +8,14 @@ data.max_response_length=1024 \
 actor_rollout_ref.model.path=$BASE_MODEL \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=128 \
-actor_rollout_ref.actor.ppo_micro_batch_size=8 \
+actor_rollout_ref.actor.ppo_micro_batch_size=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
-actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
 actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=$BASE_MODEL \
-critic.ppo_micro_batch_size=8 \
+critic.ppo_micro_batch_size=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['wandb'] \
 +trainer.val_before_train=False \

Image

@JackCloudman
Copy link

@lokmantsui interesting, this was my charts with countdown task

Image

@JerryWu-code
Copy link

I've successfully reproduce these results several days ago on two A100s, guys may check out my scripts in this fork for your reference when you encountering memory related problems on A100s, the results are shown in this report ~

Image Image

@chenlinzhe
Copy link

set :trainer.nnodes=2 :

stop at :
(main_task pid=28594) from vllm.version import version as VLLM_VERSION

set:trainer.nnodes=1 :

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s]
(WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias']
(WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']

Who can give me some ideas on how to solve this?

@chenlinzhe
Copy link

set :trainer.nnodes=2 :

stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION

set:trainer.nnodes=1 :

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']

Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.

Image

@zacksiri
Copy link

zacksiri commented Feb 1, 2025

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=64 \
data.val_batch_size=128 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=2 \
critic.model.enable_gradient_checkpointing=True \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.grad_offload=False \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

@chenlinzhe
Copy link

set :trainer.nnodes=2 :
stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION
set:trainer.nnodes=1 :
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']
Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.

Image

and When I chose to use only one A800, it started working.But I still want to give the 3B model a try.

Image

@samhodge-aiml
Copy link

samhodge-aiml commented Feb 2, 2025

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

For what it is worth this also runs on two Ampere Generation RTX 3090 GPUs each with 24Gb of VRAM.

I will let it spin for a few more hours, but the 1.5B parameter model is unlikely to learn anything

https://wandb.ai/samh_aiml/TinyZero/runs/t9v9ucm2

Sam

@zacksiri
Copy link

zacksiri commented Feb 2, 2025

I think for 1.5b grpo doesn't give good results (the config I tried)

Seems like for 1.5b ppo is the better algorithm.

I will try more configuration to see if I can get 1.5b to run in ppo or 3b even.

@Manto
Copy link

Manto commented Feb 2, 2025

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

Looking at your response length / mean, it didn't look like the training was successful - I was seeing similar issue too on smaller GPUs and smaller batch sizes. Model end up repeating the answer in the think token after a while and remained that way for rest of the training.

@chenlinzhe
Copy link

set :trainer.nnodes=2 :
stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION
set:trainer.nnodes=1 :
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']
Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.
Image

and When I chose to use only one A800, it started working.But I still want to give the 3B model a try.

Image

Who can solve this problem?

@zacksiri
Copy link

zacksiri commented Feb 7, 2025

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.
In case anyone is interested here is my config:
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk
If the training works i'll create a PR with my config.

Looking at your response length / mean, it didn't look like the training was successful - I was seeing similar issue too on smaller GPUs and smaller batch sizes. Model end up repeating the answer in the think token after a while and remained that way for rest of the training.

Yeap batch size does have an impact on the training. I think it might be hard on smaller GPUs. I've tried many configuration after my experiment, none seems to have resulted in working model.

@lonelydancer
Copy link

how to assign GPU?
export CUDA_VISIBLE_DEVICES=4 seems not work.

@Leiay
Copy link

Leiay commented Feb 7, 2025

Hi @zacksiri , are you able to launch the runs on 1.5B PPO on 2 GPUs with 24GB RAM? I am trying different hparams but could not launch one successfully.

@Kruisheer
Copy link

I have managed to get it to run with a GTX 1060 and an RTX 3090 after a few failed attempts. I am unsure if it will learn to reason but it is training with the following parameters. Its been running almost two days and I have not even completed one epoch. I will be trying different things to get it to work. Here is the command I used.

cat ./scripts/train_tiny_zero.sh
python3 -m verl.trainer.main_ppo
data.train_files="${DATA_DIR}/train.parquet"
data.val_files="${DATA_DIR}/test.parquet"
data.train_batch_size=32
data.val_batch_size=64
data.max_prompt_length=150
data.max_response_length=564
actor_rollout_ref.model.path="$BASE_MODEL"
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=16
actor_rollout_ref.actor.ppo_micro_batch_size=1
actor_rollout_ref.rollout.log_prob_micro_batch_size=1
actor_rollout_ref.rollout.tensor_model_parallel_size="$ROLLOUT_TP_SIZE"
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.rollout.dtype=half
actor_rollout_ref.ref.log_prob_micro_batch_size=1
critic.optim.lr=1e-5
critic.model.path="$BASE_MODEL"
critic.ppo_micro_batch_size=1
algorithm.kl_ctrl.kl_coef=0.001
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=1
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name="TinyZero"
trainer.experiment_name="$EXPERIMENT_NAME"
trainer.total_epochs=15

@ArlanCooper
Copy link

how to prepare the dataset,DATA_DIR,where can i find the dataset?

@samhodge-aiml
Copy link

how to prepare the dataset,DATA_DIR,where can i find the dataset?

the answer is in the README

https://github.com/Jiayi-Pan/TinyZero?tab=readme-ov-file#countdown-task

You are preparing the data with the above link you specify --local-dir and that becomes your DATASET_DIR for the train task.

@yuleiqin
Copy link

how to deal with the error below , 1_A100 PCIe 80gb . Followed the instruction with error below . 2_A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2_a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2_H200 has very good training result .

TWO gpus: export N_GPUS=2 export BASE_MODEL=Qwen/Qwen2.5-3B export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4 export ROLLOUT_TP_SIZE=2 export EXPERIMENT_NAME=countdown-qwen2.5-3b export VLLM_ATTENTION_BACKEND=XFORMERS

one gpu: export N_GPUS=1 export BASE_MODEL=Qwen/Qwen2.5-1.5B export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4 export EXPERIMENT_NAME=countdown-qwen2.5-1.5b export VLLM_ATTENTION_BACKEND=XFORMERS

Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
    dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I switched a machine and it worked without problem. But I got OOM error on Qwen3b even with 8xH100.

same here; even OOM with 8xH20 with a 3b model

UPDATE; I solved this problem with reference to vllm-project/vllm#4392

pip3 install nvidia-cublas-cu12==12.3.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests