1 gpu is not working , 2 gpus out of memory #5

deter3 · 2025-01-25T05:27:07Z

how to deal with the error below , 1A100 PCIe 80gb . Followed the instruction with error below . 2A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2H200 has very good training result .

TWO gpus:
export N_GPUS=2
export BASE_MODEL=Qwen/Qwen2.5-3B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

one gpu:
export N_GPUS=1
export BASE_MODEL=Qwen/Qwen2.5-1.5B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export EXPERIMENT_NAME=countdown-qwen2.5-1.5b
export VLLM_ATTENTION_BACKEND=XFORMERS

Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
    dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

The text was updated successfully, but these errors were encountered:

rucnyz · 2025-01-25T08:05:28Z

Seems like their experiments are done on H200 (with 141G memory)
https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

deter3 · 2025-01-25T09:45:40Z

Seems like their experiments are done on H200 (with 141G memory) https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

I checked and that 8*NVIDIA H200 running for 2 more hours , not 1 or 2 GPUs.

JerryWu-code · 2025-01-25T10:26:48Z

Same problems here, I've used two A100 and encounterd out of memory and train for a few examples. Have any recommend parameters for small batchsize but equivalent performance?

JackCloudman · 2025-01-25T11:28:22Z

I'm testing with 2xH200 and apparently it's working
Edit: 3B version

deter3 · 2025-01-25T13:12:27Z

I'm testing with 2xH200 and apparently it's working Edit: 3B version

3B version 2xH200 is working .

RobertMcCarthy97 · 2025-01-25T15:27:36Z

https://wandb.ai/jiayipan/TinyZero/runs/m19na0qi/overview

This run should work on 2 A100's?

Benjoyo · 2025-01-25T15:48:21Z

Seems like their experiments are done on H200 (with 141G memory) https://wandb.ai/jiayipan/TinyZero/runs/31q05grn/overview

I checked and that 8*NVIDIA H200 running for 2 more hours , not 1 or 2 GPUs.

That is 7b though, quite a difference to 3 or 1.5b

ZihanWang314 · 2025-01-25T22:54:20Z

Try adding actor_rollout_ref.model.enable_gradient_checkpointing=True to the config?

jiangchengchengark · 2025-01-26T10:44:39Z

If I understand correctly, actor_rollout_def.rrollout.gpu_cemory_utilization controls the video memory of vllm. I feel that it can synchronously reduce the total batch size and vllm's video memory usage to free up more training space

jiangchengchengark · 2025-01-26T10:45:07Z

如果我理解正确的话，actor_rollout_def.rrollout.gpu_cemory_utilization 控制着 vllm 的显存，我感觉可以同步减少总 batch size 和 vllm 的显存使用量，从而释放更多的训练空间

I am trying

jiangchengchengark · 2025-01-26T10:53:12Z

Now I am scaling down the batch and it is running well. I speculate that the GPU usage of VLLM may not be sufficient for such a large batch of training. If we convert H100 140G to 0.4, VLLM would take up approximately 56G. If we use an 80G graphics card for training and do not adjust the gpu_cemory_utilization or batch_size, VLLM will only have 32G of video memory available. Will there be a shortage of video memory?

lokmantsui · 2025-01-29T20:27:56Z

e3df048 worked on 4x A100 40GB gpus with git diff

diff --git a/scripts/export.txt b/scripts/export.txt
new file mode 100644
index 0000000..d1502c8
--- /dev/null
+++ b/scripts/export.txt
@@ -0,0 +1,6 @@
+export N_GPUS=4
+export BASE_MODEL=Qwen/Qwen2.5-3B
+export DATA_DIR=~/data/countdown
+export ROLLOUT_TP_SIZE=4
+export EXPERIMENT_NAME=countdown-qwen2.5-3b
+export VLLM_ATTENTION_BACKEND=XFORMERS
\ No newline at end of file
diff --git a/scripts/train_tiny_zero.sh b/scripts/train_tiny_zero.sh
index 3b2e01c..4de91d6 100644
--- a/scripts/train_tiny_zero.sh
+++ b/scripts/train_tiny_zero.sh
@@ -8,14 +8,14 @@ data.max_response_length=1024 \
 actor_rollout_ref.model.path=$BASE_MODEL \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=128 \
-actor_rollout_ref.actor.ppo_micro_batch_size=8 \
+actor_rollout_ref.actor.ppo_micro_batch_size=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
-actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
 actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=$BASE_MODEL \
-critic.ppo_micro_batch_size=8 \
+critic.ppo_micro_batch_size=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['wandb'] \
 +trainer.val_before_train=False \

JackCloudman · 2025-01-29T20:54:40Z

@lokmantsui interesting, this was my charts with countdown task

JerryWu-code · 2025-01-30T10:56:33Z

I've successfully reproduce these results several days ago on two A100s, guys may check out my scripts in this fork for your reference when you encountering memory related problems on A100s, the results are shown in this report ~

chenlinzhe · 2025-02-01T09:40:53Z

set :trainer.nnodes=2 :

stop at :
(main_task pid=28594) from vllm.version import version as VLLM_VERSION

set:trainer.nnodes=1 :

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s]
(WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias']
(WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']

Who can give me some ideas on how to solve this?

chenlinzhe · 2025-02-01T10:00:03Z

set :trainer.nnodes=2 :

stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION

set:trainer.nnodes=1 :

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']

Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.

zacksiri · 2025-02-01T10:36:17Z

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=64 \
data.val_batch_size=128 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=2 \
critic.model.enable_gradient_checkpointing=True \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.grad_offload=False \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

chenlinzhe · 2025-02-01T12:46:38Z

set :trainer.nnodes=2 :
stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION
set:trainer.nnodes=1 :
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']
Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.

and When I chose to use only one A800, it started working.But I still want to give the 3B model a try.

samhodge-aiml · 2025-02-02T05:27:50Z

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

For what it is worth this also runs on two Ampere Generation RTX 3090 GPUs each with 24Gb of VRAM.

I will let it spin for a few more hours, but the 1.5B parameter model is unlikely to learn anything

https://wandb.ai/samh_aiml/TinyZero/runs/t9v9ucm2

Sam

zacksiri · 2025-02-02T06:23:54Z

I think for 1.5b grpo doesn't give good results (the config I tried)

Seems like for 1.5b ppo is the better algorithm.

I will try more configuration to see if I can get 1.5b to run in ppo or 3b even.

Manto · 2025-02-02T17:18:17Z

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.

In case anyone is interested here is my config:

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk

If the training works i'll create a PR with my config.

Looking at your response length / mean, it didn't look like the training was successful - I was seeing similar issue too on smaller GPUs and smaller batch sizes. Model end up repeating the answer in the think token after a while and remained that way for rest of the training.

chenlinzhe · 2025-02-06T02:27:54Z

set :trainer.nnodes=2 :
stop at : (main_task pid=28594) from vllm.version import version as VLLM_VERSION
set:trainer.nnodes=1 :
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] (WorkerDict pid=3900) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias'] (WorkerDict pid=3900) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=3637) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /root/autodl-tmp/TinyZero-main/model/Qwen2.5-3B-Instruct and are newly initialized: ['score.bias', 'score.weight']
Who can give me some ideas on how to solve this?

The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound.

and When I chose to use only one A800, it started working.But I still want to give the 3B model a try.

Who can solve this problem?

zacksiri · 2025-02-07T01:24:40Z

I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working.
In case anyone is interested here is my config:
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=64
data.val_batch_size=128
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=$BASE_MODEL
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=32
actor_rollout_ref.actor.ppo_micro_batch_size=2
actor_rollout_ref.rollout.log_prob_micro_batch_size=2
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.grad_offload=False
actor_rollout_ref.ref.log_prob_micro_batch_size=2
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.model.enable_gradient_checkpointing=True
critic.optim.lr=1e-5
critic.model.path=$BASE_MODEL
critic.ppo_micro_batch_size=2
critic.model.enable_gradient_checkpointing=True
critic.model.fsdp_config.param_offload=False
critic.model.fsdp_config.grad_offload=False
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=$N_GPUS
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log
Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk
If the training works i'll create a PR with my config.

Looking at your response length / mean, it didn't look like the training was successful - I was seeing similar issue too on smaller GPUs and smaller batch sizes. Model end up repeating the answer in the think token after a while and remained that way for rest of the training.

Yeap batch size does have an impact on the training. I think it might be hard on smaller GPUs. I've tried many configuration after my experiment, none seems to have resulted in working model.

lonelydancer · 2025-02-07T06:29:08Z

how to assign GPU?
export CUDA_VISIBLE_DEVICES=4 seems not work.

Leiay · 2025-02-07T08:04:07Z

Hi @zacksiri , are you able to launch the runs on 1.5B PPO on 2 GPUs with 24GB RAM? I am trying different hparams but could not launch one successfully.

Kruisheer · 2025-02-09T14:31:36Z

I have managed to get it to run with a GTX 1060 and an RTX 3090 after a few failed attempts. I am unsure if it will learn to reason but it is training with the following parameters. Its been running almost two days and I have not even completed one epoch. I will be trying different things to get it to work. Here is the command I used.

cat ./scripts/train_tiny_zero.sh
python3 -m verl.trainer.main_ppo
data.train_files="${DATA_DIR}/train.parquet"
data.val_files="${DATA_DIR}/test.parquet"
data.train_batch_size=32
data.val_batch_size=64
data.max_prompt_length=150
data.max_response_length=564
actor_rollout_ref.model.path="$BASE_MODEL"
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=16
actor_rollout_ref.actor.ppo_micro_batch_size=1
actor_rollout_ref.rollout.log_prob_micro_batch_size=1
actor_rollout_ref.rollout.tensor_model_parallel_size="$ROLLOUT_TP_SIZE"
actor_rollout_ref.rollout.gpu_memory_utilization=0.2
actor_rollout_ref.rollout.dtype=half
actor_rollout_ref.ref.log_prob_micro_batch_size=1
critic.optim.lr=1e-5
critic.model.path="$BASE_MODEL"
critic.ppo_micro_batch_size=1
algorithm.kl_ctrl.kl_coef=0.001
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=1
trainer.nnodes=1
trainer.save_freq=10
trainer.test_freq=10
trainer.project_name="TinyZero"
trainer.experiment_name="$EXPERIMENT_NAME"
trainer.total_epochs=15

ArlanCooper · 2025-02-10T09:44:27Z

how to prepare the dataset,DATA_DIR,where can i find the dataset?

samhodge-aiml · 2025-02-10T10:05:39Z

how to prepare the dataset,DATA_DIR,where can i find the dataset?

the answer is in the README

https://github.com/Jiayi-Pan/TinyZero?tab=readme-ov-file#countdown-task

You are preparing the data with the above link you specify --local-dir and that becomes your DATASET_DIR for the train task.

yuleiqin · 2025-02-12T08:02:30Z

how to deal with the error below , 1_A100 PCIe 80gb . Followed the instruction with error below . 2_A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2_a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2_H200 has very good training result .

TWO gpus: export N_GPUS=2 export BASE_MODEL=Qwen/Qwen2.5-3B export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4 export ROLLOUT_TP_SIZE=2 export EXPERIMENT_NAME=countdown-qwen2.5-3b export VLLM_ATTENTION_BACKEND=XFORMERS

one gpu: export N_GPUS=1 export BASE_MODEL=Qwen/Qwen2.5-1.5B export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4 export EXPERIMENT_NAME=countdown-qwen2.5-1.5b export VLLM_ATTENTION_BACKEND=XFORMERS

Actor use_remove_padding=False
Error executing job with overrides: ['data.train_files=Jiayi-Pan/Countdown-Tasks-3to4/train.parquet', 'data.val_files=Jiayi-Pan/Countdown-Tasks-3to4/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-1.5B', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-1.5b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/miniconda3/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::main_task() (pid=2641, ip=172.19.0.2)
  File "/workspace/TinyZero/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/workspace/TinyZero/verl/trainer/ppo/ray_trainer.py", line 514, in init_workers
    self.actor_rollout_wg.init_model()
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(TypeError): ray::WorkerDict.actor_rollout_init_model() (pid=2892, ip=172.19.0.2, actor_id=9b3727d88709f75f8ee9f78401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7a94a9f27d30>)
  File "/workspace/TinyZero/verl/single_controller/ray/base.py", line 399, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/TinyZero/verl/single_controller/base/decorator.py", line 404, in inner
    return func(*args, **kwargs)
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 332, in init_model
    self.rollout, self.rollout_sharding_manager = self._build_rollout()
  File "/workspace/TinyZero/verl/workers/fsdp_workers.py", line 254, in _build_rollout
    dp = self.world_size // infer_tp
TypeError: unsupported operand type(s) for //: 'int' and 'str'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I switched a machine and it worked without problem. But I got OOM error on Qwen3b even with 8xH100.

same here; even OOM with 8xH20 with a 3b model

UPDATE; I solved this problem with reference to vllm-project/vllm#4392

pip3 install nvidia-cublas-cu12==12.3.4.1

deter3 changed the title ~~unsupported operand type(s) for //: 'int' and 'str'~~ 1 gpu is not working , 2 gpus out of memory Jan 25, 2025

JerryWu-code mentioned this issue Jan 25, 2025

Stuck in Ray Critic Model Initialization when second running #9

Closed

This was referenced Jan 30, 2025

Does this project support GRPO? #23

Open

Qwen 3B OOMs on 2 H100s #30

Open

zacksiri mentioned this issue Feb 2, 2025

why do u need 2GPUs to run Qwen2.5 3B? #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 gpu is not working , 2 gpus out of memory #5

1 gpu is not working , 2 gpus out of memory #5

deter3 commented Jan 25, 2025 •

edited

Loading

rucnyz commented Jan 25, 2025 •

edited

Loading

deter3 commented Jan 25, 2025 •

edited

Loading

JerryWu-code commented Jan 25, 2025

JackCloudman commented Jan 25, 2025 •

edited

Loading

deter3 commented Jan 25, 2025

RobertMcCarthy97 commented Jan 25, 2025

Benjoyo commented Jan 25, 2025

ZihanWang314 commented Jan 25, 2025

jiangchengchengark commented Jan 26, 2025

jiangchengchengark commented Jan 26, 2025

jiangchengchengark commented Jan 26, 2025 •

edited

Loading

lokmantsui commented Jan 29, 2025

JackCloudman commented Jan 29, 2025

JerryWu-code commented Jan 30, 2025

chenlinzhe commented Feb 1, 2025

chenlinzhe commented Feb 1, 2025

zacksiri commented Feb 1, 2025 •

edited

Loading

chenlinzhe commented Feb 1, 2025

samhodge-aiml commented Feb 2, 2025 •

edited

Loading

zacksiri commented Feb 2, 2025 •

edited

Loading

Manto commented Feb 2, 2025

chenlinzhe commented Feb 6, 2025

zacksiri commented Feb 7, 2025

lonelydancer commented Feb 7, 2025

Leiay commented Feb 7, 2025

Kruisheer commented Feb 9, 2025

ArlanCooper commented Feb 10, 2025

samhodge-aiml commented Feb 10, 2025

yuleiqin commented Feb 12, 2025

1 gpu is not working , 2 gpus out of memory #5

1 gpu is not working , 2 gpus out of memory #5

Comments

deter3 commented Jan 25, 2025 • edited Loading

rucnyz commented Jan 25, 2025 • edited Loading

deter3 commented Jan 25, 2025 • edited Loading

JerryWu-code commented Jan 25, 2025

JackCloudman commented Jan 25, 2025 • edited Loading

deter3 commented Jan 25, 2025

RobertMcCarthy97 commented Jan 25, 2025

Benjoyo commented Jan 25, 2025

ZihanWang314 commented Jan 25, 2025

jiangchengchengark commented Jan 26, 2025

jiangchengchengark commented Jan 26, 2025

jiangchengchengark commented Jan 26, 2025 • edited Loading

lokmantsui commented Jan 29, 2025

JackCloudman commented Jan 29, 2025

JerryWu-code commented Jan 30, 2025

chenlinzhe commented Feb 1, 2025

chenlinzhe commented Feb 1, 2025

zacksiri commented Feb 1, 2025 • edited Loading

chenlinzhe commented Feb 1, 2025

samhodge-aiml commented Feb 2, 2025 • edited Loading

zacksiri commented Feb 2, 2025 • edited Loading

Manto commented Feb 2, 2025

chenlinzhe commented Feb 6, 2025

zacksiri commented Feb 7, 2025

lonelydancer commented Feb 7, 2025

Leiay commented Feb 7, 2025

Kruisheer commented Feb 9, 2025

ArlanCooper commented Feb 10, 2025

samhodge-aiml commented Feb 10, 2025

yuleiqin commented Feb 12, 2025

deter3 commented Jan 25, 2025 •

edited

Loading

rucnyz commented Jan 25, 2025 •

edited

Loading

deter3 commented Jan 25, 2025 •

edited

Loading

JackCloudman commented Jan 25, 2025 •

edited

Loading

jiangchengchengark commented Jan 26, 2025 •

edited

Loading

zacksiri commented Feb 1, 2025 •

edited

Loading

samhodge-aiml commented Feb 2, 2025 •

edited

Loading

zacksiri commented Feb 2, 2025 •

edited

Loading