-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1 gpu is not working , 2 gpus out of memory #5
Comments
Seems like their experiments are done on H200 (with 141G memory) |
I checked and that 8*NVIDIA H200 running for 2 more hours , not 1 or 2 GPUs. |
Same problems here, I've used two A100 and encounterd out of memory and train for a few examples. Have any recommend parameters for small batchsize but equivalent performance? |
https://wandb.ai/jiayipan/TinyZero/runs/m19na0qi/overview This run should work on 2 A100's? |
That is 7b though, quite a difference to 3 or 1.5b |
Try adding actor_rollout_ref.model.enable_gradient_checkpointing=True to the config? |
If I understand correctly, actor_rollout_def.rrollout.gpu_cemory_utilization controls the video memory of vllm. I feel that it can synchronously reduce the total batch size and vllm's video memory usage to free up more training space |
I am trying |
Now I am scaling down the batch and it is running well. I speculate that the GPU usage of VLLM may not be sufficient for such a large batch of training. If we convert H100 140G to 0.4, VLLM would take up approximately 56G. If we use an 80G graphics card for training and do not adjust the gpu_cemory_utilization or batch_size, VLLM will only have 32G of video memory available. Will there be a shortage of video memory? |
e3df048 worked on 4x A100 40GB gpus with git diff
|
@lokmantsui interesting, this was my charts with countdown task |
I've successfully reproduce these results several days ago on two A100s, guys may check out my scripts in this fork for your reference when you encountering memory related problems on A100s, the results are shown in this report ~ ![]() ![]() |
set :trainer.nnodes=2 : stop at : set:trainer.nnodes=1 : Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5.30it/s] Who can give me some ideas on how to solve this? |
The GPU is running at 100% utilization, indicating it's fully engaged in computation. However, it only uses 1.7GB of memory, suggesting the task is compute-intensive but not memory-bound. |
I'm running my training on 2x A4500 with 20GB VRAM each it seems to be working. In case anyone is interested here is my config: python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=64 \
data.val_batch_size=128 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size=2 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.grad_offload=False \
actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=2 \
critic.model.enable_gradient_checkpointing=True \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.grad_offload=False \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log Here is my wandb: https://wandb.ai/opsmaru/TinyZero/runs/6o812djk If the training works i'll create a PR with my config. |
and When I chose to use only one A800, it started working.But I still want to give the 3B model a try. |
For what it is worth this also runs on two Ampere Generation RTX 3090 GPUs each with 24Gb of VRAM. I will let it spin for a few more hours, but the 1.5B parameter model is unlikely to learn anything https://wandb.ai/samh_aiml/TinyZero/runs/t9v9ucm2 Sam |
I think for 1.5b grpo doesn't give good results (the config I tried) Seems like for 1.5b ppo is the better algorithm. I will try more configuration to see if I can get 1.5b to run in ppo or 3b even. |
Looking at your response length / mean, it didn't look like the training was successful - I was seeing similar issue too on smaller GPUs and smaller batch sizes. Model end up repeating the answer in the think token after a while and remained that way for rest of the training. |
Yeap batch size does have an impact on the training. I think it might be hard on smaller GPUs. I've tried many configuration after my experiment, none seems to have resulted in working model. |
how to assign GPU? |
Hi @zacksiri , are you able to launch the runs on 1.5B PPO on 2 GPUs with 24GB RAM? I am trying different hparams but could not launch one successfully. |
I have managed to get it to run with a GTX 1060 and an RTX 3090 after a few failed attempts. I am unsure if it will learn to reason but it is training with the following parameters. Its been running almost two days and I have not even completed one epoch. I will be trying different things to get it to work. Here is the command I used. cat ./scripts/train_tiny_zero.sh |
how to prepare the dataset,DATA_DIR,where can i find the dataset? |
the answer is in the README https://github.com/Jiayi-Pan/TinyZero?tab=readme-ov-file#countdown-task You are preparing the data with the above link you specify |
UPDATE; I solved this problem with reference to vllm-project/vllm#4392
|
how to deal with the error below , 1A100 PCIe 80gb . Followed the instruction with error below . 2A100 80gb working fine but out of memory . I guess the code default to multiple GPUs . the only workable solution is 2a100 80gb for Qwen/Qwen2.5-1.5B . For my training , Qwen/Qwen2.5-1.5B does not have very good training result , Qwen/Qwen2.5-3B with 2H200 has very good training result .
TWO gpus:
export N_GPUS=2
export BASE_MODEL=Qwen/Qwen2.5-3B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS
one gpu:
export N_GPUS=1
export BASE_MODEL=Qwen/Qwen2.5-1.5B
export DATA_DIR=Jiayi-Pan/Countdown-Tasks-3to4
export EXPERIMENT_NAME=countdown-qwen2.5-1.5b
export VLLM_ATTENTION_BACKEND=XFORMERS
The text was updated successfully, but these errors were encountered: