在训练过程中产生OOM #2907
Unanswered
yangjianxin1
asked this question in
Community | Q&A
在训练过程中产生OOM
#2907
Replies: 1 comment
-
This problem may caused by memory fragments on CUDA. This problem is normal since PyTorch uses a simple caching allocator. We may alleviate this problem in the furture. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
使用ZeRO+Gemin的方案训练模型GPT2模型,开始的时候能够正常训练,并且显存还有剩余,训练到若干个step就会报错OOM,请问是否有优化的手段,比如自动进行垃圾回收。
CUDA out of memory. Tried to allocate 506.00 MiB (GPU 0; 31.75 GiB total capacity; 28.92 GiB already allocated; 10.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
temp = optim_chunk.cpu_shard.to(get_current_device())
RuntimeError: CUDA out of memory. Tried to allocate 506.00 MiB (GPU 1; 31.75 GiB total capacity; 28.92 GiB already allocated; 14.00 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%| | 35/125002 [02:28<147:40:56, 4.25s/it]
0%| | 35/125002 [02:29<147:49:53, 4.26s/it]
Beta Was this translation helpful? Give feedback.
All reactions