You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good day, I would just like to ask if you guys have any idea why I am running into CUDA memory errors when running training? This happens at the end of the first epoch (epoch 0). For reference, I am just trying to reproduce the results in REPRODUCE_RESULTS.md with the smaller dataset with annotation-small.json.
My configuration is:
OS: Windows 10 (Anaconda Prompt)
GPU: GeForce GTX 1070Ti (single)
torch version: 1.0.1
The error stack is as follows:
Error stack:
2019-03-22 14-23-05 steps >>> epoch 0 average batch time: 0:00:00.7
2019-03-22 14-23-06 steps >>> epoch 0 batch 411 sum: 1.74406
2019-03-22 14-23-07 steps >>> epoch 0 batch 412 sum: 2.26457
2019-03-22 14-23-07 steps >>> epoch 0 batch 413 sum: 1.95351
2019-03-22 14-23-08 steps >>> epoch 0 batch 414 sum: 2.39538
2019-03-22 14-23-09 steps >>> epoch 0 batch 415 sum: 1.83759
2019-03-22 14-23-10 steps >>> epoch 0 batch 416 sum: 1.92264
2019-03-22 14-23-10 steps >>> epoch 0 batch 417 sum: 1.71246
2019-03-22 14-23-11 steps >>> epoch 0 batch 418 sum: 2.32141
2019-03-22 14-23-11 steps >>> epoch 0 sum: 2.18943
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
neptune: Executing in Offline Mode.
B:\ML Models\src\utils.py:30: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
B:\ML Models\src\callbacks.py:168: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
X = Variable(X, volatile=True).cuda()
Traceback (most recent call last):
File "main.py", line 93, in <module>
main()
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 697, in main
rv = self.invoke(ctx)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\click\core.py", line 535, in invoke
return callback(*args, **kwargs)
File "main.py", line 31, in train
pipeline_manager.train(pipeline_name, dev_mode)
File "B:\ML Models\src\pipeline_manager.py", line 32, in train
train(pipeline_name, dev_mode, self.logger, self.params, self.seed)
File "B:\ML Models\src\pipeline_manager.py", line 116, in train
pipeline.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "B:\ML Models\src\steps\base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
[Previous line repeated 4 more times]
File "B:\ML Models\src\steps\base.py", line 112, in fit_transform
return self._cached_fit_transform(step_inputs)
File "B:\ML Models\src\steps\base.py", line 123, in _cached_fit_transform
step_output_data = self.transformer.fit_transform(**step_inputs)
File "B:\ML Models\src\steps\base.py", line 262, in fit_transform
self.fit(*args, **kwargs)
File "B:\ML Models\src\models.py", line 82, in fit
self.callbacks.on_epoch_end()
File "B:\ML Models\src\steps\pytorch\callbacks.py", line 92, in on_epoch_end
callback.on_epoch_end(*args, **kwargs)
File "B:\ML Models\src\steps\pytorch\callbacks.py", line 163, in on_epoch_end
val_loss = self.get_validation_loss()
File "B:\ML Models\src\callbacks.py", line 132, in get_validation_loss
return self._get_validation_loss()
File "B:\ML Models\src\callbacks.py", line 138, in _get_validation_loss
outputs = self._transform()
File "B:\ML Models\src\callbacks.py", line 172, in _transform
outputs_batch = self.model(X)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\parallel\data_parallel.py", line 141, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "B:\ML Models\src\unet_models.py", line 387, in forward
conv2 = self.conv2(conv1)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
input = module(input)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torchvision\models\resnet.py", line 88, in forward
out = self.bn3(out)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\modules\batchnorm.py", line 76, in forward
exponential_average_factor, self.eps)
File "C:\Users\AIC-WS1\Anaconda3\envs\neptune.ml\lib\site-packages\torch\nn\functional.py", line 1623, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 8.00 GiB total capacity; 6.18 GiB already allocated; 56.00 MiB free; 48.95 MiB cached)
Lowering the batch size from the default 20 to 10 decreased the memory usage of the GPU from ~6GB to ~4GB, and at the end of epoch 0, increased the memory usage to ~6GB. Afterwards, subsequent epochs have continued to run in training at memory usage of ~6GB.
Is this behavior to be expected/normal? I read somewhere that you also used GTX 1070 GPUs for training, and so I thought I would be able to run training at the default batch size. Also, is it normal for GPU memory usage to increase between epochs 0 and 1? Thank you!
The text was updated successfully, but these errors were encountered:
I have the same issue. After the first epoch I get:
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 8.00 GiB total capacity; 6.43 GiB already allocated; 0 bytes free; 6.53 GiB reserved in total by PyTorch)
I am running the mapping challenge dataset.
I have experimented with varying batch sizes and also number workers, but the problem occurs no matter the settings.
Update: Significantly reducing the batch size has solved that issue for me. (From 20 to 8).
Good day, I would just like to ask if you guys have any idea why I am running into CUDA memory errors when running training? This happens at the end of the first epoch (epoch 0). For reference, I am just trying to reproduce the results in
REPRODUCE_RESULTS.md
with the smaller dataset withannotation-small.json
.My configuration is:
OS: Windows 10 (Anaconda Prompt)
GPU: GeForce GTX 1070Ti (single)
torch version: 1.0.1
The error stack is as follows:
Error stack:
Lowering the batch size from the default 20 to 10 decreased the memory usage of the GPU from ~6GB to ~4GB, and at the end of epoch 0, increased the memory usage to ~6GB. Afterwards, subsequent epochs have continued to run in training at memory usage of ~6GB.
Is this behavior to be expected/normal? I read somewhere that you also used GTX 1070 GPUs for training, and so I thought I would be able to run training at the default batch size. Also, is it normal for GPU memory usage to increase between epochs 0 and 1? Thank you!
The text was updated successfully, but these errors were encountered: