Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory requirements #16

Open
jchook opened this issue Dec 10, 2017 · 13 comments
Open

Memory requirements #16

jchook opened this issue Dec 10, 2017 · 13 comments

Comments

@jchook
Copy link

jchook commented Dec 10, 2017

Hello, I am attempting to run this code:

python3 experiment.py --settings_file test

But I am running out of memory (OOM error):

2017-12-09 23:17:18.540786: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x
2017-12-09 23:17:18.540796: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[3988,3988]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'mul_790', defined at:
  File "experiment.py", line 221, in <module>
    mmd2, that_np = sess.run(mix_rbf_mmd2_and_ratio(eval_test_real, eval_test_sample,biased=False, sigmas=sigma))
  File "/home/jchook/dev/RGAN/mmd.py", line 71, in mix_rbf_mmd2_and_ratio
    K_XX, K_XY, K_YY, d = _mix_rbf_kernel(X, Y, sigmas, wts)
  File "/home/jchook/dev/RGAN/mmd.py", line 52, in _mix_rbf_kernel
    K_YY += wt * tf.exp(-gamma * (-2 * YY + c(Y_sqnorms) + r(Y_sqnorms)))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3988,3988]
	 [[Node: mul_790 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Neg_102, add_467)]]
	 [[Node: truediv_233/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_216_truediv_233", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

What are the minimum GPU memory requirements?

@corcra
Copy link
Collaborator

corcra commented Dec 18, 2017

Sorry for the delayed response - to give you a partial answer, we use GTX 1080s for some of the experiments, and sometimes we used the CPU (with 16-32GB of RAM).

In case it's helpful at all, the particular bit of code you're getting stuck on here originally came from this repository: https://github.com/dougalsutherland/opt-mmd

@jchook
Copy link
Author

jchook commented Jan 7, 2018

I tried this with a 1080Ti 11GB of VRAM and 32GB of RAM and still getting "Out of Memory" error. Here is a full output log.

Is there a parameter in the settings file I can change to reduce the memory requirements?

UPDATE

Yay fixed the issue! Here is what I did in case it helps someone else (or me again haha):

  1. Uninstall Tensorflow (previously installed via pip)
  2. Uninstall cuda and cudnn
  3. Re-install Cuda 8.0 and cudnn 7.0.5 (for cuda 8) using .deb packages. Note: I installed all 3 cudnn packages: lib, dev, and doc, then ran all the tests available to ensure I had properly installed everything.
  4. Compile/install Tensorflow from source

Some notes from my tensorflow configuration in case it's useful:

  • On my distro I had to enter /usr/bin/python3 for my python path
  • Told it I was using cuda 8 and cudnn 7.0.5
  • Used dpkg -L libcudnn7 to find out where the .deb installed cudnn (in my case it was /usr/lib/x86_64-linux-gnu) and entered that path into config
  • Enabled CUDA, but chose default for most other "enable [y/N]" steps

@jchook jchook closed this as completed Jan 10, 2018
@jchook
Copy link
Author

jchook commented Jan 22, 2018

Dammit. Something happened on reboot that caused the problem to re-appear.

I have completely uninstalled and re-installed various versions of CUDA + cuDNN + Nvidia drivers + Tensorflow in as many permutations as I thought might work... getting the same exact error every time.

I wrote a custom settings file (based on the mnist example) with custom data and am also getting the same exact error right around 50 epochs. Really wish I understood this problem. I have also tried varying many of the settings.

@jchook jchook reopened this Jan 22, 2018
@corcra
Copy link
Collaborator

corcra commented Jan 22, 2018

What happens if you turn off all MMD-related calculations? You could do this by setting the "if" statement on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L188 to never be true.

@corcra
Copy link
Collaborator

corcra commented Jan 22, 2018

You could also vary the size of the set used in evaluation (which gets fed into the MMD calculation), which is set on this line: https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 batch_multiplier is how many batches worth of data we want to include in the evaluation set.

The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation, but depending on your use case that may be an acceptable price to pay for the code actually running on your hardware. (I'm assuming based on your error log that the OOM is happening due to the MMD calculation, which is quadratic in the number of samples.)

@jchook
Copy link
Author

jchook commented Jan 24, 2018

You are a saint! Removing the MMD calculations allowed the script to finish. Thank you.

Reducing eval_size is also working.

The problem with reducing the evaluation set size is that it reduces the accuracy of the MMD calculation...

Does this affect training performance or only post-training evaluation?

@corcra
Copy link
Collaborator

corcra commented Jan 24, 2018

The MMD score is only used for evaluation, so it shouldn't affect training.

The main way it might affect you is that we use the MMD score (on the validation set) to decide when to save model parameters (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L227), so without it you will default to the normal frequency, which is every 50 epochs (https://github.com/ratschlab/RGAN/blob/master/experiment.py#L273).

@dmortem
Copy link

dmortem commented Sep 28, 2018

Hi,
@corcra
On the line https://github.com/ratschlab/RGAN/blob/master/experiment.py#L75 , what does '5000' mean? is it the size of the validation set? If the size of my own dataset is less than 5000, should I change this constant?
Thanks!

@corcra
Copy link
Collaborator

corcra commented Sep 28, 2018

Hi @dmortem : yes, 5000 is the (approximate) size of the validation set we use to compute the MMD during training (technically, we use up to 5000 examples, because we use multiples of the batch size). So if your validation set is smaller than this, or if you just want to have cheaper (but noisier) evaluations, you can change this number.

@dmortem
Copy link

dmortem commented Sep 28, 2018

Thank you for your explanation! @corcra
I notice that when I train the model on my own dataset, 'mmd' and 'that' will become inf first after several epochs, and then become nan. I have replaced the constant '5000' by the size of the trainset of my own. Have you ever met this problem?

@corcra
Copy link
Collaborator

corcra commented Sep 29, 2018

@dmortem It sounds like you're getting numerical issues/overflow in either the MMD calculation or the t-hat calculation. I guess it might be coming from different things, but as a first sanity check you could try checking the values of the computed kernel for strange things (e.g. look at the output of this function: https://github.com/ratschlab/RGAN/blob/master/mmd.py#L21).

Another thing: 5000 (or whatever other constant you set it to) is referring to the validation set in our code. I guess you could use your training data at that point as well, but then you're checking how similar your generated data is to the training data, which may be overly optimistic.

@dmortem
Copy link

dmortem commented Sep 30, 2018

Thank you @corcra , I will check the values you mentioned.

For the constant '5000', I think it should be the size of the training set (e.g. MNIST_train.csv), and this training set is further divided into another 'training set', validation set and test set with the ratio of [0.6, 0.2, 0.2]. According to the line https://github.com/ratschlab/RGAN/blob/master/experiment.py#L77, I think 5000 should be the size or smaller than the size of the original training set? (in MNIST case, it should be 60000 or less than 60000?)

@diogofm
Copy link

diogofm commented Apr 12, 2019

Hi guys,
I'm trying to reproduce the paper's experiments as well.
So, I'm running it with 64GB of RAM. It was supposed to run fine without the MMD calculation work-around.
The MNIST data set isn't that big and I still can't run it with. I'm afraid of trying to run the eICU and have the same problem.
Can you suggest anything that I can try?
Which are the other variables in this script that influence the memory usage? batch_size maybe?
I didn't feel comfortable to change the eval_size. If you guys can post a working script I'd appreciate.

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants