You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The host machine(ps) is running completely fine. while In the worker machine I enounter this error at first,
2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)
Thought that the previous process is consuming the GPU resource. So, i re-ran the worker script again. Then, i got another error.
E OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E
E Caused by op 'tower_1/Minimum', defined at:
E File "DeepSpeech.py", line 1838, in <module>
E tf.app.run()
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
E _sys.exit(main(argv))
E File "DeepSpeech.py", line 1820, in main
E train(server)
E File "DeepSpeech.py", line 1501, in train
E results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E File "DeepSpeech.py", line 640, in get_tower_results
E calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E File "DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
E logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E File "DeepSpeech.py", line 417, in BiRNN
E layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), FLAGS.relu_clip)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4565, in minimum
E "Minimum", x=x, y=y, name=name)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E op_def=op_def)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
E op_def=op_def)
E File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
E self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
E
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E
E
Traceback (most recent call last):
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
:
: [Note* Too long, so, I fragment..]
:
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_5' with dtype int32
E [[Node: Placeholder_5 = Placeholder[dtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E [[Node: b3/read_S591_G3013 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name="edge_2390_b3/read_S591", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]()]]
E
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.
The text was updated successfully, but these errors were encountered:
@tilmankamp , @lissyx
TensorFlow installed from (our builds, or upstream TensorFlow):
This is my host (ps) script on parameter server:
And, The below script is for the worker machine:
The host machine(ps) is running completely fine. while In the worker machine I enounter this error at first,
Thought that the previous process is consuming the GPU resource. So, i re-ran the worker script again. Then, i got another error.
The text was updated successfully, but these errors were encountered: