Training with distributed tensorflow #1386

jageshmaharjan · 2018-05-24T03:27:39Z

Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 16.04
TensorFlow installed from (our builds, or upstream TensorFlow):
TensorFlow version (use command below): 1.6
Python version: 3.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source): CUDA 9.0
CUDA/cuDNN version: cuDNN 7.0
GPU model and memory: Tesla M60 * 4
Exact command to reproduce: I am trying to run on the distributed tensorflow

This is my host (ps) script on parameter server:

python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/distributedTf/ \
  --checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts localhost:2233 \
  --worker_hosts localhost:2222 \
  --task_index 0 \
  --job_name ps

And, The below script is for the worker machine:

python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/distributedTf/ \
  --checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts localhost:2233 \
  --worker_hosts localhost:2222 \
  --task_index 0 \
  --job_name worker \
  --coord_host localhost \
  --coord_port 2501

The host machine(ps) is running completely fine. while In the worker machine I enounter this error at first,

2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)

Thought that the previous process is consuming the GPU resource. So, i re-ran the worker script again. Then, i got another error.

E OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E 	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
E Caused by op 'tower_1/Minimum', defined at:
E   File "DeepSpeech.py", line 1838, in <module>
E     tf.app.run()
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
E     _sys.exit(main(argv))
E   File "DeepSpeech.py", line 1820, in main
E     train(server)
E   File "DeepSpeech.py", line 1501, in train
E     results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E   File "DeepSpeech.py", line 640, in get_tower_results
E     calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E   File "DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
E     logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E   File "DeepSpeech.py", line 417, in BiRNN
E     layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), FLAGS.relu_clip)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4565, in minimum
E     "Minimum", x=x, y=y, name=name)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E     op_def=op_def)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
E     op_def=op_def)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
E     self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E 
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E 	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
Traceback (most recent call last):
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
:
: [Note* Too long, so, I fragment..]
:
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_5' with dtype int32
E 	 [[Node: Placeholder_5 = Placeholder[dtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E 	 [[Node: b3/read_S591_G3013 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name="edge_2390_b3/read_S591", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]()]]
E 
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.

The text was updated successfully, but these errors were encountered:

kdavis-mozilla · 2018-05-24T19:22:25Z

It looks like you are simply running out of GPU memory, your batch sizes are relatively large. Try smaller batch sizes.

jageshmaharjan · 2018-06-10T10:06:50Z

It worked just after, I tried again. I should have closed this issue.
Thanks

lock · 2019-01-03T00:52:39Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

jageshmaharjan closed this as completed Jun 10, 2018

lock bot locked and limited conversation to collaborators Jan 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with distributed tensorflow #1386

Training with distributed tensorflow #1386

jageshmaharjan commented May 24, 2018 •

edited

Loading

kdavis-mozilla commented May 24, 2018

jageshmaharjan commented Jun 10, 2018 •

edited

Loading

lock bot commented Jan 3, 2019

Training with distributed tensorflow #1386

Training with distributed tensorflow #1386

Comments

jageshmaharjan commented May 24, 2018 • edited Loading

kdavis-mozilla commented May 24, 2018

jageshmaharjan commented Jun 10, 2018 • edited Loading

lock bot commented Jan 3, 2019

jageshmaharjan commented May 24, 2018 •

edited

Loading

jageshmaharjan commented Jun 10, 2018 •

edited

Loading