Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with distributed tensorflow #1386

Closed
jageshmaharjan opened this issue May 24, 2018 · 3 comments
Closed

Training with distributed tensorflow #1386

jageshmaharjan opened this issue May 24, 2018 · 3 comments

Comments

@jageshmaharjan
Copy link

jageshmaharjan commented May 24, 2018

@tilmankamp , @lissyx

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 16.04
    TensorFlow installed from (our builds, or upstream TensorFlow):
  • TensorFlow version (use command below): 1.6
  • Python version: 3.6
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source): CUDA 9.0
  • CUDA/cuDNN version: cuDNN 7.0
  • GPU model and memory: Tesla M60 * 4
  • Exact command to reproduce: I am trying to run on the distributed tensorflow

This is my host (ps) script on parameter server:

python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/distributedTf/ \
  --checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts localhost:2233 \
  --worker_hosts localhost:2222 \
  --task_index 0 \
  --job_name ps

And, The below script is for the worker machine:

python -u DeepSpeech.py \
  --train_files /data/zh_data/data_thchs30/train.csv \
  --dev_files /data/zh_data/data_thchs30/dev.csv \
  --test_files /data/zh_data/data_thchs30/test.csv \
  --train_batch_size 80 \
  --dev_batch_size 80 \
  --test_batch_size 40 \
  --n_hidden 375 \
  --epoch 200 \
  --validation_step 1 \
  --early_stop True \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.1 \
  --estop_std_thresh 0.1 \
  --dropout_rate 0.22 \
  --learning_rate 0.00095 \
  --report_count 100 \
  --use_seq_length False \
  --export_dir /data/zh_data/exportDir/distributedTf/ \
  --checkpoint_dir /data/zh_data/checkpoint/distributedCkp/ \
  --decoder_library_path /data/jugs/asr/DeepSpeech/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /data/zh_data/alphabet.txt \
  --lm_binary_path /data/zh_data/zh_lm.binary \
  --lm_trie_path /data/zh_data/trie \
  --ps_hosts localhost:2233 \
  --worker_hosts localhost:2222 \
  --task_index 0 \
  --job_name worker \
  --coord_host localhost \
  --coord_port 2501

The host machine(ps) is running completely fine. while In the worker machine I enounter this error at first,

2018-05-24 03:05:38.015799: F tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed
Aborted (core dumped)

Thought that the previous process is consuming the GPU resource. So, i re-ran the worker script again. Then, i got another error.

E OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E 	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
E Caused by op 'tower_1/Minimum', defined at:
E   File "DeepSpeech.py", line 1838, in <module>
E     tf.app.run()
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
E     _sys.exit(main(argv))
E   File "DeepSpeech.py", line 1820, in main
E     train(server)
E   File "DeepSpeech.py", line 1501, in train
E     results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E   File "DeepSpeech.py", line 640, in get_tower_results
E     calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E   File "DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
E     logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E   File "DeepSpeech.py", line 417, in BiRNN
E     layer_1 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(batch_x, h1), b1)), FLAGS.relu_clip)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4565, in minimum
E     "Minimum", x=x, y=y, name=name)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E     op_def=op_def)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
E     op_def=op_def)
E   File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
E     self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E 
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E 	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
Traceback (most recent call last):
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[26480,375] and type float on /job:worker/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
	 [[Node: tower_1/Minimum = Minimum[T=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"](tower_1/Relu, tower_1/Minimum/y)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1_G2079 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:worker/replica:0/task:0/device:GPU:1", send_device_incarnation=-6861182159178562240, tensor_name="edge_3384_tower_1/gradients/tower_1/MatMul_3_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
:
: [Note* Too long, so, I fragment..]
:
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_5' with dtype int32
E 	 [[Node: Placeholder_5 = Placeholder[dtype=DT_INT32, shape=[], _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]
E 	 [[Node: b3/read_S591_G3013 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:1", send_device="/job:worker/replica:0/task:0/device:GPU:3", send_device_incarnation=-8674475802652740309, tensor_name="edge_2390_b3/read_S591", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:GPU:1"]()]]
E 
E The checkpoint in /data/zh_data/checkpoint/distributedCkp/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /data/zh_data/checkpoint/distributedCkp/.
@kdavis-mozilla
Copy link
Contributor

It looks like you are simply running out of GPU memory, your batch sizes are relatively large. Try smaller batch sizes.

@jageshmaharjan
Copy link
Author

jageshmaharjan commented Jun 10, 2018

It worked just after, I tried again. I should have closed this issue.
Thanks

@lock
Copy link

lock bot commented Jan 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants