Replies: 9 comments
-
>>> NanoNabla |
Beta Was this translation helpful? Give feedback.
-
>>> othiele |
Beta Was this translation helpful? Give feedback.
-
>>> Siddhesh_Patil |
Beta Was this translation helpful? Give feedback.
-
>>> Siddhesh_Patil |
Beta Was this translation helpful? Give feedback.
-
>>> othiele |
Beta Was this translation helpful? Give feedback.
-
>>> Siddhesh_Patil |
Beta Was this translation helpful? Give feedback.
-
>>> othiele |
Beta Was this translation helpful? Give feedback.
-
>>> Siddhesh_Patil |
Beta Was this translation helpful? Give feedback.
-
>>> othiele |
Beta Was this translation helpful? Give feedback.
-
>>> Siddhesh_Patil
[February 3, 2021, 12:25pm]
We have started training to our Model using below command
python3 DeepSpeech.py slash --train_files .../clips/train.csv
We are using 8GB GPU and also its getting used 100%
We are using approx. 12000 dataset to train the model. slash
Average length of 16k sampling audio files is 60 seconds.
Few initial logged lines:
there must be at least one NUMA node, so returning NUMA node zero slash
2021-02-01 09:45:06.567988: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1639 slash ] Found device 0
with properties: slash
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775 slash
pciBusID: 0000:00:1e.0 slash
2021-02-01 09:45:06.568049: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcudart.so.10.0 slash
2021-02-01 09:45:06.568082: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcublas.so.10.0 slash
2021-02-01 09:45:06.568103: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcufft.so.10.0 slash
2021-02-01 09:45:06.568132: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcurand.so.10.0 slash
2021-02-01 09:45:06.568152: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcusolver.so.10.0 slash
2021-02-01 09:45:06.568184: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcusparse.so.10.0 slash
2021-02-01 09:45:06.568210: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcudnn.so.7 slash
2021-02-01 09:45:06.568353: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983 slash ] successful
NUMA node read from SysFS had negative value (-1), but there must be at
least one NUMA node, so returning NUMA node zero slash
2021-02-01 09:45:06.569026: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983 slash ] successful
NUMA node read from SysFS had negative value (-1), but there must be at
least one NUMA node, so returning NUMA node zero slash
2021-02-01 09:45:06.569596: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1767 slash ] Adding visible
gpu devices: 0 slash
2021-02-01 09:45:06.569642: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1180 slash ] Device
interconnect StreamExecutor with strength 1 edge matrix: slash
2021-02-01 09:45:06.569664: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1186 slash ] 0 slash
2021-02-01 09:45:06.569674: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1199 slash ] 0: N slash
2021-02-01 09:45:06.569798: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983 slash ] successful
NUMA node read from SysFS had negative value (-1), but there must be at
least one NUMA node, so returning NUMA node zero slash
2021-02-01 09:45:06.570423: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983 slash ] successful
NUMA node read from SysFS had negative value (-1), but there must be at
least one NUMA node, so returning NUMA node zero slash
2021-02-01 09:45:06.571006: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1325 slash ] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with
7171 MB memory) - slash > physical GPU (device: 0, name: Tesla M60, pci bus
id: 0000:00:1e.0, compute capability: 5.2) slash
WARNING:tensorflow:From
/home/ubuntu/DeepSpeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py:71:
Variable.load (from tensorflow.python.ops.variables) is deprecated and
will be removed in a future version. slash
Instructions for updating: slash
Prefer Variable.assign which has equivalent behavior in 2.X. slash
W0201 09:45:06.574804 139693934528320 deprecation.py:323 slash ] From
/home/ubuntu/DeepSpeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py:71:
Variable.load (from tensorflow.python.ops.variables) is deprecated and
will be removed in a future version. slash
Instructions for updating: slash
Prefer Variable.assign which has equivalent behavior in 2.X. slash
D Session opened. slash
I Loading best validating checkpoint from
/home/ubuntu/.local/share/deepspeech/checkpoints/best_dev-85345 slash
I Loading variable from checkpoint: beta1_power slash
I Loading variable from checkpoint: beta2_power slash
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel slash
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam slash
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1 slash
I Loading variable from checkpoint: global_step slash
I Loading variable from checkpoint: layer_1/bias slash
I Loading variable from checkpoint: layer_1/bias/Adam slash
I Loading variable from checkpoint: layer_1/bias/Adam_1 slash
I Loading variable from checkpoint: layer_1/weights slash
I Loading variable from checkpoint: layer_1/weights/Adam slash
I Loading variable from checkpoint: layer_1/weights/Adam_1 slash
I Loading variable from checkpoint: layer_2/bias slash
I Loading variable from checkpoint: layer_2/bias/Adam slash
I Loading variable from checkpoint: layer_2/bias/Adam_1 slash
I Loading variable from checkpoint: layer_2/weights slash
I Loading variable from checkpoint: layer_2/weights/Adam slash
I Loading variable from checkpoint: layer_2/weights/Adam_1 slash
I Loading variable from checkpoint: layer_3/bias slash
I Loading variable from checkpoint: layer_3/bias/Adam slash
I Loading variable from checkpoint: layer_3/bias/Adam_1 slash
I Loading variable from checkpoint: layer_3/weights slash
I Loading variable from checkpoint: layer_3/weights/Adam slash
I Loading variable from checkpoint: layer_3/weights/Adam_1 slash
I Loading variable from checkpoint: layer_5/bias slash
I Loading variable from checkpoint: layer_5/bias/Adam slash
I Loading variable from checkpoint: layer_5/bias/Adam_1 slash
I Loading variable from checkpoint: layer_5/weights slash
I Loading variable from checkpoint: layer_5/weights/Adam slash
I Loading variable from checkpoint: layer_5/weights/Adam_1 slash
I Loading variable from checkpoint: layer_6/bias slash
I Loading variable from checkpoint: layer_6/bias/Adam slash
I Loading variable from checkpoint: layer_6/bias/Adam_1 slash
I Loading variable from checkpoint: layer_6/weights slash
I Loading variable from checkpoint: layer_6/weights/Adam slash
I Loading variable from checkpoint: layer_6/weights/Adam_1 slash
I Loading variable from checkpoint: learning_rate slash
I STARTING Optimization slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:00 slash | Steps: 0 slash | Loss:
0.000000 slash
2021-02-01 09:45:09.087454: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcublas.so.10.0 slash
2021-02-01 09:45:09.684289: I
tensorflow/stream_executor/platform/default/dso_loader.cc:44 slash ]
Successfully opened dynamic library libcudnn.so.7 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:01 slash | Steps: 1 slash | Loss:
16.107815 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:01 slash | Steps: 2 slash | Loss:
39.940804 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:02 slash | Steps: 3 slash | Loss:
41.197852 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:02 slash | Steps: 4 slash | Loss:
41.779977 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:02 slash | Steps: 5 slash | Loss:
47.854725 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:02 slash | Steps: 6 slash | Loss:
50.081092 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:02 slash | Steps: 7 slash | Loss:
54.767900 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:03 slash | Steps: 8 slash | Loss:
50.239637 slash
Epoch 0 slash | Training slash | Elapsed Time: 0:00:03 slash | Steps: 9 slash | Loss:
52.046041
As I am new to DeepSpeech training, have few questions:
1. we faced some interrupts in training due to system reboot, etc. when
we started the training using the above command we could see in the
logs that earlier checkpoint is picked up. But we everytime we could
see training starts from Epoch 0.
So is the earlier training getting saved and starting from where it
stoppped or each time its a new start?
2. Why is it taking so much time for training using GPUs?
We are using 12200 dataset to train the model. slash
Average duaration of 16k sampling audio files is 60 seconds.
Logged lines: slash
Epoch 0 slash | Training slash | Elapsed Time: 7:07:08 slash | Steps: 12200 slash | Loss:
873.814420 slash
Epoch 0 slash | Training slash | Elapsed Time: 7:07:08 slash | Steps: 12200 slash | Loss:
873.814420
Is the training elapsed time proper or is it taking longer than
expected?
3. What is the ideal Epoch training count required to train the Model?
4. What is the ideal each audio file duration required to train? Also
what will be the ideal dataset count to be used to train the Model? slash
We are using 12200 files to train
[This is an archived TTS discussion thread from discourse.mozilla.org/t/deepspeech-training-questions]
Beta Was this translation helpful? Give feedback.
All reactions