Replies: 5 comments
-
>>> kdavis |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> b.r |
Beta Was this translation helpful? Give feedback.
-
>>> kdavis |
Beta Was this translation helpful? Give feedback.
-
>>> reuben |
Beta Was this translation helpful? Give feedback.
-
>>> jiping_s
[December 3, 2017, 3:12pm]
In traditional speech recognizers language model specifies what word
sequence is possible. Deepspeech seems to generate final output based on
statistics at letter level (not word level).
I have a language model containing a few hundred words, in arpa: slash
slash slash data slash
ngram 1=655 slash
ngram 2=3133 slash
ngram 3=4482
slash slash 1-grams: slash
0 ~~-0.8111794 slash
... slash
by which the word sequence 'would you like to try our strudel for twenty
five cents' is possible. However the final output is not what I expected
if language model is used in traditional way.
Here is detailed process:
slash (1 slash ) Building language model -
./lmplz slash --text corpus.txt slash --arpa corpus.arpa slash --o 3 slash
./build_binary -T -s corpus.arpa lm.binary
slash (2 slash ) Building trie -
./generate_trie models/alphabet.txt lm.binary corpus.txt trie
(in the building trie step, alphabet.txt is the original file from
Deepspeech release, lm.binary and corpus.txt are my own files from step
(1), and trie is the generated new file)
slash (3 slash ) run deepspeech (wave file says 'would you like to try our strudel
for twenty five cents?') -
(3.1) First, use my language model with Deepspeech's original acoustic
model (the .pb file) -
deepspeech models/output_graph.pb test13.wav models/alphabet.txt
./lm.binary ./trie
output :
Loading model from file models/output_graph.pb slash
Loaded model in 0.204s. slash
Loading language model from files ./lm.binary ./trie slash
Loaded language model in 0.004s. slash
Running inference. slash
would you like to trialastruodle for twenty five cents slash
Inference took 5.162s for 4.057s audio file.
(3.2) Then use everything of Deepspeech
deepspeech models/output_graph.pb test13.wav models/alphabet.txt
models/lm.binary models/trie
output:
Loading model from file models/output_graph.pb slash
Loaded model in 0.223s. slash
Loading language model from files models/lm.binary models/trie slash
Loaded language model in 1.092s. slash
Running inference. slash
would i like to trialastruodlefortwentyfvecents slash
Inference took 5.141s for 4.057s audio file. slash
(deepspeech-venv)jeremy slash levono: slash ~/DeepSpeech slash $
Now from the output of both runs:
would you like to trialastruodle for twenty five cents slash
would i like to trialastruodlefortwentyfvecents
Deepspeech seems to use the language model in a way different from the
traditional way: the letter sequence such as slash ' trialastruodle slash ' has
only rough similarity to what should be the word sequence 'try our
strudel' which is what the language model contains. It seems that after
the neural network generates letter sequences, language model definitely
is used to do a second layer processing, so that we can see the results
above are different due to the use of different language models. My
question is why the strange letter sequence are still there?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/how-language-model-is-used-in-deepspeech]
Beta Was this translation helpful? Give feedback.
All reactions