-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language model incorrectly drops spaces for out-of-vocabulary words #1156
Comments
Thank you for filling out the issue template :) This misbehavior seems to happen only when the acoustic model is not doing a very good job. I agree the decoder should not degrade to that level. I haven't had the chance to debug this issue other than tweaking the decoder hyperparameters to try to alleviate the problem. I'll take a closer look. |
Yes. In some weird cases when the acoustic model is not performing well the decoder falls into this weird state of gluing together words. I'm hoping it can be fixed by tweaking the beam search implementation. |
What assumptions does the acoustic model make (i.e. whats the distribution and characteristics of the audio training data?) The audio I provided sounds pretty clear IMHO, but perhaps the audio training data doesn't have enough diversity to help the deep net generalize (i.e. the deep net is essentially overfitting to the training data and isn't generalizing well). |
@reuben Any further progress on this by any chance? |
Facing the same issue. Any progress or way out to improve the performance? |
How can I train my model without the language model? |
@nyro22 Training never involves the language model. Computing WER's, however does. |
I am facing the same issue on a rather similar configuration to the one described above. Was there any progress on this? Thanks! |
facing the same problem..... ta1jin3ping2iiiao1bu4de5li4liang4zai4iiiong3dao4shang4xia4fan1teng2iiiong3dong4she2xing2zhuang4ru2hai3tun2iii4zhix2iii3iii4tou2de5iiiu1shix4ling3xian1 |
Hi @bolt163, |
having the same issues. if I modify the beam search algorithm myself, what would be the steps to recompile using the updated beam search? |
@reuben I am also facing the same issue ? Any suggestions ? |
Same here. Also, seems to be happening when using out-of-vocabulary terms. |
Probably the bug is somewhere in this function: DeepSpeech/native_client/beam_search.h Lines 56 to 92 in e34c52f
Seems that the problem is that sequences with out-of-vocabulary words receive more score without spaces than with spaces. |
Does the beam search use Length Normalization? EDIT: I just realized that the 'word_count_weight_' is performing this |
@GeorgeFedoseev I've been trying to debug this part of code you have pointed, and I noticed some weird behavior. Next, I've printed the score for, respectively: (1) a common word of my corpus; (2) a rare word of the corpus; (3) an invalid/out-of-vocabulary word; (4) the score of the variable 'oov_score_'. I am printing the scores on some different states:
Notice that the oov_score is not the same as the invalid word, and in some cases it is even higher than a valid word. I tried to add the following lines to the code: Model::State out;
oov_score_ = model_.FullScore(from_state.model_state, model_.GetVocabulary().NotFound(), out).prob; and now it appears that the score of the invalid word and the variable are similar. When testing on my examples, it is not enough to solve this problem, but it certainly reduced the 'gluing together words'. PS: words are from my pt-br language corpus |
@bernardohenz If you print (3) with |
@GeorgeFedoseev yes, it is true. But why wouldn't |
@bernardohenz as I understand code: when the construction of word is not finished yet ( So |
But the problem of assigning such minimum unigram score (or One idea that occurred to me is to penalize longer words, so to avoid cases where the algorithm tries to concatenate more than 3 words together without a space. |
Replacing
with
helped. Did I just raise another error? |
In fact I created another variable ( And, I do not know if it is a good idea to set |
@bernardohenz I think that in that part ( Try to increase |
I've implemented length normalization (word_count_weight was only a gross approximation) as well as switch to a fixed OOV score (which was in a TODO list for a long time) as part of the streaming changes, which will be merged soon for our next release. When we have binaries available for testing I'll comment here so anyone interested can test if it improves the decoder behavior on the cases described here. Thanks a lot for the investigation and suggestions, @bernardohenz, @GeorgeFedoseev and @titardrew! |
+1😉 |
They will be available with our next release, v0.2, when it is ready :) |
Hi @reuben Any update on these binaries? I too would like to test their impact on decoder behavior. |
We're currently training a model for the v0.2 release. Send me an email at {my github username} at mozilla.com and I'll give you access to a preliminary trained model so you can test the code changes. If you have your own model and just want the binaries, they're available here: https://tools.taskcluster.net/groups/ClsXrFSbTJ6uUkEAPqFG8A The Python and Node packages are also available, just specify version 0.2.0-alpha.9 |
@reuben Dropped a mail to you !! |
I need the new decoder library so binary for Linux x64, how can I download it from the URL given by @reuben ? I am a bit lost on that webpage. When I click on DeepSpeech Linux AMD64 CPU and then on artifacts and then download the public/native_client.tar.xz I dont see any changes in my decoded output when using this .so library compared to the current one, there is still only one or two words followed by a very very long one without spaces... despite ensuring that my model frequently outputs white spaces, and the beam and greedy decoding output looks fine |
just tested 0.2.0 release (deepspeech and models), still get long words out of English vocabulary, This example is a phone call recording (one channel out of two), TTS works well for the first sentence (a pre-recorded welcome message). Then it is a part of real conversation. TTS doesn't work properly. The command and outputs are (deepspeech-venv) jonathan@ubuntu:~$ deepspeech --model ~/deepspeech-0.2.0-models/models/output_graph.pb --audio ~/audio/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav --alphabet ~/deepspeech-0.2.0-models/models/alphabet.txt --lm ~/deepspeech-0.2.0-models/models/lm.binary --trie ~/deepspeech-0.2.0-models/models/trie The audio can be found from |
@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this? |
@sunil3590 I feel this is not an engineering issue. The acoustic model is not trained with phone call conversations, the same as the language model, am I right? We plan to collect our own data to tune deep speech models to make it can be used in the real world. |
Are there any updates on this? I still have this issue and I am pretty sure it's not the models fault, since with normal decoding (greedy or beam search) I never get these very long words. This is a big problem for me since those long words mess up the evaluation obviously, but a language would be necessary to get acceptable performance. |
@reuben is currently working on moving to ctcdecode, which amongst others should fix this issue |
Could anyone who's seeing this issue test the new decoder on master? There's native client builds here: https://tools.taskcluster.net/groups/FyclewklSUqN6FXHavrhKQ The acoustic model is the same as v0.2, and the trie is in Let me know how it goes. |
Sorry, those instructions are incorrect. The acoustic model is the same as v0.2 but you need to re-export it with the master code. Alternatively you can grab it from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode |
@reuben 's new work is working well for me on long, clean recordings. I'm using:
The inference for a 45s podcast snippet seems pretty decent: why early on in the night i mean i think there are a couple of states that are going to be really keep kentucky and virginia kentucky closes its poles a half in the eastern times on half in the central time on so that means that half of the states at six o'clock to visit seven o'clock and so have a lot of results and in watching one particular congressional district raciness district between antibarbarus i disengaged morabaraba a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a searching for much of the night in the democratizing well there that's a pretty good sign that the wave will be building The inference for two recordings I made myself is almost totally wrong, but does not have incorrectly dropped spaces. I'm guessing the poor results are due to recording quality?
he gravitationless theocratic circuitously manipulate intermediately creation of images and a frame buffer intended for alcohol
a gravitational latrocinia idly manipulate an alternator exploration of images in a frame of her intolerable |
@spencer-brown On the recordings you made yourself, did you record directly to 16KHz, 16bit, mono audio? (The recordings sound like they were made at a lower Hz and/or bit depth.) Also, I'd tend to agree that the drop in the recording quality is likely largely to blame for the poor results on the recordings you made yourself. We're currently training models that will be more robust to background noise. |
Ah, no, I did not - thanks! In follow-up tests using those settings I'm seeing about 50% accuracy with the Bose headphones and nearly 0% with the Macbook Air mic. The recordings are still sound crackly relative to the training recordings. Re: background-noise-robust models - exciting! |
For anyone else still having trouble with this, i was able to make it work in the end by installing pytorch along with the ctcdecode library and then using that on top of my existing code, worked right of the gate with a KenLM language model! |
@f90 you shouldn't need PyTorch (or the ctcdecode library) to use the new native client, the decoder is built-in. |
I'm also experiencing the same issue, with words gluing together. I'm trying to run the new version as described by @spencer-brown above, but I'm experiencing some issues. Deep Speech v0.3 is working on my system, but using the new version is throwing an error. I'm using:
I downloaded the files, ran
This is the output:
Would love to get the new version to work - any thoughts? |
Your output shows that it's not an official build. Please use official ones before reporting issues. and please give more context on your system. |
The binary files and trie in https://github.com/mozilla/DeepSpeech/tree/master/data/lm alleviates this long-word problem. However my results are not as good as @spencer-brown for the same text. I apply deepspeech with binary files and trie mentionned above (all the rest is just straight application of instructions in "Using the model" of https://github.com/mozilla/DeepSpeech) Using ffmpeg to change the sampling rate to 16000 I get the following transcription for the 45 seconds podcast mentionned above (https://drive.google.com/file/d/1rmje0llC-PXJgTiAiuQcsPRSjaaWfsv_/view?usp=sharing):
Whenusing bandfilter:
If anyone knows tricks to further improve results I would be really interested :) |
@reuben Hi, I am using the v0.3.0 deepspeech-gpu which i installed from pip3(python) as stated in the readme on the start page. I tried to pass the audio file about 1 min (which is call center recording) to the command line command with the arguments and pre built model but i get letters stringed together similar to the people above. What do i need to update to get better results as people are getting above. I am new to this so not understanding everything. I also made 4 sec chunks for the audio recording with webrtcvad but still error in detection is there.What all files do i need to update and from where and also is there any need for seperating long audio into small chunks or the detection will work fine in both cases for the new model/binaries(if they need to be updated). And @hugorichard where is the trie2 model you are reffering to. The link is broken i think.Can you specify which are the latest output graph, lm, trie to use with the latest alpha and stable release and where to find these details. I also tried the latest alpha versions still get a lot of spelling errors. I tried to transcribe Jonathan Ive (apple hardware designer) . Its british english but still there are a lot of incomplete words and spelling error . (it spells evolution as evil lution) . I dont know if i am using the correct model(output, trie, lm) . Please tell. Thanks |
I am also facing the same issue as @raghavk92 . Can anyone please help. The links that I am using are
Do let me know if you need any other information. |
This was fixed now. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Mozilla DeepSpeech will sometimes create long runs of text with no spaces:
This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I've isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off.
The problem might be related to encountering out-of-vocabulary terms.
I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.
I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem):
https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0
The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.
Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (
chunks_with_language_model.txt
):Then, I’ve provided similar output with the language model turned off (
chunks_without_language_model.txt
):I’ve included both these files in the shared Dropbox folder link above.
Here’s what the correct transcript should be, manually done (
chunks_correct_manual_transcription.txt
):This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.
Discussion around this bug started on the standard DeepSpeech discussion forum:
https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13
https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3
The standard
client.py
was slightly modified to segment the longer 15 minute audio clip into 4 second blocks.Mac OS X 10.12.6 (16G1036)
Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file:
Did not compile from source.
Same
Used CPU only version
Used CPU only version
I haven't provided my full modified
client.py
that segments longer audio, but to run with a language model using the standarddeepspeech
command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following:This is clearly a bug and not a feature :)
The text was updated successfully, but these errors were encountered: