Language model incorrectly drops spaces for out-of-vocabulary words #1156

BradNeuberg · 2018-01-05T23:06:02Z

Mozilla DeepSpeech will sometimes create long runs of text with no spaces:

omiokaarforfthelastquarterwastoget

This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I've isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off.

The problem might be related to encountering out-of-vocabulary terms.

I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem):

https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.

Discussion around this bug started on the standard DeepSpeech discussion forum:
https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13
https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3

Have I written custom code (as opposed to running examples on an unmodified clone of the repository):

The standard client.py was slightly modified to segment the longer 15 minute audio clip into 4 second blocks.

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Mac OS X 10.12.6 (16G1036)

TensorFlow installed from (our builds, or upstream TensorFlow):

Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file:

tensorflow==1.4.0
deepspeech==0.1.0
numpy==1.13.3
scipy==0.19.1
webrtcvad==2.0.10

TensorFlow version (use command below):

('v1.4.0-rc1-11-g130a514', '1.4.0')

Python version:

Python 2.7.13

Bazel version (if compiling from source):

Did not compile from source.

GCC/Compiler version (if compiling from source):

Same

CUDA/cuDNN version:

Used CPU only version

GPU model and memory:

Used CPU only version

Exact command to reproduce:

I haven't provided my full modified client.py that segments longer audio, but to run with a language model using the standard deepspeech command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following:

# Set $DEEPSPEECH to where full Deep Speech checkout is; note that my own git checkout
# for the `deepspeech` runner is at git sha fef25e9ea6b0b6d96dceb610f96a40f2757e05e4
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt $DEEPSPEECH/models/lm.binary $DEEPSPEECH/models/trie

# Similar command to run without language model -- spaces retained for unknown words:
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt

This is clearly a bug and not a feature :)

The text was updated successfully, but these errors were encountered:

BradNeuberg · 2018-01-05T23:23:32Z

@kdavis-mozilla @lissyx

reuben · 2018-01-06T00:06:36Z

Thank you for filling out the issue template :)

This misbehavior seems to happen only when the acoustic model is not doing a very good job. I agree the decoder should not degrade to that level. I haven't had the chance to debug this issue other than tweaking the decoder hyperparameters to try to alleviate the problem. I'll take a closer look.

reuben · 2018-01-06T00:26:34Z

Yes. In some weird cases when the acoustic model is not performing well the decoder falls into this weird state of gluing together words. I'm hoping it can be fixed by tweaking the beam search implementation.

BradNeuberg · 2018-01-06T00:50:01Z

What assumptions does the acoustic model make (i.e. whats the distribution and characteristics of the audio training data?) The audio I provided sounds pretty clear IMHO, but perhaps the audio training data doesn't have enough diversity to help the deep net generalize (i.e. the deep net is essentially overfitting to the training data and isn't generalizing well).

jessetrana · 2018-02-08T20:17:51Z

@reuben Any further progress on this by any chance?

learnerAI · 2018-02-12T04:18:13Z

Facing the same issue. Any progress or way out to improve the performance?

rjzevallos · 2018-03-20T16:11:48Z

How can I train my model without the language model?

kdavis-mozilla · 2018-03-26T11:26:28Z

@nyro22 Training never involves the language model. Computing WER's, however does.

dsouza95 · 2018-04-02T17:31:12Z

I am facing the same issue on a rather similar configuration to the one described above. Was there any progress on this? Thanks!

bolt163 · 2018-05-03T13:10:02Z

facing the same problem.....
/data/home/DeepSpeech# /data/home/DeepSpeech/deepspeech phoneme_output_graph.pb phoneme.txt A2_1.wav
TensorFlow: v1.6.0-11-g7554dd8
DeepSpeech: v0.1.1-48-g31c01db
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-05-03 10:27:27.750965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-05-03 10:27:28.111299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate(GHz): 1.112
pciBusID: 0000:02:00.0
totalMemory: 23.90GiB freeMemory: 22.71GiB
2018-05-03 10:27:28.111338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-05-03 10:27:28.318726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22058 MB memory) -> physical GPU (device: 0, name: Tesla M40 24GB, pci bus id: 0000:02:00.0, compute capability: 5.2)

ta1jin3ping2iiiao1bu4de5li4liang4zai4iiiong3dao4shang4xia4fan1teng2iiiong3dong4she2xing2zhuang4ru2hai3tun2iii4zhix2iii3iii4tou2de5iiiu1shix4ling3xian1

elpimous · 2018-05-05T08:40:00Z

Hi @bolt163,
It seems that your problem is anywere else...
Could you provide wanted transcript of this wav ?!

nabrowning · 2018-05-11T20:53:38Z

having the same issues. if I modify the beam search algorithm myself, what would be the steps to recompile using the updated beam search?

kdavis-mozilla · 2018-05-11T21:00:49Z

The build requirements are here[1] and the build instructions are here[2].

mihiraniljoshi · 2018-05-30T22:04:19Z

@reuben I am also facing the same issue ? Any suggestions ?

xiapoy · 2018-06-03T11:34:00Z

Same here. Also, seems to be happening when using out-of-vocabulary terms.

GeorgeFedoseev · 2018-06-13T12:26:47Z

Probably the bug is somewhere in this function:

DeepSpeech/native_client/beam_search.h

Lines 56 to 92 in e34c52f

    
           void ExpandState(const KenLMBeamState& from_state, int from_label, 
        
                                  KenLMBeamState* to_state, int to_label) const { 
        
             CopyState(from_state, to_state); 
        
             if (!alphabet_.IsSpace(to_label)) { 
        
               to_state->incomplete_word += alphabet_.StringFromLabel(to_label); 
        
               TrieNode *trie_node = from_state.incomplete_word_trie_node; 
        
               // If we have no valid prefix we assume a very low log probability 
        
               float min_unigram_score = oov_score_; 
        
               // If prefix does exist 
        
               if (trie_node != nullptr) { 
        
                 trie_node = trie_node->GetChildAt(to_label); 
        
                 to_state->incomplete_word_trie_node = trie_node; 
        
                 if (trie_node != nullptr) { 
        
                   min_unigram_score = trie_node->GetMinUnigramScore(); 
        
                 } 
        
               } 
        
               // TODO try two options 
        
               // 1) unigram score added up to language model scare 
        
               // 2) langugage model score of (preceding_words + unigram_word) 
        
               to_state->score = min_unigram_score + to_state->language_model_score; 
        
               to_state->delta_score = to_state->score - from_state.score; 
        
             } else { 
        
               float lm_score_delta = ScoreIncompleteWord(from_state.model_state, 
        
                                     to_state->incomplete_word, 
        
                                     to_state->model_state); 
        
               // Give fixed word bonus 
        
               if (!IsOOV(to_state->incomplete_word)) { 
        
                 to_state->language_model_score += valid_word_count_weight_; 
        
               } 
        
               to_state->language_model_score += word_count_weight_; 
        
               UpdateWithLMScore(to_state, lm_score_delta); 
        
               ResetIncompleteWord(to_state); 
        
             } 
        
           }

Seems that the problem is that sequences with out-of-vocabulary words receive more score without spaces than with spaces.

bernardohenz · 2018-06-13T13:25:47Z

Does the beam search use Length Normalization?
According to Andrew Ng, it improves the beam search by reducing the penalty for outputting sentences with higher number of words.
Andrew talks about it in this video.

EDIT: I just realized that the 'word_count_weight_' is performing this

bernardohenz · 2018-06-22T19:47:26Z

@GeorgeFedoseev I've been trying to debug this part of code you have pointed, and I noticed some weird behavior. Next, I've printed the score for, respectively: (1) a common word of my corpus; (2) a rare word of the corpus; (3) an invalid/out-of-vocabulary word; (4) the score of the variable 'oov_score_'. I am printing the scores on some different states:

'não':    -2.9632
'informação':    -4.43594
'fdfdg':    -5.15036
oov_score:    -4.70759

------------------------------

'não':    -2.97739
'informação':    -4.45013
'fdfdg':    -5.16455
oov_score:    -4.70759

------------------------------

'não':    -2.88466
'informação':    -5.84782
'fdfdg':    -6.56224
oov_score:    -4.70759

Notice that the oov_score is not the same as the invalid word, and in some cases it is even higher than a valid word. I tried to add the following lines to the code:

Model::State out;
oov_score_ = model_.FullScore(from_state.model_state, model_.GetVocabulary().NotFound(), out).prob;

and now it appears that the score of the invalid word and the variable are similar. When testing on my examples, it is not enough to solve this problem, but it certainly reduced the 'gluing together words'.

PS: words are from my pt-br language corpus

GeorgeFedoseev · 2018-06-24T23:24:36Z

@bernardohenz
I think you printed scores for (1), (2) and (3) that depend on state, but oov_score_ does not depend on state (in master), and you cannot compare them.

If you print (3) with model.NullContextState() wouldn't it be the same as oov_score_?

bernardohenz · 2018-06-27T20:55:39Z

@GeorgeFedoseev yes, it is true. But why wouldn't oov_score depend on the state? I think it makes sense to compute the oov_score for each state. What do you think?

GeorgeFedoseev · 2018-06-28T08:20:41Z

@bernardohenz as I understand code: when the construction of word is not finished yet (if (!alphabet_.IsSpace(to_label)) part), to tell beam search that its going in right direction, we are adding minimum unigram score of the word that this search can lead to. And this minimum unigram score is precomputed without state (with model.NullContextState()) and saved in trie file. To get this score dynamically depending on state you will need for each prefix find all possible words that it can lead to and select minimum score (which is probably very slow).

So oov_score_ doesnt depend on state cause we are comparing OOV braches of beam search with in-vocabulary branches, which are scored using scores from trie file (and that scores don't depend on state).

bernardohenz · 2018-06-28T14:04:07Z

But the problem of assigning such minimum unigram score (or oov_score) is that, during beam search, the algorithm is preferring to concatenate lots of characters, rather than choosing an space and finish a low-probability word (such my (2) example).

One idea that occurred to me is to penalize longer words, so to avoid cases where the algorithm tries to concatenate more than 3 words together without a space.

titardrew · 2018-06-28T14:27:23Z

Replacing

oov_score_ = model_.FullScore(model_.NullContextState(), model_.GetVocabulary().NotFound(), out).prob;

with

oov_score_ = -1000.00;

helped. Did I just raise another error?

bernardohenz · 2018-06-28T14:31:49Z

In fact I created another variable (oov_score_2) to compute this value (oov_score_ can't be modified inside the function).

And, I do not know if it is a good idea to set oov_score_ = -1000.00;, since this is used when you are composing the word (char by char). The point of 'correcting' the oov_score_ is to avoid the alg to just decide to gluing all characters together (without a space char).

GeorgeFedoseev · 2018-06-28T23:01:26Z

@bernardohenz I think that in that part (if (!alphabet_.IsSpace(to_label))) it should be just that oov word gets score lower than any vocabulary word.

Try to increase word_count_weight_ from default 1 to something like 3.5. This resulted in less concatenation for me and decreased my WER by 3-4%.

reuben · 2018-07-13T14:04:29Z

I've implemented length normalization (word_count_weight was only a gross approximation) as well as switch to a fixed OOV score (which was in a TODO list for a long time) as part of the streaming changes, which will be merged soon for our next release. When we have binaries available for testing I'll comment here so anyone interested can test if it improves the decoder behavior on the cases described here. Thanks a lot for the investigation and suggestions, @bernardohenz, @GeorgeFedoseev and @titardrew!

elpimous · 2018-08-16T17:58:43Z

+1😉

reuben · 2018-08-17T13:11:20Z

They will be available with our next release, v0.2, when it is ready :)

desaur · 2018-09-07T04:27:51Z

Hi @reuben Any update on these binaries? I too would like to test their impact on decoder behavior.

reuben · 2018-09-07T15:42:11Z

We're currently training a model for the v0.2 release. Send me an email at {my github username} at mozilla.com and I'll give you access to a preliminary trained model so you can test the code changes.

If you have your own model and just want the binaries, they're available here: https://tools.taskcluster.net/groups/ClsXrFSbTJ6uUkEAPqFG8A

The Python and Node packages are also available, just specify version 0.2.0-alpha.9

b-ak · 2018-09-10T12:19:46Z

@reuben Dropped a mail to you !!

f90 · 2018-09-10T18:54:00Z

I need the new decoder library so binary for Linux x64, how can I download it from the URL given by @reuben ? I am a bit lost on that webpage.

When I click on DeepSpeech Linux AMD64 CPU and then on artifacts and then download the public/native_client.tar.xz I dont see any changes in my decoded output when using this .so library compared to the current one, there is still only one or two words followed by a very very long one without spaces... despite ensuring that my model frequently outputs white spaces, and the beam and greedy decoding output looks fine

zhao-xin · 2018-09-20T01:22:30Z

just tested 0.2.0 release (deepspeech and models), still get long words out of English vocabulary,

This example is a phone call recording (one channel out of two), TTS works well for the first sentence (a pre-recorded welcome message). Then it is a part of real conversation. TTS doesn't work properly.

The command and outputs are

(deepspeech-venv) jonathan@ubuntu:~$ deepspeech --model ~/deepspeech-0.2.0-models/models/output_graph.pb --audio ~/audio/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav --alphabet ~/deepspeech-0.2.0-models/models/alphabet.txt --lm ~/deepspeech-0.2.0-models/models/lm.binary --trie ~/deepspeech-0.2.0-models/models/trie
Loading model from file /home/jonathan/deepspeech-0.2.0-models/models/output_graph.pb
TensorFlow: v1.6.0-18-g5021473
DeepSpeech: v0.2.0-0-g009f9b6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-09-20 11:02:49.456955: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.134s.
Loading language model from files /home/jonathan/deepspeech-0.2.0-models/models/lm.binary /home/jonathan/deepspeech-0.2.0-models/models/trie
Loaded language model in 3.85s.
Running inference.
thank you for calling national storage your call may be recorded for coaching and quality the poses place let us not an if ye prefer we didn't record your colt to day in wall constrashionalshordistigwisjemaigay am so it just so he put in your code held everything in a disbosmygriparsesnwygorighticame so she's not like um that's all good if you won't care if i can just reserve something from my end over the foreign am i can reserve at the same on mine price you will looking out as well um which sent a and a unit is he looking out without which location for it an put it by sereerkapcoolofmijustrynorfrommians or a man we after the ground floor on the upper of a at
Inference took 33.947s for 58.674s audio file.

The audio can be found from
https://s3.us-east-2.amazonaws.com/fonedynamicsuseast2/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav

sunil3590 · 2018-09-27T10:51:15Z

@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?

zhao-xin · 2018-09-27T11:43:06Z

@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?

@sunil3590 I feel this is not an engineering issue. The acoustic model is not trained with phone call conversations, the same as the language model, am I right?

We plan to collect our own data to tune deep speech models to make it can be used in the real world.

f90 · 2018-10-23T11:10:43Z

Are there any updates on this? I still have this issue and I am pretty sure it's not the models fault, since with normal decoding (greedy or beam search) I never get these very long words.

This is a big problem for me since those long words mess up the evaluation obviously, but a language would be necessary to get acceptable performance.

lissyx · 2018-10-23T11:22:01Z

@reuben is currently working on moving to ctcdecode, which amongst others should fix this issue

reuben · 2018-10-30T15:38:19Z

Could anyone who's seeing this issue test the new decoder on master?

There's native client builds here: https://tools.taskcluster.net/groups/FyclewklSUqN6FXHavrhKQ

The acoustic model is the same as v0.2, and the trie is in data/lm/trie.ctcdecode after you update to latest master. Testing with some problematic examples I had shows much better results, but the links in this thread are all broken so I couldn't test with your files.

Let me know how it goes.

reuben · 2018-10-30T15:50:43Z

Sorry, those instructions are incorrect. The acoustic model is the same as v0.2 but you need to re-export it with the master code. Alternatively you can grab it from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode

spencer-brown · 2018-10-31T22:41:23Z

@reuben 's new work is working well for me on long, clean recordings.

I'm using:

The new Node module for OS X (npm linked locally).
output_graph.pbmm from Reuben's release (as linked above).
The ctcdecode trie from here.

The inference for a 45s podcast snippet seems pretty decent:

why early on in the night i mean i think there are a couple of states that are going to be really keep kentucky and virginia kentucky closes its poles a half in the eastern times on half in the central time on so that means that half of the states at six o'clock to visit seven o'clock and so have a lot of results and in watching one particular congressional district raciness district between antibarbarus i disengaged morabaraba a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a searching for much of the night in the democratizing well there that's a pretty good sign that the wave will be building

The inference for two recordings I made myself is almost totally wrong, but does not have incorrectly dropped spaces. I'm guessing the poor results are due to recording quality?

12s recording made with Bose QC35 II

he gravitationless theocratic circuitously manipulate intermediately creation of images and a frame buffer intended for alcohol

11s recording made with mid-'13 Macbook Air built-in mic

a gravitational latrocinia idly manipulate an alternator exploration of images in a frame of her intolerable

kdavis-mozilla · 2018-11-01T08:59:56Z

@spencer-brown On the recordings you made yourself, did you record directly to 16KHz, 16bit, mono audio? (The recordings sound like they were made at a lower Hz and/or bit depth.)

Also, I'd tend to agree that the drop in the recording quality is likely largely to blame for the poor results on the recordings you made yourself. We're currently training models that will be more robust to background noise.

spencer-brown · 2018-11-01T19:24:21Z

Ah, no, I did not - thanks! In follow-up tests using those settings I'm seeing about 50% accuracy with the Bose headphones and nearly 0% with the Macbook Air mic. The recordings are still sound crackly relative to the training recordings.

Re: background-noise-robust models - exciting!

f90 · 2018-11-01T20:40:28Z

For anyone else still having trouble with this, i was able to make it work in the end by installing pytorch along with the ctcdecode library and then using that on top of my existing code, worked right of the gate with a KenLM language model!

reuben · 2018-11-01T21:02:22Z

@f90 you shouldn't need PyTorch (or the ctcdecode library) to use the new native client, the decoder is built-in.

derekpankaew · 2018-11-03T11:07:55Z

I'm also experiencing the same issue, with words gluing together. I'm trying to run the new version as described by @spencer-brown above, but I'm experiencing some issues.

Deep Speech v0.3 is working on my system, but using the new version is throwing an error.

I'm using:

The new Node module here
The new ctcdecode trie here
The new output_graph.pbmm model here
lm.binary and alphabet from the previous v0.3 release

I downloaded the files, ran npm install, and then ran the command:

node client.js --audio="./--audios_for_testing/90secondtest.wav" --model="./output_graph.pbmm" --trie="./trie.ctcdecode" --lm="./deepspeech_models/lm.binary" --alphabet="./deepspeech_models/alphabet.txt"

This is the output:

Loading model from file ./output_graph.pbmm
TensorFlow: v1.11.0-11-gbee825492f
DeepSpeech: unknown
2018-11-03 17:35:42.654139: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
dyld: lazy symbol binding failed: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
  Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
  Expected in: flat namespace

dyld: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
  Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
  Expected in: flat namespace

Abort trap: 6

Would love to get the new version to work - any thoughts?

lissyx · 2018-11-13T15:38:39Z

Would love to get the new version to work - any thoughts?

Your output shows that it's not an official build. Please use official ones before reporting issues. and please give more context on your system.

hugorichard · 2018-11-18T17:03:07Z

The binary files and trie in https://github.com/mozilla/DeepSpeech/tree/master/data/lm alleviates this long-word problem. However my results are not as good as @spencer-brown for the same text.

I apply deepspeech with binary files and trie mentionned above (all the rest is just straight application of instructions in "Using the model" of https://github.com/mozilla/DeepSpeech)

Using ffmpeg to change the sampling rate to 16000
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 midterm-update-clipped2.wav

I get the following transcription for the 45 seconds podcast mentionned above (https://drive.google.com/file/d/1rmje0llC-PXJgTiAiuQcsPRSjaaWfsv_/view?usp=sharing):

Loading model from file models/output_graph.pbmm
TensorFlow: v1.11.0-9-g97d851f
DeepSpeech: v0.3.0-0-gef6b5bd
Loaded model in 0.013s.
Loading language model from files models/lm2.binary models/trie2
Loaded language model in 0.000145s.
Running inference.
why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in te the eastern times own half an the sentral times on so that means that half of the states at six o'clock afh o vis had seven o'clock ah and joll have a lot of results and an waschings one particular congressional district race o six congressinal district between a andi bar and maganme graph i ad bis emmigrass the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a u seem magrapha leading for much of the night and de democratis doing well there that's a pretty good sign that the wave will be building
Inference took 41.082s for 48.489s audio file.

Whenusing bandfilter:
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 -af lowpass=3000,highpass=200 midterm-update-clipped3.wav
I get a slightly better transcription:

TensorFlow: v1.11.0-9-g97d851f
DeepSpeech: v0.3.0-0-gef6b5bd
Loaded model in 0.0128s.
Loading language model from files models/lm2.binary models/trie2
Loaded language model in 0.000105s.
Running inference.
why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in the the eastern times own half in the central times on so that means that half of the states it six o'clock atfe vits ad seven o'clock ah and toll have a lot of results and an waschings one particular congressional district race o six congressial district between a andi bar and maganmc graph i ed es emmograss the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a you seem mograph leading for much of the night and te democratis doing well there thats a pretty good sign that the wave will be building
Inference took 43.352s for 48.489s audio file.

If anyone knows tricks to further improve results I would be really interested :)

raghavk92 · 2018-11-27T07:43:42Z

@reuben Hi, I am using the v0.3.0 deepspeech-gpu which i installed from pip3(python) as stated in the readme on the start page. I tried to pass the audio file about 1 min (which is call center recording) to the command line command with the arguments and pre built model but i get letters stringed together similar to the people above. What do i need to update to get better results as people are getting above. I am new to this so not understanding everything. I also made 4 sec chunks for the audio recording with webrtcvad but still error in detection is there.What all files do i need to update and from where and also is there any need for seperating long audio into small chunks or the detection will work fine in both cases for the new model/binaries(if they need to be updated). And @hugorichard where is the trie2 model you are reffering to. The link is broken i think.Can you specify which are the latest output graph, lm, trie to use with the latest alpha and stable release and where to find these details.

I also tried the latest alpha versions still get a lot of spelling errors. I tried to transcribe Jonathan Ive (apple hardware designer) . Its british english but still there are a lot of incomplete words and spelling error . (it spells evolution as evil lution) . I dont know if i am using the correct model(output, trie, lm) . Please tell.

Thanks

gr8nishan · 2018-11-29T10:41:49Z

I am also facing the same issue as @raghavk92 . Can anyone please help.

The links that I am using are

Do let me know if you need any other information.

lissyx · 2019-03-30T20:10:42Z

This was fixed now.

lock · 2019-04-29T20:50:16Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben self-assigned this Jan 6, 2018

BradNeuberg closed this as completed Jan 6, 2018

BradNeuberg reopened this Jan 6, 2018

lissyx mentioned this issue Jan 9, 2018

DeepSpeech fails on CPUs without AVX support #1157

Closed

mozilla deleted a comment from ZipingL Aug 17, 2018

lissyx mentioned this issue Sep 28, 2018

space between words #1605

Closed

raghavk92 mentioned this issue Jan 4, 2019

Transcription having lot of spelling errors and getting wrong word segments(although phonetically correct some times) #1817

Closed

lissyx closed this as completed Mar 30, 2019

lock bot locked and limited conversation to collaborators Apr 29, 2019

Language model incorrectly drops spaces for out-of-vocabulary words #1156

Language model incorrectly drops spaces for out-of-vocabulary words #1156

Comments

BradNeuberg commented Jan 5, 2018 • edited Loading

BradNeuberg commented Jan 5, 2018

reuben commented Jan 6, 2018

reuben commented Jan 6, 2018

BradNeuberg commented Jan 6, 2018

jessetrana commented Feb 8, 2018 • edited by kdavis-mozilla Loading

learnerAI commented Feb 12, 2018

rjzevallos commented Mar 20, 2018

kdavis-mozilla commented Mar 26, 2018

dsouza95 commented Apr 2, 2018

bolt163 commented May 3, 2018 • edited Loading

ta1jin3ping2iiiao1bu4de5li4liang4zai4iiiong3dao4shang4xia4fan1teng2iiiong3dong4she2xing2zhuang4ru2hai3tun2iii4zhix2iii3iii4tou2de5iiiu1shix4ling3xian1

elpimous commented May 5, 2018

nabrowning commented May 11, 2018

kdavis-mozilla commented May 11, 2018

mihiraniljoshi commented May 30, 2018

xiapoy commented Jun 3, 2018

GeorgeFedoseev commented Jun 13, 2018 • edited Loading

bernardohenz commented Jun 13, 2018 • edited Loading

bernardohenz commented Jun 22, 2018

GeorgeFedoseev commented Jun 24, 2018

bernardohenz commented Jun 27, 2018

GeorgeFedoseev commented Jun 28, 2018 • edited Loading

bernardohenz commented Jun 28, 2018

titardrew commented Jun 28, 2018

bernardohenz commented Jun 28, 2018

GeorgeFedoseev commented Jun 28, 2018 • edited Loading

reuben commented Jul 13, 2018

elpimous commented Aug 16, 2018

reuben commented Aug 17, 2018

desaur commented Sep 7, 2018

reuben commented Sep 7, 2018

b-ak commented Sep 10, 2018

f90 commented Sep 10, 2018 • edited Loading

zhao-xin commented Sep 20, 2018 • edited Loading

sunil3590 commented Sep 27, 2018

zhao-xin commented Sep 27, 2018

f90 commented Oct 23, 2018

lissyx commented Oct 23, 2018

reuben commented Oct 30, 2018

reuben commented Oct 30, 2018

spencer-brown commented Oct 31, 2018

kdavis-mozilla commented Nov 1, 2018

spencer-brown commented Nov 1, 2018

f90 commented Nov 1, 2018

reuben commented Nov 1, 2018 • edited Loading

derekpankaew commented Nov 3, 2018 • edited Loading

lissyx commented Nov 13, 2018

hugorichard commented Nov 18, 2018 • edited Loading

raghavk92 commented Nov 27, 2018 • edited Loading

gr8nishan commented Nov 29, 2018 • edited Loading

lissyx commented Mar 30, 2019

lock bot commented Apr 29, 2019

BradNeuberg commented Jan 5, 2018 •

edited

Loading

jessetrana commented Feb 8, 2018 •

edited by kdavis-mozilla

Loading

bolt163 commented May 3, 2018 •

edited

Loading

GeorgeFedoseev commented Jun 13, 2018 •

edited

Loading

bernardohenz commented Jun 13, 2018 •

edited

Loading

GeorgeFedoseev commented Jun 28, 2018 •

edited

Loading

GeorgeFedoseev commented Jun 28, 2018 •

edited

Loading

f90 commented Sep 10, 2018 •

edited

Loading

zhao-xin commented Sep 20, 2018 •

edited

Loading

reuben commented Nov 1, 2018 •

edited

Loading

derekpankaew commented Nov 3, 2018 •

edited

Loading

hugorichard commented Nov 18, 2018 •

edited

Loading

raghavk92 commented Nov 27, 2018 •

edited

Loading

gr8nishan commented Nov 29, 2018 •

edited

Loading