Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language model incorrectly drops spaces for out-of-vocabulary words #1156

Closed
BradNeuberg opened this issue Jan 5, 2018 · 54 comments
Closed
Assignees

Comments

@BradNeuberg
Copy link

BradNeuberg commented Jan 5, 2018

Mozilla DeepSpeech will sometimes create long runs of text with no spaces:

omiokaarforfthelastquarterwastoget

This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I've isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off.

The problem might be related to encountering out-of-vocabulary terms.

I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem):

https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.

Discussion around this bug started on the standard DeepSpeech discussion forum:
https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13
https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository):

The standard client.py was slightly modified to segment the longer 15 minute audio clip into 4 second blocks.

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Mac OS X 10.12.6 (16G1036)

  • TensorFlow installed from (our builds, or upstream TensorFlow):

Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file:

tensorflow==1.4.0
deepspeech==0.1.0
numpy==1.13.3
scipy==0.19.1
webrtcvad==2.0.10
  • TensorFlow version (use command below):
('v1.4.0-rc1-11-g130a514', '1.4.0')
  • Python version:
Python 2.7.13
  • Bazel version (if compiling from source):

Did not compile from source.

  • GCC/Compiler version (if compiling from source):

Same

  • CUDA/cuDNN version:

Used CPU only version

  • GPU model and memory:

Used CPU only version

  • Exact command to reproduce:

I haven't provided my full modified client.py that segments longer audio, but to run with a language model using the standard deepspeech command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following:

# Set $DEEPSPEECH to where full Deep Speech checkout is; note that my own git checkout
# for the `deepspeech` runner is at git sha fef25e9ea6b0b6d96dceb610f96a40f2757e05e4
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt $DEEPSPEECH/models/lm.binary $DEEPSPEECH/models/trie

# Similar command to run without language model -- spaces retained for unknown words:
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt 

This is clearly a bug and not a feature :)

@BradNeuberg
Copy link
Author

@kdavis-mozilla @lissyx

@reuben reuben self-assigned this Jan 6, 2018
@reuben
Copy link
Contributor

reuben commented Jan 6, 2018

Thank you for filling out the issue template :)

This misbehavior seems to happen only when the acoustic model is not doing a very good job. I agree the decoder should not degrade to that level. I haven't had the chance to debug this issue other than tweaking the decoder hyperparameters to try to alleviate the problem. I'll take a closer look.

@reuben
Copy link
Contributor

reuben commented Jan 6, 2018

Yes. In some weird cases when the acoustic model is not performing well the decoder falls into this weird state of gluing together words. I'm hoping it can be fixed by tweaking the beam search implementation.

@BradNeuberg
Copy link
Author

What assumptions does the acoustic model make (i.e. whats the distribution and characteristics of the audio training data?) The audio I provided sounds pretty clear IMHO, but perhaps the audio training data doesn't have enough diversity to help the deep net generalize (i.e. the deep net is essentially overfitting to the training data and isn't generalizing well).

@jessetrana
Copy link

jessetrana commented Feb 8, 2018

@reuben Any further progress on this by any chance?

@learnerAI
Copy link

Facing the same issue. Any progress or way out to improve the performance?

@rjzevallos
Copy link

How can I train my model without the language model?

@kdavis-mozilla
Copy link
Contributor

@nyro22 Training never involves the language model. Computing WER's, however does.

@dsouza95
Copy link

dsouza95 commented Apr 2, 2018

I am facing the same issue on a rather similar configuration to the one described above. Was there any progress on this? Thanks!

@bolt163
Copy link

bolt163 commented May 3, 2018

facing the same problem.....
/data/home/DeepSpeech# /data/home/DeepSpeech/deepspeech phoneme_output_graph.pb phoneme.txt A2_1.wav
TensorFlow: v1.6.0-11-g7554dd8
DeepSpeech: v0.1.1-48-g31c01db
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-05-03 10:27:27.750965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-05-03 10:27:28.111299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate(GHz): 1.112
pciBusID: 0000:02:00.0
totalMemory: 23.90GiB freeMemory: 22.71GiB
2018-05-03 10:27:28.111338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-05-03 10:27:28.318726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22058 MB memory) -> physical GPU (device: 0, name: Tesla M40 24GB, pci bus id: 0000:02:00.0, compute capability: 5.2)

ta1jin3ping2iiiao1bu4de5li4liang4zai4iiiong3dao4shang4xia4fan1teng2iiiong3dong4she2xing2zhuang4ru2hai3tun2iii4zhix2iii3iii4tou2de5iiiu1shix4ling3xian1

@elpimous
Copy link

elpimous commented May 5, 2018

Hi @bolt163,
It seems that your problem is anywere else...
Could you provide wanted transcript of this wav ?!

@nabrowning
Copy link

having the same issues. if I modify the beam search algorithm myself, what would be the steps to recompile using the updated beam search?

@kdavis-mozilla
Copy link
Contributor

The build requirements are here[1] and the build instructions are here[2].

@mihiraniljoshi
Copy link

@reuben I am also facing the same issue ? Any suggestions ?

@xiapoy
Copy link

xiapoy commented Jun 3, 2018

Same here. Also, seems to be happening when using out-of-vocabulary terms.

@GeorgeFedoseev
Copy link
Contributor

GeorgeFedoseev commented Jun 13, 2018

Probably the bug is somewhere in this function:

void ExpandState(const KenLMBeamState& from_state, int from_label,
KenLMBeamState* to_state, int to_label) const {
CopyState(from_state, to_state);
if (!alphabet_.IsSpace(to_label)) {
to_state->incomplete_word += alphabet_.StringFromLabel(to_label);
TrieNode *trie_node = from_state.incomplete_word_trie_node;
// If we have no valid prefix we assume a very low log probability
float min_unigram_score = oov_score_;
// If prefix does exist
if (trie_node != nullptr) {
trie_node = trie_node->GetChildAt(to_label);
to_state->incomplete_word_trie_node = trie_node;
if (trie_node != nullptr) {
min_unigram_score = trie_node->GetMinUnigramScore();
}
}
// TODO try two options
// 1) unigram score added up to language model scare
// 2) langugage model score of (preceding_words + unigram_word)
to_state->score = min_unigram_score + to_state->language_model_score;
to_state->delta_score = to_state->score - from_state.score;
} else {
float lm_score_delta = ScoreIncompleteWord(from_state.model_state,
to_state->incomplete_word,
to_state->model_state);
// Give fixed word bonus
if (!IsOOV(to_state->incomplete_word)) {
to_state->language_model_score += valid_word_count_weight_;
}
to_state->language_model_score += word_count_weight_;
UpdateWithLMScore(to_state, lm_score_delta);
ResetIncompleteWord(to_state);
}
}

Seems that the problem is that sequences with out-of-vocabulary words receive more score without spaces than with spaces.

@bernardohenz
Copy link
Contributor

bernardohenz commented Jun 13, 2018

Does the beam search use Length Normalization?
According to Andrew Ng, it improves the beam search by reducing the penalty for outputting sentences with higher number of words.
Andrew talks about it in this video.

EDIT: I just realized that the 'word_count_weight_' is performing this

@bernardohenz
Copy link
Contributor

@GeorgeFedoseev I've been trying to debug this part of code you have pointed, and I noticed some weird behavior. Next, I've printed the score for, respectively: (1) a common word of my corpus; (2) a rare word of the corpus; (3) an invalid/out-of-vocabulary word; (4) the score of the variable 'oov_score_'. I am printing the scores on some different states:

'não':    -2.9632
'informação':    -4.43594
'fdfdg':    -5.15036
oov_score:    -4.70759

------------------------------

'não':    -2.97739
'informação':    -4.45013
'fdfdg':    -5.16455
oov_score:    -4.70759

------------------------------

'não':    -2.88466
'informação':    -5.84782
'fdfdg':    -6.56224
oov_score:    -4.70759

Notice that the oov_score is not the same as the invalid word, and in some cases it is even higher than a valid word. I tried to add the following lines to the code:

Model::State out;
oov_score_ = model_.FullScore(from_state.model_state, model_.GetVocabulary().NotFound(), out).prob;

and now it appears that the score of the invalid word and the variable are similar. When testing on my examples, it is not enough to solve this problem, but it certainly reduced the 'gluing together words'.

PS: words are from my pt-br language corpus

@GeorgeFedoseev
Copy link
Contributor

@bernardohenz
I think you printed scores for (1), (2) and (3) that depend on state, but oov_score_ does not depend on state (in master), and you cannot compare them.

If you print (3) with model.NullContextState() wouldn't it be the same as oov_score_?

@bernardohenz
Copy link
Contributor

@GeorgeFedoseev yes, it is true. But why wouldn't oov_score depend on the state? I think it makes sense to compute the oov_score for each state. What do you think?

@GeorgeFedoseev
Copy link
Contributor

GeorgeFedoseev commented Jun 28, 2018

@bernardohenz as I understand code: when the construction of word is not finished yet (if (!alphabet_.IsSpace(to_label)) part), to tell beam search that its going in right direction, we are adding minimum unigram score of the word that this search can lead to. And this minimum unigram score is precomputed without state (with model.NullContextState()) and saved in trie file. To get this score dynamically depending on state you will need for each prefix find all possible words that it can lead to and select minimum score (which is probably very slow).

So oov_score_ doesnt depend on state cause we are comparing OOV braches of beam search with in-vocabulary branches, which are scored using scores from trie file (and that scores don't depend on state).

@bernardohenz
Copy link
Contributor

But the problem of assigning such minimum unigram score (or oov_score) is that, during beam search, the algorithm is preferring to concatenate lots of characters, rather than choosing an space and finish a low-probability word (such my (2) example).

One idea that occurred to me is to penalize longer words, so to avoid cases where the algorithm tries to concatenate more than 3 words together without a space.

@titardrew
Copy link

Replacing

oov_score_ = model_.FullScore(model_.NullContextState(), model_.GetVocabulary().NotFound(), out).prob;

with

oov_score_ = -1000.00;

helped. Did I just raise another error?

@bernardohenz
Copy link
Contributor

In fact I created another variable (oov_score_2) to compute this value (oov_score_ can't be modified inside the function).

And, I do not know if it is a good idea to set oov_score_ = -1000.00;, since this is used when you are composing the word (char by char). The point of 'correcting' the oov_score_ is to avoid the alg to just decide to gluing all characters together (without a space char).

@GeorgeFedoseev
Copy link
Contributor

GeorgeFedoseev commented Jun 28, 2018

@bernardohenz I think that in that part (if (!alphabet_.IsSpace(to_label))) it should be just that oov word gets score lower than any vocabulary word.

Try to increase word_count_weight_ from default 1 to something like 3.5. This resulted in less concatenation for me and decreased my WER by 3-4%.

@reuben
Copy link
Contributor

reuben commented Jul 13, 2018

I've implemented length normalization (word_count_weight was only a gross approximation) as well as switch to a fixed OOV score (which was in a TODO list for a long time) as part of the streaming changes, which will be merged soon for our next release. When we have binaries available for testing I'll comment here so anyone interested can test if it improves the decoder behavior on the cases described here. Thanks a lot for the investigation and suggestions, @bernardohenz, @GeorgeFedoseev and @titardrew!

@elpimous
Copy link

+1😉

@reuben
Copy link
Contributor

reuben commented Aug 17, 2018

They will be available with our next release, v0.2, when it is ready :)

@mozilla mozilla deleted a comment from ZipingL Aug 17, 2018
@desaur
Copy link

desaur commented Sep 7, 2018

Hi @reuben Any update on these binaries? I too would like to test their impact on decoder behavior.

@reuben
Copy link
Contributor

reuben commented Sep 7, 2018

We're currently training a model for the v0.2 release. Send me an email at {my github username} at mozilla.com and I'll give you access to a preliminary trained model so you can test the code changes.

If you have your own model and just want the binaries, they're available here: https://tools.taskcluster.net/groups/ClsXrFSbTJ6uUkEAPqFG8A

The Python and Node packages are also available, just specify version 0.2.0-alpha.9

@b-ak
Copy link
Contributor

b-ak commented Sep 10, 2018

@reuben Dropped a mail to you !!

@f90
Copy link

f90 commented Sep 10, 2018

I need the new decoder library so binary for Linux x64, how can I download it from the URL given by @reuben ? I am a bit lost on that webpage.

When I click on DeepSpeech Linux AMD64 CPU and then on artifacts and then download the public/native_client.tar.xz I dont see any changes in my decoded output when using this .so library compared to the current one, there is still only one or two words followed by a very very long one without spaces... despite ensuring that my model frequently outputs white spaces, and the beam and greedy decoding output looks fine

@zhao-xin
Copy link

zhao-xin commented Sep 20, 2018

just tested 0.2.0 release (deepspeech and models), still get long words out of English vocabulary,

This example is a phone call recording (one channel out of two), TTS works well for the first sentence (a pre-recorded welcome message). Then it is a part of real conversation. TTS doesn't work properly.

The command and outputs are

(deepspeech-venv) jonathan@ubuntu:~$ deepspeech --model ~/deepspeech-0.2.0-models/models/output_graph.pb --audio ~/audio/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav --alphabet ~/deepspeech-0.2.0-models/models/alphabet.txt --lm ~/deepspeech-0.2.0-models/models/lm.binary --trie ~/deepspeech-0.2.0-models/models/trie
Loading model from file /home/jonathan/deepspeech-0.2.0-models/models/output_graph.pb
TensorFlow: v1.6.0-18-g5021473
DeepSpeech: v0.2.0-0-g009f9b6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-09-20 11:02:49.456955: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.134s.
Loading language model from files /home/jonathan/deepspeech-0.2.0-models/models/lm.binary /home/jonathan/deepspeech-0.2.0-models/models/trie
Loaded language model in 3.85s.
Running inference.
thank you for calling national storage your call may be recorded for coaching and quality the poses place let us not an if ye prefer we didn't record your colt to day in wall constrashionalshordistigwisjemaigay am so it just so he put in your code held everything in a disbosmygriparsesnwygorighticame so she's not like um that's all good if you won't care if i can just reserve something from my end over the foreign am i can reserve at the same on mine price you will looking out as well um which sent a and a unit is he looking out without which location for it an put it by sereerkapcoolofmijustrynorfrommians or a man we after the ground floor on the upper of a at
Inference took 33.947s for 58.674s audio file.

The audio can be found from
https://s3.us-east-2.amazonaws.com/fonedynamicsuseast2/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav

@sunil3590
Copy link

@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?

@zhao-xin
Copy link

@zhao-xin I'm facing the exact same problem. I am working with call recording. Were you able to fix this?

@sunil3590 I feel this is not an engineering issue. The acoustic model is not trained with phone call conversations, the same as the language model, am I right?

We plan to collect our own data to tune deep speech models to make it can be used in the real world.

@f90
Copy link

f90 commented Oct 23, 2018

Are there any updates on this? I still have this issue and I am pretty sure it's not the models fault, since with normal decoding (greedy or beam search) I never get these very long words.

This is a big problem for me since those long words mess up the evaluation obviously, but a language would be necessary to get acceptable performance.

@lissyx
Copy link
Collaborator

lissyx commented Oct 23, 2018

@reuben is currently working on moving to ctcdecode, which amongst others should fix this issue

@reuben
Copy link
Contributor

reuben commented Oct 30, 2018

Could anyone who's seeing this issue test the new decoder on master?

There's native client builds here: https://tools.taskcluster.net/groups/FyclewklSUqN6FXHavrhKQ

The acoustic model is the same as v0.2, and the trie is in data/lm/trie.ctcdecode after you update to latest master. Testing with some problematic examples I had shows much better results, but the links in this thread are all broken so I couldn't test with your files.

Let me know how it goes.

@reuben
Copy link
Contributor

reuben commented Oct 30, 2018

Sorry, those instructions are incorrect. The acoustic model is the same as v0.2 but you need to re-export it with the master code. Alternatively you can grab it from here: https://github.com/reuben/DeepSpeech/releases/tag/v0.2.0-prod-ctcdecode

@spencer-brown
Copy link

@reuben 's new work is working well for me on long, clean recordings.

I'm using:


The inference for a 45s podcast snippet seems pretty decent:

why early on in the night i mean i think there are a couple of states that are going to be really keep kentucky and virginia kentucky closes its poles a half in the eastern times on half in the central time on so that means that half of the states at six o'clock to visit seven o'clock and so have a lot of results and in watching one particular congressional district raciness district between antibarbarus i disengaged morabaraba a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a searching for much of the night in the democratizing well there that's a pretty good sign that the wave will be building

The inference for two recordings I made myself is almost totally wrong, but does not have incorrectly dropped spaces. I'm guessing the poor results are due to recording quality?

he gravitationless theocratic circuitously manipulate intermediately creation of images and a frame buffer intended for alcohol
a gravitational latrocinia idly manipulate an alternator exploration of images in a frame of her intolerable

@kdavis-mozilla
Copy link
Contributor

@spencer-brown On the recordings you made yourself, did you record directly to 16KHz, 16bit, mono audio? (The recordings sound like they were made at a lower Hz and/or bit depth.)

Also, I'd tend to agree that the drop in the recording quality is likely largely to blame for the poor results on the recordings you made yourself. We're currently training models that will be more robust to background noise.

@spencer-brown
Copy link

Ah, no, I did not - thanks! In follow-up tests using those settings I'm seeing about 50% accuracy with the Bose headphones and nearly 0% with the Macbook Air mic. The recordings are still sound crackly relative to the training recordings.

Re: background-noise-robust models - exciting!

@f90
Copy link

f90 commented Nov 1, 2018

For anyone else still having trouble with this, i was able to make it work in the end by installing pytorch along with the ctcdecode library and then using that on top of my existing code, worked right of the gate with a KenLM language model!

@reuben
Copy link
Contributor

reuben commented Nov 1, 2018

@f90 you shouldn't need PyTorch (or the ctcdecode library) to use the new native client, the decoder is built-in.

@derekpankaew
Copy link

derekpankaew commented Nov 3, 2018

I'm also experiencing the same issue, with words gluing together. I'm trying to run the new version as described by @spencer-brown above, but I'm experiencing some issues.

Deep Speech v0.3 is working on my system, but using the new version is throwing an error.

I'm using:

  • The new Node module here
  • The new ctcdecode trie here
  • The new output_graph.pbmm model here
  • lm.binary and alphabet from the previous v0.3 release

I downloaded the files, ran npm install, and then ran the command:

node client.js --audio="./--audios_for_testing/90secondtest.wav" --model="./output_graph.pbmm" --trie="./trie.ctcdecode" --lm="./deepspeech_models/lm.binary" --alphabet="./deepspeech_models/alphabet.txt"

This is the output:

Loading model from file ./output_graph.pbmm
TensorFlow: v1.11.0-11-gbee825492f
DeepSpeech: unknown
2018-11-03 17:35:42.654139: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
dyld: lazy symbol binding failed: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
  Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
  Expected in: flat namespace

dyld: Symbol not found: __ZN2v87Isolate19CheckMemoryPressureEv
  Referenced from: /Users/derekpankaew/Dropbox/Javascript Programming/speech_recognition/lib/binding/v1.0.0/darwin-x64/node-v57/deepspeech.node
  Expected in: flat namespace

Abort trap: 6

Would love to get the new version to work - any thoughts?

@lissyx
Copy link
Collaborator

lissyx commented Nov 13, 2018

Would love to get the new version to work - any thoughts?

Your output shows that it's not an official build. Please use official ones before reporting issues. and please give more context on your system.

@hugorichard
Copy link

hugorichard commented Nov 18, 2018

The binary files and trie in https://github.com/mozilla/DeepSpeech/tree/master/data/lm alleviates this long-word problem. However my results are not as good as @spencer-brown for the same text.

I apply deepspeech with binary files and trie mentionned above (all the rest is just straight application of instructions in "Using the model" of https://github.com/mozilla/DeepSpeech)

Using ffmpeg to change the sampling rate to 16000
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 midterm-update-clipped2.wav

I get the following transcription for the 45 seconds podcast mentionned above (https://drive.google.com/file/d/1rmje0llC-PXJgTiAiuQcsPRSjaaWfsv_/view?usp=sharing):

Loading model from file models/output_graph.pbmm
TensorFlow: v1.11.0-9-g97d851f
DeepSpeech: v0.3.0-0-gef6b5bd
Loaded model in 0.013s.
Loading language model from files models/lm2.binary models/trie2
Loaded language model in 0.000145s.
Running inference.
why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in te the eastern times own half an the sentral times on so that means that half of the states at six o'clock afh o vis had seven o'clock ah and joll have a lot of results and an waschings one particular congressional district race o six congressinal district between a andi bar and maganme graph i ad bis emmigrass the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a u seem magrapha leading for much of the night and de democratis doing well there that's a pretty good sign that the wave will be building
Inference took 41.082s for 48.489s audio file.

Whenusing bandfilter:
ffmpeg -i midterm-update-clipped.wav -acodec pcm_s16le -ac 1 -ar 16000 -af lowpass=3000,highpass=200 midterm-update-clipped3.wav
I get a slightly better transcription:

TensorFlow: v1.11.0-9-g97d851f
DeepSpeech: v0.3.0-0-gef6b5bd
Loaded model in 0.0128s.
Loading language model from files models/lm2.binary models/trie2
Loaded language model in 0.000105s.
Running inference.
why early on in the night i mean i think there are a couple states that are going to be really keep can tucky and virginia contucky closes its poles a half in the the eastern times own half in the central times on so that means that half of the states it six o'clock atfe vits ad seven o'clock ah and toll have a lot of results and an waschings one particular congressional district race o six congressial district between a andi bar and maganmc graph i ed es emmograss the democrat bars in combent a republican and this is a race that really should not be on the map this is a race that should be republican territory and if this race is a you seem mograph leading for much of the night and te democratis doing well there thats a pretty good sign that the wave will be building
Inference took 43.352s for 48.489s audio file.

If anyone knows tricks to further improve results I would be really interested :)

@raghavk92
Copy link

raghavk92 commented Nov 27, 2018

@reuben Hi, I am using the v0.3.0 deepspeech-gpu which i installed from pip3(python) as stated in the readme on the start page. I tried to pass the audio file about 1 min (which is call center recording) to the command line command with the arguments and pre built model but i get letters stringed together similar to the people above. What do i need to update to get better results as people are getting above. I am new to this so not understanding everything. I also made 4 sec chunks for the audio recording with webrtcvad but still error in detection is there.What all files do i need to update and from where and also is there any need for seperating long audio into small chunks or the detection will work fine in both cases for the new model/binaries(if they need to be updated). And @hugorichard where is the trie2 model you are reffering to. The link is broken i think.Can you specify which are the latest output graph, lm, trie to use with the latest alpha and stable release and where to find these details.

I also tried the latest alpha versions still get a lot of spelling errors. I tried to transcribe Jonathan Ive (apple hardware designer) . Its british english but still there are a lot of incomplete words and spelling error . (it spells evolution as evil lution) . I dont know if i am using the correct model(output, trie, lm) . Please tell.

Thanks

@gr8nishan
Copy link

gr8nishan commented Nov 29, 2018

@lissyx
Copy link
Collaborator

lissyx commented Mar 30, 2019

This was fixed now.

@lissyx lissyx closed this as completed Mar 30, 2019
@lock
Copy link

lock bot commented Apr 29, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Apr 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests