Allow use of several decoders (language models) with a single model in the API #1678

reuben · 2018-10-24T17:22:29Z

One might have two language models for example, one with just commands, and one for general English text. The first is used to detect voice commands, and the second to understand a parameter to a command, like say a text message the user wants to send.

Currently you can only have one decoder instance live at any given point, by calling enableDecoderWithKenLM. It'd be nice to be able to have several decoder instances and specify which one you want to use when you call decode.

The text was updated successfully, but these errors were encountered:

lissyx · 2018-11-04T10:53:47Z

Do you think we could also do that such that each language model has its own alphabet? This could help with transfer learning using English model and language dedicated language model

reuben · 2018-11-04T13:55:25Z

The acoustic model output has the alphabet embedded in it (number of predicted classes) so you can't use a different alphabet just on the decoder end. For doing transfer learning like that you could try freezing all layers but the final one and then retraining it with the new alphabet.

elpimous · 2018-11-21T22:59:35Z

Hi.
I had one idea about it, (but without time limit !!)

In a loop, send a wav to "3" different language models,
keep confidence value for each,
wait the end loop, and estimate the correct language, with the best confidence.

My idea is a big data multi language waves to transcribe (us/fr/chinese...) Without realtime need !

So, 3 models in memory, 3 transcribes -> confidences -> best confidence -> return correct inference.

C.U

pvanickova · 2019-01-07T16:42:42Z

Here are my thoughts on the multiple language model features:

could I specify which LM models are used at inference time for decoding? (e.g. use general english and department1 lm for one call and use general english and department2 lm for another one)
could I specify priority or relative weights of the lm models somehow? (e.g. if you find a hit above minimal threshold from department1 lm, it beats general english model hypothesis)
this feature doesn't resolve the case where a custom phrase is embedded in general English if used in isolated decodings as suggested above (e.g. general english + medical terminology lms: dear mr smith i suspect you have "compound multiple fracture" and we have to fix it), all models would need to be used in parallel during the beam search for that case

bernardohenz · 2019-02-06T11:33:50Z

About using several language models. I am wondering if it is possible to use two language models in the decoding. The idea is for using a word-based LM (like the current one) together with a char-based LM.

Edit: This is discussed in some recent papers, like the following one:

Hori et al. Multilevel language modeling and decoding for open vocabulary end-to-end speech recognition

dabinat · 2019-05-27T21:23:46Z

@reuben I'd be interested in taking a look at this if you can give me more information. It would be great if you could query multiple models and take the result with the highest confidence value.

Also, just to be clear, this only covers the acoustic model, not the KenLM model?

kdavis-mozilla · 2019-05-28T06:08:52Z

@dabinat The original idea as I understood it was to start the engine with several language models and one acoustic model then be able to, at runtime, select which language model was active.

For example one might have an application that has several distinct "command and control" aspects, one dealing with navigation (north, south, east, west) and another dealing with questions (yes, no), another dealing with... and each "command and control" aspect would correspond to a language model and the "command and control" aspect that was active could be changed at runtime.

reuben · 2019-05-29T13:29:26Z

@dabinat this bug is for using different language models, not acoustic models. The idea is that you can instantiate multiple language model instances and then for every new stream you could specify what LM to use for decoding. This would you let you implement something like this, for a voice-assistant like UI:

By default, use a command LM
A command is recognized, it requires further interaction, like saying what's the address to be displayed on a map
A new stream is created with an address dictation LM
An address is recognized
Rinse, repeat

To be honest, I'm not sure how useful this feature would be compared to implementing some LM fusion technique, which would let applications use a single general language model with contextual biasing. With that, an application wouldn't be forced to do things in two steps and start a new stream with a new LM, which might not fit the intended user flow.

reuben · 2019-05-29T13:35:22Z

In any case, the rough idea I had for how to implement/expose this is by separating language model instances into their own handle, so DS_EnableDecoderWithLM would turn into DS_CreateDecoderWithLM, which would return an opaque DecoderState* that identifies a LM with its hyperparameters. A new DS_FreeDecoder function would have to be added.

This DecoderState pointer would then be passed into DS_SetupStream. The decoder-specific members of ModelState (scorer, beam_width, etc) would be moved into the DecoderState structure, and StreamingState would have a reference to it.

dabinat · 2019-05-30T06:55:40Z

To be honest, I'm not sure how useful this feature would be compared to implementing some LM fusion technique, which would let applications use a single general language model with contextual biasing. With that, an application wouldn't be forced to do things in two steps and start a new stream with a new LM, which might not fit the intended user flow.

Yes, that's a good point. I thought perhaps one could process a stream multiple times with different LMs and choose words with the highest confidence, but on second thoughts I'm not sure how reliable that would be.

My primary interest is in the ability for my application to infer which words may be likely to occur in the transcript and supply them to DeepSpeech in advance, which then assigns those words a slightly higher priority/weight. I thought this issue might get one step closer to that goal. A fusion model certainly would. I did some digging in the code a few weeks ago with this goal in mind but I wasn't able to figure out how to do it and KenLM seems to be quite poorly documented, so if you have any suggestions I'd definitely appreciate them.

Having said that, the steps you outlined to accomplish this issue seem pretty straightforward and I'm still happy to do it if you think it would be useful.

kdavis-mozilla · 2020-01-10T10:02:07Z

I'm putting this in the 1.0.0 project, as that will contain some language model work, which may include this.

reuben · 2020-01-23T14:55:22Z

I missed the fact that this was in 1.0, it involves more changes than the ones currently in #2681. It shouldn't be too difficult, but it will complicate the API a bit, as every call that creates a stream will need to take a scorer parameter.

kdavis-mozilla · 2020-01-24T09:47:59Z

I missed the fact that this was in 1.0...

I don't know if this has to be in 1.0. We can, if it's too much, delay it for 2.0.

kdavis-mozilla · 2020-03-06T17:19:18Z

We're going to delay until 2.0

gkucsko · 2021-06-27T15:01:09Z

Would there be interest in integrating pyctcdecode? https://github.com/kensho-technologies/pyctcdecode
It supports multiple language models as well as a few other things out of the box.

reuben added the enhancement label Oct 24, 2018

kdavis-mozilla mentioned this issue Jan 7, 2019

Phrase hints in inference calls #1821

Open

reuben mentioned this issue May 29, 2019

Expose multiple candidate transcriptions in API (top_paths/top_n != 1) #432

Closed

mo-g mentioned this issue Oct 31, 2022

Feature request: Multiple Parallel/Concatenatable Models coqui-ai/STT#2306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow use of several decoders (language models) with a single model in the API #1678

Allow use of several decoders (language models) with a single model in the API #1678

reuben commented Oct 24, 2018

lissyx commented Nov 4, 2018

reuben commented Nov 4, 2018

elpimous commented Nov 21, 2018

pvanickova commented Jan 7, 2019

bernardohenz commented Feb 6, 2019 •

edited

Loading

dabinat commented May 27, 2019

kdavis-mozilla commented May 28, 2019

reuben commented May 29, 2019

reuben commented May 29, 2019

dabinat commented May 30, 2019

kdavis-mozilla commented Jan 10, 2020

reuben commented Jan 23, 2020

kdavis-mozilla commented Jan 24, 2020

kdavis-mozilla commented Mar 6, 2020

gkucsko commented Jun 27, 2021

Allow use of several decoders (language models) with a single model in the API #1678

Allow use of several decoders (language models) with a single model in the API #1678

Comments

reuben commented Oct 24, 2018

lissyx commented Nov 4, 2018

reuben commented Nov 4, 2018

elpimous commented Nov 21, 2018

pvanickova commented Jan 7, 2019

bernardohenz commented Feb 6, 2019 • edited Loading

dabinat commented May 27, 2019

kdavis-mozilla commented May 28, 2019

reuben commented May 29, 2019

reuben commented May 29, 2019

dabinat commented May 30, 2019

kdavis-mozilla commented Jan 10, 2020

reuben commented Jan 23, 2020

kdavis-mozilla commented Jan 24, 2020

kdavis-mozilla commented Mar 6, 2020

gkucsko commented Jun 27, 2021

bernardohenz commented Feb 6, 2019 •

edited

Loading