Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of several decoders (language models) with a single model in the API #1678

Open
reuben opened this issue Oct 24, 2018 · 15 comments
Open

Comments

@reuben
Copy link
Contributor

reuben commented Oct 24, 2018

One might have two language models for example, one with just commands, and one for general English text. The first is used to detect voice commands, and the second to understand a parameter to a command, like say a text message the user wants to send.

Currently you can only have one decoder instance live at any given point, by calling enableDecoderWithKenLM. It'd be nice to be able to have several decoder instances and specify which one you want to use when you call decode.

@lissyx
Copy link
Collaborator

lissyx commented Nov 4, 2018

Do you think we could also do that such that each language model has its own alphabet? This could help with transfer learning using English model and language dedicated language model

@reuben
Copy link
Contributor Author

reuben commented Nov 4, 2018

The acoustic model output has the alphabet embedded in it (number of predicted classes) so you can't use a different alphabet just on the decoder end. For doing transfer learning like that you could try freezing all layers but the final one and then retraining it with the new alphabet.

@elpimous
Copy link

Hi.
I had one idea about it, (but without time limit !!)

In a loop, send a wav to "3" different language models,
keep confidence value for each,
wait the end loop, and estimate the correct language, with the best confidence.

My idea is a big data multi language waves to transcribe (us/fr/chinese...) Without realtime need !

So, 3 models in memory, 3 transcribes -> confidences -> best confidence -> return correct inference.

C.U

@pvanickova
Copy link

Here are my thoughts on the multiple language model features:

  1. could I specify which LM models are used at inference time for decoding? (e.g. use general english and department1 lm for one call and use general english and department2 lm for another one)

  2. could I specify priority or relative weights of the lm models somehow? (e.g. if you find a hit above minimal threshold from department1 lm, it beats general english model hypothesis)

  3. this feature doesn't resolve the case where a custom phrase is embedded in general English if used in isolated decodings as suggested above (e.g. general english + medical terminology lms: dear mr smith i suspect you have "compound multiple fracture" and we have to fix it), all models would need to be used in parallel during the beam search for that case

@bernardohenz
Copy link
Contributor

bernardohenz commented Feb 6, 2019

About using several language models. I am wondering if it is possible to use two language models in the decoding. The idea is for using a word-based LM (like the current one) together with a char-based LM.

Edit: This is discussed in some recent papers, like the following one:

  • Hori et al. Multilevel language modeling and decoding for open vocabulary end-to-end speech recognition

@dabinat
Copy link
Collaborator

dabinat commented May 27, 2019

@reuben I'd be interested in taking a look at this if you can give me more information. It would be great if you could query multiple models and take the result with the highest confidence value.

Also, just to be clear, this only covers the acoustic model, not the KenLM model?

@kdavis-mozilla
Copy link
Contributor

@dabinat The original idea as I understood it was to start the engine with several language models and one acoustic model then be able to, at runtime, select which language model was active.

For example one might have an application that has several distinct "command and control" aspects, one dealing with navigation (north, south, east, west) and another dealing with questions (yes, no), another dealing with... and each "command and control" aspect would correspond to a language model and the "command and control" aspect that was active could be changed at runtime.

@reuben
Copy link
Contributor Author

reuben commented May 29, 2019

@dabinat this bug is for using different language models, not acoustic models. The idea is that you can instantiate multiple language model instances and then for every new stream you could specify what LM to use for decoding. This would you let you implement something like this, for a voice-assistant like UI:

  1. By default, use a command LM
  2. A command is recognized, it requires further interaction, like saying what's the address to be displayed on a map
  3. A new stream is created with an address dictation LM
  4. An address is recognized
  5. Rinse, repeat

To be honest, I'm not sure how useful this feature would be compared to implementing some LM fusion technique, which would let applications use a single general language model with contextual biasing. With that, an application wouldn't be forced to do things in two steps and start a new stream with a new LM, which might not fit the intended user flow.

@reuben
Copy link
Contributor Author

reuben commented May 29, 2019

In any case, the rough idea I had for how to implement/expose this is by separating language model instances into their own handle, so DS_EnableDecoderWithLM would turn into DS_CreateDecoderWithLM, which would return an opaque DecoderState* that identifies a LM with its hyperparameters. A new DS_FreeDecoder function would have to be added.

This DecoderState pointer would then be passed into DS_SetupStream. The decoder-specific members of ModelState (scorer, beam_width, etc) would be moved into the DecoderState structure, and StreamingState would have a reference to it.

@dabinat
Copy link
Collaborator

dabinat commented May 30, 2019

To be honest, I'm not sure how useful this feature would be compared to implementing some LM fusion technique, which would let applications use a single general language model with contextual biasing. With that, an application wouldn't be forced to do things in two steps and start a new stream with a new LM, which might not fit the intended user flow.

Yes, that's a good point. I thought perhaps one could process a stream multiple times with different LMs and choose words with the highest confidence, but on second thoughts I'm not sure how reliable that would be.

My primary interest is in the ability for my application to infer which words may be likely to occur in the transcript and supply them to DeepSpeech in advance, which then assigns those words a slightly higher priority/weight. I thought this issue might get one step closer to that goal. A fusion model certainly would. I did some digging in the code a few weeks ago with this goal in mind but I wasn't able to figure out how to do it and KenLM seems to be quite poorly documented, so if you have any suggestions I'd definitely appreciate them.

Having said that, the steps you outlined to accomplish this issue seem pretty straightforward and I'm still happy to do it if you think it would be useful.

@kdavis-mozilla
Copy link
Contributor

I'm putting this in the 1.0.0 project, as that will contain some language model work, which may include this.

@reuben
Copy link
Contributor Author

reuben commented Jan 23, 2020

I missed the fact that this was in 1.0, it involves more changes than the ones currently in #2681. It shouldn't be too difficult, but it will complicate the API a bit, as every call that creates a stream will need to take a scorer parameter.

@kdavis-mozilla
Copy link
Contributor

I missed the fact that this was in 1.0...

I don't know if this has to be in 1.0. We can, if it's too much, delay it for 2.0.

@kdavis-mozilla
Copy link
Contributor

We're going to delay until 2.0

@gkucsko
Copy link

gkucsko commented Jun 27, 2021

Would there be interest in integrating pyctcdecode? https://github.com/kensho-technologies/pyctcdecode
It supports multiple language models as well as a few other things out of the box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants