-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow use of several decoders (language models) with a single model in the API #1678
Comments
Do you think we could also do that such that each language model has its own alphabet? This could help with transfer learning using English model and language dedicated language model |
The acoustic model output has the alphabet embedded in it (number of predicted classes) so you can't use a different alphabet just on the decoder end. For doing transfer learning like that you could try freezing all layers but the final one and then retraining it with the new alphabet. |
Hi. In a loop, send a wav to "3" different language models, My idea is a big data multi language waves to transcribe (us/fr/chinese...) Without realtime need ! So, 3 models in memory, 3 transcribes -> confidences -> best confidence -> return correct inference. C.U |
Here are my thoughts on the multiple language model features:
|
About using several language models. I am wondering if it is possible to use two language models in the decoding. The idea is for using a word-based LM (like the current one) together with a char-based LM. Edit: This is discussed in some recent papers, like the following one:
|
@reuben I'd be interested in taking a look at this if you can give me more information. It would be great if you could query multiple models and take the result with the highest confidence value. Also, just to be clear, this only covers the acoustic model, not the KenLM model? |
@dabinat The original idea as I understood it was to start the engine with several language models and one acoustic model then be able to, at runtime, select which language model was active. For example one might have an application that has several distinct "command and control" aspects, one dealing with navigation (north, south, east, west) and another dealing with questions (yes, no), another dealing with... and each "command and control" aspect would correspond to a language model and the "command and control" aspect that was active could be changed at runtime. |
@dabinat this bug is for using different language models, not acoustic models. The idea is that you can instantiate multiple language model instances and then for every new stream you could specify what LM to use for decoding. This would you let you implement something like this, for a voice-assistant like UI:
To be honest, I'm not sure how useful this feature would be compared to implementing some LM fusion technique, which would let applications use a single general language model with contextual biasing. With that, an application wouldn't be forced to do things in two steps and start a new stream with a new LM, which might not fit the intended user flow. |
In any case, the rough idea I had for how to implement/expose this is by separating language model instances into their own handle, so This |
Yes, that's a good point. I thought perhaps one could process a stream multiple times with different LMs and choose words with the highest confidence, but on second thoughts I'm not sure how reliable that would be. My primary interest is in the ability for my application to infer which words may be likely to occur in the transcript and supply them to DeepSpeech in advance, which then assigns those words a slightly higher priority/weight. I thought this issue might get one step closer to that goal. A fusion model certainly would. I did some digging in the code a few weeks ago with this goal in mind but I wasn't able to figure out how to do it and KenLM seems to be quite poorly documented, so if you have any suggestions I'd definitely appreciate them. Having said that, the steps you outlined to accomplish this issue seem pretty straightforward and I'm still happy to do it if you think it would be useful. |
I'm putting this in the 1.0.0 project, as that will contain some language model work, which may include this. |
I missed the fact that this was in 1.0, it involves more changes than the ones currently in #2681. It shouldn't be too difficult, but it will complicate the API a bit, as every call that creates a stream will need to take a scorer parameter. |
I don't know if this has to be in 1.0. We can, if it's too much, delay it for 2.0. |
We're going to delay until 2.0 |
Would there be interest in integrating pyctcdecode? https://github.com/kensho-technologies/pyctcdecode |
One might have two language models for example, one with just commands, and one for general English text. The first is used to detect voice commands, and the second to understand a parameter to a command, like say a text message the user wants to send.
Currently you can only have one decoder instance live at any given point, by calling
enableDecoderWithKenLM
. It'd be nice to be able to have several decoder instances and specify which one you want to use when you calldecode
.The text was updated successfully, but these errors were encountered: