Conceptual question about scorer #2217

fquirin · 2022-05-18T07:45:18Z

fquirin
May 18, 2022

Hello everybody,

I have a conceptual question about the LM aka scorer in Coqui.

A bit of background first. I've been working on open-source voice assistants for a long time now (e.g. SEPIA) and although the ASR systems have come a long way (since Sphinx4 ^^) it is still necessary to build small domain, custom LMs to get acceptable recognition quality.
Pre-trained models usually come with larger LMs and since it's often not practicable or possible to reuse the original training data I have to replace the original LM completely instead of simply "augmenting" it with my own data.

So here is my question. When I noticed that the Coqui models were performing quite OK even without a scorer I was wondering if it's possible to use a custom scorer, trained on a few dozens of own sentences and apply it in a way so that Coqui is preferring those sentences without loosing the original large vocabulary completely? Basically just shifting weights a bit?
I have a feeling that the alpha and beta arguments for the scorer might be able to control this but I can't find any explanation of what they actually do.

[EDIT] The 'hot words' feature seems to be very similar, but boosts only single words not sentences

Thanks in advance for any info or help 🙂
Florian

wasertech · 2022-06-02T15:40:09Z

wasertech
Jun 2, 2022
Collaborator

You are correct. Building your own little scorer for your task with a pre-trained model is the fastest way the improve your results.
There is practically no reason to fine-tune the acoustic models (they are quite on spot).
Try to aim for a scorer under 800 Mb.
To create your source_lm.txt file, try to use around 40 Go of normalized text and it should be good to optimize.
Use the lm_optimize python script to find the best values for alpha and beta so you can export with them, this will make $LM_N_TRIALS test to hyper-optimize the value of alpha and beta within the specified max range $LM_ALPHA_MAX and $LM_BETA_MAX.
As for word boost it can help here and there if you say a word a lot and it doesn't pick it just right but the word must be present in the language model to begin with.

# Make a dummy scorer to find apha and beta
generate_lm.py \
		--input_txt /mnt/extracted/sources_lm.txt \
		--output_dir /mnt/lm/ \
		--top_k ${LM_TOP_K} \
		--kenlm_bins ${HOMEDIR}/kenlm/build/bin/ \
		--arpa_order 4 \
		--max_arpa_memory "85%" \
		--arpa_prune "0|0|1" \
		--binary_a_bits 255 \
		--binary_q_bits 8 \
		--binary_type trie
generate_scorer_package \
		--checkpoint /mnt/models/ \
		--lm /mnt/lm/lm.binary \
		--vocab /mnt/lm/vocab-${LM_TOP_K}.txt \
		--package /mnt/lm/kenlm.scorer \
		--default_alpha 0.0 \
		--default_beta 0.0
# Find best values
lm_optimizer.py \
		--show_progressbar true \
		--train_cudnn true \
		--alphabet_config_path /mnt/models/alphabet.txt \
		--scorer_path /mnt/lm/kenlm.scorer \
		--feature_cache /mnt/sources/feature_cache \
		--test_files ${all_test_csv} \
		--test_batch_size ${TEST_BATCH_SIZE} \
		--n_hidden ${N_HIDDEN} \
		--lm_alpha_max ${LM_ALPHA_MAX} \
		--lm_beta_max ${LM_BETA_MAX} \
		--n_trials ${LM_N_TRIALS} \
		--checkpoint_dir /mnt/checkpoints/
# Repackage the scorer with the correct values
rm  /mnt/lm/kenlm.scorer
generate_scorer_package \
		--checkpoint /mnt/models/ \
		--lm /mnt/lm/lm.binary \
		--vocab /mnt/lm/vocab-${LM_TOP_K}.txt \
		--package /mnt/lm/kenlm.scorer \
		--default_alpha ${LM_ALPHA} \
		--default_beta ${LM_BETA}
# Testing with the best values
python -m coqui_stt_training.evaluate \
		--show_progressbar true \
		--train_cudnn true \
		${AMP_FLAG} \
		--alphabet_config_path /mnt/models/alphabet.txt \
		--scorer_path /mnt/lm/kenlm.scorer \
		--test_files ${all_test_csv} \
		--test_batch_size ${TEST_BATCH_SIZE} \
		--n_hidden ${N_HIDDEN} \
		--lm_alpha ${LM_ALPHA} \
		--lm_beta ${LM_BETA} \
		--checkpoint_dir /mnt/checkpoints/ \
		--test_output_file /mnt/models/test_output.json
# Exporting the best models
/tflite-venv/bin/python -m coqui_stt_training.export \
		--alphabet_config_path /mnt/models/alphabet.txt \
		--scorer_path /mnt/lm/kenlm.scorer \
		--feature_cache /mnt/sources/feature_cache \
		--n_hidden ${N_HIDDEN} \
		--beam_width ${BEAM_WIDTH} \
		--lm_alpha ${LM_ALPHA} \
		--lm_beta ${LM_BETA} \
		--load_evaluate "best" \
		${LOAD_CHECKPOINT_FROM} \
		--export_dir /mnt/models/ \
		--export_tflite true \
		${ALL_METADATA_FLAGS} \
		${METADATA_MODEL_NAME_FLAG}

12 replies

wasertech Sep 21, 2022
Collaborator

can you give some pointers for this?

Not really as I've never done it. 😅

I would imagine it would go something like this:

Find a nice paper describing a language modeling technique you like.
Implement it with C++ preferably and make a python wrapper to link it to STT just like with KenLM.

I'm not saying it would be easy, but from my understanding, STT is made in such a way, it could be done with minimal modifications to the code base.

There is not much documentation on this (that I can find at least), so a good understanding of KenLM and it's implementation inside STT would greatly help.

fquirin Sep 22, 2022
Author

Maybe the raspberry pi isn't the best platform for your needs.

It is the other way around. I'm trying to find the best ASR system for the Raspberry Pi not the best hardware for Coqui ;-)

STT supports custom language models so technically nothing is stopping you to implement another LM with those desired traits.

As much as I'd like to spend countless hours on this I will leave this to the ASR experts with years of experience ^^.
Maybe there'll come a day where the Hot-Word boost feature will actually become a "Hot-Sentence" boost 🙂

HarikalarKutusu Sep 22, 2022

Maybe there'll come a day where the Hot-Word boost feature will actually become a "Hot-Sentence" boost

If you are working with small LMs with special command like structures, there is a trick: Glue the words... E.g.

"hey home, open the door" => "heyhomeopenthedoor", hot-word boost it and use it as an intent...

fquirin Sep 22, 2022
Author

That is actually working? Shouldn't that be out of vocabulary according to the docs:

"It is worth noting that boosting non-existent words in scorer (mostly proper nouns) or a word that share no phonetic prefix with other word in the input audio don’t change the final transcription."

HarikalarKutusu Sep 22, 2022

If you are creating your own scorer, you will have it in the language model. I applied the idea in my voice chess for the coordinates. E.g. instead of adding "alpha two" I added "alphatwo" and that made the recognition much better. I used a generator for that:

https://github.com/HarikalarKutusu/3d-voice-chess/tree/main/language-model-creation/chess-sentence-generators

There are no hot-word boosting in that implementation though, there were too many to list.

I.e. you can also use: "heyhome" "openthedoor" like sub sentences, the first for wakeup, the second for intent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conceptual question about scorer #2217

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Conceptual question about scorer #2217

fquirin May 18, 2022

Replies: 1 comment · 12 replies

wasertech Jun 2, 2022 Collaborator

wasertech Sep 21, 2022 Collaborator

fquirin Sep 22, 2022 Author

HarikalarKutusu Sep 22, 2022

fquirin Sep 22, 2022 Author

HarikalarKutusu Sep 22, 2022

fquirin
May 18, 2022

Replies: 1 comment 12 replies

wasertech
Jun 2, 2022
Collaborator

wasertech Sep 21, 2022
Collaborator

fquirin Sep 22, 2022
Author

fquirin Sep 22, 2022
Author