Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

rotemdan · 2023-07-29T07:07:42Z

In the VITS and eSpeak engines, the text is converted to phonemes using the phoneme events produced by the eSpeak speech synthesizer during synthesis. eSpeak does a reasonable job in some languages (especially English), but have many errors and inaccuracies in others.

Fortunately, we can improve these inaccurate pronunciations, and thus improve the quality of the VITS voices, by applying corrections using lexicon files. The lexicons are applied as part of a preprocessing step, in JavaScript, to specify the exact pronunciations of some words, before the tokens are sent to eSpeak.

An example of a pronunciation file is the English heteronyms file in data/lexicons/heteronyms.en.json. It specifies pronunciations of various English words, like "read", "present", "content" and "use", that are written the same, but pronounced differently based on context.

The heteronym lexicon demonstrates more advanced capabilities of the lexicon system, but lexicon files, can, of course, be used in a simpler way, to correct pronunciations when there is only a single alternative.

The overall structure for a basic correction entry, would look like:

{
	"en":
	{
		"hello": [{
			"pronunciation": {
				"espeak": {
					"en-us": "h ə l ˈoʊ",
					"en-gb-x-rp": "h ə l ˈəʊ"
				}
			}
		}]
	}
}

You can specify a custom lexicon JSON file for synthesis (as well as alignment), using the customLexiconPaths option, which accepts an array of file paths:

echogarden speak-file myText.txt --customLexiconPaths=['myLexicon.json']

The only engines that currently make use of them are vits and espeak for synthesis, dtw and dtw-ra for alignment.

We can also collect these pronunciation corrections, add them to the main repository, and load them by default, to improve pronunciations across many different languages.

The text was updated successfully, but these errors were encountered:

rotemdan added bug Something isn't working synthesis Issue related to speech synthesis labels Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

rotemdan commented Jul 29, 2023 •

edited

Loading

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

Comments

rotemdan commented Jul 29, 2023 • edited Loading

rotemdan commented Jul 29, 2023 •

edited

Loading