Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesis: VITS voices have various pronunciation errors that can be fixed using lexicons #15

Open
rotemdan opened this issue Jul 29, 2023 · 0 comments
Labels
bug Something isn't working synthesis Issue related to speech synthesis

Comments

@rotemdan
Copy link
Member

rotemdan commented Jul 29, 2023

In the VITS and eSpeak engines, the text is converted to phonemes using the phoneme events produced by the eSpeak speech synthesizer during synthesis. eSpeak does a reasonable job in some languages (especially English), but have many errors and inaccuracies in others.

Fortunately, we can improve these inaccurate pronunciations, and thus improve the quality of the VITS voices, by applying corrections using lexicon files. The lexicons are applied as part of a preprocessing step, in JavaScript, to specify the exact pronunciations of some words, before the tokens are sent to eSpeak.

An example of a pronunciation file is the English heteronyms file in data/lexicons/heteronyms.en.json. It specifies pronunciations of various English words, like "read", "present", "content" and "use", that are written the same, but pronounced differently based on context.

The heteronym lexicon demonstrates more advanced capabilities of the lexicon system, but lexicon files, can, of course, be used in a simpler way, to correct pronunciations when there is only a single alternative.

The overall structure for a basic correction entry, would look like:

{
	"en":
	{
		"hello": [{
			"pronunciation": {
				"espeak": {
					"en-us": "h ə l ˈoʊ",
					"en-gb-x-rp": "h ə l ˈəʊ"
				}
			}
		}]
	}
}

You can specify a custom lexicon JSON file for synthesis (as well as alignment), using the customLexiconPaths option, which accepts an array of file paths:

echogarden speak-file myText.txt --customLexiconPaths=['myLexicon.json']

The only engines that currently make use of them are vits and espeak for synthesis, dtw and dtw-ra for alignment.

We can also collect these pronunciation corrections, add them to the main repository, and load them by default, to improve pronunciations across many different languages.

@rotemdan rotemdan added bug Something isn't working synthesis Issue related to speech synthesis labels Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working synthesis Issue related to speech synthesis
Projects
None yet
Development

No branches or pull requests

1 participant