You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the VITS and eSpeak engines, the text is converted to phonemes using the phoneme events produced by the eSpeak speech synthesizer during synthesis. eSpeak does a reasonable job in some languages (especially English), but have many errors and inaccuracies in others.
Fortunately, we can improve these inaccurate pronunciations, and thus improve the quality of the VITS voices, by applying corrections using lexicon files. The lexicons are applied as part of a preprocessing step, in JavaScript, to specify the exact pronunciations of some words, before the tokens are sent to eSpeak.
An example of a pronunciation file is the English heteronyms file in data/lexicons/heteronyms.en.json. It specifies pronunciations of various English words, like "read", "present", "content" and "use", that are written the same, but pronounced differently based on context.
The heteronym lexicon demonstrates more advanced capabilities of the lexicon system, but lexicon files, can, of course, be used in a simpler way, to correct pronunciations when there is only a single alternative.
The overall structure for a basic correction entry, would look like:
You can specify a custom lexicon JSON file for synthesis (as well as alignment), using the customLexiconPaths option, which accepts an array of file paths:
The only engines that currently make use of them are vits and espeak for synthesis, dtw and dtw-ra for alignment.
We can also collect these pronunciation corrections, add them to the main repository, and load them by default, to improve pronunciations across many different languages.
The text was updated successfully, but these errors were encountered:
In the VITS and eSpeak engines, the text is converted to phonemes using the phoneme events produced by the eSpeak speech synthesizer during synthesis. eSpeak does a reasonable job in some languages (especially English), but have many errors and inaccuracies in others.
Fortunately, we can improve these inaccurate pronunciations, and thus improve the quality of the VITS voices, by applying corrections using lexicon files. The lexicons are applied as part of a preprocessing step, in JavaScript, to specify the exact pronunciations of some words, before the tokens are sent to eSpeak.
An example of a pronunciation file is the English heteronyms file in
data/lexicons/heteronyms.en.json
. It specifies pronunciations of various English words, like "read", "present", "content" and "use", that are written the same, but pronounced differently based on context.The heteronym lexicon demonstrates more advanced capabilities of the lexicon system, but lexicon files, can, of course, be used in a simpler way, to correct pronunciations when there is only a single alternative.
The overall structure for a basic correction entry, would look like:
You can specify a custom lexicon JSON file for synthesis (as well as alignment), using the
customLexiconPaths
option, which accepts an array of file paths:The only engines that currently make use of them are
vits
andespeak
for synthesis,dtw
anddtw-ra
for alignment.We can also collect these pronunciation corrections, add them to the main repository, and load them by default, to improve pronunciations across many different languages.
The text was updated successfully, but these errors were encountered: