-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset? #59
Comments
Hi, short answer is that the voice is going to be rubbish as the model will average them. I will probably implement a multispeaker version soon. The idea is to condition each voice on a speaker embedding, e.g. from https://github.com/resemble-ai/Resemblyzer and provide a reference embedding for inference. I had some success previously with this repo just doing that, but that's outdated already (the branch was done before implementing pitch and energy conditioning). |
I've read in some master's thesis for Finnish language that the author received good results with a "warm start" method. He trained the base model on multiple voices of 20 hours and then trained a single voice on top of that model. Would this idea work with ForwardTacotron? |
I think still this makes much more sense if you have the voice conditioning. Do the authors share their model architecture? I suspect they are using some speaker embedding. |
The auther used Nvidia's implementation of Tacotron. They didn't change anything in the code. "Using a warm-starting training schema yielded better results. First, a general |
Ah very interesting. Could well be tried with this repo then. If there is enough data for each speaker, it could work. Just try it out and throw everything in. Carefully watch the tacotron training to see if the attention score jumps above 0.5 between 3k-10k steps. If its successful then you can wait until the alignments are extracted (after 40k tacotron trainnig steps) and then train your multispeaker forward tacotron until 50k steps or so and then start messing with the data (replace it with single-speaker data). |
I will report my findings. Thank you for your help. |
Good luck, lmk how it goes! |
Haven't tried it but I found that speaker selection isn't random but usually by some similarity to training data sentences. Unfortunately it often overrides the speaker embedding in my case - pick a sentence from the training data of a speaker and the embedding vector of another speaker and you usually still get output of the first speaker, even if you slightly modify the sentence. For very long sentences it sometimes does switch mid-sentence. |
I have another question regarding the fine-tuning of an existing model. How would I go about this? |
The tacotron is only used to extract phoneme durations from the dataset. Once you processed all voices at once you can simply use the latest forward model to fine-tune. You probably need to manually filter the data according to the speaker. |
Hi,
I was wondering, if it was possible to train on a dataset, that has let's say 2-3 male voices, each with about 10 hours of data.
Will the end result of this be a good neutral male voice?
The text was updated successfully, but these errors were encountered: