-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poor alignment when conditioned on reference audios #20
Comments
Hi, I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data. I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now. |
Hi.. Thanks for your quick response. I was under the impression that I can use any reference-audio (as a style) and use the model to generate new voice in the referenced-audio style. Does it matter which reference I use? Does it have to be from the same distribution as training data? My assumption was that model learns training-data distribution automatically and generates new audio/wav files with the style given in the training data. Please correct me if I am wrong. Thanks again for your help. |
@mohsinjuni Hi, I'm in the same situation like yours, have you figured out if reference audio can be any audio or must be in the same dataset? Thanks. |
Hi, I have a similar question. Has anyone found a solution for the same? |
First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?
I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters
SampleAudios.zip
. Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.
Following is from training step#200000
However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.
Without conditioning, (i.e., random)
Even, style-transfer in voice does not make much difference.
Please find attached zipped file for voice samples.
My Questions:
Is there anything I can change to get better quality audio and alignment? Thanks for your help in advance.
Can you please share pre-trained model you used to generate Audio Samples here.
The text was updated successfully, but these errors were encountered: