Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor alignment when conditioned on reference audios #20

Open
mohsinjuni opened this issue Sep 26, 2018 · 4 comments
Open

poor alignment when conditioned on reference audios #20

mohsinjuni opened this issue Sep 26, 2018 · 4 comments

Comments

@mohsinjuni
Copy link

First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?

I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters
SampleAudios.zip
. Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.

Following is from training step#200000

step-200000-align

However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.

eval-203000_ref-sample2-align

Without conditioning, (i.e., random)

eval-203000_ref-randomweight-align

Even, style-transfer in voice does not make much difference.

Please find attached zipped file for voice samples.

My Questions:

  • Is there anything I can change to get better quality audio and alignment? Thanks for your help in advance.

  • Can you please share pre-trained model you used to generate Audio Samples here.

@syang1993
Copy link
Owner

Hi, I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data.

I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now.

@mohsinjuni
Copy link
Author

Hi.. Thanks for your quick response. I was under the impression that I can use any reference-audio (as a style) and use the model to generate new voice in the referenced-audio style. Does it matter which reference I use? Does it have to be from the same distribution as training data? My assumption was that model learns training-data distribution automatically and generates new audio/wav files with the style given in the training data. Please correct me if I am wrong. Thanks again for your help.

@liangshuang1993
Copy link

@mohsinjuni Hi, I'm in the same situation like yours, have you figured out if reference audio can be any audio or must be in the same dataset? Thanks.

@shrinidhin
Copy link

Hi, I have a similar question. Has anyone found a solution for the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants