poor alignment when conditioned on reference audios #20

mohsinjuni · 2018-09-26T23:21:50Z

First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?

I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters
SampleAudios.zip
. Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.

Following is from training step#200000

However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.

Without conditioning, (i.e., random)

Even, style-transfer in voice does not make much difference.

Please find attached zipped file for voice samples.

My Questions:

Is there anything I can change to get better quality audio and alignment? Thanks for your help in advance.
Can you please share pre-trained model you used to generate Audio Samples here.

syang1993 · 2018-09-27T03:32:02Z

Hi, I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data.

I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now.

mohsinjuni · 2018-09-27T21:40:07Z

Hi.. Thanks for your quick response. I was under the impression that I can use any reference-audio (as a style) and use the model to generate new voice in the referenced-audio style. Does it matter which reference I use? Does it have to be from the same distribution as training data? My assumption was that model learns training-data distribution automatically and generates new audio/wav files with the style given in the training data. Please correct me if I am wrong. Thanks again for your help.

liangshuang1993 · 2018-12-17T02:06:21Z

@mohsinjuni Hi, I'm in the same situation like yours, have you figured out if reference audio can be any audio or must be in the same dataset? Thanks.

shrinidhin · 2019-10-22T05:28:09Z

Hi, I have a similar question. Has anyone found a solution for the same?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

poor alignment when conditioned on reference audios #20

poor alignment when conditioned on reference audios #20

mohsinjuni commented Sep 26, 2018

syang1993 commented Sep 27, 2018

mohsinjuni commented Sep 27, 2018

liangshuang1993 commented Dec 17, 2018

shrinidhin commented Oct 22, 2019

poor alignment when conditioned on reference audios #20

poor alignment when conditioned on reference audios #20

Comments

mohsinjuni commented Sep 26, 2018

syang1993 commented Sep 27, 2018

mohsinjuni commented Sep 27, 2018

liangshuang1993 commented Dec 17, 2018

shrinidhin commented Oct 22, 2019