Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346

Closed
erogol opened this issue Feb 6, 2020 · 36 comments
Closed

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346

erogol opened this issue Feb 6, 2020 · 36 comments
Labels
model-release explanation for new model releases

Comments

@erogol
Copy link
Contributor

erogol commented Feb 6, 2020

Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing

This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

(Ignore the small jiggle on the figures caused by TB)
image

image

@erogol erogol added the model-release explanation for new model releases label Feb 6, 2020
@m-toman
Copy link
Contributor

m-toman commented Feb 6, 2020

Cool, overall do you prefer the forward attention model over this one?

@el-tocino
Copy link

He mentioned "It is the best model so far trained." on the forward model post.

@m-toman
Copy link
Contributor

m-toman commented Feb 7, 2020

Yes, perhaps I should be more specific. I assume this might be more because of the specific training regimen (switching to batch norm, training longer...) and handholding and not necessarily because of the attention mechanism itself.

Like I got many more better wavernn 10 bit mulaw models in practices although overall I think MoL leads to better results.

But I assume that can not really be answered before lots of experiments with different datasets etc.

Also, the "more natural-sounding" seemed to be a comparison to the forward attention model.

@erogol
Copy link
Contributor Author

erogol commented Feb 7, 2020

My two cents are

Graves easier to train in different datasets and it is more natural sounding with a better prosody
Forward attention leads to a more robust attention alignment and easier to integrate with PWGAN trained on ground truth spectrograms.

Disclaimer: I was about to release the graves model but then I removed the whole model by mistake.
Now retraining it :)

@vcjob
Copy link

vcjob commented Feb 10, 2020

@ergol, why do you prefer PWGAN over MelGAN? It is faster, while the quality seems fine. Btw, on https://github.com/kan-bayashi/ParallelWaveGAN they provide now MelGAN as well. Any plans to try it, adapt for TTS?
Also, the official paper states PWGAN's MOS is quite higher than WaveNet's MOS. Is it hard to get similar results, or the authors (https://arxiv.org/pdf/1910.11480.pdf) prettify it a little bit?

@m-toman
Copy link
Contributor

m-toman commented Feb 10, 2020

@vcjob interesting, I find even the PWGAN official samples of just vocoded recordings already exhibit some artefacts. r9y9s taco-wavenet (MoL) samples definitely sound better.
Also wavernn gave me better results than MelGAN, although with LJ they are pretty similar. But definitely on other speakers and better results than on the official melgan demo page.

I think the difference in the PWGAN paper is just because they used the espnet Gaussian Wavenet. I tried all their models and they are definitely not as good as r9y9s Wavenet.
No wonder considering how much effort went into that over the years...and of course, it's ultra-slow.

Also interesting how more or less nobody uses the original wavernn formulation. Even the amazon papers use a simple GRU followed by FCs predicting quantized output via softmax.

Well, in the end they're all annoying for some different reason ;)

EDIT: just realized the main author of PWGAN is r9y9. Even stranger he didn't use his own Wavenet implementation for comparison

@erogol
Copy link
Contributor Author

erogol commented Feb 10, 2020

@vcjob PWGAN is easier to adapt to TTS and the model is smaller. Now, I also train MelGAN type generator as the official repo suggested. But it'd be nice to try original MelGAN with TTS if you are interested.

A paper is a paper :).

@hadaev8
Copy link

hadaev8 commented Feb 10, 2020

I'm trying your implementation of graves attention with my fork of Nvidia tacotron2.
But soon or later I get a gradient explosion, should you advise how to deal with it?

@erogol
Copy link
Contributor Author

erogol commented Feb 10, 2020

@hadaev8 is it the latest implementation?

@hadaev8
Copy link

hadaev8 commented Feb 10, 2020

@erogol
Latest from master.

@erogol
Copy link
Contributor Author

erogol commented Feb 10, 2020

@hadaev8 try the one in dev branch

@hadaev8
Copy link

hadaev8 commented Feb 20, 2020

@erogol
This one works fine, but why max attention value is 0.5?

@erogol
Copy link
Contributor Author

erogol commented Feb 20, 2020

because you are normalizing it. Actually this reduces the quality at inference time I guess. If you have solution for this, I'd like to know.

@hadaev8
Copy link

hadaev8 commented Feb 20, 2020

@erogol
Should you point exact line with normalisation? Im bit lost in math.

@erogol
Copy link
Contributor Author

erogol commented Feb 25, 2020

@hadaev8 it is not an explicit normalization.

Since values are bounded in [0, 1] even without discretization, with discretization they are also bounded in the same range. And because we do subtraction between time steps, the effective range comes close to zero. In our case it is [0, ~0.4]. So we could find a trick to expand this range.

@erogol
Copy link
Contributor Author

erogol commented Feb 25, 2020

I released the model finally with couple of changes. This moel uses Batch Norm prenet from the beginning.

@erogol
Copy link
Contributor Author

erogol commented Feb 28, 2020

One interesting problem with Graves's attention is that actually after the model converges only one of the attention heads is actively used suppressing the other heads. Which is an indicator of using only one head would also work fine with faster run-time.

Or dropout might be used to randomized the behavior of the heads in training assuming that would learn the other heads.

@erogol erogol closed this as completed Mar 11, 2020
@Shikherneo2
Copy link

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

@erogol
Copy link
Contributor Author

erogol commented Apr 22, 2020

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

It is actually true. Yet it worked?. Thx for the catch. I'll fix it and try again.

@Shikherneo2
Copy link

Shikherneo2 commented Apr 22, 2020

@erogol A unexpected but welcome surprise!
Ive been trying to port your implementation to tensorflow for my code, and for some reason, the attention values very quickly die to values close to 0. Any suggestions into where I should look for the issue?

`
def init(self, memory_sequence_length=None, training=True, name="GravesAttention" ):

self.training = training
with tf.name_scope(name, 'GmmAttentionMechanismInit'):
  self._mask_value = 1e-8
  self.maybe_mask_score = lambda x: _maybe_mask_score(x, memory_sequence_length, self._mask_value)
# Number of gaussians in the mixture
self.K = 5
self.eps = 1e-5

bias_init = tf.constant_initializer( np.hstack([np.zeros(self.K), np.full(self.K, 10), np.ones(self.K)]) )
layer1 = tf.layers.Dense( units=num_units, activation="relu", name="graves_attention_denselayer1", trainable=True, dtype=dtype )
layer2 = tf.layers.Dense( units=3*self.K, bias_initializer=bias_init, name="graves_attention_denselayer2", trainable=True, dtype=dtype )
self.dense_layer = lambda x: layer2(layer1(x))

self.J = tf.cast( tf.range( self.seq_len + 2 ), dtype=tf.float32 ) + 0.5

def call(self, query, state):

seq_length = self._alignments_size
mu_prev = state
with variable_scope.variable_scope(None, "graves_attention", [query]):
  j = tf.slice( self.J, [0], [ seq_length+1 ] )

  gbk_t = self.dense_layer( query )
  g_t, b_t, k_t = tf.split( gbk_t, num_or_size_splits=3, axis=1 )

  mu_t = mu_prev + tf.math.softplus(k_t)
  sig_t = tf.math.softplus(b_t) + self.eps

  g_t = tf.layers.dropout( g_t, rate=0.5, training=self.training )
  g_t = tf.nn.softmax( g_t, axis=1 ) + self.eps

  x = (j-tf.expand_dims(mu_t, -1))/ tf.expand_dims(sig_t, -1)
  phi_t = tf.expand_dims(g_t, -1) * tf.nn.sigmoid( x )

  alpha_t = tf.reduce_sum( phi_t, 1 )

  # discretize
  a = tf.slice( alpha_t, [0, 1], [self._batch_size, seq_length] )
  b = tf.slice( alpha_t, [0, 0], [self._batch_size, seq_length] )
  alpha_t = a-b

  alpha_t = self.maybe_mask_score(alpha_t)

next_state = mu_t 
return alpha_t, next_state`

@erogol
Copy link
Contributor Author

erogol commented Apr 24, 2020

not sure, maybe you can try the broken version as in my code.

@erogol
Copy link
Contributor Author

erogol commented Apr 24, 2020

If I use your version, attention weights are computed negative. It is weird.

@Shikherneo2
Copy link

Shikherneo2 commented Apr 25, 2020

I think I know whats happening. Your earlier implementation used a distribution that was monotonically decreasing, but your (mu_t - j) was flipped(possibly because you thought you were using exp instead of sigmoid), so it worked out just fine.
So, just change mu_t- j to j-mu_t, and your values should be positive again.

@erogol
Copy link
Contributor Author

erogol commented Apr 26, 2020

yeah that's a great return. I totally missed that.

@erogol
Copy link
Contributor Author

erogol commented Apr 28, 2020

@Shikherneo2 as I changed the implementation as you said and I had the same problem. After 10K iterations all the alignment turns out zero.

@Shikherneo2
Copy link

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.

@erogol
Copy link
Contributor Author

erogol commented Apr 29, 2020

In my case, network goes to zero sometimes after 10K and sometimes 60K. I checked the layer statistics through the training but I could not see something explanatory.

@erogol
Copy link
Contributor Author

erogol commented Apr 29, 2020

It is interesting. The function I used previously is a reverse sigmoid with a squashed range around 2/3. So mathematically it makes no sense but it worked.

@candlewill
Copy link

What's the benefit to discritize attention weights? Why don't directly use the original version?

@erogol
Copy link
Contributor Author

erogol commented May 28, 2020

It mathematically makes more sense to me and it works better.

@WhiteFu
Copy link

WhiteFu commented Aug 28, 2020

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.
请问你解决了么,我也遇到了相同的问题。

@Shikherneo2
Copy link

@WhiteFu No. I wasn't able to. When I looked at the statistics, I realized that the encoder gradients were going to zero after a few thousand iterations. So I added a highway network (like in Tacotron-1), which stabilized the training. But the weights still all go to zero.

@WhiteFu
Copy link

WhiteFu commented Sep 2, 2020

@Shikherneo2 this is weird, I will follow up and let you know if there is any progress!

@erogol
Copy link
Contributor Author

erogol commented Sep 7, 2020

should I reopen the issue if anyone working on it?

@Liujingxiu23
Copy link

Liujingxiu23 commented Sep 22, 2020

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?
捕获
phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large?
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

@LeoniusChen
Copy link

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?
捕获
phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large?
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

In Mozilla/TTS, Graves Attention is discrete. Now you can use codes in this Repo to implement DCA or GMM attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-release explanation for new model releases
Projects
None yet
Development

No branches or pull requests

10 participants