Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346

erogol · 2020-02-06T16:24:10Z

Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing

This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

(Ignore the small jiggle on the figures caused by TB)

m-toman · 2020-02-06T16:37:54Z

Cool, overall do you prefer the forward attention model over this one?

el-tocino · 2020-02-07T04:34:28Z

He mentioned "It is the best model so far trained." on the forward model post.

m-toman · 2020-02-07T05:32:50Z

Yes, perhaps I should be more specific. I assume this might be more because of the specific training regimen (switching to batch norm, training longer...) and handholding and not necessarily because of the attention mechanism itself.

Like I got many more better wavernn 10 bit mulaw models in practices although overall I think MoL leads to better results.

But I assume that can not really be answered before lots of experiments with different datasets etc.

Also, the "more natural-sounding" seemed to be a comparison to the forward attention model.

erogol · 2020-02-07T11:14:06Z

My two cents are

Graves easier to train in different datasets and it is more natural sounding with a better prosody
Forward attention leads to a more robust attention alignment and easier to integrate with PWGAN trained on ground truth spectrograms.

Disclaimer: I was about to release the graves model but then I removed the whole model by mistake.
Now retraining it :)

vcjob · 2020-02-10T08:32:14Z

@ergol, why do you prefer PWGAN over MelGAN? It is faster, while the quality seems fine. Btw, on https://github.com/kan-bayashi/ParallelWaveGAN they provide now MelGAN as well. Any plans to try it, adapt for TTS?
Also, the official paper states PWGAN's MOS is quite higher than WaveNet's MOS. Is it hard to get similar results, or the authors (https://arxiv.org/pdf/1910.11480.pdf) prettify it a little bit?

m-toman · 2020-02-10T09:01:36Z

@vcjob interesting, I find even the PWGAN official samples of just vocoded recordings already exhibit some artefacts. r9y9s taco-wavenet (MoL) samples definitely sound better.
Also wavernn gave me better results than MelGAN, although with LJ they are pretty similar. But definitely on other speakers and better results than on the official melgan demo page.

I think the difference in the PWGAN paper is just because they used the espnet Gaussian Wavenet. I tried all their models and they are definitely not as good as r9y9s Wavenet.
No wonder considering how much effort went into that over the years...and of course, it's ultra-slow.

Also interesting how more or less nobody uses the original wavernn formulation. Even the amazon papers use a simple GRU followed by FCs predicting quantized output via softmax.

Well, in the end they're all annoying for some different reason ;)

EDIT: just realized the main author of PWGAN is r9y9. Even stranger he didn't use his own Wavenet implementation for comparison

erogol · 2020-02-10T11:27:49Z

@vcjob PWGAN is easier to adapt to TTS and the model is smaller. Now, I also train MelGAN type generator as the official repo suggested. But it'd be nice to try original MelGAN with TTS if you are interested.

A paper is a paper :).

hadaev8 · 2020-02-10T11:58:15Z

I'm trying your implementation of graves attention with my fork of Nvidia tacotron2.
But soon or later I get a gradient explosion, should you advise how to deal with it?

erogol · 2020-02-10T16:48:21Z

@hadaev8 is it the latest implementation?

hadaev8 · 2020-02-10T20:38:46Z

@erogol
Latest from master.

erogol · 2020-02-10T23:12:14Z

@hadaev8 try the one in dev branch

hadaev8 · 2020-02-20T15:27:15Z

@erogol
This one works fine, but why max attention value is 0.5?

erogol · 2020-02-20T16:46:07Z

because you are normalizing it. Actually this reduces the quality at inference time I guess. If you have solution for this, I'd like to know.

hadaev8 · 2020-02-20T21:06:07Z

@erogol
Should you point exact line with normalisation? Im bit lost in math.

erogol · 2020-02-25T11:30:55Z

@hadaev8 it is not an explicit normalization.

Since values are bounded in [0, 1] even without discretization, with discretization they are also bounded in the same range. And because we do subtraction between time steps, the effective range comes close to zero. In our case it is [0, ~0.4]. So we could find a trick to expand this range.

erogol · 2020-02-25T11:35:31Z

I released the model finally with couple of changes. This moel uses Batch Norm prenet from the beginning.

erogol · 2020-02-28T12:11:46Z

One interesting problem with Graves's attention is that actually after the model converges only one of the attention heads is actively used suppressing the other heads. Which is an indicator of using only one head would also work fine with faster run-time.

Or dropout might be used to randomized the behavior of the heads in training assuming that would learn the other heads.

Shikherneo2 · 2020-04-20T22:56:31Z

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

erogol · 2020-04-22T22:13:38Z

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

It is actually true. Yet it worked?. Thx for the catch. I'll fix it and try again.

Shikherneo2 · 2020-04-22T23:32:28Z

@erogol A unexpected but welcome surprise!
Ive been trying to port your implementation to tensorflow for my code, and for some reason, the attention values very quickly die to values close to 0. Any suggestions into where I should look for the issue?

`
def init(self, memory_sequence_length=None, training=True, name="GravesAttention" ):

self.training = training
with tf.name_scope(name, 'GmmAttentionMechanismInit'):
  self._mask_value = 1e-8
  self.maybe_mask_score = lambda x: _maybe_mask_score(x, memory_sequence_length, self._mask_value)
# Number of gaussians in the mixture
self.K = 5
self.eps = 1e-5

bias_init = tf.constant_initializer( np.hstack([np.zeros(self.K), np.full(self.K, 10), np.ones(self.K)]) )
layer1 = tf.layers.Dense( units=num_units, activation="relu", name="graves_attention_denselayer1", trainable=True, dtype=dtype )
layer2 = tf.layers.Dense( units=3*self.K, bias_initializer=bias_init, name="graves_attention_denselayer2", trainable=True, dtype=dtype )
self.dense_layer = lambda x: layer2(layer1(x))

self.J = tf.cast( tf.range( self.seq_len + 2 ), dtype=tf.float32 ) + 0.5

def call(self, query, state):

seq_length = self._alignments_size
mu_prev = state
with variable_scope.variable_scope(None, "graves_attention", [query]):
  j = tf.slice( self.J, [0], [ seq_length+1 ] )

  gbk_t = self.dense_layer( query )
  g_t, b_t, k_t = tf.split( gbk_t, num_or_size_splits=3, axis=1 )

  mu_t = mu_prev + tf.math.softplus(k_t)
  sig_t = tf.math.softplus(b_t) + self.eps

  g_t = tf.layers.dropout( g_t, rate=0.5, training=self.training )
  g_t = tf.nn.softmax( g_t, axis=1 ) + self.eps

  x = (j-tf.expand_dims(mu_t, -1))/ tf.expand_dims(sig_t, -1)
  phi_t = tf.expand_dims(g_t, -1) * tf.nn.sigmoid( x )

  alpha_t = tf.reduce_sum( phi_t, 1 )

  # discretize
  a = tf.slice( alpha_t, [0, 1], [self._batch_size, seq_length] )
  b = tf.slice( alpha_t, [0, 0], [self._batch_size, seq_length] )
  alpha_t = a-b

  alpha_t = self.maybe_mask_score(alpha_t)

next_state = mu_t 
return alpha_t, next_state`

erogol · 2020-04-24T16:32:23Z

not sure, maybe you can try the broken version as in my code.

erogol · 2020-04-24T16:58:27Z

If I use your version, attention weights are computed negative. It is weird.

Shikherneo2 · 2020-04-25T17:15:22Z

I think I know whats happening. Your earlier implementation used a distribution that was monotonically decreasing, but your (mu_t - j) was flipped(possibly because you thought you were using exp instead of sigmoid), so it worked out just fine.
So, just change mu_t- j to j-mu_t, and your values should be positive again.

erogol · 2020-04-26T01:16:44Z

yeah that's a great return. I totally missed that.

erogol · 2020-04-28T08:39:43Z

@Shikherneo2 as I changed the implementation as you said and I had the same problem. After 10K iterations all the alignment turns out zero.

Shikherneo2 · 2020-04-28T17:05:37Z

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.

erogol · 2020-04-29T08:50:25Z

In my case, network goes to zero sometimes after 10K and sometimes 60K. I checked the layer statistics through the training but I could not see something explanatory.

erogol · 2020-04-29T10:37:17Z

It is interesting. The function I used previously is a reverse sigmoid with a squashed range around 2/3. So mathematically it makes no sense but it worked.

candlewill · 2020-05-26T12:43:10Z

What's the benefit to discritize attention weights? Why don't directly use the original version?

erogol · 2020-05-28T07:46:51Z

It mathematically makes more sense to me and it works better.

WhiteFu · 2020-08-28T13:45:59Z

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.
请问你解决了么，我也遇到了相同的问题。

Shikherneo2 · 2020-08-28T14:34:09Z

@WhiteFu No. I wasn't able to. When I looked at the statistics, I realized that the encoder gradients were going to zero after a few thousand iterations. So I added a highway network (like in Tacotron-1), which stabilized the training. But the weights still all go to zero.

WhiteFu · 2020-09-02T06:11:22Z

@Shikherneo2 this is weird, I will follow up and let you know if there is any progress！

erogol · 2020-09-07T09:20:14Z

should I reopen the issue if anyone working on it?

Liujingxiu23 · 2020-09-22T01:54:51Z

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?

phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large？
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

LeoniusChen · 2021-03-30T03:36:35Z

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?

phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large？
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

In Mozilla/TTS, Graves Attention is discrete. Now you can use codes in this Repo to implement DCA or GMM attention.

erogol added the model-release explanation for new model releases label Feb 6, 2020

erogol closed this as completed Mar 11, 2020

p0p4k mentioned this issue Nov 1, 2022

Use COEF value. coqui-ai/TTS#2116

Closed

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech #346

Comments

erogol commented Feb 6, 2020 • edited Loading

m-toman commented Feb 6, 2020

el-tocino commented Feb 7, 2020

m-toman commented Feb 7, 2020

erogol commented Feb 7, 2020

vcjob commented Feb 10, 2020

m-toman commented Feb 10, 2020 • edited Loading

erogol commented Feb 10, 2020

hadaev8 commented Feb 10, 2020

erogol commented Feb 10, 2020

hadaev8 commented Feb 10, 2020

erogol commented Feb 10, 2020

hadaev8 commented Feb 20, 2020

erogol commented Feb 20, 2020

hadaev8 commented Feb 20, 2020

erogol commented Feb 25, 2020

erogol commented Feb 25, 2020

erogol commented Feb 28, 2020

Shikherneo2 commented Apr 20, 2020

erogol commented Apr 22, 2020 • edited Loading

Shikherneo2 commented Apr 22, 2020 • edited Loading

erogol commented Apr 24, 2020

erogol commented Apr 24, 2020

Shikherneo2 commented Apr 25, 2020 • edited Loading

erogol commented Apr 26, 2020

erogol commented Apr 28, 2020

Shikherneo2 commented Apr 28, 2020

erogol commented Apr 29, 2020

erogol commented Apr 29, 2020

candlewill commented May 26, 2020

erogol commented May 28, 2020

WhiteFu commented Aug 28, 2020

Shikherneo2 commented Aug 28, 2020

WhiteFu commented Sep 2, 2020

erogol commented Sep 7, 2020

Liujingxiu23 commented Sep 22, 2020 • edited Loading

LeoniusChen commented Mar 30, 2021

erogol commented Feb 6, 2020 •

edited

Loading

m-toman commented Feb 10, 2020 •

edited

Loading

erogol commented Apr 22, 2020 •

edited

Loading

Shikherneo2 commented Apr 22, 2020 •

edited

Loading

Shikherneo2 commented Apr 25, 2020 •

edited

Loading

Liujingxiu23 commented Sep 22, 2020 •

edited

Loading