Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanishing Gradient Problem Occurred During Training #38

Open
bright1993ff66 opened this issue Dec 5, 2018 · 4 comments
Open

Vanishing Gradient Problem Occurred During Training #38

bright1993ff66 opened this issue Dec 5, 2018 · 4 comments

Comments

@bright1993ff66
Copy link

bright1993ff66 commented Dec 5, 2018

Hi, I am new to the attention mechanism and I found your codes, tutorials very helpful to beginners like me!

Currently, I am trying to use your attention decoder to do the sentiment analysis of the Sentiment140 Dataset. I have successfully constructed the following BiLSTM-with-attention model to split the positive and negative tweets:

def get_bi_lstm_with_attention_model(timesteps, features):
    input_shape = (timesteps, features)
    input = Input(shape=input_shape, dtype='float32')
    enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
                        merge_mode='concat', name='bidirectional_1')(input)
    y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='sigmoid')(enc)
    bilstm_attention_model = Model(inputs=input, outputs=y_hat)
    bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return bilstm_attention_model

However, when I use this model to fit my training data(which is a 1280000*50 matrix, batch_size=128, I definitely reshape the data first to (int(1280000/5), 5, 50) following the rule of the input_shape = (batch_size, time_steps, input_dim)), the accuracy is very low(around 50%). My BiLSTM without attention model could at least reach 80% accuracy using the same hyperparameter settings. Hence, my question is: what's wrong with my current BiLSTM with Attention model? I think it should be a vanishing gradient problem. I would really appreciate it if anyone could give me some guidelines about how to deal with this issue. Thank you very much!!

@bright1993ff66
Copy link
Author

bright1993ff66 commented Dec 5, 2018

Quick Update:

I try to add the regularizers to both the kernel regularizer and activity regularizer, which are the following:

kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01),

And I also add the BatchNormalization after the BiLSTM encoder and change the activation function to Relu. The newest model is given below:

def get_bi_lstm_with_attention_model(timesteps, features):
    input_shape = (timesteps, features)
    input = Input(shape=input_shape, dtype='float32')
    enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
                        merge_mode='concat', name='bidirectional_1')(input)
    normalized =  BatchNormalization()(enc)
    y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='relu')(normalized)
    bilstm_attention_model = Model(inputs=input, outputs=y_hat)
    bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return bilstm_attention_model

But the accuracy still fluctuates around the 50%. Any help&insights would be appreciated!

@zafarali
Copy link
Contributor

zafarali commented Dec 5, 2018 via email

@Tlchan99
Copy link

Tlchan99 commented Feb 9, 2019

Hi Zafarali,
I got the same issue as bright1993ff66, which I use your attention decoder in Portuguese/Chinese and English/Chinese machine translation for sentence with different (e.g. English = max. 13 words) input and output sequence (e.g. Chinese = max. 10 words) length.. There is no performance increase with BiLSTM model at all in BLEU1/2/3/4 measurement after bunches of prolonged GPU calculations.
It seems there is something wrong with the stepping that I am not sure from initial state if context vector is mapping correctly with correct encoder sequence time step and decoder sequence (with different maximum length) time step.
I've been start from scratch for month but cannot figure this out. Could you explain a bit more on the step functions on 1) how you do the get_initial_state 2) the step function?

Appreciate a lot if you could help to solve this complex puzzle. Thanks in advanced.

@Tlchan99
Copy link

Tlchan99 commented Feb 9, 2019

Hi Zafarali,

In detail, I translated Portuguese (max. 14 words with zero padding if shorter than 14) to Chinese (max. 15 words with padding), after a generic embedding later, the problem is the encoder output of BiLSTM later must go through a Repeatvector layer of sequence length = 15 (that is the target_timesteps length, not the input time step 14 or there will be an error prompt of expected a tensor of (15, 256), vs. (14, 256), a return_sequences = True in LSTM layer got the same error, 256 is the word dimension)

I think there is a mismatch of context vector, calculated by 14 timesteps (input words) in a sentence, not the actual 15 time steps that is passed to attention decoder layer as the decoder input (should be 14 time steps) for attention mechanism. So want to know how you count the initial state of y0, s0 and ytm, stm, ci to make sure they are looped correctly. I believe this is the reason why no or little performance increases with the attention decoder layer in my case. Appreciate if you could help or shed me some light and thanks.

Tom Chan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants