-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vanishing Gradient Problem Occurred During Training #38
Comments
Quick Update: I try to add the regularizers to both the kernel regularizer and activity regularizer, which are the following: kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01), And I also add the BatchNormalization after the BiLSTM encoder and change the activation function to Relu. The newest model is given below: def get_bi_lstm_with_attention_model(timesteps, features):
input_shape = (timesteps, features)
input = Input(shape=input_shape, dtype='float32')
enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
merge_mode='concat', name='bidirectional_1')(input)
normalized = BatchNormalization()(enc)
y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='relu')(normalized)
bilstm_attention_model = Model(inputs=input, outputs=y_hat)
bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
return bilstm_attention_model But the accuracy still fluctuates around the 50%. Any help&insights would be appreciated! |
Hi!
You can explicitly check for vanishing gradient by inspecting the
gradients, it does seem like your sequence length is long but not too long
that and LSTM might fail, especially since the regular version works.
Do you have masking in your sequences?
…On Wed, Dec 5, 2018 at 12:09 AM Bright Chang ***@***.***> wrote:
Quick Update:
I try to add the regularizers to both the kernel regularizer and activity
regularizer, which are the following:
kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01),
And I also add the BatchNormalization after the BiLSTM encoder and the
updated model is:
def get_bi_lstm_with_attention_model(timesteps, features):
input_shape = (timesteps, features)
input = Input(shape=input_shape, dtype='float32')
enc = Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2, return_sequences=True, input_shape=input_shape),
merge_mode='concat', name='bidirectional_1')(input)
normalized = BatchNormalization()(enc)
y_hat = AttentionDecoder(units=100,output_dim=1, name='attention_decoder_1', activation='relu')(normalized)
bilstm_attention_model = Model(inputs=input, outputs=y_hat)
bilstm_attention_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
return bilstm_attention_model
But the accuracy still fluctuates around the 50%. Any help&insights would
be appreciated!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGAO_PucDjX6z3Iv_NYLz5VKxthhZe-mks5u11TwgaJpZM4ZB_bN>
.
|
Hi Zafarali, Appreciate a lot if you could help to solve this complex puzzle. Thanks in advanced. |
Hi Zafarali, In detail, I translated Portuguese (max. 14 words with zero padding if shorter than 14) to Chinese (max. 15 words with padding), after a generic embedding later, the problem is the encoder output of BiLSTM later must go through a Repeatvector layer of sequence length = 15 (that is the target_timesteps length, not the input time step 14 or there will be an error prompt of expected a tensor of (15, 256), vs. (14, 256), a return_sequences = True in LSTM layer got the same error, 256 is the word dimension) I think there is a mismatch of context vector, calculated by 14 timesteps (input words) in a sentence, not the actual 15 time steps that is passed to attention decoder layer as the decoder input (should be 14 time steps) for attention mechanism. So want to know how you count the initial state of y0, s0 and ytm, stm, ci to make sure they are looped correctly. I believe this is the reason why no or little performance increases with the attention decoder layer in my case. Appreciate if you could help or shed me some light and thanks. Tom Chan. |
Hi, I am new to the attention mechanism and I found your codes, tutorials very helpful to beginners like me!
Currently, I am trying to use your attention decoder to do the sentiment analysis of the Sentiment140 Dataset. I have successfully constructed the following BiLSTM-with-attention model to split the positive and negative tweets:
However, when I use this model to fit my training data(which is a 1280000*50 matrix, batch_size=128, I definitely reshape the data first to (int(1280000/5), 5, 50) following the rule of the input_shape = (batch_size, time_steps, input_dim)), the accuracy is very low(around 50%). My BiLSTM without attention model could at least reach 80% accuracy using the same hyperparameter settings. Hence, my question is: what's wrong with my current BiLSTM with Attention model? I think it should be a vanishing gradient problem. I would really appreciate it if anyone could give me some guidelines about how to deal with this issue. Thank you very much!!
The text was updated successfully, but these errors were encountered: