- Giovanni Rivera; Intro to Machine Learning, Section A
- Oscar Jesus Zambrano; Intro to Machine Learning, Section B
# Ensure that we have the newest version of pip installed
%pip install -q --upgrade pip
# Install necessary libraries
%pip install -q numpy
%pip install -q pandas
%pip install -q matplotlib
%pip install -q seaborn
%pip install -q plotly
# Helps avoid showing plots in a separate line
%matplotlib inline
%pip install -q scikit-learn
%pip install -q tensorflow
# Helps run plot_model from keras
%pip install pydot
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
# Import the modules
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.graph_objs as plotly
from sklearn import datasets
# Set the styling of the plt plots to darkgrid
sns.set_style('darkgrid')
# Removes error messsages and sets precision to 3 decimal places
import warnings
warnings.filterwarnings('ignore')
np.set_printoptions(precision=3, suppress=True)
In light of recent events around the globe, we have seen the rise of misinformation being spread on social media. In order to combat this we are using real tweets from real users on Twitter to find a way to stop misinformation from spreading, and promote real useful information to those living in or have relatives in affected areas.
We are designing machine learning models using Tensorflow/Keras to identify whether a given tweet is a natural distaster.
Throughout the notebook, we will refer to tweets about a natural distaster as part of the postive class
, while tweets that aren't about a natural disaster as part of the negative class
.
We use a dataset from the Kaggle competition Tech Exchange 2023 ML Project.
You may find the dataset by navigating to the 'Data' tab in the link aforementioned. Under 'Data Explorer', you can download the following files:
train.csv
: Contains the training data; we will use this dataset for our training and validation data.test.csv
: Contains our testing data; we will use this dataset for making our submissions to Kaggle.
# Please update the Path here to the location of your train.csv and test.csv files
path_to_csv = 'Data/techexchange-2023-ml-project'
# Load the Data Frames from the Training and Testing Data Frame
train_df = pd.read_csv(f'{path_to_csv}/train.csv')
test_df = pd.read_csv(f'{path_to_csv}/test.csv')
display(train_df.head())
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
# Separate Data Frame for input and outputs
input_names = ['id', 'keyword', 'location', 'text']
input_df = train_df[input_names]
display(input_df.head())
output_names = ['target']
output_df = train_df[output_names]
display(output_df.head())
id | keyword | location | text | |
---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... |
target | |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 1 |
Numpy will help us take full advantage of our GPU power to quickly perform training operations.
# Convert into numpy data
X_data = input_df.to_numpy()
Y_data = output_df.to_numpy().flatten()
print(X_data[:5])
print(Y_data[:5])
print(X_data.shape)
print(Y_data.shape)
[[1 nan nan
'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all']
[4 nan nan 'Forest fire near La Ronge Sask. Canada']
[5 nan nan
"All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"]
[6 nan nan
'13,000 people receive #wildfires evacuation orders in California ']
[7 nan nan
'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ']]
[1 1 1 1 1]
(7613, 4)
(7613,)
We use Sklearn's train_test_split to split our train.csv
dataset to create a 90:10 testing/validation split to use further down this notebook.
from sklearn.model_selection import train_test_split
# Split the data into train and test
X_train, X_validation, Y_train, Y_validation = train_test_split(X_data, Y_data, train_size=0.90)
# Going through a few examples in the training split
for index in range(2):
print('Text')
print(X_train[index])
print()
print('Is Natural Disaster?')
print(Y_train[index], 'Yes' if Y_train[index] == 1 else 'No')
print('======================')
print()
Text
[4917 'exploded' nan
'Im Dead!!! My two Loves in 1 photo! My Heart exploded into a Million Pieces!!! ?????????????? @BrandonSkeie @samsmithworld http://t.co/yEtagC2d8A']
Is Natural Disaster?
0 No
======================
Text
[8280 'rioting' 'heart of darkness, unholy ?'
"@Georgous__ what alternatives? Legal alternatives? Protesting? Rioting may not be the most peaceful thing but it's a demonstration of how"]
Is Natural Disaster?
0 No
======================
# Returns a positive result, regardless of the input
def baseline_model(text_inputs):
return 1
# Vectorized version of the method to apply to numpy arrays properly
baseline_model_np = np.vectorize(baseline_model, signature='(n) -> ()')
# Testing the baseline on the
baseline_predictions_train = baseline_model_np(X_train)
for i in range(5):
print('Input:')
print(X_train[i])
print()
print('Output')
print(Y_train[i])
print()
print('Prediction')
print(baseline_predictions_train[i])
print('==================')
print()
Input:
[4917 'exploded' nan
'Im Dead!!! My two Loves in 1 photo! My Heart exploded into a Million Pieces!!! ?????????????? @BrandonSkeie @samsmithworld http://t.co/yEtagC2d8A']
Output
0
Prediction
1
==================
Input:
[8280 'rioting' 'heart of darkness, unholy ?'
"@Georgous__ what alternatives? Legal alternatives? Protesting? Rioting may not be the most peaceful thing but it's a demonstration of how"]
Output
0
Prediction
1
==================
Input:
[4606 'emergency%20services' 'London, UK'
'I am not an American but I have family who have served in the military work in the emergency services and work in... http://t.co/Pl2VzLrKVK']
Output
1
Prediction
1
==================
Input:
[7668 'panic' 'Topeka, KS'
"The good thing is that the #Royals won't face a newbie in the playoffs. No real reason to panic."]
Output
0
Prediction
1
==================
Input:
[7397 'obliterated' 'Valparaiso '
'RIZZO IS ON ???????? THAT BALL WAS OBLITERATED']
Output
0
Prediction
1
==================
Since our baseline is simple, we can use this loss value to determine whether our models are more accurate than a naive approach.
# Calculates Log Loss
def calculate_loss(labels, predictions):
epsilon = 0.000001 # Prevents taking the natural log of non-positive values
ce_values = -labels * np.log(predictions + epsilon) - (1 - labels) * np.log(1 - predictions + epsilon)
loss = ce_values.mean()
return loss
# Calculate the loss on the training portion of our train data
training_loss = calculate_loss(Y_train, baseline_predictions_train)
print('Training Loss:', training_loss)
# Calculate the loss on the validation portion our our train data
baseline_predictions_validation = baseline_model_np(X_validation)
testing_loss = calculate_loss(Y_validation, baseline_predictions_validation)
print('Validation Loss:', testing_loss)
Training Loss: 7.87469942823704
Validation Loss: 7.923067964344554
training_accuracy = (baseline_predictions_train == Y_train).sum() / len(Y_train)
print("Training Accuracy:", training_accuracy)
validation_accuracy = (baseline_predictions_validation == Y_validation).sum() / len(Y_validation)
print('Validation Accurracy:', validation_accuracy)
Training Accuracy: 0.4300102174864983
Validation Accurracy: 0.42650918635170604
We can then upload a submission.csv
file to Kaggle to find the F1 score of our baseline model.
# Create the submission CSV file for our Kaggle submission
def save_to_submissions_csv(text_inputs, prediction_labels, name='submission.csv'):
print(f'Generating "{name}" file...')
# Extract the ids of the text inputs and flatten to a 1D ndarray
test_ids = text_inputs[:,0].flatten()
# Write the submission file and save to 'submission.csv'
np.savetxt(
name,
np.rec.fromarrays([test_ids, prediction_labels]),
fmt=['%s', '%d'],
delimiter=',',
header='id,target',
comments=''
)
# Show success!
print(f'Successfully created "{name}"')
# Reformat the single training dataframe to an input dataframe
input_names = ['id', 'keyword', 'location', 'text']
test_input_df = test_df[input_names]
# Reformat the input dataframe into a numpy array for running through our model
test_input_np = test_input_df.to_numpy()
# Predict by using the baseline model on the test input and save to a .csv
baseline_predictions_test = baseline_model_np(test_input_np)
save_to_submissions_csv(test_input_np, baseline_predictions_test, 'baseline_submission.csv')
Generating "baseline_submission.csv" file...
Successfully created "baseline_submission.csv"
# Look at the first few predictions to ensure things went smoothly
pd.read_csv('baseline_submission.csv').head()
id | target | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 1 |
2 | 3 | 1 |
3 | 9 | 1 |
4 | 11 | 1 |
Tweets can have all sorts of words inside of them, but we can always find common words that we can use for our vocabulary
The motivation for this graph to see what type of words are commonly used within our tweets dataset to see what we should and should not include in our vocabulary.
from collections import Counter
# Plots the frequency of the top commonly used words in the provided tweets in
# the positive and negative class
def plot_frequency(tweets_np, labels_np, max_word_num=20):
# Creates a counter that keeps track of the frequency of words (similar to defaultdict)
pos_counter = Counter()
neg_counter = Counter()
tweets_np = np.copy(tweets_np) # Deep Copy of input
# Total Counter Variables
pos_counter_all = 0
neg_counter_all = 0
# Go through the tweets dataset
total_words = set()
for entry_index in range(tweets_np.shape[0]):
# Flatten all of the features into a single string
words = ' '.join([str(feature).lower() for feature in tweets_np[entry_index]])
# Count the frequency of each word
for word in words.split():
# Group all links as 1 token
if word.startswith('http'):
word = '<LINK>'
# Group all articles as 1 token
elif word in ['the', 'a', 'an']:
word = '<ARTICLE>'
if labels_np[entry_index]:
pos_counter[word] += 1 # Positive entry
pos_counter_all += 1
else:
neg_counter[word] += 1 # Negative entry
neg_counter_all += 1
# For Debug purposes, saving all of the words we encounter
total_words.add(word)
# Extract at most max_word_num words that are the most common words
# for both classes (and removes overlap)
top_pos_words = [word for word, _ in pos_counter.most_common(max_word_num // 2)]
top_neg_words = [word for word, _ in neg_counter.most_common(max_word_num // 2)]
top_words = set(top_pos_words + top_neg_words)
# Create a Data Frame for the collected data
result = {
'word': [word for word in top_words],
'pos count': [pos_counter[word] for word in top_words],
'neg count': [neg_counter[word] for word in top_words],
'% chance is pos': [(pos_counter[word] / (pos_counter[word] + neg_counter[word])) * 100 for word in top_words]
}
word_count_df = pd.DataFrame(data=result, columns=result.keys())
word_count_df = word_count_df.set_index('word')
display(word_count_df)
# Plot a bar graph that groups pos and neg count for a few of the most used words
pd.concat([word_count_df[['pos count']], word_count_df[['neg count']]], axis=1).plot.bar()
print(f'DEBUG: Total Words Len: {len(total_words)}')
print(f'DEBUG: Total Positive Examples: {pos_counter_all}')
print(f'DEBUG: Total Negative Examples: {neg_counter_all}')
# Display and plot at most 40 words from the X_data set
plot_frequency(X_data, Y_data, max_word_num=40)
pos count | neg count | % chance is pos | |
---|---|---|---|
word | |||
with | 188 | 382 | 32.982456 |
for | 399 | 489 | 44.932432 |
to | 761 | 1195 | 38.905930 |
- | 419 | 396 | 51.411043 |
and | 512 | 927 | 35.580264 |
<ARTICLE> | 2426 | 3372 | 41.842014 |
i | 292 | 1061 | 21.581670 |
my | 134 | 566 | 19.142857 |
that | 181 | 357 | 33.643123 |
of | 957 | 935 | 50.581395 |
you | 125 | 629 | 16.578249 |
in | 1186 | 854 | 58.137255 |
nan | 1117 | 1479 | 43.027735 |
new | 132 | 292 | 31.132075 |
after | 181 | 75 | 70.703125 |
<LINK> | 2510 | 2196 | 53.336167 |
is | 348 | 590 | 37.100213 |
at | 305 | 237 | 56.273063 |
it | 164 | 342 | 32.411067 |
as | 171 | 160 | 51.661631 |
this | 179 | 287 | 38.412017 |
from | 247 | 183 | 57.441860 |
by | 279 | 242 | 53.550864 |
on | 418 | 444 | 48.491879 |
be | 113 | 287 | 28.250000 |
DEBUG: Total Words Len: 33580
DEBUG: Total Positive Examples: 61864
DEBUG: Total Negative Examples: 80205
After reviewing this information, there appears to be a lot of words that are partially helpful, but we don't see any keywords that we expected such as "fire", or other natural disaster words.
As a result, we will need to have a lot of words in our vocabulary to include these helpful words as well, as the expected keywords that we expected.
We were thinking that the number of tokens in a tweet might help with determining whether a tween is a disaster or not.
from collections import defaultdict
def length_plot(tweets_np , labels_np, interval_list):
# initializes maps to count based on intervals of words
pos_intervals = defaultdict(int)
neg_intervals = defaultdict(int)
# Convert the interval list into ranges for use below
interval_map = {}
for interval in interval_list:
# Parses interval strings into useable ranges
if '-' in interval:
start, end = map(lambda x: int(x), interval.split('-'))
else:
start, end = interval.split('+')[0], 285
interval_map[interval] = range(int(start), int(end) + 1)
# Track the counts of positive and negative inputs for each range from above
for entry_index in range(tweets_np.shape[0]):
# Flatten all of the features into a single string
words = ' '.join([str(feature) for feature in tweets_np[entry_index]]).split()
for interval in interval_list:
# Ignore words counts outside of our intervals
if len(words) not in interval_map[interval]:
continue
# Updates the counts of positive and negative entries in the dictionary
# based on the interval they are in
if labels_np[entry_index]:
pos_intervals[interval] += 1
else:
neg_intervals[interval] += 1
# Create a Data Frame for the collected data
result= {
'interval' : [ interval for interval in interval_list],
'pos intervals' : [pos_intervals[interval] for interval in interval_list],
'neg intervals' : [neg_intervals[interval] for interval in interval_list],
'% chance is pos': [(pos_intervals[interval] / (pos_intervals[interval] + neg_intervals[interval])) * 100 for interval in interval_list],
}
word_count_df = pd.DataFrame(data = result , columns=result.keys())
word_count_df = word_count_df.set_index('interval')
# Plot and Display the collected data
pd.concat([word_count_df[['pos intervals']], word_count_df[['neg intervals']]], axis=1).plot.bar()
display(word_count_df)
# Debugging code
print(neg_intervals)
print(pos_intervals)
# Plots the count of positive and negative classes of X_data based on the given intervals
intervals = ['0-5', '6-10' , '11-15' , '16-20' , '21-25' , '26-30' , '31-35', '36+']
length_plot(X_data, Y_data, intervals)
pos intervals | neg intervals | % chance is pos | |
---|---|---|---|
interval | |||
0-5 | 2 | 28 | 6.666667 |
6-10 | 164 | 463 | 26.156300 |
11-15 | 740 | 1017 | 42.117245 |
16-20 | 1041 | 1165 | 47.189483 |
21-25 | 973 | 1010 | 49.067070 |
26-30 | 317 | 552 | 36.478711 |
31-35 | 34 | 106 | 24.285714 |
36+ | 0 | 1 | 0.000000 |
defaultdict(<class 'int'>, {'6-10': 463, '0-5': 28, '16-20': 1165, '11-15': 1017, '21-25': 1010, '26-30': 552, '31-35': 106, '36+': 1})
defaultdict(<class 'int'>, {'16-20': 1041, '6-10': 164, '21-25': 973, '11-15': 740, '26-30': 317, '31-35': 34, '0-5': 2, '36+': 0})
It looks like there is a fair distribution based on the number of words in the tweet.
After viewing this graph, we figured that it may be best to not include the number of words in our model inputs. Bucketing the tweets based on these values may not be worth the extra model complexity.
We were able to come up with three differing models using different algorithms learned in class, each varying in complexity. various differing variations of each model were ran however we kept the best versions of each model along with an explanation of how we reached that point
Since tweets can contain lots of punctuation, we want to specify a standardization to sanitize our inputs. In addition, all links are changed to as a way of reducing the creation of unique features.
NOTE Though standarization helps ensure that our model works for unusual input, there may be a lost of information when applying normalizations.
# Stardardizes the input
def normalize_punctuation_and_links(tweets_text, show_debug=False):
# Make all letters lowercase
result_tensor = tf.strings.lower(tweets_text)
# Replace links with <LINK> token
link_regex = r'(https?:\/\/)([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)'
result_tensor = tf.strings.regex_replace(result_tensor, link_regex, '<LINK>')
# Remove punctuation (but don't remove <> from our Link tokens)
punctuation_regex = r'[^\w\s<>]'
result_tensor = tf.strings.regex_replace(result_tensor, punctuation_regex, ' ')
# Remove extra spaces
multi_space_regex = r'\s{2,}'
result_tensor = tf.strings.regex_replace(result_tensor, multi_space_regex, ' ')
if show_debug:
print('DEBUG: ', end='')
tf.print(result_tensor)
return result_tensor
normalize_punctuation_and_links('I.am.cool http://www.example.com, https://github.com/example', show_debug=True)
DEBUG: i am cool <LINK> <LINK>
<tf.Tensor: shape=(), dtype=string, numpy=b'i am cool <LINK> <LINK>'>
Our first model is a fairly simple FFNN. We used the concept of "Bag of Embeddings", where we use embeddings with dim=3 to try to identify which words are related to each other.
A problem with this model is that, as the name suggests, the words are not in any particular order (just in a messy "bag"), but there is a lot of information that the model can learn without word order.
def build_ffnn_model(tweets_np, max_vocab, max_tokens, embedding_dim):
# Remove randomness
tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(0)
# Build the normalization layer and vocabulary ahead of time
norm_layer = tf.keras.layers.TextVectorization(
max_tokens,
standardize=normalize_punctuation_and_links,
split='whitespace',
output_mode='int',
encoding='utf-8',
name='Normalization_Layer',
output_sequence_length=max_tokens,
)
norm_layer.adapt(tweets_np, batch_size=64)
# display(norm_layer.get_vocabulary())
# Build our FFNN Model using embeddings and average pooling
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(norm_layer)
model.add(tf.keras.layers.Embedding(
input_dim=max_vocab,
output_dim=embedding_dim,
input_length=max_tokens,
name='Embeddings_Layer',
))
# max pool layer
model.add(tf.keras.layers.GlobalMaxPooling1D(
name='Max_Pool_Layer'
))
# two dense layer with a sigmoid activation
model.add(tf.keras.layers.Dense(
units=128,
activation='sigmoid',
name='Dense_Layer_1',
))
model.add(tf.keras.layers.Dense(
units=64,
activation='sigmoid',
name='Dense_Layer_2',
))
# output layer with a sigmoid activation
model.add(tf.keras.layers.Dense(
units=1,
activation='sigmoid',
name='Output_Layer',
))
# Compile and return the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
A large vocabulary with seems to have a direct benefit to the model as more features can be identified. An input of 300 with a dimension of 12 was also found to have a good impact through tedious trial and error. The biggest obstacle was finding the correct amount of training. This model was very prone to overfitting data if it was trained past 11 epochs.
ffnn_model = build_ffnn_model(X_train[:,3], 35000, 300, 12)
display(ffnn_model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Normalization_Layer (TextVe (None, 300) 0
ctorization)
Embeddings_Layer (Embedding (None, 300, 12) 420000
)
Max_Pool_Layer (GlobalMaxPo (None, 12) 0
oling1D)
Dense_Layer_1 (Dense) (None, 128) 1664
Dense_Layer_2 (Dense) (None, 64) 8256
Output_Layer (Dense) (None, 1) 65
=================================================================
Total params: 429,985
Trainable params: 429,985
Non-trainable params: 0
_________________________________________________________________
None
The second model made was a CNN. This model is also fairly standard. This model applies filters along the features to detect any patterns that are useful for our predictions. We still use embeddings since these is useful for identifying if words are related.
Giovanni put this together and got it running with originally 2 conv1D layers, which later concantenated together.
def build_cnn_model(tweets_np, max_vocab, max_tokens, embedding_dim):
# Remove randomness
tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(0)
# Build the normalization layer ahead of time
norm_layer = tf.keras.layers.TextVectorization(
max_tokens,
standardize=normalize_punctuation_and_links,
split='whitespace',
output_mode='int',
encoding='utf-8',
name='Normalization_Layer',
output_sequence_length=max_tokens,
)
norm_layer.adapt(tweets_np, batch_size=64)
# Build our CNN Model using Keras' Functional API
input_layer = tf.keras.Input(shape=(1,), dtype=tf.string, name='Input Layer')
norm_layer = norm_layer(input_layer)
embedding_layer = tf.keras.layers.Embedding(
input_dim=max_vocab,
output_dim=embedding_dim,
input_length=max_tokens,
name='Embeddings_Layer',
)(norm_layer)
# three conv1D layers, filter sizes increase as it progresses
conv_1 = tf.keras.layers.Conv1D(
filters=16,
kernel_size=3,
padding='same',
activation='relu',
name='Conv_Layer_1_1',
)(embedding_layer)
conv_2 = tf.keras.layers.Conv1D(
filters=16,
kernel_size=4,
padding='same',
activation='relu',
name='Conv_Layer_1_2',
)(embedding_layer)
conv_3 = tf.keras.layers.Conv1D(
filters=16,
kernel_size=5,
padding='same',
activation='relu',
name='Conv_Layer_1_3',
)(embedding_layer)
# combines layers into single layers (Dropout adds some randomness, while MaxPool1D helps shrink our layers)
concat_layer = tf.keras.layers.Concatenate(name='Concatenate_Layer')([conv_1, conv_2, conv_3])
dropout_layer = tf.keras.layers.Dropout(rate=0.05)(concat_layer)
max_pool_layer = tf.keras.layers.MaxPool1D(pool_size=max_tokens, name='Max_Pool_Layer')(dropout_layer)
flatten_layer = tf.keras.layers.Flatten(name='Flatten_Layer_1')(max_pool_layer)
# hidden layer of size 32, sigmoid
hidden_layer = tf.keras.layers.Dense(
units=32,
activation='sigmoid',
name='Hidden_Layer_1',
)(flatten_layer)
# output layer
output_layer = tf.keras.layers.Dense(
units=1,
activation='sigmoid',
name='Output_Layer',
)(hidden_layer)
# adds input and output layer to model
model = tf.keras.Model(
inputs=input_layer,
outputs=output_layer,
name='CNN_Model',
)
# Compile and return the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
As in the previous model, a larger vocab size, and embedding Layer (input/dimension), had a huge boost. This may be due to using the same preprocessing funcion across the board. We first started with just 2 convolutional layers. After doing some research online, and seeing that common practice was to use a smaller filter size and increase through each of the convolutional layers, we had some better success. Our reasoning behind this choice was that the context of the input is widened as it is processed through the model. Our final change was adding one more convolutional layer, which brought the total of those type of layers to three.
cnn_model = build_cnn_model(X_train[:,3], 20000, 300, 9)
display(cnn_model.summary())
Model: "CNN_Model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
Input Layer (InputLayer) [(None, 1)] 0 []
Normalization_Layer (TextVecto (None, 300) 0 ['Input Layer[0][0]']
rization)
Embeddings_Layer (Embedding) (None, 300, 9) 180000 ['Normalization_Layer[0][0]']
Conv_Layer_1_1 (Conv1D) (None, 300, 16) 448 ['Embeddings_Layer[0][0]']
Conv_Layer_1_2 (Conv1D) (None, 300, 16) 592 ['Embeddings_Layer[0][0]']
Conv_Layer_1_3 (Conv1D) (None, 300, 16) 736 ['Embeddings_Layer[0][0]']
Concatenate_Layer (Concatenate (None, 300, 48) 0 ['Conv_Layer_1_1[0][0]',
) 'Conv_Layer_1_2[0][0]',
'Conv_Layer_1_3[0][0]']
dropout (Dropout) (None, 300, 48) 0 ['Concatenate_Layer[0][0]']
Max_Pool_Layer (MaxPooling1D) (None, 1, 48) 0 ['dropout[0][0]']
Flatten_Layer_1 (Flatten) (None, 48) 0 ['Max_Pool_Layer[0][0]']
Hidden_Layer_1 (Dense) (None, 32) 1568 ['Flatten_Layer_1[0][0]']
Output_Layer (Dense) (None, 1) 33 ['Hidden_Layer_1[0][0]']
==================================================================================================
Total params: 183,377
Trainable params: 183,377
Non-trainable params: 0
__________________________________________________________________________________________________
None
The last model we tried was an RNN. Redundant neural networks are still new to us, so it was really difficult to build this.
In comparison to the other models, RNNs process input data using a Bidirectional layer (bi meaning 2 directions, forwards and backwards). This ensures that the model can use information from previous training and future training to change its weights in the back propagation process.
Oscar was able to get this model working and we trust that this model will help us predict the best label.
def build_rnn_model(tweets_np, max_vocab, max_tokens, embedding_dim):
tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(0)
# create normalized vocab
norm = tf.keras.layers.TextVectorization(
max_tokens,
standardize=normalize_punctuation_and_links,
split='whitespace',
output_mode='int',
encoding='utf-8',
name='Normalization_Layer',
output_sequence_length=max_tokens,
)
norm.adapt(tweets_np , batch_size=64)
# Uses Sequential api with guidance from the tutorial https://www.tensorflow.org/text/tutorials/text_classification_rnn
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string, name='Input Layer'))
model.add(norm)
model.add(tf.keras.layers.Embedding(
input_dim=max_vocab,
output_dim=embedding_dim,
input_length=max_tokens,
name='Embeddings_Layer',
))
# bidirectional layers
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences= True), name='Bidirectional_Layer_1'))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True ), name='Bidirectional_Layer_2'))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64), name='Bidirectional_Layer_3'))
# hidden sigmoid layers
model.add(tf.keras.layers.Dense(128, activation='sigmoid', name='hidden_layer_1'))
model.add(tf.keras.layers.Dense(64, activation='sigmoid', name='hidden_layer_2'))
# output layers
model.add(tf.keras.layers.Dense(
units=1,
activation='sigmoid',
name='Output_Layer',
))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
rnn_model = build_rnn_model(X_train[:,3], 30000, 300, 9)
display(rnn_model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Normalization_Layer (TextVe (None, 300) 0
ctorization)
Embeddings_Layer (Embedding (None, 300, 9) 270000
)
Bidirectional_Layer_1 (Bidi (None, 300, 128) 37888
rectional)
Bidirectional_Layer_2 (Bidi (None, 300, 128) 98816
rectional)
Bidirectional_Layer_3 (Bidi (None, 128) 98816
rectional)
hidden_layer_1 (Dense) (None, 128) 16512
hidden_layer_2 (Dense) (None, 64) 8256
Output_Layer (Dense) (None, 1) 65
=================================================================
Total params: 530,353
Trainable params: 530,353
Non-trainable params: 0
_________________________________________________________________
None
def plot_history(history, epochs):
history = pd.DataFrame(history)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss vs Epoch')
plt.plot(list(range(1, epochs + 1)), history['loss'], label="Train")
plt.plot(list(range(1, epochs + 1)), history['val_loss'], label="Validation")
plt.legend(loc='best')
plt.show()
print('Loss:', history['loss'].iloc[-1])
print('Val Loss:', history['val_loss'].iloc[-1])
plt.xlabel('Epochs')
plt.ylabel('Accuracy (in %)')
plt.title('Accuracy vs Epoch')
plt.plot(list(range(1, epochs + 1)), history['accuracy'] * 100, label="Train")
plt.plot(list(range(1, epochs + 1)), history['val_accuracy'] * 100, label="Validation")
plt.legend(loc='best')
plt.show()
print('Accuracy:', history['accuracy'].iloc[-1])
print('Val Accuracy:', history['val_accuracy'].iloc[-1])
def train_and_analyze_model(model, tweet_features, labels, num_of_epochs=16, num_per_batch=16, validation=0.1):
# Remove randomness
tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(0)
history = model.fit(
tweet_features[:,3],
labels,
epochs=num_of_epochs,
batch_size=num_per_batch,
validation_split=validation,
verbose=1,
)
plot_history(history.history, num_of_epochs)
def plot_predictions(predictions, actual):
from matplotlib.ticker import PercentFormatter
plt.title('Prediction Distribution')
plt.xlabel('Positive Confidence Level (in %)')
plt.ylabel('# of predictions')
plt.gca().xaxis.set_major_formatter(PercentFormatter())
plt.hist([predictions[actual == 0] * 100, predictions[actual == 1] * 100], label=['negative', 'positive'])
plt.legend(loc='best')
plt.show()
def plot_difference(predictions, actual):
plt.title('Prediction Difference Distribution')
plt.xlabel('Positive Confidence Level (in %)')
plt.ylabel('Difference')
diff = abs(predictions - actual)
plt.hist([diff[actual == 0] * 100, diff[actual == 1] * 100], label=['negative', 'positive'])
plt.show()
def print_confusion_matrix(predictions, actual, threshold):
from sklearn.metrics import confusion_matrix
predictions[predictions >= threshold] = 1
predictions[predictions < threshold] = 0
tn, fp, fn, tp = confusion_matrix(actual, predictions).ravel()
print('True Positives:', tp)
print('True Negatives:', tn)
print('False Positives:', fp)
print('False Negatives:', fn)
print()
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
train_and_analyze_model(
ffnn_model,
X_train,
Y_train,
num_of_epochs=10,
num_per_batch=64,
validation=0.1,
)
Epoch 1/10
97/97 [==============================] - 2s 12ms/step - loss: 0.6862 - accuracy: 0.5689 - val_loss: 0.6785 - val_accuracy: 0.5802
Epoch 2/10
97/97 [==============================] - 1s 7ms/step - loss: 0.6715 - accuracy: 0.5732 - val_loss: 0.6377 - val_accuracy: 0.5875
Epoch 3/10
97/97 [==============================] - 1s 7ms/step - loss: 0.5872 - accuracy: 0.7155 - val_loss: 0.5403 - val_accuracy: 0.7391
Epoch 4/10
97/97 [==============================] - 1s 7ms/step - loss: 0.5152 - accuracy: 0.7543 - val_loss: 0.5100 - val_accuracy: 0.7522
Epoch 5/10
97/97 [==============================] - 1s 7ms/step - loss: 0.4819 - accuracy: 0.7739 - val_loss: 0.4972 - val_accuracy: 0.7711
Epoch 6/10
97/97 [==============================] - 1s 8ms/step - loss: 0.4651 - accuracy: 0.7849 - val_loss: 0.4946 - val_accuracy: 0.7682
Epoch 7/10
97/97 [==============================] - 1s 7ms/step - loss: 0.4538 - accuracy: 0.7893 - val_loss: 0.4952 - val_accuracy: 0.7726
Epoch 8/10
97/97 [==============================] - 1s 8ms/step - loss: 0.4459 - accuracy: 0.7942 - val_loss: 0.4987 - val_accuracy: 0.7595
Epoch 9/10
97/97 [==============================] - 1s 7ms/step - loss: 0.4386 - accuracy: 0.7958 - val_loss: 0.4976 - val_accuracy: 0.7726
Epoch 10/10
97/97 [==============================] - 1s 7ms/step - loss: 0.4332 - accuracy: 0.8016 - val_loss: 0.4995 - val_accuracy: 0.7580
Loss: 0.4331941604614258
Val Loss: 0.4994606077671051
Accuracy: 0.8016220331192017
Val Accuracy: 0.7580174803733826
Here above we see that the model converges pretty well. One small issue that we have noticed is that it does not generalize for all data too well, but the results are acceptable for a simple model that is not too complex.
Notes
After multiple runs with different parameters and layers/layer sizes, we noticed that it trained and converged faster then the rest of the models. Adding more hidden layers with more nuerons did not have a significant impact on this model's accuracy, in fact it may have hurt it. As the writing of this, it is the best performing model we have however, it does not have much room to grow. One way that can maybe help this function is the using another process to pre-process the data that is far more complex that also can handle potential typos that can occur in online discourse.
ffnn_train_predictions = ffnn_model.predict(X_train[:, 3]).flatten()
ffnn_train_loss = calculate_loss(Y_train, ffnn_train_predictions)
print(f'Train Loss: {ffnn_train_loss}')
ffnn_validation_predictions = ffnn_model.predict(X_validation[:, 3]).flatten()
ffnn_validation_loss = calculate_loss(Y_validation, ffnn_validation_predictions)
print(f'Val Loss: {ffnn_validation_loss}')
215/215 [==============================] - 1s 2ms/step
Train Loss: 0.43176578291667506
24/24 [==============================] - 0s 2ms/step
Val Loss: 0.522468330900813
plot_predictions(ffnn_train_predictions, Y_train)
plot_difference(ffnn_train_predictions, Y_train)
print_confusion_matrix(ffnn_train_predictions, Y_train, 0.7)
True Positives: 1587
True Negatives: 3756
False Positives: 149
False Negatives: 1359
Accuracy: 0.7798861480075902
Precision: 0.9141705069124424
Recall: 0.5386965376782077
The performance of this model was middle of the pack after modifications were made. The data tables above show the distribution of predictions using two different metrics. Looking at the raw distribution of predictions, we can see that the model might fair best using a threshold between .7 and .8 to get the maximum results. In addition, when looking at the difference in distributions, we can see the model is making a lot of errors in which many false negatives are being predicted. This can be derived from both graphs. The prediciton distribution graph shows a fair amount of positives near the middle of the graph, which will all be missed with a high threshold. In the second graph we see larger amount of positives to the right of the graph signaling false negatives.
ffnn_predictions = ffnn_model.predict(test_input_np[:, 3]).flatten()
# Apply threshold
THRESHOLD = 0.7
ffnn_predictions[ffnn_predictions >= THRESHOLD] = 1
ffnn_predictions[ffnn_predictions < THRESHOLD] = 0
save_to_submissions_csv(test_input_np, ffnn_predictions, 'ffnn_submission.csv')
102/102 [==============================] - 0s 2ms/step
Generating "ffnn_submission.csv" file...
Successfully created "ffnn_submission.csv"
pd.read_csv('ffnn_submission.csv').head()
id | target | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 1 |
2 | 3 | 0 |
3 | 9 | 0 |
4 | 11 | 1 |
train_and_analyze_model(
cnn_model,
X_train,
Y_train,
num_of_epochs=10,
num_per_batch=32,
validation=0.1,
)
Epoch 1/10
193/193 [==============================] - 4s 14ms/step - loss: 0.7237 - accuracy: 0.5306 - val_loss: 0.6533 - val_accuracy: 0.5875
Epoch 2/10
193/193 [==============================] - 2s 13ms/step - loss: 0.6209 - accuracy: 0.6701 - val_loss: 0.5798 - val_accuracy: 0.7012
Epoch 3/10
193/193 [==============================] - 2s 12ms/step - loss: 0.5476 - accuracy: 0.7294 - val_loss: 0.5098 - val_accuracy: 0.7595
Epoch 4/10
193/193 [==============================] - 2s 12ms/step - loss: 0.4848 - accuracy: 0.7721 - val_loss: 0.4892 - val_accuracy: 0.7770
Epoch 5/10
193/193 [==============================] - 2s 13ms/step - loss: 0.4684 - accuracy: 0.7826 - val_loss: 0.4871 - val_accuracy: 0.7755
Epoch 6/10
193/193 [==============================] - 3s 13ms/step - loss: 0.4613 - accuracy: 0.7857 - val_loss: 0.4884 - val_accuracy: 0.7653
Epoch 7/10
193/193 [==============================] - 3s 13ms/step - loss: 0.4546 - accuracy: 0.7896 - val_loss: 0.4866 - val_accuracy: 0.7770
Epoch 8/10
193/193 [==============================] - 3s 13ms/step - loss: 0.4495 - accuracy: 0.7909 - val_loss: 0.4870 - val_accuracy: 0.7711
Epoch 9/10
193/193 [==============================] - 3s 13ms/step - loss: 0.4443 - accuracy: 0.7990 - val_loss: 0.4929 - val_accuracy: 0.7638
Epoch 10/10
193/193 [==============================] - 3s 14ms/step - loss: 0.4402 - accuracy: 0.7982 - val_loss: 0.4905 - val_accuracy: 0.7682
Loss: 0.44017520546913147
Val Loss: 0.4905254542827606
Accuracy: 0.7982157468795776
Val Accuracy: 0.7682215571403503
Using the scatter plots above, we can see that this model might actually be overfitting the data. During the validation, it does not generalize new data well and struggles a little.
cnn_train_predictions = cnn_model.predict(X_train[:, 3]).flatten()
cnn_train_loss = calculate_loss(Y_train, cnn_train_predictions)
print(f'Train Loss: {cnn_train_loss}')
cnn_validation_predictions = cnn_model.predict(X_validation[:, 3]).flatten()
cnn_validation_loss = calculate_loss(Y_validation, cnn_validation_predictions)
print(f'Val Loss: {cnn_validation_loss}')
215/215 [==============================] - 1s 4ms/step
Train Loss: 0.43194361795525693
24/24 [==============================] - 0s 5ms/step
Val Loss: 0.5099057625701744
plot_predictions(cnn_train_predictions, Y_train)
plot_difference(cnn_train_predictions, Y_train)
print_confusion_matrix(cnn_train_predictions, Y_train, 0.7)
True Positives: 1613
True Negatives: 3771
False Positives: 134
False Negatives: 1333
Accuracy: 0.7858706758137498
Precision: 0.9232970807097882
Recall: 0.5475220638153429
This model has a similar problem to the first model, the same pattern appears here where an decent chunk of false negatives are being mislabeled. One reason for this may be the text processing techinque and use of embeddings in combination. In contrast, negative results are being predicted with a reasonable amount of accuracy.
cnn_predictions = cnn_model.predict(test_input_np[:, 3]).flatten()
# Apply threshold
THRESHOLD = 0.7
cnn_predictions[cnn_predictions >= THRESHOLD] = 1
cnn_predictions[cnn_predictions < THRESHOLD] = 0
save_to_submissions_csv(test_input_np, cnn_predictions, 'cnn_submission.csv')
102/102 [==============================] - 0s 4ms/step
Generating "cnn_submission.csv" file...
Successfully created "cnn_submission.csv"
pd.read_csv('cnn_submission.csv').head()
id | target | |
---|---|---|
0 | 0 | 1 |
1 | 2 | 1 |
2 | 3 | 0 |
3 | 9 | 0 |
4 | 11 | 1 |
NOTE: This RNN model has the longest training time. If you are in a time crunch and are only interested in viewing our results, please view the already saved output for the next few cells.
train_and_analyze_model(
rnn_model,
X_train,
Y_train,
num_of_epochs=10,
num_per_batch=32,
validation=0.1,
)
Epoch 1/10
193/193 [==============================] - 119s 547ms/step - loss: 0.6519 - accuracy: 0.6076 - val_loss: 0.5534 - val_accuracy: 0.7362
Epoch 2/10
193/193 [==============================] - 99s 513ms/step - loss: 0.5288 - accuracy: 0.7457 - val_loss: 0.5017 - val_accuracy: 0.7566
Epoch 3/10
193/193 [==============================] - 99s 515ms/step - loss: 0.4947 - accuracy: 0.7674 - val_loss: 0.4997 - val_accuracy: 0.7566
Epoch 4/10
193/193 [==============================] - 97s 500ms/step - loss: 0.4834 - accuracy: 0.7734 - val_loss: 0.4887 - val_accuracy: 0.7697
Epoch 5/10
193/193 [==============================] - 94s 485ms/step - loss: 0.4741 - accuracy: 0.7810 - val_loss: 0.4818 - val_accuracy: 0.7799
Epoch 6/10
193/193 [==============================] - 98s 508ms/step - loss: 0.4710 - accuracy: 0.7815 - val_loss: 0.4940 - val_accuracy: 0.7609
Epoch 7/10
193/193 [==============================] - 97s 501ms/step - loss: 0.4586 - accuracy: 0.7899 - val_loss: 0.4816 - val_accuracy: 0.7799
Epoch 8/10
193/193 [==============================] - 95s 494ms/step - loss: 0.4538 - accuracy: 0.7953 - val_loss: 0.5022 - val_accuracy: 0.7653
Epoch 9/10
193/193 [==============================] - 97s 501ms/step - loss: 0.4498 - accuracy: 0.7908 - val_loss: 0.4894 - val_accuracy: 0.7566
Epoch 10/10
193/193 [==============================] - 97s 503ms/step - loss: 0.4435 - accuracy: 0.7968 - val_loss: 0.4822 - val_accuracy: 0.7741
Loss: 0.44350114464759827
Val Loss: 0.48224204778671265
Accuracy: 0.796755850315094
Val Accuracy: 0.7740525007247925
This model is the most chaotic of all three. As we can see, it is not performing well in its current state. The model does not converge well, however it seems to generalize well. It is possible that if left training for a long enough time it will perform slightly better.
After several trial and errors while working on this model, we expanded our bidirectional layers to three. This however was not the best decision. The thought process was that adding more layers can help the model learn word contexts of the tweets, rather then making predictions on single words, such as the FFNN. Another unforseen effect was the amount of time it would take to fully train a model. Some variations lasted for up to 30 minutes. We also learned that more complexity does not mean better results. Three bidirectional layers performed worse then a model with two layers.
Out of all our experiments, this one is the worst in terms of accuracy and time taken to train. As we trained with different variations, it was noted that adding complexity does not necessarily add any benefit and may instead hinder the learning process. Despite these current results, we still think that using an RNN model as a basis may still help us.
As of writing this, the model is still a work in progess and has shown promise thus far. We will continue to work on this model to try to get the highest kaggle score possible as it still is promising.
rnn_train_predictions = cnn_model.predict(X_train[:, 3]).flatten()
rnn_train_loss = calculate_loss(Y_train, rnn_train_predictions)
print(f'Train Loss: {rnn_train_loss}')
rnn_validation_predictions = cnn_model.predict(X_validation[:, 3]).flatten()
rnn_validation_loss = calculate_loss(Y_validation, rnn_validation_predictions)
print(f'Val Loss: {rnn_validation_loss}')
215/215 [==============================] - 1s 4ms/step
Train Loss: 0.43194361795525693
24/24 [==============================] - 0s 4ms/step
Val Loss: 0.5099057625701744
plot_predictions(rnn_train_predictions, Y_train)
plot_difference(rnn_train_predictions, Y_train)
print_confusion_matrix(rnn_train_predictions, Y_train, 0.7)
True Positives: 1613
True Negatives: 3771
False Positives: 134
False Negatives: 1333
Accuracy: 0.7858706758137498
Precision: 0.9232970807097882
Recall: 0.5475220638153429
Here the same problem emerges, a large amount of false positves are being predicted. Interestingly however, it has done a very good job of prediciting negative values.
rnn_predictions = rnn_model.predict(test_input_np[:, 3]).flatten()
# Apply threshold
THRESHOLD = 0.7
rnn_predictions[rnn_predictions >= THRESHOLD] = 1
rnn_predictions[rnn_predictions < THRESHOLD] = 0
save_to_submissions_csv(test_input_np, rnn_predictions, 'rnn_submission.csv')
102/102 [==============================] - 19s 157ms/step
Generating "rnn_submission.csv" file...
Successfully created "rnn_submission.csv"
pd.read_csv('rnn_submission.csv').head()
id | target | |
---|---|---|
0 | 0 | 0 |
1 | 2 | 0 |
2 | 3 | 1 |
3 | 9 | 0 |
4 | 11 | 1 |
Overall, our models much better than the baseline, showing that they have learned something from the training process, rather than predicting 1 or 0 all the time. Below is a chart that shows the accuracy of all of our models.
Note: Some of these values change due to some inner randomness of each layer (despite using a set global seed), but the values are still very close each training time.
Model | Training Loss | Training Accuracy | Validation Loss | Validation Accuracy | F1 Score |
---|---|---|---|---|---|
Baseline | 7.900914820187636 | 0.4308860020434973 | 7.687370260599765 | 0.41863517060367456 | 0.42966 |
FFNN | 0.43176037073135376 | 0.8029196858406067 | 0.5280683636665344 | 0.7434402108192444 | 0.75176 |
CNN | 0.4304109811782837 | 0.8103811740875244 | 0.5113658905029297 | 0.7594752311706543 | 0.74287 |
RNN | 0.44599589705467224 | 0.7956204414367676 | 0.5282750725746155 | 0.7448979616165161 | 0.72816 |
Most General: RNN
Highest F1: FFNN
Smallest Validation Loss: CNN
After sifting through the graphs on all the models, the main hurdle we seem to struggle with the most is the amount of false negatives the models continuously miss. The root of the problem may be the use of custom made pre-processing function for this project. It is present in all models, so it may be the biggest factor that hinders the learning process. One possibility for a better pre-processing function is to break apart words into n-grams, that way links can be broken down to individual websites that may appear frequeantly (e.g.common news site).