Twitter-Sentiment-Analysis--BigData

Introduction:

Sentiment analysis is used to analyse text data inorder understand the underlying sentiment (positive or negavtive).
Sentiment analysis uses Natural language processing(NLP) and machine learning to determine emotional intent behind a communication.
Twittershutdown, Riptwitter was trending when Elon Musk took charge and hundreds of twitter employees send in their resignations.
This project will perform Sentiment anlaysis on tweets collected for hastags #Twittershutdown, #Riptwitter, #Elon Musk and so on.

Problem Statement:

Collect tweets using Twitter Api by launching an AWS EC2 instance, stream the tweets using Kinesis firehose and store the data in AWS S3 bucket.
Create a binary classification model to classify sentiment of each tweet (positive or negative), label= sentiment(0>negative, 1>positive) .
Create a Quicksight dashboard for the data collected and also predictions from the classification model.

Tools used:

AWS, Twitter Api, Amazon Kinesis firehose, Pyspark, Amazon Quicksight, Databricks

Data

Data Collection:

399333 tweets were collected using Twitter Api and stored in AWS S3
Using Databricks environment connect to S3 bucket and mount the data by creating a spark session.

Data preprocessing:

Created a pyspark dataframe object twitter data.
Checked for null values and drop rows with Null values.
Converted create_at to datetime column.
Used regular expression to clean the tweet, location columns. .
Textblob which is a library in python for text analysis can be used to assign sentiment for each tweet.
Created a column Sentiment which will have values 0 if a tweet has nagative sentiment and 1 for positive sentiment.
After cleaning we have 135,083 tweets out of which 45,760 tweets were with positive sentiment and 89,323 were tweets with negative sentiment.

Model:

Feature Engineering:

Using library Tokenizer convert tweet column to lowercase and split it by white spaces, outputColumn="tokens"
Remove stopwords from tokens using library StopWordsRemover,outputColumn="filtered" .
Convert filtered tweets into matrix of token counts using CountVectorizer library,outputColumn="cv" .
Inverse document frequency (IDF) library will check for relevant words in the tweet and remove sparse words, outputcolumn = "1gram_idf".
Ngram (n=2) library is feature transformer that converts the input array of strings into an array of n-grams, outputcolumn= "2gram".
HashingTF will map a sequence of terms to their term frequencies using the hashing trick, numFeatures=20000,outputcolumn= "2gram_tf".
Again perform IDf to remove sparse terms, outputColumn="2gram_idf"
VectorAssembler will merges "1gram_idf", "2gram_tf" columns into a vector column="rawFeatures"
ChiSqSelector will select categorical features from rawFeatures, outputCol="features" and reduce the number of features to 16000

Model Development and Evaluation:

Data was split into 90% train and 10% test data.
Sentiment column is the label. 0 > negative sentiment, 1> positive sentiment
We tried RandomforestClassifier and Logisticregression models to classify if the tweet in the test data is positive or negative
With RandomForestClassifer we acheived 66% accuracy and 72.87% Roc-Auc score
Classification report for RandomForestClassifer as follows:
LogisticRegression gave us an accuracy score of 90.425 and Roc-Auc score of 92.83
Classification report for LogisticRegression as follows:

LogisticRegresssion model gave us better accuracy, the predictions are saved back to AWS S3 bucket

QuickSight Dashboard

Tweets post data preprocessing:

66% of the tweets were with negative sentiment
Top 10 location in terms of number of tweet, location as a feature doesnot seem to be a contributor in tweet sentiment as they almost have equal percentage of both negative and positive tweets

Predictions:

from the 8.9K negative tweets, the model was able to correctly predict 8.19K tweets as tweets with negative sentiment.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Quicksight_cleantweets.pdf		Quicksight_cleantweets.pdf
Quicksight_predictions.pdf		Quicksight_predictions.pdf
README.md		README.md
RIPTwitter.ipynb		RIPTwitter.ipynb
Twitter.py		Twitter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter-Sentiment-Analysis--BigData

Introduction:

Problem Statement:

Tools used:

Data

Data Collection:

Data preprocessing:

Model:

Feature Engineering:

Model Development and Evaluation:

LogisticRegresssion model gave us better accuracy, the predictions are saved back to AWS S3 bucket

QuickSight Dashboard

Tweets post data preprocessing:

Predictions:

About

Releases

Packages

Languages

nnvij/Twitter-Sentiment-Analysis-BigData

Folders and files

Latest commit

History

Repository files navigation

Twitter-Sentiment-Analysis--BigData

Introduction:

Problem Statement:

Tools used:

Data

Data Collection:

Data preprocessing:

Model:

Feature Engineering:

Model Development and Evaluation:

LogisticRegresssion model gave us better accuracy, the predictions are saved back to AWS S3 bucket

QuickSight Dashboard

Tweets post data preprocessing:

Predictions:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages