- Project Overview
- Project Architecture
- Data Ingestion and Cleaning
- SFTP Transfer
- Data Modeling and Design
- DWH Load
- Data Visualization
This project aims to provide a deep analysis of user feedback on ChatGPT, using data sourced from Twitter. The objective is to ingest, clean, model and visualize user opinions and discussions to uncover trends and insights regarding ChatGPT. By analyzing this social media data, we can better understand user sentiment, highlighting key areas such as popular features, common frustrations, and suggestions for improvement. The project features an end-to-end data pipeline, storing the data in a structured Hive datawarehouse, and ultimately creating insightful visualizations.
The project kicks off by using Python to connect with the Twitter API and extract tweets containing the term "ChatGPT." These tweets are filtered to focus on user feedback, ranging from positive experiences to areas of concern. After collection, the raw data undergoes cleaning, handle missing values and eliminate duplicates.
The processed data is then organized into a star schema, composed of fact and dimension tables, for optimized querying and analysis. CSV files with the cleaned data are securely transferred from an Ubuntu virtual machine to a Cloudera virtual machine using the SFTP protocol. On the Cloudera machine, the data is ingested into a Hive data warehouse, where it is structured and stored in tables.
With the data securely housed in Hive, Power BI is connected to access the Hive data warehouse on Cloudera. Within Power BI, an interactive dashboard is built, allowing for exploration of user sentiment, trending topics, and key insights around ChatGPT feedback. This visualization provides stakeholders with a clear view of user perceptions, helping to guide future improvements and feature development.
The Twitter API (Application Programming Interface) is a programming interface that allows developers to access Twitter's features and data. Using the Twitter API, we were able to collect real-time, valuable information from user tweets. Access to this API is granted through secret keys provided by a Twitter Developer account. One of the first challenges encountered was dealing with the raw format of the tweets. As shown in the screenshot below, this is a snippet of a tweet obtained from the API, illustrating the raw nature of the data.
We can visually identify some interesting data points, such as the tweet's timestamp, source link, language, etc. To interact with this API, we use the Tweepy library in Python. The collected tweet data required extensive cleaning to prepare it for analysis. This process involved removing irrelevant information, handling missing values, and addressing inconsistencies in the data. Key tasks included:
- Removing Duplicates: Identifying and removing duplicate tweets to ensure a unique dataset.
- Handling Missing Values: Filling or removing missing data entries to maintain dataset integrity.
- Standardizing Data: Formatting timestamps, URLs, and other fields to ensure consistency across the dataset.
- Preparing tweets for sentiment analysis: Cleaning tweets' text from emojis, hashtags, mentions, etc.
I did sentiment analysis on the tweets using a custom Python class named TweetAnalyzer
. This class leverages the transformers
library and utilizes the cardiffnlp/twitter-roberta-base-sentiment
model to classify tweets as Negative, Neutral, or Positive.
The TweetAnalyzer
class is designed to process multiple tweets at once. It first pre-processes the tweets by removing usernames and links to ensure cleaner input for the model. Then, it encodes the pre-processed tweets and passes them through the RoBERTa model, using softmax to calculate the sentiment scores for each tweet. This allowed us to quickly analyze the overall sentiment of a large dataset of tweets, providing valuable insights into user opinions.
To transfer our CSV files from the Ubuntu VM to the Cloudera VM, we use a shell script that employs the SFTP (Secure File Transfer Protocol). This script facilitates the secure transfer of files to the remote Cloudera VM.
The shell script, named export_script.sh
, automates the file transfer process. The script can be found here: export_script.sh
For the design of our Data Warehouse, we opted for a β star schema architecture, following Kimball's methodology. This approach helps facilitate the organization and visualization of the data we stored for sentiment analysis of ChatGPT-related tweets.
To analyze and visualize the tweets, we designed a fact table and multiple dimension tables. After several iterations, we finalized the following architecture:
- ποΈ Fact Table:
tweets_fact
Stores the essential details of each tweet, such as the tweet's ID, content, and references to the associated dimensions.
- π Dimension Tables:
- π€
user_dim
: Contains data related to the tweet's author, such as user ID, name, and profile attributes. - π±
device_dim
: Stores information about the operating system and device from which the tweet was posted. - π·οΈ
hashtag_dim
: Captures the hashtags associated with each tweet. - π§
sentiment_dim
: Holds the sentiment analysis results for each tweet, including scores and labels (Negative, Neutral, Positive). - π
location_dim
: Provides geographical data related to the tweet's source, including country and city. - β°
time_dim
: Stores the timestamp of when the tweet was posted, broken down into components such as date, hour, and minute.
- π€
This β star schema enables efficient querying and aggregation of tweet data, which we leveraged for sentiment analysis and visualization in Power BI.
For the datawarehouse creation, the HiveQL
script named twitter_dwh_creation_script.hql contains the required commands for the creation and load of the DWH. This image shows the tables created in the twitter_dwh
datawarehouse:
Once the datawarehouse is created, the data in the CSV files transfered via SFTP by export_script.sh
is loaded in their corresponding tables.
Here is an example of the tweets_fact
table after data load:
To visualize the data on the Hive DataWarehouse, I first had to ensure the connection between PowerBI
and Hive
on the Cloudera VM.
Here is the final dashboard for our data: