Skip to content

πŸ¦πŸ”΅End-to-end ChatGPT Users' Tweets Data Pipeline with Python 🐍, Hive 🐝, and Power BI πŸ“Š

Notifications You must be signed in to change notification settings

achraf-oujjir/ChatGPT-Users-Tweets-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ChatGPT Users Tweets Analysis Pipeline



banner

Unlocking insights from ChatGPT Users' Tweets

πŸ“ Table of Contents

  1. Project Overview
  2. Project Architecture
  3. Data Ingestion and Cleaning
  4. SFTP Transfer
  5. Data Modeling and Design
  6. DWH Load
  7. Data Visualization

Project Overview πŸ”¬:

Goal 🎯:

This project aims to provide a deep analysis of user feedback on ChatGPT, using data sourced from Twitter. The objective is to ingest, clean, model and visualize user opinions and discussions to uncover trends and insights regarding ChatGPT. By analyzing this social media data, we can better understand user sentiment, highlighting key areas such as popular features, common frustrations, and suggestions for improvement. The project features an end-to-end data pipeline, storing the data in a structured Hive datawarehouse, and ultimately creating insightful visualizations.

Steps πŸ› οΈ:

The project kicks off by using Python to connect with the Twitter API and extract tweets containing the term "ChatGPT." These tweets are filtered to focus on user feedback, ranging from positive experiences to areas of concern. After collection, the raw data undergoes cleaning, handle missing values and eliminate duplicates.

The processed data is then organized into a star schema, composed of fact and dimension tables, for optimized querying and analysis. CSV files with the cleaned data are securely transferred from an Ubuntu virtual machine to a Cloudera virtual machine using the SFTP protocol. On the Cloudera machine, the data is ingested into a Hive data warehouse, where it is structured and stored in tables.

With the data securely housed in Hive, Power BI is connected to access the Hive data warehouse on Cloudera. Within Power BI, an interactive dashboard is built, allowing for exploration of user sentiment, trending topics, and key insights around ChatGPT feedback. This visualization provides stakeholders with a clear view of user perceptions, helping to guide future improvements and feature development.

πŸ“ Project Architecture

architecture

🧹 Data Ingestion and Cleaning

The Twitter API (Application Programming Interface) is a programming interface that allows developers to access Twitter's features and data. Using the Twitter API, we were able to collect real-time, valuable information from user tweets. Access to this API is granted through secret keys provided by a Twitter Developer account. One of the first challenges encountered was dealing with the raw format of the tweets. As shown in the screenshot below, this is a snippet of a tweet obtained from the API, illustrating the raw nature of the data.

raw

We can visually identify some interesting data points, such as the tweet's timestamp, source link, language, etc. To interact with this API, we use the Tweepy library in Python. The collected tweet data required extensive cleaning to prepare it for analysis. This process involved removing irrelevant information, handling missing values, and addressing inconsistencies in the data. Key tasks included:
  1. Removing Duplicates: Identifying and removing duplicate tweets to ensure a unique dataset.
  2. Handling Missing Values: Filling or removing missing data entries to maintain dataset integrity.
  3. Standardizing Data: Formatting timestamps, URLs, and other fields to ensure consistency across the dataset.
  4. Preparing tweets for sentiment analysis: Cleaning tweets' text from emojis, hashtags, mentions, etc.

I did sentiment analysis on the tweets using a custom Python class named TweetAnalyzer. This class leverages the transformers library and utilizes the cardiffnlp/twitter-roberta-base-sentiment model to classify tweets as Negative, Neutral, or Positive.

The TweetAnalyzer class is designed to process multiple tweets at once. It first pre-processes the tweets by removing usernames and links to ensure cleaner input for the model. Then, it encodes the pre-processed tweets and passes them through the RoBERTa model, using softmax to calculate the sentiment scores for each tweet. This allowed us to quickly analyze the overall sentiment of a large dataset of tweets, providing valuable insights into user opinions.

🌐 SFTP Transfer

To transfer our CSV files from the Ubuntu VM to the Cloudera VM, we use a shell script that employs the SFTP (Secure File Transfer Protocol). This script facilitates the secure transfer of files to the remote Cloudera VM.

The shell script, named export_script.sh, automates the file transfer process. The script can be found here: export_script.sh

πŸ—οΈ Data Modeling and DWH Design

For the design of our Data Warehouse, we opted for a ⭐ star schema architecture, following Kimball's methodology. This approach helps facilitate the organization and visualization of the data we stored for sentiment analysis of ChatGPT-related tweets.

To analyze and visualize the tweets, we designed a fact table and multiple dimension tables. After several iterations, we finalized the following architecture:

  • πŸ—ƒοΈ Fact Table: tweets_fact
    Stores the essential details of each tweet, such as the tweet's ID, content, and references to the associated dimensions.
  • πŸ“Š Dimension Tables:
    • πŸ‘€ user_dim: Contains data related to the tweet's author, such as user ID, name, and profile attributes.
    • πŸ“± device_dim: Stores information about the operating system and device from which the tweet was posted.
    • 🏷️ hashtag_dim: Captures the hashtags associated with each tweet.
    • 🧠 sentiment_dim: Holds the sentiment analysis results for each tweet, including scores and labels (Negative, Neutral, Positive).
    • 🌍 location_dim: Provides geographical data related to the tweet's source, including country and city.
    • ⏰ time_dim: Stores the timestamp of when the tweet was posted, broken down into components such as date, hour, and minute.

dwh-model

This ⭐ star schema enables efficient querying and aggregation of tweet data, which we leveraged for sentiment analysis and visualization in Power BI.

πŸ“¦ DWH Load

For the datawarehouse creation, the HiveQL script named twitter_dwh_creation_script.hql contains the required commands for the creation and load of the DWH. This image shows the tables created in the twitter_dwh datawarehouse:


tables

Once the datawarehouse is created, the data in the CSV files transfered via SFTP by export_script.sh is loaded in their corresponding tables.


csv_files

Here is an example of the tweets_fact table after data load:

tw_fact_data

πŸ“Š Data Visualization

To visualize the data on the Hive DataWarehouse, I first had to ensure the connection between PowerBI and Hive on the Cloudera VM.


hive_conn

Here is the final dashboard for our data:

dashboard