Google Flights ETL Project

Overview

This repository contains the source code and documentation for the Google Flight ETL project. The project involves daily crawling of flight data using Selenium, storing the data in MySQL, processing it with Apache Spark, storing in HDFS as a data lake, and warehousing the data with Hive. The entire environment is containerized and deployed using Docker.

System Architecture

Project Structure

Components

Selenium: Web scraping tool used for extracting flight data from Google Flights.
MySQL: Relational database used for storing raw flight data.
Apache Spark: Distributed data processing engine for data transformation.
HDFS: Distributed file system used as a data lake for storing processed data.
Hive: Data warehousing tool for querying and analyzing structured data.

Prerequisites

Docker installed on your machine.
Python and necessary libraries for Selenium web scraping.

Getting Started

Clone the repository:

git clone https://github.com/your-username/google_flight_etl.git
cd google_flight_etl

Build and run the Docker containers:

cd docker-configuration
docker-compose up -d

Execute the data crawling process:
```
python flight_selenium.py
```

Execute the Hadoop ingestion process, remember to change execution date parameter to your current date:

docker exec -it namenode bash
spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/ingestion.py --tblName "flights" --executionDate "YYYY-MM-DD"

Execute the Hive transformation process:

docker exec -it namenode bash
spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/transformation.py --executionDate "YYYY-MM-DD"

Deploy Superset, connect Hive to Superset dashboard, and design with your own style:

export SUPERSET_VERSION=<latest_version>

docker pull apache/superset:$SUPERSET_VERSION

docker run -d -p 8088:8088 \
         -e "SUPERSET_SECRET_KEY=$(openssl rand -base64 42)" \
         -e "TALISMAN_ENABLED=False" \
         --name superset apache/superset:$SUPERSET_VERSION

Superset Dashboard Example

Acknowledgments

This project is inspired by the Data Lake & Warehousing demo by Mr. Canh Tran (Data Guy Story). The architecture design, ingestion, and transformation scripts in Spark (Scala) were outlined in a video available here. I adapted the scripts to use PySpark for implementation.

Contact

For questions or support, please contact [dong.lenam.2002@gmail.com].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Google Flights ETL Project

Overview

System Architecture

Project Structure

Components

Prerequisites

Getting Started

Superset Dashboard Example

Acknowledgments

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Google Flights ETL Project

Overview

System Architecture

Project Structure

Components

Prerequisites

Getting Started

Superset Dashboard Example

Acknowledgments

Contact