Google Flights ETL Project

Overview

This repository contains the source code and documentation for the Google Flight ETL project. The project involves daily crawling of flight data using Selenium, storing the data in MySQL, processing it with Apache Spark, storing in HDFS as a data lake, and warehousing the data with Hive. The entire environment is containerized and deployed using Docker.

System Architecture

Project Structure

Components

Selenium: Web scraping tool used for extracting flight data from Google Flights.
MySQL: Relational database used for storing raw flight data.
Apache Spark: Distributed data processing engine for data transformation.
HDFS: Distributed file system used as a data lake for storing processed data.
Hive: Data warehousing tool for querying and analyzing structured data.

Prerequisites

Docker installed on your machine.
Python and necessary libraries for Selenium web scraping.

Getting Started

Clone the repository:

git clone https://github.com/your-username/google_flight_etl.git
cd google_flight_etl

Build and run the Docker containers:

cd docker-configuration
docker-compose up -d

Execute the data crawling process:
```
python flight_selenium.py
```

Execute the Hadoop ingestion process, remember to change execution date parameter to your current date:

docker exec -it namenode bash
spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/ingestion.py --tblName "flights" --executionDate "YYYY-MM-DD"

Execute the Hive transformation process:

docker exec -it namenode bash
spark-submit --master spark://spark-master:7077 --py-files pyspark-jobs/transformation.py --executionDate "YYYY-MM-DD"

Deploy Superset, connect Hive to Superset dashboard, and design with your own style:

export SUPERSET_VERSION=<latest_version>

docker pull apache/superset:$SUPERSET_VERSION

docker run -d -p 8088:8088 \
         -e "SUPERSET_SECRET_KEY=$(openssl rand -base64 42)" \
         -e "TALISMAN_ENABLED=False" \
         --name superset apache/superset:$SUPERSET_VERSION

Superset Dashboard Example

Acknowledgments

This project is inspired by the Data Lake & Warehousing demo by Mr. Canh Tran (Data Guy Story). The architecture design, ingestion, and transformation scripts in Spark (Scala) were outlined in a video available here. I adapted the scripts to use PySpark for implementation.

Contact

For questions or support, please contact [[email protected]].

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
daily_flight_data		daily_flight_data
docker-configuration		docker-configuration
pyspark-jobs		pyspark-jobs
.DS_Store		.DS_Store
README.md		README.md
dashboard.png		dashboard.png
flight-extraction.py		flight-extraction.py
flight-selenium.py		flight-selenium.py
google-flights-pipeline.png		google-flights-pipeline.png
manual-insert.py		manual-insert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Flights ETL Project

Overview

System Architecture

Project Structure

Components

Prerequisites

Getting Started

Superset Dashboard Example

Acknowledgments

Contact

About

Releases

Packages

Languages

MarcusLe02/google-flights-etl

Folders and files

Latest commit

History

Repository files navigation

Google Flights ETL Project

Overview

System Architecture

Project Structure

Components

Prerequisites

Getting Started

Superset Dashboard Example

Acknowledgments

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages