Churn Data Analytics Data Pipeline

Project Overview

This project involves a robust ETL pipeline that starts with data stored in Amazon RDS, primarily serving OLTP purposes. The data is then transferred to Amazon S3, which acts as an intermediate storage layer. To facilitate efficient querying, AWS Glue Crawler is employed to scan the data in S3 and automatically generate metadata in the Glue Data Catalog. This metadata can be leveraged for querying the data directly from S3 using Amazon Athena or Amazon Redshift.

In this architecture, Amazon S3 acts as a staging area, allowing for the creation of an external schema in Amazon Redshift. This setup enables seamless access to the data in S3 through Redshift, where it is organized into landing tables. From these landing tables, dimension and fact tables are subsequently generated, forming the backbone of the data warehouse.

Apache Airflow plays a critical role in orchestrating the entire workflow. It automates the process of extracting data from RDS, loading it into S3, triggering the Glue Crawler to update the metadata, and executing the necessary SQL commands on Redshift.

The pipeline is designed to handle incremental loads, capturing only the data that has changed since the last ETL sync. This incremental approach ensures that the data warehouse is updated efficiently and in a timely manner, minimizing unnecessary data processing and enhancing overall performance.

By integrating these technologies—Amazon RDS, S3, Glue, Redshift, and Airflow—the project delivers a scalable and efficient data processing solution that supports advanced analytics and reporting.

Dataset Link: (https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset?resource=download) Article: Churn Data Analytics Data Pipeline

Prerequisites

Apache Airflow
Python
AWS Cloud account

ETL Architecture

Airflow Workflow

Technologies Used

AWS Glue
AWS S3
Apache Airflow
Redshift Serverless
AWS RDS

Steps to Deploy the Project

Clone the repository on AWS EC2

git clone https://github.com/Sanjay-dev-ds/churn-data-analytics-data-pipeline.git

Set up all infrastructure
Run and invoke Airflow UI
```
airflow standalone
```
Test the project!

Author

Sanjay Jayakumar

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
dags		dags
images		images
raw_data		raw_data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
resources_cft.yaml		resources_cft.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Churn Data Analytics Data Pipeline

Project Overview

Prerequisites

ETL Architecture

Airflow Workflow

Technologies Used

Steps to Deploy the Project

Author

About

Languages

Sanjay-dev-ds/churn-data-analytics-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Churn Data Analytics Data Pipeline

Project Overview

Prerequisites

ETL Architecture

Airflow Workflow

Technologies Used

Steps to Deploy the Project

Author

About

Topics

Resources

Stars

Watchers

Forks

Languages