Weather-Data-ETL-Pipeline

Overview

This repository contains a project that demonstrates an ETL (Extract, Transform, Load) pipeline using AWS services. The pipeline retrieves weather forecast data from the OpenWeather API, stores it in an S3 bucket, transforms it using AWS Glue, and loads the processed data into an Amazon Redshift table for further analysis.

The entire orchestration of the pipeline is managed using Amazon Managed Workflows for Apache Airflow (MWAA). This project showcases how AWS services can be integrated to build a scalable, reliable, and automated ETL pipeline.

Architecture

The architecture of the ETL pipeline is depicted in the diagram above. It involves the following AWS components:

Amazon MWAA (Managed Workflows for Apache Airflow): Used to orchestrate the entire ETL process by triggering and managing the data pipeline tasks.
Amazon S3: Serves as the data lake, storing the raw weather data fetched from the OpenWeather API.
AWS Glue: Used for data transformation. The Glue job reads the data from S3, performs necessary transformations, and writes the processed data to Amazon Redshift.
Amazon Redshift: A data warehouse that stores the transformed data for analysis and reporting.
AWS CodeBuild: Manages the CI/CD pipeline for deploying DAGs and scripts.

ETL Process

Extraction: An Airflow DAG (openweather_api_dag) calls the OpenWeather API to fetch weather forecast data for a specified location (Toronto, Canada). The extracted data is normalized into a tabular format using Pandas and then stored as a CSV file in an S3 bucket.
Loading to S3: The transformed data is uploaded to an S3 bucket (weather-data-landing-zone-zn) using S3CreateObjectOperator.
Triggering Glue Job: Once the data is available in S3, another Airflow DAG (transform_redshift_dag) is triggered using the TriggerDagRunOperator. This DAG triggers an AWS Glue job that reads the raw data from S3, performs data transformation, and loads the data into an Amazon Redshift table (public.weather_data).
Transformation and Loading to Redshift: The AWS Glue job performs schema mapping and transformation on the weather data, making it ready for analysis. The transformed data is then loaded into Amazon Redshift for further querying and analysis.

Technologies Used

AWS MWAA (Managed Workflows for Apache Airflow)
Amazon S3
AWS Glue
Amazon Redshift
AWS CodeBuild
Python (Airflow DAGs, Glue ETL Script)
Pandas (Data Manipulation)
OpenWeather API (Data Source)

Repository Structure

Weather-Data-ETL-Pipeline/
│
├── Project Snapshots/
│   ├── airflow_dags.png
│   ├── redshift_result.png
│   └── Weather ETL Pipeline Architecture.png
│
├── dags/
│   ├── transform_redshift_load.py
│   └── weather_api.py
│
├── scripts/
│   └── weather_data_ingestion.py
│
├── buildspec.yml
│
└── requirements.txt

Description

Project Snapshots: Contains snapshots of Airflow DAGs, Redshift results, and the architecture diagram.
dags: Contains the Airflow DAGs for fetching data from the OpenWeather API and triggering Glue jobs.
scripts: Contains the Glue ETL script (weather_data_ingestion.py) used for data transformation.
buildspec.yml: Configuration file for CI/CD using AWS CodeBuild. It places the DAGs and requirements.txt in their respective locations.
requirements.txt: Lists the Python dependencies (e.g., Pandas, Requests) required for the Airflow DAGs.

DAG Details

`transform_redshift_load.py`

This DAG handles the transformation of data and its loading into Amazon Redshift. The DAG uses GlueJobOperator to start a Glue job that processes the data.

Key components:

GlueJobOperator: Runs the Glue job glue_transform_task.
ExternalTaskSensor: Ensures the DAG waits for the completion of the previous DAG (openweather_api_dag).

`weather_api.py`

This DAG is responsible for extracting weather data from the OpenWeather API and storing it in S3.

Key components:

PythonOperator: Fetches data from the OpenWeather API and processes it using Pandas.
S3CreateObjectOperator: Stores the processed data in an S3 bucket.
TriggerDagRunOperator: Triggers the transform_redshift_dag once data is loaded into S3.

Glue Script Details

The Glue script (weather_data_ingestion.py) reads the raw weather data from S3, transforms it by applying schema mappings, and loads the processed data into Amazon Redshift. The script uses the AWS Glue library for ETL processes such as data extraction, transformation, and loading.

Key components:

DynamicFrame Creation: Reads the data from S3 as a DynamicFrame.
Schema Transformation: Applies a schema to the data.
Data Load to Redshift: Loads the transformed data into a Redshift table (public.weather_data).

CICD Pipeline

The CI/CD pipeline is managed by AWS CodeBuild. The buildspec.yml file contains the build and deploy steps, which automate the placement of DAGs and scripts into their respective Airflow and Glue locations.

How to Run

Set Up AWS Environment:
- Create S3 buckets, Glue jobs, and Redshift clusters as specified.
- Ensure that the necessary IAM roles and permissions are in place.
Deploy Airflow DAGs:
- Use AWS CodeBuild and buildspec.yml for automatic deployment.
Install Requirements:
- Install dependencies listed in requirements.txt in the MWAA environment.
Run the DAGs:
- Trigger the openweather_api_dag manually or set it to run on a schedule.
- The DAG will automatically trigger the transform_redshift_dag after loading the data to S3.
Monitor the Pipeline:
- Use Airflow UI, AWS Glue Console, and Redshift Console to monitor each stage of the ETL pipeline.

Future Enhancements

Error Handling and Logging: Improve error handling and logging mechanisms in Airflow and Glue jobs.
Data Validation: Implement data validation steps to ensure data quality before loading into Redshift.
Automated Alerts: Add monitoring and alerting for data pipeline failures or anomalies.
Expand Data Sources: Integrate additional data sources to enrich the analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weather-Data-ETL-Pipeline

Table of Contents

Overview

Architecture

ETL Process

Technologies Used

Repository Structure

Description

DAG Details

`transform_redshift_load.py`

`weather_api.py`

Glue Script Details

CICD Pipeline

How to Run

Future Enhancements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Project Snapshots		Project Snapshots
dags		dags
scripts		scripts
README.md		README.md
buildspec.yml		buildspec.yml
requirements.txt		requirements.txt

desininja/Weather-Data-ETL-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Weather-Data-ETL-Pipeline

Table of Contents

Overview

Architecture

ETL Process

Technologies Used

Repository Structure

Description

DAG Details

transform_redshift_load.py

weather_api.py

Glue Script Details

CICD Pipeline

How to Run

Future Enhancements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`transform_redshift_load.py`

`weather_api.py`

Packages