- Overview
- Architecture
- ETL Process
- Technologies Used
- Repository Structure
- DAG Details
- Glue Script Details
- CICD Pipeline
- How to Run
- Future Enhancements
This repository contains a project that demonstrates an ETL (Extract, Transform, Load) pipeline using AWS services. The pipeline retrieves weather forecast data from the OpenWeather API, stores it in an S3 bucket, transforms it using AWS Glue, and loads the processed data into an Amazon Redshift table for further analysis.
The entire orchestration of the pipeline is managed using Amazon Managed Workflows for Apache Airflow (MWAA). This project showcases how AWS services can be integrated to build a scalable, reliable, and automated ETL pipeline.
The architecture of the ETL pipeline is depicted in the diagram above. It involves the following AWS components:
- Amazon MWAA (Managed Workflows for Apache Airflow): Used to orchestrate the entire ETL process by triggering and managing the data pipeline tasks.
- Amazon S3: Serves as the data lake, storing the raw weather data fetched from the OpenWeather API.
- AWS Glue: Used for data transformation. The Glue job reads the data from S3, performs necessary transformations, and writes the processed data to Amazon Redshift.
- Amazon Redshift: A data warehouse that stores the transformed data for analysis and reporting.
- AWS CodeBuild: Manages the CI/CD pipeline for deploying DAGs and scripts.
-
Extraction: An Airflow DAG (
openweather_api_dag
) calls the OpenWeather API to fetch weather forecast data for a specified location (Toronto, Canada). The extracted data is normalized into a tabular format using Pandas and then stored as a CSV file in an S3 bucket. -
Loading to S3: The transformed data is uploaded to an S3 bucket (
weather-data-landing-zone-zn
) usingS3CreateObjectOperator
. -
Triggering Glue Job: Once the data is available in S3, another Airflow DAG (
transform_redshift_dag
) is triggered using theTriggerDagRunOperator
. This DAG triggers an AWS Glue job that reads the raw data from S3, performs data transformation, and loads the data into an Amazon Redshift table (public.weather_data
). -
Transformation and Loading to Redshift: The AWS Glue job performs schema mapping and transformation on the weather data, making it ready for analysis. The transformed data is then loaded into Amazon Redshift for further querying and analysis.
- AWS MWAA (Managed Workflows for Apache Airflow)
- Amazon S3
- AWS Glue
- Amazon Redshift
- AWS CodeBuild
- Python (Airflow DAGs, Glue ETL Script)
- Pandas (Data Manipulation)
- OpenWeather API (Data Source)
Weather-Data-ETL-Pipeline/
│
├── Project Snapshots/
│ ├── airflow_dags.png
│ ├── redshift_result.png
│ └── Weather ETL Pipeline Architecture.png
│
├── dags/
│ ├── transform_redshift_load.py
│ └── weather_api.py
│
├── scripts/
│ └── weather_data_ingestion.py
│
├── buildspec.yml
│
└── requirements.txt
- Project Snapshots: Contains snapshots of Airflow DAGs, Redshift results, and the architecture diagram.
- dags: Contains the Airflow DAGs for fetching data from the OpenWeather API and triggering Glue jobs.
- scripts: Contains the Glue ETL script (
weather_data_ingestion.py
) used for data transformation. - buildspec.yml: Configuration file for CI/CD using AWS CodeBuild. It places the DAGs and requirements.txt in their respective locations.
- requirements.txt: Lists the Python dependencies (e.g., Pandas, Requests) required for the Airflow DAGs.
This DAG handles the transformation of data and its loading into Amazon Redshift. The DAG uses GlueJobOperator
to start a Glue job that processes the data.
Key components:
- GlueJobOperator: Runs the Glue job
glue_transform_task
. - ExternalTaskSensor: Ensures the DAG waits for the completion of the previous DAG (
openweather_api_dag
).
This DAG is responsible for extracting weather data from the OpenWeather API and storing it in S3.
Key components:
- PythonOperator: Fetches data from the OpenWeather API and processes it using Pandas.
- S3CreateObjectOperator: Stores the processed data in an S3 bucket.
- TriggerDagRunOperator: Triggers the
transform_redshift_dag
once data is loaded into S3.
The Glue script (weather_data_ingestion.py
) reads the raw weather data from S3, transforms it by applying schema mappings, and loads the processed data into Amazon Redshift. The script uses the AWS Glue library for ETL processes such as data extraction, transformation, and loading.
Key components:
- DynamicFrame Creation: Reads the data from S3 as a DynamicFrame.
- Schema Transformation: Applies a schema to the data.
- Data Load to Redshift: Loads the transformed data into a Redshift table (
public.weather_data
).
The CI/CD pipeline is managed by AWS CodeBuild. The buildspec.yml
file contains the build and deploy steps, which automate the placement of DAGs and scripts into their respective Airflow and Glue locations.
-
Set Up AWS Environment:
- Create S3 buckets, Glue jobs, and Redshift clusters as specified.
- Ensure that the necessary IAM roles and permissions are in place.
-
Deploy Airflow DAGs:
- Use AWS CodeBuild and
buildspec.yml
for automatic deployment.
- Use AWS CodeBuild and
-
Install Requirements:
- Install dependencies listed in
requirements.txt
in the MWAA environment.
- Install dependencies listed in
-
Run the DAGs:
- Trigger the
openweather_api_dag
manually or set it to run on a schedule. - The DAG will automatically trigger the
transform_redshift_dag
after loading the data to S3.
- Trigger the
-
Monitor the Pipeline:
- Use Airflow UI, AWS Glue Console, and Redshift Console to monitor each stage of the ETL pipeline.
- Error Handling and Logging: Improve error handling and logging mechanisms in Airflow and Glue jobs.
- Data Validation: Implement data validation steps to ensure data quality before loading into Redshift.
- Automated Alerts: Add monitoring and alerting for data pipeline failures or anomalies.
- Expand Data Sources: Integrate additional data sources to enrich the analysis.