This project focuses on processing and analyzing large datasets for a television production company. The data, exceeding 20 million records, originates from diverse sources:
- User contract information and interaction data (TXT files)
- User log watching history (JSON files)
- User log search history (Parquet files)
Data is retrieved from various storage solutions, including MySQL, Azure SQL, and the local file system. Subsequently, it undergoes transformation and organization into structured insight tables within a PostgreSQL database.
- AzureSQL
- MySQL
- Python
- Apache Spark
- PostgreSQL
- etl_log_content.py: Ingests, transforms, and loads watch history data
- etl_log_search.py: Ingests, transforms, and loads log search data
- user_analysis.ipynb: Analyzes user behavior from contract information and interaction data
- mysql_azuresql_connector_template.ipynb: PySpark templates for connecting data sources and destinations