GitHub - rsotogar/spark-streaming: Streaming ETL pipeline using Kafka - Spark Structured Streaming

Overview

This ETL pipeline was built as part of a series of interviews for a senior data engineer role in which I had to build a data system that would process real time messages from a Kafka topic.

I show an example of how such streaming workflow could be set up in a scalable, robust and easy-to-integrate manner. Structured Streaming is a module of Apache Spark that can ingest and process a huge volume of incoming files in almost real time. It integrates seamlessly with real-time data ingestion platforms (Kafka, Kinesis) and it provides a bunch of connectors for SQL and NoSQL databases (MySQL, Mongo), data lakes (S3, GCS, etc.), and data warehouses (Redshift, BigQuery).

This project requires Docker and Docker Compose installed on your local machine (the Docker images were specified for the M1 chip on Mac):

In the root directory, run docker-compose up
Enter the listener container with docker-compose exec listener-service bash
And run the python script to start seeing the messages as they are being published to the events topic in the Kafka container python3 kafka_consumer.py

The messages published to the events topic in Kafka mimic real time messages that are constantly being generated as a result of our activity on websites and apps. These messages are then buffered into the Kafka cluster waiting for a consumer that 'subscribes' to the topic. The consumer will get every single message published into the topic it subscribed to so as long as the subscription remains active. If another consumer subscribes to the same topic, each consumer will get a copy of each message, therefore enabling one-to-one, one-to-many, many-to-one and many-to-many patterns.

If we decide to process the incoming messages as they are received by the Kafka cluster there are several stream processors available, such as Flink, Dataflow, and Apache Spark. I chose this one because (besides being the one I use the most) the same programming model is applied to both batch and streaming pipelines, therefore simplifying ETL pipeline creation and deployment. Spark Streaming can give almost real time data processing latencies (1s microbatches) but if real time processing is needed, it may not be your preferred choice. It also has a very simple API in Python and Scala, with a lot of documentation out there that allows us to virtually do any task on streaming data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
classification-service		classification-service
data-streaming-service		data-streaming-service
listener-service		listener-service
solution		solution
Codetest - Data Engineer.pdf		Codetest - Data Engineer.pdf
docker-compose.yml		docker-compose.yml
entities.txt		entities.txt
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

About

Releases

Packages

Languages

rsotogar/spark-streaming

Folders and files

Latest commit

History

Repository files navigation

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages