This repository serves as an introduction to Spark Streaming combined with Kedro. The infrastructure needed to enable
streaming has all been defined in docker-compose.yml
.
There are two pipelines defined, emulating a real-life scenario:
kedro run --pipeline train
- This pipeline takes care of data engineering, training a ML model, publishing it to MLflow, and running validation.kedro run --pipeline inference
- This pipeline is responsible for inference. It does data engineering on streaming data, loads the ML model from MLflow, and prints the outputted predictions as a stream to the console. A Kafka Topic can also be configured.
We use MinIO as an abstraction to a blob storage. Buckets that are defined here can be aliased to real S3 buckets for production scenarios. For the MLflow metadata store, we use a MySQL server. Kafka is used to handle the streaming architecture.
- A Docker installation, and basic understanding of Docker and microservices
- A high-level understanding of how Kedro pipelines work
-
Clone this repository, and change the current directory to the root of the repository.
git clone https://github.com/deepyaman/kedro-streaming.git cd kedro-streaming
-
Build the
mlflow_server
Docker image.docker build -t mlflow_server:latest -f Dockerfile.mlflow .
-
Start the multi-container application in detached mode.
docker-compose up -d
Services
Service Host Port Kafka localhost 19092 MLflow localhost 5000 Minio (S3) localhost 9000 -
Create a new conda environment, activate it, and install all Python dependencies.
conda create -n kedro-streaming python=3.7 && conda activate kedro-streaming && pip install -e src/
-
Run the Kedro pipeline to train the model, and publish the model to the MLflow Model Registry.
kedro run --pipeline train
-
Run the Kedro pipeline for spinning up the inference engine, which will do predictions on data as it comes in. The
SparkStreamingDataSet
is defined to read from thehello-fraudster
Kafka topic. The pipeline is currently configured to print to the console, but various other sinks can be defined.kedro run --pipeline inference
-
Now that you have the inference engine up, we need to produce some messages for the inference pipeline to consume in real time to test it.
notebooks/message_sender.ipynb
contains a sample of how messages can be produced and sent to the definedhello-fraudster
Kafka topic.Upon executing the notebook, you should see the predictions in the console where you ran the inference pipeline.