Demo project to mess around.
Initially, fake chipotle order data is generated via Go using the script under /src
. Within the script we serialize the order into JSON and push to a predefined Kafka topic orders
. From there, we grab the topic within Spark, parse the data back to JSON (since must be []byte to send via Kafka) and explode nested columns. Finally, we aggregate order items on a 5 minute basis and push the micro-batch result to a table in Cassandra.
This workflow lighlty ressembles a real-world real-time (ish) reporting scenario. Ideally there would be some sort of BI tool that sits on top of Cassandra to query the data. Additionally, the configuration and replication of the services is overly simplified since I did this project on a singular DigitalOcean droplet.
+---------+ +---------+ +---------+ +-------------+
| | --> | | | | | |
| ORDER | --> | KAFKA | --> | SPARK | --> | CASSANDRA |
| | --> | | | | | |
+---------+ +---------+ +---------+ +-------------+
- Create Kafka Topic
kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partition 1 --topic orders
- Set Up Cassandra
create keyspace chipotle with replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
create table top_orders_every_5_mins (
start timestamp,
end timestamp,
name text,
count int,
primary key (start, name)
);
- Submit Spark Job + Streaming
make spark
- Create Fake Order Data
cd src && go run *.go
{
"time": "2022-09-14 22:03:15",
"customer": {
"name": "Yoshiko McGlynn",
"age": 43,
"city": "Chula Vista",
"state": "Texas"
},
"order": [
{
"name": "Patron Margarita",
"price": 7.15
},
{
"name": "Salad (Chicken)",
"price": 6.5
}
],
"creditCard": "Mastercard"
}
- Spark -> v 3.2.0
- Scala -> v 2.12.15
- SBT -> v 1.7.1
- Cassandra -> v 4.0.5
- Kafka -> v 3.2.1
- Go -> v 1.18.1