SqlTableStreamingSource

Upcoming Medium tutorial...

Spark2.4 distributed, fault tolerant and scalable custom Spark streaming from any Sql table to Kafka. The project is based on Apache spark for the micro Batch streaming, and HBase to store and retrieve the offsets.

Project sturcture

build.sbt - dependencies of the following:
- Hbase-client,sqlserver(change it to any sql connection), org.json, spark-sql, spark-core, spark-sql-kafka-0-10
src/resources/application.congf - sql server jdbc configurations, hbase connection configurations and etc
src/main/zeev/hbase/util - contains the logic for the HBase Configuration connection and for storing and extracting the offsets to keep the streaming fault tollorent
src/main/zeev/sql/util - contains the logic for the sql configurations and connections, including dao to query the sql tables
src/main/zeev/producers - basic Spark2.4 struct streaming to readStream from the custom sql table source, and writeStream to Kafka
src/main/zeev/spark/streaming - contains all the objects and classes(MicroBatchReader, DataReaderFactories, DataReader, DataReaderFactory, Custom Offsets, and etc ) to support and implement a custom Spark2.4 sturct streaming. The following is more detail explanation:
1. src/main/zeev/spark/streaming/offsets - this custom implementation for the super trait of v2.reader.streaming.Offset, it is essentially an abstract representation of progress through the microBatchReader, or continuosReader. During execution the offsets will be logged and used as restart checkpoints. this will also be used to store into HBase as well.
2. src/main/spark/streaming/sources - it will contain the custom micro batch readers classes and companion objects, the classes will extend DataSourceV2, MicroBatchReadSupport, and DataSourceRegister. this will be starting point and the class that the kafka producer is referring in the ReadStream.
3. src/main/zeev/spark/streaming/batchreaders - implementation of the microBatchReader, it contains all the logic of getting data from sql tables, storing the offsets, restarting from checkpoints, and creating RDDs or DataReaderFactories for spark
4. src/main/zeev/spark/streaming/partitionreaders - will contain all the custom inputpartitionReader(in spark2.3 it used to be called DataReaderFactory), and its is responsible for creating the actual data reader of one RDD partition
5. src/main/zeev/spark/streaming/inputpartitions - it is InputPartition(used to be called in 2.3 DataReaderFactory), and its responsible for creating the actual partition readers.

Getting Started

change the src/main/resource/application.conf based on your HBase settings and sql settings
in the src/main/zeev/kafka/producers - modify the kafka brokers and any input you want to add to the microBatchReader source
set and define your checkpoint location
run/deploy it in anyway you want, standalone, cluster mode...

Installing/usage

install and import the sbt libraries

Built With

Apache Spark, Scala, Hbase, Sql, HDFS

Authors

Zeev Feldbeine

License

This project is licensed under the MIT License - see the (LICENSE.md) file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SqlTableStreamingSource

Upcoming Medium tutorial...

Project sturcture

Getting Started

Installing/usage

Built With

Authors

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
project		project
src/main		src/main
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

License

zeev1079/SqlTableStreamingSource

Folders and files

Latest commit

History

Repository files navigation

SqlTableStreamingSource

Upcoming Medium tutorial...

Project sturcture

Getting Started

Installing/usage

Built With

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages