Skip to content

Kafka to HDFS Generic and Scalable Ingestor

Notifications You must be signed in to change notification settings

XRayA4T/hyperdrive

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Copyright 2019 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Hyperdrive - Kafka to HDFS Ingestion tool

Hyperdrive is a flexible ingestor platform that allows configurable ingestion from Kafka into HDFS.

It is composed of Spark ingestion jobs and watchers for notification topics containing the parameters of the ingestion.

The notification topic serves as the "API" for the platform.

Motivation

In environments composed of multiple and diverse data sources, getting access to all of them as well as creating a data lake is quite a challenging task.

One approach that has been gathering momentum is to use the so called Lambda Architecture (http://lambda-architecture.net/). However, the ingestion from Kafka into HDFS is not yet standardized in the industry. Parameters such as periodicity might be source-dependent which makes it harder to create a silver bullet solution.

On the other hand, if the task of deciding when and what to ingest is put on data producers' shoulders, it becomes possible to create a general solution for ingestion. This is what Hyperdrive is intended to be.

Features

Hyperdrive defines a notification topic that serves as a trigger for ingestions.

The structure of the messages in the notification topic takes into account the payload topic to be offloaded into HDFS, the starting offsets, the number of messages to be processed, the destination directory, etc.

Hyperdrive supports Schema Registry connection and Confluent Avro messages (through ABRiS), which makes it capable of seamlessly handling whatever schemas the payload messages are using.

There is also a load balancer that controls the number of active Spark ingestor jobs at a given time.

About

Kafka to HDFS Generic and Scalable Ingestor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%