Skip to content

Enables Python developers to leverage Debezium's CDC capabilities with custom event handlers and seamless integration.

License

Notifications You must be signed in to change notification settings

memiiso/pydbzengine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License contributions welcome Create Pypi Release

pydbzengine

A Python module to use Debezium Engine in python. Consume Database CDC events using python.

Java integration is using Pyjnius, It is a Python library for accessing Java classes

Installation

install:

pip install pydbzengine
# install from github:
pip install https://github.com/memiiso/pydbzengine/archive/master.zip --upgrade --user

How to Use

First install the packages, pip install pydbzengine[dev]

from typing import List
from pydbzengine import ChangeEvent, BasePythonChangeHandler
from pydbzengine import Properties, DebeziumJsonEngine


class PrintChangeHandler(BasePythonChangeHandler):
    """
    A custom change event handler class.

    This class processes batches of Debezium change events received from the engine.
    The `handleJsonBatch` method is where you implement your logic for consuming
    and processing these events.  Currently, it prints basic information about
    each event to the console.
    """

    def handleJsonBatch(self, records: List[ChangeEvent]):
        """
        Handles a batch of Debezium change events.

        This method is called by the Debezium engine with a list of ChangeEvent objects.
        Change this method to implement your desired processing logic.  For example,
        you might parse the event data, transform it, and load it into a database or
        other destination.

        Args:
            records: A list of ChangeEvent objects representing the changes captured by Debezium.
        """
        print(f"Received {len(records)} records")
        for record in records:
            print(f"destination: {record.destination()}")
            print(f"key: {record.key()}")
            print(f"value: {record.value()}")
        print("--------------------------------------")


if __name__ == '__main__':
    props = Properties()
    props.setProperty("name", "engine")
    props.setProperty("snapshot.mode", "initial_only")
    # Add further Debezium connector configuration properties here.  For example:
    # props.setProperty("connector.class", "io.debezium.connector.mysql.MySqlConnector")
    # props.setProperty("database.hostname", "your_database_host")
    # props.setProperty("database.port", "3306")

    # Create a DebeziumJsonEngine instance, passing the configuration properties and the custom change event handler.
    engine = DebeziumJsonEngine(properties=props, handler=PrintChangeHandler())

    # Start the Debezium engine to begin consuming and processing change events.
    engine.run()

How to consume events with dlt

For the full code please see dlt_consuming.py

def main():
"""
Demonstrates capturing change data from PostgreSQL using Debezium and loading
it into DuckDB using dlt.
This example starts a PostgreSQL container, configures Debezium to capture changes,
processes the change events with a custom handler using dlt, and finally queries
the DuckDB database to display the loaded data.
"""
# Start the PostgreSQL container that will serve as the replication source.
sourcedb = DbPostgresql()
sourcedb.start()
# Get Debezium engine configuration properties, including connection details
# for the PostgreSQL database. This function debezium_engine_props returns all the properties
props = debezium_engine_props(sourcedb=sourcedb)
# Create a dlt pipeline to load the change events into DuckDB.
dlt_pipeline = dlt.pipeline(
pipeline_name="dbz_cdc_events_example",
destination="duckdb",
dataset_name="dbz_data"
)
# Instantiate change event handler (DltChangeHandler) that uses the dlt pipeline
# to process and load the Debezium events. This handler has
# the logic for transforming and loading the events.
handler = DltChangeHandler(dlt_pipeline=dlt_pipeline)
# Create a DebeziumJsonEngine instance, providing the configuration properties
# and the custom event handler.
engine = DebeziumJsonEngine(properties=props, handler=handler)
# Run the Debezium engine asynchronously with a timeout. This allows the example
# to run for a limited time and then terminate automatically.
Utils.run_engine_async(engine=engine, timeout_sec=60)
# engine.run() # This would be used for synchronous execution (without timeout)
# ================ PRINT THE CONSUMED DATA FROM DUCKDB ===========================
# Connect to the DuckDB database.
con = duckdb.connect(DUCKDB_FILE.as_posix())
# Retrieve a list of all tables in the DuckDB database.
result = con.sql("SHOW ALL TABLES").fetchall()
# Iterate through the tables and display the data from tables within the 'dbz_data' schema.
for r in result:
database, schema, table = r[:3] # Extract database, schema, and table names.
if schema == "dbz_data": # Only show data from the schema where Debezium loaded the data.
print(f"Data in table {table}:")
con.sql(f"select * from {database}.{schema}.{table}").show() # Display table data
if __name__ == "__main__":
"""
Main entry point for the script.
Before running, ensure you have installed the necessary dependencies:
`pip install pydbzengine[dev]`
"""
main()

Contributors

About

Enables Python developers to leverage Debezium's CDC capabilities with custom event handlers and seamless integration.

Resources

License

Stars

Watchers

Forks

Packages

No packages published