Member-only story

Building Open Data Lakes on AWS with Debezium and Apache Hudi

Build an open-source data lake on AWS using a combination of Debezium, Apache Kafka, Apache Hudi, Apache Spark, and Apache Hive

2 min readOct 26, 2021

Introduction

In the following recorded demonstration, we will build a simple open data lake on AWS using a combination of open-source software (OSS), including Red Hat’s Debezium, Apache Kafka, and Kafka Connect for change data capture (CDC), and Apache Hive, Apache Spark, Apache Hudi, and Hudi’s DeltaStreamer for managing our data lake. We will use fully-managed AWS services to host the open data lake components, including Amazon RDS, Amazon MKS, Amazon EKS, and EMR.

Demonstration

Source Code

All source code for this post and the previous posts in this series are open-sourced and located on GitHub.

GitHub - garystafford/kafka-connect-msk-demo: For a series of posts on Amazon MSK, Amazon EKS, and…

For a series of posts on Amazon MSK, Amazon EKS, and Amazon EMR - GitHub - garystafford/kafka-connect-msk-demo: For a…

github.com

The following files are used in the demonstration:

MoMA data: Uncompress files and import pipe-delimited data to PostgreSQL;
base.properties: Base Hudi DeltaStreamer properties;
deltastreamer_artists_file_based_schema.properties: Demo-specific Hudi DeltaStreamer properties for MoMA Artists;
deltastreamer_artworks_file_based_schema.properties: Demo-specific Hudi DeltaStreamer properties for MoMA Artworks;
source_connector_moma_postgres_kafka.json: Kafka Connect Source Connector (PostgreSQL to Kafka);
sink_connector_moma_kafka_s3.json: Kafka Connect Sink Connector (Kafka to Amazon S3);
moma_debezium_hudi_demo.ipynb: Jupyter PySpark Notebook;
demonstration_notes.md: Commands used in the demonstration;

Additional Reading

A full-length technical blog on this topic can be found at, The Art of Building Open Data Lakes with Apache Hudi, Kafka, Hive, and Debezium.

The Art of Building Open Data Lakes with Apache Hudi, Kafka, Hive, and Debezium

Build near real-time, open-source data lakes on AWS using a combination of Apache Kafka, Hudi, Spark, Hive, and…

garystafford.medium.com

This blog represents my own viewpoints and not of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.