Member-only story

Working with Apache Avro files in Amazon S3

Working with binary-encoded Apache Avro files and Avro schemas using Apache Avro Tools

7 min readOct 11, 2021

Introduction

Amazon S3 Select, a feature of Amazon Simple Storage Service (Amazon S3), has been generally available since April 2018. According to documentation, “with Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.”

Currently, Amazon S3 Select file support is limited to Parquet, CSV (without quoted delimiters), and JSON objects (in LINES mode only). In addition, CSV and JSON objects must be uncompressed. In this post, we will learn how to preview another popular file format often stored in Amazon S3— Apache Avro™.

Apache Avro

Apache Avro™ describes itself as a data serialization system. Apache Avro is a compact, fast, binary data format (non-human readable) similar to Apache Parquet, Apache Thrift, MongoDB’s BSON, and Google’s Protocol Buffers (protobuf). However, Apache Avro is a row-based storage format compared to columnar storage formats like Apache Parquet and Apache ORC. Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. According to the documentation, schemas permit each datum to be written with no per-value overheads, making serialization fast and small. Schemas also facilitate use with dynamic scripting languages since data, together with its schema, is fully self-describing.

Avro File Samples

I have provided a few sample Apache Avro files on GitHub. The files contain a list of artists from a dataset available on Kaggle, whose work is part of the Museum of Modern Art (MoMa) collection. I have also provided an alternative JSON-format Avro schema for use later in the post. Use the following commands to download the files:

mkdir avro_samples && cd "$_"

export BASE_URL="https://github.com/garystafford/kafka-connect-msk-demo/raw/main/avro_samples"wget $BASE_URL/moma.public.artists%2B0%2B0000000000.avro
wget $BASE_URL/moma.public.artists%2B0%2B0000000300.avro
wget…

Working with Apache Avro files in Amazon S3

Working with binary-encoded Apache Avro files and Avro schemas using Apache Avro Tools

Introduction

Apache Avro

Avro File Samples

Written by Gary A. Stafford

No responses yet