Working with Apache Avro files in Amazon S3

Working with binary-encoded Apache Avro files and Avro schemas using Apache Avro Tools

Gary A. Stafford
7 min readOct 11, 2021

Introduction

Amazon S3 Select, a feature of Amazon Simple Storage Service (Amazon S3), has been generally available since April 2018. According to documentation, “with Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need. By using Amazon S3 Select to filter this data, you can reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency to retrieve this data.”

Currently, Amazon S3 Select file support is limited to Parquet, CSV (without quoted delimiters), and JSON objects (in LINES mode only). In addition, CSV and JSON objects must be uncompressed. In this post, we will learn how to preview another popular file format often stored in Amazon S3— Apache Avro™.

Apache Avro

Apache Avro™ describes itself as a data serialization system. Apache Avro is a compact, fast, binary data format (non-human readable) similar to Apache Parquet, Apache Thrift, MongoDB’s BSON, and Google’s Protocol Buffers (protobuf). However, Apache Avro is a row-based storage format compared to columnar storage formats like Apache Parquet and Apache ORC. Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. According to…

--

--

Gary A. Stafford

Area Principal Solutions Architect @ AWS | 10x AWS Certified Pro | Polyglot Developer | DataOps | GenAI | Technology consultant, writer, and speaker