Avro

Read Apache Avro binary serialized format files

Category: file-formats

Description

## Overview The Avro data source reads Apache Avro binary serialized format files. Avro is a self-describing format that embeds the schema (field names, types, and nullability) in the file header. DeltaForge reads the embedded schema automatically, so no type inference or sampling is needed. Avro files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING AVRO`. DeltaForge supports Avro logical types including date and timestamp-millis, nullable union types, and files with mixed compression codecs (null, deflate). Multiple Avro files are read in parallel across reader threads. ## Usage Create an external table pointing to one or more Avro files. The LOCATION can be a directory (reads all Avro files) or a single file path. ```sql -- Create a zone and schema CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.ecommerce; -- Register all Avro files in a directory CREATE EXTERNAL TABLE IF NOT EXISTS analytics.ecommerce.all_orders USING AVRO LOCATION '/data/orders' OPTIONS ( file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery from the embedded Avro header, no separate DETECT SCHEMA needed -- Grant access GRANT ADMIN ON TABLE analytics.ecommerce.all_orders TO USER analyst; -- Query the data SELECT order_id, customer_name, order_date, total_amount FROM analytics.ecommerce.all_orders; ``` Use `file_filter` to select a subset of files and `max_rows` for sampling: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS analytics.ecommerce.recent_orders USING AVRO LOCATION '/data/orders' OPTIONS ( file_filter = '2024_*.avro', max_rows = '100' ); ``` ## Schema Detection Avro is a self-describing format. The schema is embedded in each file's header, containing field names, types, default values, and documentation. DeltaForge reads this schema directly. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying files change shape (new fields, schema evolution). ## Schema Evolution When reading multiple Avro files with different schemas, DeltaForge unifies all field definitions into a single combined schema. Fields that exist in some files but not others appear as NULL for rows from files that lack those fields. Avro's native support for nullable unions (e.g., ["null", "string"]) is handled transparently. ## Key Options - **file_metadata**: Inject system columns (df_file_name, df_row_number) to track which file each row originated from. - **file_filter**: Select a subset of Avro files from a directory using a glob pattern. - **max_rows**: Limit the total number of rows read, useful for sampling large datasets. - **parallelism**: Control the number of parallel reader threads for multi-file reads.

See Also

Open in interactive docs →   DeltaForge home →