Apache Iceberg

Read Apache Iceberg tables with time-travel support via snapshot selection

Category: file-formats

Description

## Overview The Iceberg data source reads Apache Iceberg tables (v1 and v2 table format). Iceberg is a table format that manages collections of data files (typically Parquet) with rich metadata including snapshots, schema evolution, and partition pruning. The LOCATION points to the Iceberg table root directory, which contains `data/` and `metadata/` subdirectories. Iceberg tables are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING ICEBERG`. DeltaForge reads the Iceberg metadata to discover which data files belong to the current snapshot, then reads those files in parallel. It supports both copy-on-write and merge-on-read modes, including position deletes and equality deletes. The `snapshot_id` option enables time-travel queries against historical snapshots. ## Usage Create an external table pointing to an Iceberg table root directory. No OPTIONS are required for basic usage since Iceberg is fully self-describing. ```sql -- Create a zone and schema CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.iceberg_demos; -- Register an Iceberg table CREATE EXTERNAL TABLE IF NOT EXISTS analytics.iceberg_demos.shipments USING ICEBERG LOCATION '/data/iceberg/shipments'; -- CREATE EXTERNAL TABLE auto-runs schema discovery from the Iceberg metadata JSON, no separate DETECT SCHEMA needed -- Grant access GRANT ADMIN ON TABLE analytics.iceberg_demos.shipments TO USER analyst; -- Query the current snapshot SELECT shipment_id, origin, destination, status FROM analytics.iceberg_demos.shipments; ``` Time-travel to a specific snapshot: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS analytics.iceberg_demos.shipments_historical USING ICEBERG LOCATION '/data/iceberg/shipments' OPTIONS ( snapshot_id = '3051729675574597004' ); ``` ## Schema Detection Iceberg is a self-describing table format. The schema (column names, types, nullability, and field IDs) is stored in the Iceberg metadata JSON files. DeltaForge reads this metadata directly, so no type inference or row sampling is needed. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the Iceberg snapshot's schema changes (column add/rename/type promote). ## Schema Evolution Iceberg has native support for schema evolution. The table metadata tracks schema changes across snapshots, including column additions, removals, renames, and type promotions. DeltaForge reads the schema from the target snapshot's metadata. When querying the current snapshot, the latest schema is used. Columns added after earlier data files were written appear as NULL for rows from those older files. ## Key Options - **snapshot_id**: Read a specific historical snapshot by ID for time-travel queries. When omitted, the latest snapshot is read. - **row_group_filter**: Enable Parquet statistics-based filtering on the underlying data files (default true). - **projection**: Read only specific columns to reduce I/O. - **max_rows**: Limit the total number of rows read for sampling purposes.

Apache Iceberg

Description

See Also