Parquet

Read Apache Parquet columnar format files

Category: file-formats

Description

## Overview The Parquet data source reads Apache Parquet columnar format files. Parquet is a self-describing binary format that stores column types, names, and statistics directly in the file metadata. DeltaForge reads the embedded schema automatically, so no type inference or sampling is needed. Parquet files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING PARQUET`. DeltaForge reads multiple Parquet files in parallel, distributing work across reader threads. The columnar layout enables efficient predicate pushdown and column pruning. Row group statistics allow the reader to skip entire row groups that do not match query filters, significantly reducing I/O for selective queries. ## Usage Create an external table pointing to a directory of Parquet files or a single file. The LOCATION can be a directory path (reads all Parquet files in it) or a specific file path. ```sql -- Create a zone and schema CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.parquet_flights; -- Register all Parquet files in a directory CREATE EXTERNAL TABLE IF NOT EXISTS analytics.parquet_flights.all_flights USING PARQUET LOCATION '/data/flights' OPTIONS ( file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery from the embedded Parquet footers, no separate DETECT SCHEMA needed -- Grant access GRANT ADMIN ON TABLE analytics.parquet_flights.all_flights TO USER analyst; -- Query with file provenance SELECT df_file_name, carrier, flight_number, departure_time FROM analytics.parquet_flights.all_flights WHERE carrier = 'AA'; ``` Use `file_filter` to select a subset of files from a directory: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS analytics.parquet_flights.domestic_only USING PARQUET LOCATION '/data/flights' OPTIONS ( file_filter = 'domestic_*.parquet' ); ``` ## Schema Detection Parquet is a self-describing format. Column names, types, and nullability are embedded in the file metadata. DeltaForge reads this metadata directly, so no row sampling is required. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying files change shape (new columns, schema evolution). ## Schema Evolution When reading multiple Parquet files with different schemas, DeltaForge unifies all column schemas into a single combined schema. Columns that were added in later files appear as NULL for rows from earlier files that lack those columns. This enables incremental schema changes over time without rewriting historical data files. ## Key Options - **file_metadata**: Inject system columns (df_file_name, df_row_number) to track which file each row originated from. - **file_filter**: Select a subset of Parquet files from a directory using a glob pattern. - **row_group_filter**: Enable statistics-based row group skipping for predicate pushdown (default true). - **projection**: Read only specific columns to reduce I/O on wide tables. - **max_rows**: Limit the total number of rows read for sampling purposes.

Parquet

Description

See Also