ORC

Read Apache ORC optimized columnar format files

Category: file-formats

Description

## Overview The ORC data source reads Apache ORC (Optimized Row Columnar) format files. ORC is a self-describing columnar binary format that stores column types, names, and statistics in the file footer. DeltaForge reads the embedded schema automatically, so no type inference or sampling is needed. ORC files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING ORC`. DeltaForge reads multiple ORC files in parallel, distributing work across reader threads. The LOCATION can point to a single file or a directory containing multiple ORC files. ## Usage Create an external table pointing to one or more ORC files. ```sql -- Create a zone and schema CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.orc_bank; -- Register all ORC files in a directory CREATE EXTERNAL TABLE IF NOT EXISTS analytics.orc_bank.all_transactions USING ORC LOCATION '/data/transactions' OPTIONS ( file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery from the ORC footers, no separate DETECT SCHEMA needed -- Grant access GRANT ADMIN ON TABLE analytics.orc_bank.all_transactions TO USER analyst; -- Query the data SELECT transaction_id, account_number, amount, transaction_date FROM analytics.orc_bank.all_transactions; ``` Point to a single ORC file: ```sql CREATE EXTERNAL TABLE IF NOT EXISTS analytics.orc_bank.january_transactions USING ORC LOCATION '/data/transactions/january_2024.orc'; ``` ## Schema Detection ORC is a self-describing format. Column names, types, and statistics are stored in the file footer. DeltaForge reads this metadata directly, so no row sampling is required. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying files change shape (new columns, schema evolution). ## Schema Evolution When reading multiple ORC files with different schemas, DeltaForge unifies all column definitions into a single combined schema. Columns that exist in some files but not others appear as NULL for rows from files that lack those columns. This supports incremental schema changes over time. ## Key Options - **file_metadata**: Inject system columns (df_file_name, df_row_number) to track which file each row originated from. - **file_filter**: Select a subset of ORC files from a directory using a glob pattern. - **max_rows**: Limit the total number of rows read for sampling purposes. - **parallelism**: Control the number of parallel reader threads for multi-file reads.

See Also

Open in interactive docs →   DeltaForge home →