ORC Server Logs: Schema Evolution & Glob Filtering

Read 5 HTTP access-log ORC files with two schema versions and prove schema evolution, LOCATION glob filtering, and file_metadata traceability.

Category: orc

Syntax

-- ================================================================
-- SETUP
-- ================================================================
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS external.orc_demos
    COMMENT 'ORC-backed external tables';

-- Union all 5 files; the 2 v1 columns appear as NULL for web-* rows.
CREATE EXTERNAL TABLE IF NOT EXISTS external.orc_demos.all_requests
USING ORC
LOCATION '/data/server-logs'
OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- LOCATION glob, scope to a single producer's files.
CREATE EXTERNAL TABLE IF NOT EXISTS external.orc_demos.api01_only
USING ORC
LOCATION '/data/server-logs/api-01*.orc'
OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- ================================================================
-- QUERIES
-- ================================================================
-- Full scan, 2,500 rows across 5 files.
SELECT COUNT(*) FROM external.orc_demos.all_requests;

-- Per-file row counts via file_metadata.
SELECT df_file_name, COUNT(*) AS row_count
FROM external.orc_demos.all_requests
GROUP BY df_file_name
ORDER BY df_file_name;

-- Schema evolution: v1 columns are NULL for web-*, populated for api-*.
SELECT server_name,
       COUNT(*) AS total_rows,
       COUNT(request_body_bytes) AS body_bytes_non_null,
       COUNT(cache_hit)          AS cache_hit_non_null
FROM external.orc_demos.all_requests
GROUP BY server_name
ORDER BY server_name;

-- Glob-scoped table: 500 rows, all v2 columns populated.
SELECT COUNT(*)                         AS rows,
       COUNT(request_body_bytes)        AS body_bytes,
       COUNT(cache_hit)                 AS cache_hit
FROM external.orc_demos.api01_only;

-- Merged column count: 13 data + 2 metadata = 15.
SELECT COUNT(*) AS merged_columns
FROM information_schema.columns
WHERE table_schema = 'orc_demos' AND table_name = 'all_requests';

-- Status-code analytics across the merged table.
SELECT status_code,
       COUNT(*)                                   AS request_count,
       ROUND(AVG(CAST(response_time_ms AS DOUBLE)), 0) AS avg_response_ms
FROM external.orc_demos.all_requests
GROUP BY status_code
ORDER BY request_count DESC;

-- ================================================================
-- CLEANUP
-- ================================================================
DROP EXTERNAL TABLE IF EXISTS external.orc_demos.all_requests WITH FILES;
DROP EXTERNAL TABLE IF EXISTS external.orc_demos.api01_only  WITH FILES;
DROP SCHEMA IF EXISTS external.orc_demos;
DROP ZONE IF EXISTS external;

Description

## When to Use Use this pattern when you have ORC files written over time where the schema has been extended (new columns appended) and you need to query them together in a single external table. Common in long-lived Hive/Spark pipelines where access logs or event streams gain fields (cache_hit, request_body_bytes) as the producer evolves. Also shows how to carve out a single producer using a LOCATION glob rather than filtering at query time. ## What You Will Learn 1. How ORC's self-describing footers let DeltaForge union two physical schemas (v1 = 11 fields, v2 = 13 fields) into one logical table. 2. Why v1 rows return NULL for the two v2-only columns (request_body_bytes, cache_hit), missing-column NULL filling, not reader error. 3. Using a LOCATION glob (`/path/api-01*.orc`) to scope an external table to one producer without scanning then filtering. 4. Enabling `file_metadata` so every row exposes `df_file_name` and `df_row_number` for lineage and debugging. 5. Using information_schema.columns to verify the merged column count (13 data + 2 metadata = 15). ## Prerequisites - A zone of TYPE EXTERNAL (the demo creates `external`). - Five ORC files on an accessible path, web-0{1,2,3}_access.orc (schema v1, 500 rows each) and api-0{1,2}_access.orc (schema v2, 500 rows each). Total 2,500 rows. - No prior DETECT SCHEMA needed: CREATE EXTERNAL TABLE auto-detects from ORC footers.

Pitfalls

Open in interactive docs →   DeltaForge home →