Parquet Flight Delays: Airline Delay Analysis

Read three quarterly Parquet files with schema evolution (NULL-filled columns), plus file_filter and file_metadata for per-file drill-down.

Category: parquet

Syntax

-- ============================================================================
-- SETUP
-- ============================================================================
CREATE ZONE IF NOT EXISTS {{zone_name}} TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS {{zone_name}}.parquet_flights
    COMMENT 'Parquet-backed flight delay tables with schema evolution';

-- Table 1: union of all 3 quarterly files, NULL-filling added columns
CREATE EXTERNAL TABLE IF NOT EXISTS {{zone_name}}.parquet_flights.all_flights
USING PARQUET
LOCATION '{{data_path}}'
OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- Table 2: Q1 only via file_filter glob
CREATE EXTERNAL TABLE IF NOT EXISTS {{zone_name}}.parquet_flights.q1_flights
USING PARQUET
LOCATION '{{data_path}}'
OPTIONS (
    file_filter   = '*q1*',
    file_metadata = '{"columns":["df_file_name"]}'
);

-- ============================================================================
-- QUERIES
-- ============================================================================
-- Full scan (120 rows across 3 files)
ASSERT ROW_COUNT = 120
SELECT * FROM {{zone_name}}.parquet_flights.all_flights;

-- Q1-only filter (40 rows)
ASSERT ROW_COUNT = 40
SELECT * FROM {{zone_name}}.parquet_flights.q1_flights;

-- Schema-evolution proof: NULL counts per source file
ASSERT ROW_COUNT = 3
SELECT df_file_name,
       COUNT(*) AS total_rows,
       COUNT(*) FILTER (WHERE delay_reason  IS NULL) AS null_delay_reason,
       COUNT(*) FILTER (WHERE carrier_code IS NULL) AS null_carrier_code
FROM {{zone_name}}.parquet_flights.all_flights
GROUP BY df_file_name
ORDER BY df_file_name;

-- Airline performance aggregation
ASSERT ROW_COUNT = 5
ASSERT VALUE avg_delay = 52.0 WHERE airline = 'Southwest Airlines'
SELECT airline,
       ROUND(AVG(delay_minutes), 1) AS avg_delay,
       COUNT(*) AS total_flights,
       COUNT(*) FILTER (WHERE status = 'On Time')   AS on_time,
       COUNT(*) FILTER (WHERE status = 'Delayed')   AS delayed,
       COUNT(*) FILTER (WHERE status = 'Cancelled') AS cancelled
FROM {{zone_name}}.parquet_flights.all_flights
GROUP BY airline
ORDER BY avg_delay DESC;

-- ============================================================================
-- CLEANUP
-- ============================================================================
DROP EXTERNAL TABLE IF EXISTS {{zone_name}}.parquet_flights.all_flights WITH FILES;
DROP EXTERNAL TABLE IF EXISTS {{zone_name}}.parquet_flights.q1_flights WITH FILES;
DROP SCHEMA         IF EXISTS {{zone_name}}.parquet_flights;
DROP ZONE           IF EXISTS {{zone_name}};

Description

## When to Use Use this demo when landing Parquet datasets that have evolved over time, columns added in later quarters, monthly dumps with new fields, or partner feeds whose schema drifts release to release. It shows how DeltaForge unifies heterogeneous Parquet files into one external table while preserving per-file traceability through system metadata columns. ## What You Will Learn 1. How CREATE EXTERNAL TABLE USING PARQUET auto-detects types from embedded Parquet metadata without a schema definition 2. How DeltaForge NULL-fills columns that are present in some files but missing from others (schema evolution across quarters) 3. How to use the `file_filter` OPTION with a glob pattern to restrict a table to a single file 4. How to attach `df_file_name` and `df_row_number` system columns via the `file_metadata` OPTION for provenance 5. How to write cross-cutting ASSERT VALUE checks that verify row counts, NULL counts, and per-file breakdowns ## Prerequisites - DeltaForge GUI running with a workspace open - The three demo files (`flights_2025_q1.parquet`, `flights_2025_q2.parquet`, `flights_2025_q3.parquet`) accessible at `{{data_path}}` - Permission to CREATE ZONE and CREATE SCHEMA in the target workspace

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →