Sales Schema Evolution

Single external table reads five quarterly CSVs (2024-Q1 to 2025-Q1) whose columns evolve over time; missing columns surface as NULL and df_file_name traces every row to its source.

Category: csv

Syntax

-- DEMO: Sales Schema Evolution
-- Difficulty: beginner | Time: ~3 min
-- Requires: 5 CSV files (sales_2024_q1 .. sales_2025_q1)

-- ========================================================================
-- SETUP
-- ========================================================================

CREATE ZONE IF NOT EXISTS demo
    TYPE EXTERNAL
    COMMENT 'External tables -- demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS demo.csv_demos
    COMMENT 'CSV-backed external tables';

-- Single external table over 5 quarterly files via wildcard.
-- file_metadata injects df_file_name + df_row_number per row.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_demos.sales
USING CSV
LOCATION '~/delta-data/csv/sales-quickstart/sales*.csv'
OPTIONS (
    header = 'true',
    file_metadata = '{"columns":["df_file_name","df_file_modified","df_dataset","df_row_number"]}'
);

-- ========================================================================
-- QUERIES
-- ========================================================================

-- 1. Unified view across all 5 files (15 rows).
-- Columns not present in a given file come back as NULL.
SELECT *
FROM demo.csv_demos.sales
ORDER BY id;

-- 2. Revenue by product -- CSV strings must be CAST for arithmetic.
SELECT
    product_name,
    SUM(CAST(quantity AS INT)) AS total_quantity,
    ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY product_name
ORDER BY total_revenue DESC;

-- 3. Sales rep performance. Q1-2024 rows have sales_rep IS NULL
-- because that column did not exist in the Q1 file.
SELECT
    sales_rep,
    COUNT(*) AS sale_count,
    ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY sales_rep
ORDER BY total_revenue DESC;

-- 4. Quarterly revenue trend across the evolution timeline.
SELECT
    CAST(EXTRACT(YEAR FROM sale_date) AS INT) || '-Q' || CAST(EXTRACT(QUARTER FROM sale_date) AS INT) AS period,
    COUNT(*) AS sale_count,
    ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY period
ORDER BY period;

-- 5. Region -> Territory handoff. region was retired in Q4-2024
-- and replaced by territory, so early rows have region and NULL
-- territory, later rows have territory and NULL region.
SELECT id, sale_date, region, territory
FROM demo.csv_demos.sales
ORDER BY id;

-- 6. File-level lineage via injected metadata columns.
SELECT id, product_name, df_file_name, df_row_number
FROM demo.csv_demos.sales
ORDER BY id;

-- VERIFY: grand totals + schema-evolution invariants in one row.
SELECT
    COUNT(*) AS row_count,
    ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS grand_total_revenue,
    MAX(CASE WHEN id = '1'  THEN sales_rep END) AS q1_sales_rep,   -- expect NULL
    MAX(CASE WHEN id = '1'  THEN territory END) AS q1_territory,   -- expect NULL
    MAX(CASE WHEN id = '10' THEN region    END) AS q4_region,      -- expect NULL
    MAX(CASE WHEN id = '13' THEN channel   END) AS q1_2025_channel -- expect NOT NULL
FROM demo.csv_demos.sales;

-- ========================================================================
-- CLEANUP
-- ========================================================================

DROP EXTERNAL TABLE IF EXISTS demo.csv_demos.sales WITH FILES;
DROP SCHEMA IF EXISTS demo.csv_demos;
DROP ZONE IF EXISTS demo;

Description

## When to Use Use this demo as your first introduction to CSV external tables in DeltaForge. It is the simplest real-world pattern: one `CREATE EXTERNAL TABLE ... USING CSV` over a glob of files whose schemas are not identical. Quarterly sales exports add `sales_rep`, `discount_pct`, `territory`, `channel` over time and retire `region` and `discount_pct`. DeltaForge unifies all five schemas into one logical table; columns that are missing from a particular file are returned as SQL NULL. The `file_metadata` option injects system columns (`df_file_name`, `df_row_number`) so each output row can be traced back to the exact file it came from, invaluable for audit and debugging of multi-file ingestion. ## What You Will Learn 1. How to define ONE external table over MANY CSV files using a wildcard `LOCATION` (`sales*.csv`). 2. How DeltaForge unifies divergent column sets across files and fills gaps with NULL. 3. How to enable per-row lineage columns via `OPTIONS (file_metadata = '{"columns":[...]}')`. 4. How to write queries that span an evolving schema safely (using `IS NULL` / `IS NOT NULL` to segment eras). 5. Why CSV values must be `CAST` before arithmetic, all CSV columns are strings at read time. ## Prerequisites Requires the five files in `demos/csv/sales-quickstart/data/` (`sales_2024_q1.csv` ... `sales_2025_q1.csv`, 3 rows each). No other demo data or tables need to exist first. Adjust the `data_path` variable to point at the directory where DeltaForge downloaded the files.

Parameters

Name	Type	Description
`data_path`		Directory containing the five quarterly sales CSV files.
`zone_name`		Zone where the csv_demos schema is created.

Pitfalls

Wildcards like 'sales*.csv' are greedy -- an unrelated file that matches the glob will be ingested silently and its columns merged into the unified schema. List the directory before running.
Columns added in later files (sales_rep, territory, channel) return NULL for rows from earlier files. Any aggregate over those columns must tolerate NULLs or filter with IS NOT NULL.
The file_metadata option takes a JSON string; double-quotes inside must be escaped. Invalid JSON is silently ignored and no metadata columns appear.
CSV values are parsed as strings. Forgetting CAST(quantity AS INT) or CAST(unit_price AS DOUBLE) produces string concatenation or comparison errors rather than arithmetic.
Files must share the same delimiter and header convention. A single file without a header row will push its first data row into the unified schema as a spurious 'header-looking' row.

Sales Schema Evolution

Syntax

Description

Parameters

Pitfalls

See Also