Single external table reads five quarterly CSVs (2024-Q1 to 2025-Q1) whose columns evolve over time; missing columns surface as NULL and df_file_name traces every row to its source.
-- DEMO: Sales Schema Evolution
-- Difficulty: beginner | Time: ~3 min
-- Requires: 5 CSV files (sales_2024_q1 .. sales_2025_q1)
-- ========================================================================
-- SETUP
-- ========================================================================
CREATE ZONE IF NOT EXISTS demo
TYPE EXTERNAL
COMMENT 'External tables -- demo datasets and file-backed data';
CREATE SCHEMA IF NOT EXISTS demo.csv_demos
COMMENT 'CSV-backed external tables';
-- Single external table over 5 quarterly files via wildcard.
-- file_metadata injects df_file_name + df_row_number per row.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_demos.sales
USING CSV
LOCATION '~/delta-data/csv/sales-quickstart/sales*.csv'
OPTIONS (
header = 'true',
file_metadata = '{"columns":["df_file_name","df_file_modified","df_dataset","df_row_number"]}'
);
-- ========================================================================
-- QUERIES
-- ========================================================================
-- 1. Unified view across all 5 files (15 rows).
-- Columns not present in a given file come back as NULL.
SELECT *
FROM demo.csv_demos.sales
ORDER BY id;
-- 2. Revenue by product -- CSV strings must be CAST for arithmetic.
SELECT
product_name,
SUM(CAST(quantity AS INT)) AS total_quantity,
ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY product_name
ORDER BY total_revenue DESC;
-- 3. Sales rep performance. Q1-2024 rows have sales_rep IS NULL
-- because that column did not exist in the Q1 file.
SELECT
sales_rep,
COUNT(*) AS sale_count,
ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY sales_rep
ORDER BY total_revenue DESC;
-- 4. Quarterly revenue trend across the evolution timeline.
SELECT
CAST(EXTRACT(YEAR FROM sale_date) AS INT) || '-Q' || CAST(EXTRACT(QUARTER FROM sale_date) AS INT) AS period,
COUNT(*) AS sale_count,
ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM demo.csv_demos.sales
GROUP BY period
ORDER BY period;
-- 5. Region -> Territory handoff. region was retired in Q4-2024
-- and replaced by territory, so early rows have region and NULL
-- territory, later rows have territory and NULL region.
SELECT id, sale_date, region, territory
FROM demo.csv_demos.sales
ORDER BY id;
-- 6. File-level lineage via injected metadata columns.
SELECT id, product_name, df_file_name, df_row_number
FROM demo.csv_demos.sales
ORDER BY id;
-- VERIFY: grand totals + schema-evolution invariants in one row.
SELECT
COUNT(*) AS row_count,
ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS grand_total_revenue,
MAX(CASE WHEN id = '1' THEN sales_rep END) AS q1_sales_rep, -- expect NULL
MAX(CASE WHEN id = '1' THEN territory END) AS q1_territory, -- expect NULL
MAX(CASE WHEN id = '10' THEN region END) AS q4_region, -- expect NULL
MAX(CASE WHEN id = '13' THEN channel END) AS q1_2025_channel -- expect NOT NULL
FROM demo.csv_demos.sales;
-- ========================================================================
-- CLEANUP
-- ========================================================================
DROP EXTERNAL TABLE IF EXISTS demo.csv_demos.sales WITH FILES;
DROP SCHEMA IF EXISTS demo.csv_demos;
DROP ZONE IF EXISTS demo;
## When to Use Use this demo as your first introduction to CSV external tables in DeltaForge. It is the simplest real-world pattern: one `CREATE EXTERNAL TABLE ... USING CSV` over a glob of files whose schemas are not identical. Quarterly sales exports add `sales_rep`, `discount_pct`, `territory`, `channel` over time and retire `region` and `discount_pct`. DeltaForge unifies all five schemas into one logical table; columns that are missing from a particular file are returned as SQL NULL. The `file_metadata` option injects system columns (`df_file_name`, `df_row_number`) so each output row can be traced back to the exact file it came from, invaluable for audit and debugging of multi-file ingestion. ## What You Will Learn 1. How to define ONE external table over MANY CSV files using a wildcard `LOCATION` (`sales*.csv`). 2. How DeltaForge unifies divergent column sets across files and fills gaps with NULL. 3. How to enable per-row lineage columns via `OPTIONS (file_metadata = '{"columns":[...]}')`. 4. How to write queries that span an evolving schema safely (using `IS NULL` / `IS NOT NULL` to segment eras). 5. Why CSV values must be `CAST` before arithmetic, all CSV columns are strings at read time. ## Prerequisites Requires the five files in `demos/csv/sales-quickstart/data/` (`sales_2024_q1.csv` ... `sales_2025_q1.csv`, 3 rows each). No other demo data or tables need to exist first. Adjust the `data_path` variable to point at the directory where DeltaForge downloaded the files.
| Name | Type | Description |
|---|---|---|
data_path | Directory containing the five quarterly sales CSV files. | |
zone_name | Zone where the csv_demos schema is created. |