Three-branch veterinary clinic with 75 visits across branch-north/south/east subdirectories; demonstrates recursive scanning, file_filter wildcards to isolate one branch, and per-file max_rows sampling.
-- DEMO: Veterinary Clinic Patient Records
-- Difficulty: intermediate | Time: ~5 min
-- Requires: data/branch-{north,south,east}/visits.csv
-- ========================================================================
-- SETUP
-- ========================================================================
CREATE ZONE IF NOT EXISTS demo
TYPE EXTERNAL
COMMENT 'External tables -- demo datasets and file-backed data';
CREATE SCHEMA IF NOT EXISTS demo.csv_vet
COMMENT 'Veterinary clinic CSV-backed external tables';
-- 1. Recursive scan of all 3 branches -> 75 rows.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.all_visits
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
header = 'true',
recursive = 'true'
);
-- 2. Recursive scan + file_filter -> only the north branch (25 rows).
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.north_only
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
header = 'true',
recursive = 'true',
file_filter = '*north*'
);
-- 3. Recursive scan + per-file cap -> 10 rows per file, 30 rows total.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.sampled_visits
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
header = 'true',
recursive = 'true',
max_rows = '10'
);
-- ========================================================================
-- QUERIES
-- ========================================================================
-- 1. Full recursive scan returns 75 rows; visit_id prefixes N-/S-/E-
-- prove that all three branch files were read.
SELECT
SUBSTRING(visit_id, 1, 1) AS branch,
COUNT(*) AS visits
FROM demo.csv_vet.all_visits
GROUP BY branch
ORDER BY branch;
-- 2. file_filter isolates the north branch only (25 rows).
SELECT visit_id, pet_name, species, owner_name
FROM demo.csv_vet.north_only
ORDER BY visit_id;
-- 3. max_rows='10' caps per file -> 3 files * 10 = 30 rows.
SELECT COUNT(*) AS sampled_rows
FROM demo.csv_vet.sampled_visits;
-- 4. Species breakdown across all branches.
SELECT species, COUNT(*) AS visit_count
FROM demo.csv_vet.all_visits
GROUP BY species
ORDER BY species;
-- 5. Treatment cost analysis by species -- CSV values must be CAST.
SELECT species,
ROUND(SUM(CAST(treatment_cost AS DOUBLE)), 2) AS species_sum,
ROUND(AVG(CAST(treatment_cost AS DOUBLE)), 2) AS species_avg
FROM demo.csv_vet.all_visits
GROUP BY species
ORDER BY species;
-- 6. NULL / empty handling for sparsely-populated fields.
SELECT
COUNT(*) FILTER (WHERE breed IS NULL OR breed = '') AS null_breed_count,
COUNT(*) FILTER (WHERE diagnosis IS NULL OR diagnosis = '') AS null_diag_count,
COUNT(*) FILTER (WHERE breed IS NOT NULL AND breed <> '') AS non_null_breed
FROM demo.csv_vet.all_visits;
-- VERIFY: grand totals across all 3 branches.
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT species) AS distinct_species,
ROUND(SUM(CAST(treatment_cost AS DOUBLE)), 2) AS sum_treatment_cost,
ROUND(AVG(CAST(weight_kg AS DOUBLE)), 2) AS avg_weight_kg
FROM demo.csv_vet.all_visits;
-- ========================================================================
-- CLEANUP
-- ========================================================================
DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.all_visits WITH FILES;
DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.north_only WITH FILES;
DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.sampled_visits WITH FILES;
DROP SCHEMA IF EXISTS demo.csv_vet;
DROP ZONE IF EXISTS demo;
## When to Use Use this demo when your CSV data is partitioned across subdirectories -- one per branch, region, tenant, or day -- and you want a single table that reads them all. It shows three complementary techniques on the same dataset: - **Recursive scanning** via `OPTIONS (recursive = 'true')` walks subdirectories and reads every matching file. - **File filtering** via `OPTIONS (file_filter = '*north*')` picks a subset by path pattern, so you can materialise one table per shard without moving files. - **Row sampling** via `OPTIONS (max_rows = '10')` caps rows per file -- useful for dev environments where a full scan is expensive. On top of that, the demo's queries exercise species aggregation, treatment-cost statistics, and NULL handling for sparsely-populated fields like `breed` and `diagnosis`. ## What You Will Learn 1. How `recursive = 'true'` expands a directory-level LOCATION into every matching file underneath. 2. How `file_filter` applies a glob to the recursive expansion so a single LOCATION can serve multiple tables with different slices. 3. How `max_rows` is enforced per file, not per table -- so a 3-file, max_rows=10 table returns 30 rows. 4. How to aggregate mixed-quality data with NULL / empty-string handling (`IS NULL OR = ''`). 5. That recursive scanning unifies heterogeneous files -- if one branch's schema drifts, the missing columns come back NULL. ## Prerequisites Requires the three branch subdirectories under `demos/csv/veterinary-clinic/data/` (`branch-north/visits.csv`, `branch-south/visits.csv`, `branch-east/visits.csv`, 25 rows each). All three share the same column layout; `data_path` must point at the parent `data/` directory, not at any one `visits.csv`.
| Name | Type | Description |
|---|---|---|
data_path | Parent directory containing the branch-north / branch-south / branch-east subdirectories. | |
zone_name | Zone where the csv_vet schema is created. |