Veterinary Clinic Patient Records

Three-branch veterinary clinic with 75 visits across branch-north/south/east subdirectories; demonstrates recursive scanning, file_filter wildcards to isolate one branch, and per-file max_rows sampling.

Category: csv

Syntax

-- DEMO: Veterinary Clinic Patient Records
-- Difficulty: intermediate | Time: ~5 min
-- Requires: data/branch-{north,south,east}/visits.csv

-- ========================================================================
-- SETUP
-- ========================================================================

CREATE ZONE IF NOT EXISTS demo
    TYPE EXTERNAL
    COMMENT 'External tables -- demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS demo.csv_vet
    COMMENT 'Veterinary clinic CSV-backed external tables';

-- 1. Recursive scan of all 3 branches -> 75 rows.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.all_visits
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
    header = 'true',
    recursive = 'true'
);

-- 2. Recursive scan + file_filter -> only the north branch (25 rows).
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.north_only
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
    header = 'true',
    recursive = 'true',
    file_filter = '*north*'
);

-- 3. Recursive scan + per-file cap -> 10 rows per file, 30 rows total.
CREATE EXTERNAL TABLE IF NOT EXISTS demo.csv_vet.sampled_visits
USING CSV
LOCATION '~/delta-data/csv/vet/'
OPTIONS (
    header = 'true',
    recursive = 'true',
    max_rows = '10'
);

-- ========================================================================
-- QUERIES
-- ========================================================================

-- 1. Full recursive scan returns 75 rows; visit_id prefixes N-/S-/E-
-- prove that all three branch files were read.
SELECT
    SUBSTRING(visit_id, 1, 1) AS branch,
    COUNT(*) AS visits
FROM demo.csv_vet.all_visits
GROUP BY branch
ORDER BY branch;

-- 2. file_filter isolates the north branch only (25 rows).
SELECT visit_id, pet_name, species, owner_name
FROM demo.csv_vet.north_only
ORDER BY visit_id;

-- 3. max_rows='10' caps per file -> 3 files * 10 = 30 rows.
SELECT COUNT(*) AS sampled_rows
FROM demo.csv_vet.sampled_visits;

-- 4. Species breakdown across all branches.
SELECT species, COUNT(*) AS visit_count
FROM demo.csv_vet.all_visits
GROUP BY species
ORDER BY species;

-- 5. Treatment cost analysis by species -- CSV values must be CAST.
SELECT species,
       ROUND(SUM(CAST(treatment_cost AS DOUBLE)), 2) AS species_sum,
       ROUND(AVG(CAST(treatment_cost AS DOUBLE)), 2) AS species_avg
FROM demo.csv_vet.all_visits
GROUP BY species
ORDER BY species;

-- 6. NULL / empty handling for sparsely-populated fields.
SELECT
    COUNT(*) FILTER (WHERE breed     IS NULL OR breed     = '') AS null_breed_count,
    COUNT(*) FILTER (WHERE diagnosis IS NULL OR diagnosis = '') AS null_diag_count,
    COUNT(*) FILTER (WHERE breed     IS NOT NULL AND breed <> '') AS non_null_breed
FROM demo.csv_vet.all_visits;

-- VERIFY: grand totals across all 3 branches.
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT species) AS distinct_species,
    ROUND(SUM(CAST(treatment_cost AS DOUBLE)), 2) AS sum_treatment_cost,
    ROUND(AVG(CAST(weight_kg       AS DOUBLE)), 2) AS avg_weight_kg
FROM demo.csv_vet.all_visits;

-- ========================================================================
-- CLEANUP
-- ========================================================================

DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.all_visits      WITH FILES;
DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.north_only       WITH FILES;
DROP EXTERNAL TABLE IF EXISTS demo.csv_vet.sampled_visits   WITH FILES;
DROP SCHEMA IF EXISTS demo.csv_vet;
DROP ZONE IF EXISTS demo;

Description

## When to Use Use this demo when your CSV data is partitioned across subdirectories -- one per branch, region, tenant, or day -- and you want a single table that reads them all. It shows three complementary techniques on the same dataset: - **Recursive scanning** via `OPTIONS (recursive = 'true')` walks subdirectories and reads every matching file. - **File filtering** via `OPTIONS (file_filter = '*north*')` picks a subset by path pattern, so you can materialise one table per shard without moving files. - **Row sampling** via `OPTIONS (max_rows = '10')` caps rows per file -- useful for dev environments where a full scan is expensive. On top of that, the demo's queries exercise species aggregation, treatment-cost statistics, and NULL handling for sparsely-populated fields like `breed` and `diagnosis`. ## What You Will Learn 1. How `recursive = 'true'` expands a directory-level LOCATION into every matching file underneath. 2. How `file_filter` applies a glob to the recursive expansion so a single LOCATION can serve multiple tables with different slices. 3. How `max_rows` is enforced per file, not per table -- so a 3-file, max_rows=10 table returns 30 rows. 4. How to aggregate mixed-quality data with NULL / empty-string handling (`IS NULL OR = ''`). 5. That recursive scanning unifies heterogeneous files -- if one branch's schema drifts, the missing columns come back NULL. ## Prerequisites Requires the three branch subdirectories under `demos/csv/veterinary-clinic/data/` (`branch-north/visits.csv`, `branch-south/visits.csv`, `branch-east/visits.csv`, 25 rows each). All three share the same column layout; `data_path` must point at the parent `data/` directory, not at any one `visits.csv`.

Parameters

Name	Type	Description
`data_path`		Parent directory containing the branch-north / branch-south / branch-east subdirectories.
`zone_name`		Zone where the csv_vet schema is created.

Pitfalls

Without recursive='true', a directory-style LOCATION only reads files immediately in the directory -- subdirectories are skipped silently and you will miss rows rather than see an error.
file_filter is a glob, not a regex. '*north*' matches any path segment containing 'north'; a file named 'notnorth.csv' in an unrelated directory would also match.
max_rows is enforced PER FILE, not per table. Recursive scans multiply the cap: max_rows='10' across 3 files gives 30 rows, not 10.
Empty CSV fields default to empty string '', not NULL. Use both (IS NULL OR col = '') to count missing values, as this demo does for breed and diagnosis.
All numeric columns read from CSV are strings. Forgetting CAST(treatment_cost AS DOUBLE) or CAST(weight_kg AS DOUBLE) produces string-ordering results or errors.
If one branch file has a different column order or extra columns, DeltaForge unifies the schema silently and pads missing columns with NULL -- great for resilience, but it can hide data-quality regressions. Periodically DESCRIBE the table to check the column list.

Veterinary Clinic Patient Records

Syntax

Description

Parameters

Pitfalls

See Also