Read JSON and newline-delimited JSON files with optional flattening
## Overview The JSON data source reads JSON and newline-delimited JSON (NDJSON / JSONL) files. It supports both single JSON arrays and NDJSON files where each line is an independent JSON object. The key capability is the `json_flatten_config` option, which controls how nested and hierarchical JSON structures are flattened into tabular columns. JSON files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING JSON`. DeltaForge reads multiple JSON files in parallel. For deeply nested JSON, the flatten configuration provides fine-grained control over which fields to extract, how to handle arrays, and how to name the resulting columns. ## Path Notation DeltaForge uses a JSONPath-style notation for `root_path`, `include_paths`, `explode_paths`, `exclude_paths`, and `json_paths`: | Notation | Meaning | |----------|---------| | `$` | The current row root (one JSON object per row) | | `$.field` | A top-level field of the row | | `$.parent.child` | A nested field, dotted through object keys | | `$.items` | An array; repeat handling decides how it becomes a column | | `$.items.name` | A field inside every element of the `items` array | | `$['Field With Spaces']` or `$.Field With Spaces` | A key that contains spaces or punctuation | Array element indexing (for example `$.items[0].name`) is not supported. Select the array path and use `explode_paths` or `default_array_handling` to control how the array is turned into rows or values. ## Path Extraction Examples Each example below pairs a real JSON document with the flatten configuration that extracts it. All examples are drawn from demos in `delta-forge-demos/demos/json/`. ### Example 1: Flat object with scalars (CRM customers) Source document `customers.json` (one element shown from a JSON array): ```json [ { "id": 1, "email": "isidro_von@hotmail.com", "first": "Torrey", "last": "Veum", "company": "Hilll, Mayert and Wolf", "created_at": "2014-12-25T04:06:27.981Z", "country": "Switzerland" } ] ``` Each array element becomes one row. Paths are simply `$.<field>`: | JSONPath | Column | Value | |----------|--------|-------| | `$.id` | `id` | `1` | | `$.email` | `email` | `isidro_von@hotmail.com` | | `$.first` | `first` | `Torrey` | | `$.last` | `last` | `Veum` | | `$.company` | `company` | `Hilll, Mayert and Wolf` | | `$.created_at` | `created_at` | `2014-12-25T04:06:27.981Z` | | `$.country` | `country` | `Switzerland` | Flatten config: ```json { "root_path": "$", "include_paths": [ "$.id", "$.email", "$.first", "$.last", "$.company", "$.created_at", "$.country" ], "max_depth": 1, "infer_types": true } ``` ### Example 2: Nested objects (real-estate listings) Source document `listings.json`: ```json [ { "id": "LST-001", "title": "Modern Downtown Loft", "type": "condo", "bedrooms": 2, "sqft": 1200, "location": { "address": { "street": "450 Market Street", "unit": "12B", "city": "San Francisco", "state": "CA", "zip": "94105" }, "geo": { "lat": 37.7749, "lng": -122.4194 }, "neighborhood": "SoMa", "walk_score": 95 }, "pricing": { "list_price": 895000, "price_per_sqft": 746, "hoa_monthly": 650, "mortgage_estimate": { "monthly_payment": 5230, "rate_pct": 6.75 } } } ] ``` Each nested object key becomes another segment in the JSONPath. Default column names join the segments with the configured `separator`: | JSONPath | Default column | |----------|----------------| | `$.id` | `id` | | `$.title` | `title` | | `$.bedrooms` | `bedrooms` | | `$.location.address.street` | `location_address_street` | | `$.location.address.city` | `location_address_city` | | `$.location.geo.lat` | `location_geo_lat` | | `$.location.geo.lng` | `location_geo_lng` | | `$.location.neighborhood` | `location_neighborhood` | | `$.pricing.list_price` | `pricing_list_price` | | `$.pricing.mortgage_estimate.monthly_payment` | `pricing_mortgage_estimate_monthly_payment` | ### Example 3: Array explosion with parent duplication (music catalog) Source document `catalog.json`: ```json [ { "id": 8, "name": "Warner 25 Anos", "sku": "ALBUM-8", "price": 1053, "vendor": { "id": 6, "name": "Antonio Carlos Jobim" }, "details": [ { "track_id": 76, "name": "Canta, Canta Mais", "milliseconds": 271856, "unit_price": 0.99 }, { "track_id": 75, "name": "O Boto", "milliseconds": 366837, "unit_price": 0.99 } ] } ] ``` Add the array path to `explode_paths` to emit one row per element. Album-level fields duplicate on every track row; `vendor` is a nested object that flattens into `vendor_id` and `vendor_name`. Flatten config: ```json { "root_path": "$", "explode_paths": ["$.details"], "include_paths": [ "$.id", "$.name", "$.sku", "$.price", "$.vendor.id", "$.vendor.name", "$.details.track_id", "$.details.name", "$.details.milliseconds", "$.details.unit_price" ], "column_mappings": { "$.id": "album_id", "$.name": "album_name", "$.vendor.id": "vendor_id", "$.vendor.name": "vendor_name", "$.details.track_id": "track_id", "$.details.name": "track_name", "$.details.milliseconds": "track_milliseconds", "$.details.unit_price": "track_unit_price" } } ``` The sample album produces two rows: | album_id | album_name | vendor_id | vendor_name | track_id | track_name | track_milliseconds | |----------|------------|-----------|-------------|----------|------------|--------------------| | 8 | Warner 25 Anos | 6 | Antonio Carlos Jobim | 76 | Canta, Canta Mais | 271856 | | 8 | Warner 25 Anos | 6 | Antonio Carlos Jobim | 75 | O Boto | 366837 | ### Example 4: Arrays without explosion (count, join, first, to_json) Same source as Example 3. To produce one row per album and represent `details` as a scalar summary instead of exploding it, drop `explode_paths` and pick a handler: ```json { "root_path": "$", "include_paths": ["$.id", "$.name", "$.details"], "default_array_handling": "count" } ``` Output: | id | name | details | |----|------|---------| | 8 | Warner 25 Anos | 18 | Valid values for `default_array_handling`: - `count` emits the array length as an integer - `to_json` serializes the array as a JSON string - `first` keeps only the first element - `explode` behaves like listing the path in `explode_paths` ### Example 5: Capture a subtree as a JSON blob (json_paths) Sometimes a nested object is too irregular (or too large) to flatten. `json_paths` captures a subtree as a single JSON string column. Source: ```json { "id": "LST-001", "title": "Modern Downtown Loft", "location": { "address": { "city": "San Francisco" }, "geo": { "lat": 37.77, "lng": -122.41 } }, "pricing": { "list_price": 895000, "tax_history": [{"year": 2023, "amount": 10740}] } } ``` Config that keeps `location` and `pricing` intact: ```json { "root_path": "$", "include_paths": ["$.id", "$.title", "$.location", "$.pricing"], "json_paths": ["$.location", "$.pricing"] } ``` Result (location and pricing are each one JSON string column): | id | title | location | pricing | |----|-------|----------|---------| | LST-001 | Modern Downtown Loft | `{"address":{"city":"San Francisco"},"geo":{"lat":37.77,"lng":-122.41}}` | `{"list_price":895000,"tax_history":[{"year":2023,"amount":10740}]}` | Use this pattern when downstream code needs the full nested structure (map rendering, schema-on-read, forwarding to an API). ### Example 6: Deep paths with spaces and punctuation (CIA World Factbook) Real-world JSON often contains object keys with spaces, parentheses, and other characters that are invalid in SQL identifiers. The JSONPath syntax accepts them directly; the column-name derivation step normalizes them. Source excerpt (`ke.json`): ```json { "Geography": { "Area": { "total ": { "text": "580,367 sq km" } }, "Climate": { "text": "varies from tropical along coast to arid in interior" } }, "People and Society": { "Population": { "total": { "text": "58,246,378" } }, "Languages": { "Languages": { "text": "English, Kiswahili, indigenous languages" } } } } ``` Flatten config: ```json { "root_path": "$", "include_paths": [ "$.Geography.Area.total .text", "$.Geography.Climate.text", "$.People and Society.Population.total.text", "$.People and Society.Languages.Languages.text" ], "column_mappings": { "$.Geography.Area.total .text": "area_total", "$.Geography.Climate.text": "climate", "$.People and Society.Population.total.text": "population_total", "$.People and Society.Languages.Languages.text": "languages" }, "max_depth": 5 } ``` Because the source keys contain spaces and punctuation, always use `column_mappings` to assign SQL-safe column names rather than relying on the auto-derived defaults. ### Example 7: Exclude a branch (exclude_paths) `exclude_paths` drops an entire subtree. It takes precedence over `include_paths` when the paths overlap. Use it to strip verbose or sensitive sections: ```json { "root_path": "$", "include_paths": ["$.Government", "$.Economy"], "exclude_paths": ["$.Introduction.Background"] } ``` ### Example 8: Rename columns (column_mappings) Default column names are derived from the JSONPath. `column_mappings` rewrites them one-for-one: ```json { "column_mappings": { "$.vendor.id": "vendor_id", "$.vendor.name": "vendor_name", "$.details.track_id": "track_id" } } ``` Paths that are not listed keep the auto-generated name. ## Column Name Generation When `column_mappings` is not supplied, DeltaForge derives a column name from the JSONPath: 1. Strip the leading `$.` 2. Replace each `.` with the configured `separator` (default `_`) 3. Lowercase the result 4. Replace characters that are invalid in a SQL identifier (including spaces, parentheses, slashes, and hyphens) using the same rule as `column_cleanup_pattern` | JSONPath | Generated column | |----------|------------------| | `$.id` | `id` | | `$.vendor.name` | `vendor_name` | | `$.location.address.city` | `location_address_city` | | `$.pricing.mortgage_estimate.monthly_payment` | `pricing_mortgage_estimate_monthly_payment` | | `$.People and Society.Population.total.text` | `people_and_society_population_total_text` | Whenever the source keys contain characters that are not valid in SQL identifiers, prefer an explicit `column_mappings` entry so the resulting names are predictable. ## NDJSON vs JSON Array - A JSON array file (`[{...}, {...}]`) is one document containing many rows. DeltaForge streams elements from the top-level array. - NDJSON / JSONL (one JSON object per line) is read line-by-line. Each line is an independent row. DeltaForge auto-detects the format. Force a specific mode with `newline_delimited = true` (NDJSON) or `newline_delimited = false` (JSON array). ## Usage ```sql CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.json_demos; CREATE EXTERNAL TABLE IF NOT EXISTS analytics.json_demos.album_tracks USING JSON LOCATION '/data/music' OPTIONS ( json_flatten_config = '{ "root_path": "$", "explode_paths": ["$.details"], "include_paths": [ "$.id", "$.name", "$.price", "$.vendor.id", "$.vendor.name", "$.details.track_id", "$.details.name", "$.details.unit_price" ], "column_mappings": { "$.vendor.id": "vendor_id", "$.vendor.name": "vendor_name", "$.details.track_id": "track_id", "$.details.name": "track_name", "$.details.unit_price": "track_unit_price" }, "separator": "_", "max_depth": 10, "infer_types": true }', file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery, no separate DETECT SCHEMA needed GRANT ADMIN ON TABLE analytics.json_demos.album_tracks TO USER analyst; SELECT vendor_name, track_name, track_unit_price FROM analytics.json_demos.album_tracks WHERE vendor_name = 'Antonio Carlos Jobim'; ``` ## Schema Detection JSON is not a self-describing tabular format. DeltaForge infers the schema by sampling the first N rows (controlled by `infer_schema_rows`, default 1000) and analyzing the flattened field structure. The `json_flatten_config` determines which fields are extracted and how they map to columns. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying JSON structure changes (new fields, evolved schema, modified `json_flatten_config`). ## Schema Evolution When reading multiple JSON files with different field structures, DeltaForge unifies all discovered fields into a single combined schema. Fields that exist in some files but not others appear as NULL for rows from files that lack those fields. This allows the JSON structure to evolve over time without breaking existing queries. ## Key Options - `json_flatten_config`: Primary configuration. Sub-options: - `root_path`: JSONPath to the root of each row (typically `$`). - `include_paths`: Selective field extraction. - `exclude_paths`: Paths to drop from output (takes precedence over include). - `explode_paths`: Array paths that emit one row per element. - `json_paths`: Paths whose subtree is captured as a JSON string column. - `column_mappings`: Rename generated columns. - `default_array_handling`: `count`, `to_json`, `first`, `explode` (applies to arrays not listed in `explode_paths`). - `separator`: Delimiter for nested field names (default `_`). - `max_depth`: Maximum nesting depth to traverse. - `infer_types`: Parse numeric, boolean, and date values from JSON string fields. - `file_metadata`: Inject system columns (`df_file_name`, `df_row_number`) for provenance tracking. - `newline_delimited`: Force NDJSON mode, JSON array mode, or auto-detection (default).