Read and flatten XML documents using XPath row selection
## Overview The XML data source reads and flattens XML documents into tabular data using XPath-based row selection. The key capability is the `xml_flatten_config` option, which specifies an XPath expression (`row_xpath`) to identify repeating elements that become rows, along with path-based extraction rules for elements and attributes. XML files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING XML`. DeltaForge reads multiple XML files in parallel. The flatten configuration provides precise control over which elements and attributes to extract, how to handle namespaces, and how to map deeply nested paths to flat column names. ## Path Notation DeltaForge uses XPath-style absolute paths in `row_xpath`, `include_paths`, `explode_paths`, `exclude_paths`, and `xml_paths`. Path segments separate with `/`. Attributes use the `@` prefix. The `//` operator matches an element at any depth. | Notation | Meaning | |----------|---------| | `/catalog/book` | Absolute path from the document root | | `//book` | Any `book` element regardless of depth (typical for `row_xpath`) | | `/catalog/book/@id` | The `id` attribute on each `book` element | | `/catalog/book/author` | Text content of the `author` child | | `/orders/order/items/item` | The repeating `item` element (target of explode or count) | ## Path Extraction Examples The examples below walk through the translation from a source XML document to the DeltaForge flatten configuration and the resulting tabular columns. Each example is drawn from a demo in `delta-forge-demos/demos/xml/`. ### Example 1: Flat elements and attributes (books catalog) Source document `books.xml`: ```xml <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> </book> </catalog> ``` Each `book` element becomes one row. The following paths cover every column: | XPath | Generated column | Example value | |-------|------------------|---------------| | `//book` | (row_xpath, not a column) | selects each `book` | | `/catalog/book/@id` | `catalog_book_attr_id` | `bk101` | | `/catalog/book/author` | `catalog_book_author` | `Gambardella, Matthew` | | `/catalog/book/title` | `catalog_book_title` | `XML Developer's Guide` | | `/catalog/book/genre` | `catalog_book_genre` | `Computer` | | `/catalog/book/price` | `catalog_book_price` | `44.95` | | `/catalog/book/publish_date` | `catalog_book_publish_date` | `2000-10-01` | Flatten config: ```json { "row_xpath": "//book", "include_paths": [ "/catalog/book/@id", "/catalog/book/author", "/catalog/book/title", "/catalog/book/genre", "/catalog/book/price", "/catalog/book/publish_date" ], "include_attributes": true } ``` ### Example 2: Deep nested paths and attributes (product specifications) Source document `products.xml`: ```xml <catalog> <product id="PRD-001" status="active"> <name>Industrial Sensor Module</name> <price currency="USD">249.99</price> <specifications> <weight unit="kg">0.85</weight> <dimensions> <length unit="mm">120</length> <width unit="mm">80</width> <height unit="mm">35</height> </dimensions> <voltage>3.3V DC</voltage> </specifications> </product> </catalog> ``` Every nesting level appears as another segment in the path. Attributes on any element are addressed with `@name`: | XPath | Generated column | |-------|------------------| | `/catalog/product/@id` | `catalog_product_attr_id` | | `/catalog/product/@status` | `catalog_product_attr_status` | | `/catalog/product/name` | `catalog_product_name` | | `/catalog/product/price` | `catalog_product_price` | | `/catalog/product/price/@currency` | `catalog_product_price_attr_currency` | | `/catalog/product/specifications/weight` | `catalog_product_specifications_weight` | | `/catalog/product/specifications/weight/@unit` | `catalog_product_specifications_weight_attr_unit` | | `/catalog/product/specifications/dimensions/length` | `catalog_product_specifications_dimensions_length` | | `/catalog/product/specifications/dimensions/length/@unit` | `catalog_product_specifications_dimensions_length_attr_unit` | | `/catalog/product/specifications/voltage` | `catalog_product_specifications_voltage` | ### Example 3: Array explosion (order line items) Source document `orders.xml`: ```xml <orders batch_id="B2025-001"> <order id="ORD-1001" status="shipped"> <customer> <name>Alice Johnson</name> <tier>gold</tier> </customer> <order_date>2025-06-14</order_date> <items> <item sku="WDG-100"> <product>Premium Widget</product> <quantity>2</quantity> <unit_price>29.99</unit_price> </item> <item sku="GDG-200"> <product>Gadget Pro</product> <quantity>1</quantity> <unit_price>79.99</unit_price> </item> </items> </order> </orders> ``` To produce one row per line item, add the repeating element path to `explode_paths`. Order-level fields duplicate on every item row. Flatten config: ```json { "row_xpath": "//order", "explode_paths": ["/orders/order/items/item"], "include_paths": [ "/orders/order/@id", "/orders/order/customer/name", "/orders/order/customer/tier", "/orders/order/items/item/@sku", "/orders/order/items/item/product", "/orders/order/items/item/quantity", "/orders/order/items/item/unit_price" ] } ``` Result (the sample order produces two rows): | orders_order_attr_id | orders_order_customer_name | orders_order_items_item_attr_sku | orders_order_items_item_product | orders_order_items_item_quantity | |----------------------|----------------------------|-----------------------------------|----------------------------------|----------------------------------| | ORD-1001 | Alice Johnson | WDG-100 | Premium Widget | 2 | | ORD-1001 | Alice Johnson | GDG-200 | Gadget Pro | 1 | ### Example 4: Repeats without explosion (count, join, first) Same source as Example 3. To produce one row per order without exploding items, reference the `item` path itself and set `default_repeat_handling`: ```json { "row_xpath": "//order", "include_paths": [ "/orders/order/@id", "/orders/order/customer/name", "/orders/order/items/item" ], "default_repeat_handling": "count" } ``` Output: | orders_order_attr_id | orders_order_customer_name | orders_order_items_item | |----------------------|----------------------------|--------------------------| | ORD-1001 | Alice Johnson | 2 | Valid values for `default_repeat_handling`: - `count` emits the repeat count as an integer - `join_comma` concatenates repeat values as a comma-separated string - `first` keeps only the first occurrence - `to_json` serializes the entire set as a JSON array ### Example 5: Capture a subtree as a blob (xml_paths) When a subtree is irregular across documents, keep it intact as a single column instead of flattening. `xml_paths` lists the paths to capture, and `nested_output_format` selects the serialization. Source XML: ```xml <product id="PRD-001"> <name>Industrial Sensor Module</name> <specifications> <weight unit="kg">0.85</weight> <dimensions> <length unit="mm">120</length> </dimensions> <certifications>CE, FCC, UL</certifications> </specifications> </product> ``` Config that serializes `specifications` as a JSON string: ```json { "row_xpath": "//product", "include_paths": [ "/catalog/product/@id", "/catalog/product/name", "/catalog/product/specifications" ], "xml_paths": ["/catalog/product/specifications"], "nested_output_format": "json" } ``` Result (one column holds the whole subtree): | catalog_product_attr_id | catalog_product_name | catalog_product_specifications | |-------------------------|----------------------|--------------------------------| | PRD-001 | Industrial Sensor Module | `{"weight":{"@unit":"kg","#text":"0.85"},"dimensions":{"length":{"@unit":"mm","#text":"120"}},"certifications":"CE, FCC, UL"}` | Set `"nested_output_format": "xml"` to keep the original XML fragment verbatim instead: | catalog_product_specifications | |--------------------------------| | `<specifications><weight unit="kg">0.85</weight>...</specifications>` | ### Example 6: Exclude a subtree (exclude_paths) Source XML contains an `internal_audit` block that must never leak into analytics: ```xml <order id="ORD-1001"> <customer>...</customer> <items>...</items> <internal_audit> <cost_center>CC-4400</cost_center> <margin_pct>0.42</margin_pct> </internal_audit> </order> ``` Config: ```json { "row_xpath": "//order", "exclude_paths": ["/orders/order/internal_audit"] } ``` No column derived from any path beneath `/orders/order/internal_audit` appears in the output, even if it matches an `include_paths` entry. ### Example 7: Renaming columns (column_mappings) Generated column names reflect the full path. `column_mappings` overrides them one-for-one: ```json { "column_mappings": { "/orders/order/@id": "order_id", "/orders/order/customer/name": "customer_name", "/orders/order/items/item/@sku": "sku", "/orders/order/items/item/variant/color": "item_color", "/orders/order/items/item/variant/size": "item_size" } } ``` Paths not listed in `column_mappings` keep the default generated name. ## Column Name Generation When `column_mappings` is not supplied, DeltaForge derives a column name from the path: 1. Drop the leading `/` 2. Replace each `/` with the configured `separator` (default `_`) 3. Replace the attribute marker `@` with `attr_` so `@id` becomes `attr_id` 4. Strip namespace prefixes when `strip_namespace_prefixes` is true 5. Lowercase the result and sanitise any character that is not valid in a SQL identifier | XPath | Generated column | |-------|------------------| | `/catalog/book/author` | `catalog_book_author` | | `/catalog/book/@id` | `catalog_book_attr_id` | | `/orders/order/items/item/variant/color` | `orders_order_items_item_variant_color` | | `/ns:product/ns:name` (with `strip_namespace_prefixes: true`) | `product_name` | ## Usage Create an external table pointing to XML files. Use `xml_flatten_config` to define row extraction and column mapping. ```sql CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.xml_demos; CREATE EXTERNAL TABLE IF NOT EXISTS analytics.xml_demos.books_evolved USING XML LOCATION '/data/books' OPTIONS ( xml_flatten_config = '{ "row_xpath": "//book", "include_paths": [ "/catalog/book/@id", "/catalog/book/author", "/catalog/book/title", "/catalog/book/genre", "/catalog/book/price" ], "column_mappings": { "/catalog/book/@id": "book_id", "/catalog/book/author": "author", "/catalog/book/title": "title" }, "include_attributes": true, "separator": "_", "max_depth": 10, "strip_namespace_prefixes": true }', file_metadata = '{"columns":["df_file_name","df_file_modified","df_dataset","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery, no separate DETECT SCHEMA needed GRANT ADMIN ON TABLE analytics.xml_demos.books_evolved TO USER analyst; SELECT book_id, author, title, catalog_book_price FROM analytics.xml_demos.books_evolved; ``` ## Schema Detection XML is not a self-describing tabular format. DeltaForge infers the schema by sampling rows (controlled by `infer_schema_rows`, default 1000) and analyzing the structure of elements matching the `row_xpath`. The `xml_flatten_config` determines which elements and attributes become columns. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the underlying XML structure changes (new elements, evolved schema, modified `xml_flatten_config`). ## Schema Evolution When reading multiple XML files with different element structures, DeltaForge unifies all discovered elements and attributes into a single combined schema. Elements that exist in some files but not others appear as NULL for rows from files that lack those elements. This supports evolving XML schemas over time. ## Key Options - `xml_flatten_config`: The primary configuration for XML flattening. Sub-options: - `row_xpath`: XPath expression identifying the repeating element that becomes a row (for example `//book`). - `include_paths`: List of XPath paths to extract as columns. - `exclude_paths`: Paths to drop from the output. - `explode_paths`: Repeating elements that emit one row each (cross-joined with the parent). - `xml_paths`: Paths whose subtree is captured as a single string column. - `nested_output_format`: `json` or `xml` serialization for captured subtrees. - `column_mappings`: Dictionary mapping XPath paths to custom column names. - `default_repeat_handling`: Behavior for repeating elements without an explicit explode path (`count`, `join_comma`, `first`, `to_json`). - `include_attributes`: Extract XML attributes as columns (default false). - `strip_namespace_prefixes`: Remove namespace prefixes from element names. - `separator`: Delimiter for nested element names (default `_`). - `max_depth`: Maximum nesting depth to traverse. - `file_metadata`: Inject system columns (`df_file_name`, `df_row_number`, `df_file_modified`, `df_dataset`) for provenance tracking.