Advanced JSON Extraction: On-Demand Segment Access via df_transaction_json

Extracts deeply-nested X12 segments ad hoc using SQL JSON functions, json_typeof, json_array_length, json_extract_path_text, jsonb_pretty, and the #>> path operator, against the df_transaction_json column, without pre-configuring materialized_paths.

Category: edi

Syntax

-- Envelope-only table; full transaction body lives in df_transaction_json
CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.json_extraction_messages
USING EDI
LOCATION '/path/to/edi/*.edi'
OPTIONS (
    edi_config = '{"ediFormat": "x12"}',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- Query 1, structural probe: segment count per transaction
SELECT
    df_file_name,
    st_1 AS txn_type,
    json_typeof(df_transaction_json::json) AS root_type,
    json_array_length(df_transaction_json::json -> 'segments') AS segment_count
FROM external.edi_demos.json_extraction_messages
ORDER BY segment_count DESC;

-- Query 2, read the first segment's name and third element via path operators
SELECT
    df_file_name,
    df_transaction_json::json #>> '{segments,0,name}'         AS first_seg_name,
    df_transaction_json::json #>> '{segments,0,elements,2}'   AS first_seg_elem3,
    json_extract_path_text(df_transaction_json::json, 'segments', '1', 'name') AS second_seg_name
FROM external.edi_demos.json_extraction_messages
LIMIT 5;

-- Query 3, pretty-print one transaction for debugging
SELECT
    df_file_name,
    jsonb_pretty(df_transaction_json::jsonb) AS pretty
FROM external.edi_demos.json_extraction_messages
WHERE df_file_name = 'x12_850_purchase_order_a.edi'
LIMIT 1;

-- Query 4, count PO1 line items without materializing
WITH parsed AS (
    SELECT
        df_file_name,
        st_1,
        df_transaction_json::json AS j
    FROM external.edi_demos.json_extraction_messages
    WHERE st_1 = '850'
)
SELECT
    df_file_name,
    (
        SELECT COUNT(*)
        FROM json_array_elements(j -> 'segments') AS seg
        WHERE seg ->> 'name' = 'PO1'
    ) AS po1_line_count
FROM parsed
ORDER BY po1_line_count DESC;

Description

## When to Use Use this pattern when materialized_paths is the wrong tool: when you receive hundreds of heterogeneous X12 messages daily and cannot pre-declare every segment of interest, when a one-off investigation needs access to an obscure segment (SAC, TXI, HL, PID, MSG) that is not worth materializing permanently, or when segment occurrence counts vary so wildly that neither `indexed` nor a fixed `max_repeating_segments` fits. `df_transaction_json` is produced on every EDI external table by default and contains the complete parsed transaction as a JSON object, every segment, every element, every component, preserving occurrence order. Materialized_paths should remain your default for hot-path columns. Use JSON extraction when the query is exploratory, the segment is rare, or the shape varies across files. ## What You Will Learn 1. How df_transaction_json structures an X12 transaction as `{segments: [{name, elements: [...]}, ...]}` (same shape for EDIFACT/TRADACOMS, only element content differs) 2. How to use `json_typeof`, `json_array_length`, and the `#>>` path operator to probe the structure without pre-knowledge 3. How to cast STRING to JSON (`df_transaction_json::json`), required; raw JSON functions on a STRING return NULL 4. How to extract a named segment using a WHERE clause inside a LATERAL subquery over the segments array 5. How to count occurrences of a specific segment name (e.g. PO1 line items) without materializing them 6. When JSON extraction is slower than materialized_paths, every query reparses the JSON string, so high-fanout queries should graduate to materialized columns ## Prerequisites - Familiarity with the X12 or EDIFACT structure of the files you are querying - Familiarity with PostgreSQL-style JSON operators (`->`, `->>`, `#>`, `#>>`) and functions (`json_typeof`, `json_array_length`, `json_extract_path_text`) - A working external table that leaves `materialized_paths` unset (or minimally set) so df_transaction_json carries the full payload

Pitfalls

df_transaction_json is a STRING column, applying `->`, `->>`, or json_* functions without casting to `::json` / `::jsonb` silently returns NULL.
Path indices in `#>>` and `json_extract_path_text` are STRINGS, not integers: use `{segments,0,elements,2}` or `json_extract_path_text(j, 'segments', '0', 'elements', '2')`, writing the integer `0` fails.
Querying df_transaction_json reparses the full JSON for every row every time, it is convenient but slow. Once a JSON path is queried repeatedly in production, promote it to materialized_paths so the parser materializes it once at ingest.
Segment names inside df_transaction_json are UPPERCASE (`PO1`, `N1`, `SAC`, `TXI`), materialized_paths are lowercase (`po1_1`, `n1_2`). Do not mix conventions in a query.
Composite elements appear as nested arrays inside `elements`, e.g. an SV1_1 composite of `HC:99213` shows as `["HC", "99213"]` at that element position. Use `#>> '{segments,N,elements,M,0}'` to read the qualifier and `...,M,1` for the code.
An envelope carrying multiple ST..SE transactions yields multiple rows, each with its own df_transaction_json; the ISA/GS envelope segments are NOT included inside df_transaction_json, they are surfaced as ISA_* / GS_* columns only.
Some optional segments (e.g. DTM at the loop level) may be absent entirely, guard with `json_array_length(j -> 'segments' -> ...) > 0` before indexing, or the path operator returns NULL and downstream CAST fails.

Advanced JSON Extraction: On-Demand Segment Access via df_transaction_json

Syntax

Description

Pitfalls

See Also