Parse HL7 v2 healthcare messages (ADT, ORU, ORM, etc.) with segment-level access
## Overview HL7 v2 (Health Level Seven Version 2) is the most widely deployed healthcare messaging standard in the world, used by hospitals, laboratories, pharmacies, imaging centers, and payer systems for real-time clinical data exchange. The standard defines message types for patient administration (ADT), orders (ORM/OML), results (ORU), scheduling (SIU), master file updates (MFN), and other clinical workflows. HL7 v2 messages are pipe-delimited text with a hierarchical structure of segments, fields, components, and sub-components. DeltaForge reads HL7 v2 files using the `HL7` format handler. The parser detects HL7 v2 files by inspecting the first three characters for the MSH segment identifier. The field separator is auto-detected from the fourth character of the MSH segment (conventionally '|'), and the encoding characters (component '^', repetition '~', escape '\\', sub-component '&') are read from MSH-2. ## Usage HL7 v2 messages support two reading modes: a compact view with default MSH headers plus a full JSON column, and a materialized view that extracts specific segment fields as first-class columns. ### Compact view (default) ```sql CREATE EXTERNAL TABLE IF NOT EXISTS zone.hl7_demos.adt_messages USING HL7 LOCATION '{{data_path}}/*.hl7' OPTIONS ( file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); ``` ### Materialized view (extract specific segments as columns) ```sql CREATE EXTERNAL TABLE IF NOT EXISTS zone.hl7_demos.adt_materialized USING HL7 LOCATION '{{data_path}}/*.hl7' OPTIONS ( hl7_config = '{ "materialized_paths": [ "pid_3", "pid_5", "pid_7", "pid_8", "pid_11", "pv1_2", "pv1_3", "pv1_7", "evn_1", "evn_2" ] }', file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); ``` The materialized view extracts the specified segment.field positions as first-class columns alongside the default output. This provides direct SQL access to commonly queried fields such as patient identifiers, names, dates, and visit information without needing JSON functions. Query either table with standard SQL: ```sql SELECT MSH_9, pid_3, pid_5, df_message_json FROM zone.hl7_demos.adt_materialized WHERE MSH_9 LIKE 'ADT^A01%'; ``` ## Output Schema Each HL7 v2 message produces one row. The default output columns are: | Column | Description | |---|---| | MSH_1 through MSH_21 | Message header fields including field separator, encoding characters, sending/receiving application and facility, message timestamp, message type, control ID, processing ID, and version | | df_message_json | Full message content as a JSON string | | df_message_id | Unique deterministic hash identifying the message | When `hl7_config` includes `materialized_paths`, the specified segment fields (e.g., pid_3, pv1_2, evn_1) appear as additional columns alongside the default output. When `file_metadata` is configured, additional columns such as `df_file_name` and `df_row_number` are appended. ## Key Options - **hl7_config**: JSON string with `materialized_paths` array to extract specific segment.field positions as first-class columns. Each path follows the format `segment_field` in lowercase (e.g., `pid_3`, `pv1_2`, `evn_1`). - **file_metadata**: JSON string specifying which system columns to inject, e.g., `{"columns":["df_file_name","df_row_number"]}`. - **flatten_components**: Expands component-level fields into separate columns (e.g., PID_5_1 for family name, PID_5_2 for given name). ## Supported Versions and Message Types DeltaForge supports HL7 v2.3 through v2.6. Common message types include: - **ADT**: Patient administration (A01 Admit, A03 Discharge, A08 Update Patient Information) - **ORM**: General order messages (lab orders, radiology orders) - **ORU**: Observation results (lab results, diagnostic reports) - **SIU**: Scheduling information - **MFN**: Master file notifications HL7 v2 messages routinely contain protected health information (PHI) including patient names, dates of birth, medical record numbers, and clinical results. DeltaForge preserves all message control IDs, sending/receiving facility identifiers, and message timestamps to support audit trail requirements and message deduplication.