Protocol Buffers

Read Protocol Buffers binary format using an external .proto schema

Category: file-formats

Description

## Overview The Protocol Buffers data source reads binary Protobuf-encoded files using an external `.proto` schema. Unlike self-describing formats, Protobuf requires both a schema file (`schema_path`) and a fully-qualified message name (`message_name`) to decode the binary payload. The key capability is the `proto_flatten_config` option, which controls how nested messages and repeated fields map to tabular columns. Protobuf files are registered as external tables using the `CREATE EXTERNAL TABLE` statement with `USING PROTOBUF`. DeltaForge reads multiple Protobuf files in parallel. The flatten configuration provides control over row extraction from repeated fields, nested message flattening, enum decoding, timestamp formatting, and column renaming. ## Path Notation Field paths in `row_path`, `include_paths`, `explode_paths`, and `column_mappings` are dot-separated field names matching the `.proto` definition. Path segments use the proto field name exactly as declared. | Notation | Meaning | |----------|---------| | `people` | A repeated field on the top-level message (typical `row_path`) | | `people.name` | A scalar field on each element of `people` | | `people.phones` | A nested repeated field | | `people.phones.number` | A scalar inside every element of `people.phones` | | `shipments.packages.dimensions.length_cm` | A scalar reached through three nesting levels | The top-level message is the one named in `message_name`. Its repeated field (listed in `row_path`) generates rows. Unlike XML or JSON, Protobuf paths are always absolute in terms of field names and never include leading characters like `/` or `$`. ## Path Extraction Examples Each example pairs a real `.proto` schema with a flatten configuration drawn from demos in `delta-forge-demos/demos/protobuf/`. ### Example 1: Repeated message with enum and timestamp (address book) Source schema `addressbook.proto`: ```proto syntax = "proto3"; package tutorial; import "google/protobuf/timestamp.proto"; message Person { string name = 1; int32 id = 2; string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { string number = 1; PhoneType type = 2; } repeated PhoneNumber phones = 4; google.protobuf.Timestamp last_updated = 5; } message AddressBook { repeated Person people = 1; } ``` Sample decoded record (one of many in the `.pb` file): ```text AddressBook { people: [ Person { name: "Alice Chen" id: 1001 email: "alice@example.com" phones: [ PhoneNumber { number: "+1-555-0100", type: MOBILE } PhoneNumber { number: "+1-555-0101", type: WORK } ] last_updated: 2024-03-15T09:20:00Z } ] } ``` Config (one row per Person, phones joined into one string column): ```json { "row_path": "people", "include_paths": [ "people.name", "people.id", "people.email", "people.phones.number", "people.phones.type", "people.last_updated" ], "default_repeat_handling": "join_comma", "column_mappings": { "people.name": "contact_name", "people.id": "contact_id", "people.email": "email", "people.phones.number": "phone_numbers", "people.phones.type": "phone_types", "people.last_updated": "last_updated" }, "decode_enums": true, "timestamp_format": "iso8601" } ``` Field-path to column mapping: | Field path | Column | Decoded value | |------------|--------|---------------| | `people.name` | `contact_name` | `Alice Chen` | | `people.id` | `contact_id` | `1001` | | `people.email` | `email` | `alice@example.com` | | `people.phones.number` | `phone_numbers` | `+1-555-0100, +1-555-0101` | | `people.phones.type` | `phone_types` | `MOBILE, WORK` | | `people.last_updated` | `last_updated` | `2024-03-15T09:20:00Z` | Because `decode_enums` is true, `PhoneType` integers become the string names declared in the enum. Because `timestamp_format` is `iso8601`, `google.protobuf.Timestamp` serializes as an ISO 8601 string. ### Example 2: Explode a nested repeated field Same source schema. To produce one row per phone number (instead of joining), add the phones path to `explode_paths`: ```json { "row_path": "people", "explode_paths": ["people.phones"], "include_paths": [ "people.name", "people.id", "people.email", "people.phones.number", "people.phones.type" ], "column_mappings": { "people.phones.number": "phone_number", "people.phones.type": "phone_type" } } ``` Person-level fields duplicate on every phone row: | people_name | people_id | people_email | phone_number | phone_type | |-------------|-----------|--------------|--------------|------------| | Alice Chen | 1001 | alice@example.com | +1-555-0100 | MOBILE | | Alice Chen | 1001 | alice@example.com | +1-555-0101 | WORK | Persons with an empty `phones` list produce zero rows under `explode_paths`. Use `default_repeat_handling: "join_comma"` instead to keep them. ### Example 3: Three-level nesting (freight shipping) Source schema `freight.proto`: ```proto syntax = "proto3"; package freight; enum ShipmentStatus { UNSPECIFIED = 0; CREATED = 1; PICKED_UP = 2; IN_TRANSIT = 3; DELIVERED = 4; } enum PackageClass { PACKAGE_UNSPECIFIED = 0; STANDARD = 1; FRAGILE = 2; HAZMAT = 3; } message Dimensions { float length_cm = 1; float width_cm = 2; float height_cm = 3; } message Package { string package_id = 1; string description = 2; float weight_kg = 3; Dimensions dimensions = 4; PackageClass package_class = 5; bool requires_signature = 6; int64 declared_value_cents = 7; } message Shipment { string shipment_id = 1; string origin = 2; string destination = 3; ShipmentStatus status = 4; bool is_express = 5; int64 total_cost_cents = 6; repeated Package packages = 7; } message ShippingManifest { repeated Shipment shipments = 1; string carrier_name = 2; } ``` Config that explodes to one row per package and reaches through three nesting levels (`Shipment` -> `Package` -> `Dimensions`): ```json { "row_path": "shipments", "explode_paths": ["shipments.packages"], "include_paths": [ "shipments.shipment_id", "shipments.status", "shipments.is_express", "shipments.packages.package_id", "shipments.packages.description", "shipments.packages.weight_kg", "shipments.packages.dimensions.length_cm", "shipments.packages.dimensions.width_cm", "shipments.packages.dimensions.height_cm", "shipments.packages.package_class", "shipments.packages.requires_signature", "shipments.packages.declared_value_cents" ], "column_mappings": { "shipments.shipment_id": "shipment_id", "shipments.status": "shipment_status", "shipments.is_express": "is_express", "shipments.packages.package_id": "package_id", "shipments.packages.description": "description", "shipments.packages.weight_kg": "weight_kg", "shipments.packages.dimensions.length_cm": "length_cm", "shipments.packages.dimensions.width_cm": "width_cm", "shipments.packages.dimensions.height_cm": "height_cm", "shipments.packages.package_class": "package_class", "shipments.packages.requires_signature": "requires_signature", "shipments.packages.declared_value_cents": "declared_value_cents" }, "decode_enums": true } ``` Field-path to column summary: | Field path | Proto type | Column | Sample decoded value | |------------|-----------|--------|----------------------| | `shipments.shipment_id` | string | `shipment_id` | `SHIP-2024-001` | | `shipments.status` | enum ShipmentStatus | `shipment_status` | `IN_TRANSIT` | | `shipments.is_express` | bool | `is_express` | `true` | | `shipments.packages.package_id` | string | `package_id` | `PKG-001` | | `shipments.packages.weight_kg` | float | `weight_kg` | `2.35` | | `shipments.packages.dimensions.length_cm` | float | `length_cm` | `40.0` | | `shipments.packages.package_class` | enum PackageClass | `package_class` | `FRAGILE` | | `shipments.packages.requires_signature` | bool | `requires_signature` | `true` | | `shipments.packages.declared_value_cents` | int64 | `declared_value_cents` | `125000` | ### Example 4: Multiple repeated fields alongside each other A single Shipment contains both `packages` and `tracking` as repeated fields. Each set can drive its own table by exploding one at a time: ```json { "row_path": "shipments", "explode_paths": ["shipments.tracking"], "include_paths": [ "shipments.shipment_id", "shipments.origin", "shipments.destination", "shipments.tracking.event_time", "shipments.tracking.location", "shipments.tracking.description" ] } ``` Exploding `shipments.tracking` here leaves `shipments.packages` implicit. Unused repeated fields produce no columns. Each repeated path listed in `explode_paths` multiplies the row count; listing more than one simultaneously performs a cross join across the sets. ### Example 5: Timestamp formatting `google.protobuf.Timestamp` fields render according to `timestamp_format`: | Value | Output | |-------|--------| | `iso8601` | `2024-03-15T09:20:00Z` (string) | | `epoch_seconds` | `1710494400` (int64) | | `epoch_millis` | `1710494400000` (int64) | Pick the one that matches how downstream queries filter or join on the timestamp. ### Example 6: Sensor readings, capture the full nested message Source schema `sensor.proto`: ```proto syntax = "proto3"; package iot; import "google/protobuf/timestamp.proto"; message SensorReading { double value = 1; google.protobuf.Timestamp recorded_at = 2; string unit = 3; } message Sensor { string sensor_id = 1; string sensor_type = 2; string location = 3; repeated SensorReading readings = 4; } message SensorNetwork { repeated Sensor sensors = 1; string facility_name = 2; } ``` One row per sensor reading (explode to the innermost repeated field): ```json { "row_path": "sensors", "explode_paths": ["sensors.readings"], "include_paths": [ "sensors.sensor_id", "sensors.sensor_type", "sensors.location", "sensors.readings.value", "sensors.readings.unit", "sensors.readings.recorded_at" ], "column_mappings": { "sensors.sensor_id": "sensor_id", "sensors.sensor_type": "sensor_type", "sensors.location": "location", "sensors.readings.value": "reading_value", "sensors.readings.unit": "reading_unit", "sensors.readings.recorded_at":"recorded_at" }, "timestamp_format": "iso8601" } ``` Sensor-level fields (`sensor_id`, `sensor_type`, `location`) duplicate onto each reading row. A sensor with 1,000 readings produces 1,000 rows. ## Column Name Generation When `column_mappings` is not supplied, DeltaForge derives a column name from the field path: 1. Replace each `.` with the configured `separator` (default `_`) 2. Lowercase the result 3. Sanitise any character that is not valid in a SQL identifier | Field path | Generated column | |------------|------------------| | `people.name` | `people_name` | | `people.phones.number` | `people_phones_number` | | `shipments.packages.dimensions.length_cm` | `shipments_packages_dimensions_length_cm` | For Protobuf messages with many nesting levels, the default names become long and repetitive. An explicit `column_mappings` entry for each extracted field keeps the resulting schema readable. ## Enum Decoding When `decode_enums` is true (default), enum fields are emitted as the string name declared in the `.proto`. With `decode_enums: false`, enum fields are emitted as their integer tag value. Zero-valued enum entries (the proto3 default, typically named `UNSPECIFIED`) render the same way as any explicitly set zero value. There is no way to distinguish "unset" from `0` in proto3 semantics. ## Usage ```sql CREATE ZONE IF NOT EXISTS analytics TYPE EXTERNAL; CREATE SCHEMA IF NOT EXISTS analytics.protobuf_demos; CREATE EXTERNAL TABLE IF NOT EXISTS analytics.protobuf_demos.contacts USING PROTOBUF LOCATION '/data/contacts' OPTIONS ( schema_path = '/schemas/addressbook.proto', message_name = 'tutorial.AddressBook', proto_flatten_config = '{ "row_path": "people", "include_paths": [ "people.name", "people.id", "people.email", "people.phones.number", "people.phones.type", "people.last_updated" ], "default_repeat_handling": "join_comma", "column_mappings": { "people.name": "contact_name", "people.id": "contact_id", "people.email": "email", "people.phones.number": "phone_numbers", "people.phones.type": "phone_types", "people.last_updated": "last_updated" }, "decode_enums": true, "timestamp_format": "iso8601", "separator": "_", "max_depth": 10 }', file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -- CREATE EXTERNAL TABLE auto-runs schema discovery from the supplied `.proto`, no separate DETECT SCHEMA needed GRANT ADMIN ON TABLE analytics.protobuf_demos.contacts TO USER analyst; SELECT contact_name, email, phone_numbers FROM analytics.protobuf_demos.contacts; ``` ## Schema Detection Protobuf schema is defined externally in the `.proto` file rather than embedded in the data files. DeltaForge reads the `.proto` definition to determine field names, types, and nesting structure. The `proto_flatten_config` then controls how nested messages and repeated fields map to tabular columns. CREATE EXTERNAL TABLE auto-runs schema discovery as part of registration, so column metadata appears in the catalog immediately and `information_schema.columns` is queryable right away. Re-run `DETECT SCHEMA FOR TABLE zone.schema.table` only when the `.proto` definition or `proto_flatten_config` changes. ## Schema Evolution When reading multiple Protobuf files that may have been encoded against different versions of the `.proto`, DeltaForge uses the supplied `.proto` as the canonical schema. Fields present in the proto but absent in older payloads appear as their proto3 default values (empty string, `0`, `false`, empty repeated list). Unknown fields in the payload but not in the proto are silently ignored. ## Key Options - `schema_path`: Path to the `.proto` schema file (required). - `message_name`: Fully-qualified Protobuf message name to decode, for example `tutorial.AddressBook` (required). - `proto_flatten_config`: Primary configuration. Sub-options: - `row_path`: Field path to the repeated message that generates rows. - `include_paths`: Selective field extraction within each row. - `explode_paths`: Specific repeated fields to explode into separate rows. - `column_mappings`: Rename fields to custom column names. - `default_repeat_handling`: Handler for repeated fields not listed in `explode_paths` (`join_comma`, `first`, `count`). - `decode_enums`: Emit enum string names when true (default), integer tags when false. - `timestamp_format`: `iso8601`, `epoch_seconds`, or `epoch_millis`. - `separator`: Delimiter for generated column names (default `_`). - `max_depth`: Maximum nesting depth to traverse. - `file_metadata`: Inject system columns (`df_file_name`, `df_row_number`, `df_file_modified`, `df_dataset`) for provenance tracking.

Protocol Buffers

Description

See Also