DISCOVER

Auto-detects the file format at a zone-relative path from the file's actual bytes, synthesizes the matching OPTIONS, and registers the object as an external table. PRINT mode returns the generated DDL without running it.

Category: data-loadingDeltaForge extension

Syntax

DISCOVER <zone>.<schema>.<object> PATH '<relative path>'
  [WITH (KEY = value, ...)] [PRINT | EXECUTE]

Description

## Overview DISCOVER turns a file or directory under a zone's storage root into a queryable external table in a single statement. It reads the actual bytes at `<zone storage root>/<relative path>`, identifies the format (magic numbers for binary formats, text-shape detection and CSV dialect sniffing for text formats), synthesizes the OPTIONS the format needs (a json_flatten_config for JSON, an xml_flatten_config for XML, the sniffed dialect for CSV), and registers `zone.schema.object` as an external table. Compared to CREATE EXTERNAL TABLE, which requires you to name the format and supply its OPTIONS yourself, DISCOVER derives both from the data. Compared to DETECT SCHEMA, which infers columns for a table that is already registered, DISCOVER starts from a raw path and produces the registration itself. ## Modes - **EXECUTE** (the default when no mode keyword is given): performs the registration and returns a summary row with the columns object, location, format, confidence, action, and evidence, so the result of detection is visible immediately. - **PRINT**: returns the exact generated CREATE EXTERNAL TABLE statement (or REGISTER DELTA TABLE when the path is a Delta layout) WITHOUT running it. Use this to review or hand-edit the DDL before committing to it. ## Zone-relative paths and credentials PATH must be relative to the zone's storage root: absolute paths and '..' traversal are rejected, the same contract REGISTER DELTA TABLE applies to LOCATION. The zone supplies the storage root and the credentials, so the identical statement works whether the zone is backed by a local directory or by cloud object storage (ADLS, S3, GCS). ## Format detection pipeline Detection reads real bytes rather than trusting the file extension: binary formats (Parquet, ORC, Avro, Excel) are identified by their magic numbers; text formats are classified by shape (JSON vs NDJSON vs XML vs HL7 vs EDI vs FHIR); CSV goes through dialect sniffing to pick up the delimiter, quoting, and header convention. The synthesized OPTIONS reflect what was actually found. When detection is ambiguous or you want to force a specific handler, pass WITH (FORMAT = '<name>') with one of: csv, json, ndjson, xml, parquet, avro, orc, excel, hl7, edi, fhir. Two table layouts are recognized and routed to their native registration paths instead of CREATE EXTERNAL TABLE: - A directory containing `_delta_log/` registers via REGISTER DELTA TABLE semantics. - An Iceberg `metadata/` tree registers as an external table USING ICEBERG. Protobuf files are rejected with guidance: protobuf cannot be decoded without its schema, so the table must be created explicitly with CREATE EXTERNAL TABLE and an explicit .proto. ## WITH options - `FORMAT = '<name>'`: override auto-detection with one of the supported format names. - `FILE_METADATA = true`: the registered table carries the df_file_name and df_row_number provenance columns. - `RECURSIVE = true`: descend into subdirectories under the path. - Any other key = value pair passes through unchanged into the generated external table's OPTIONS, so format-specific tuning is available without abandoning DISCOVER for hand-written DDL. ## Safety gates - **Low-confidence verdicts.** EXECUTE refuses to register when content detection lands at low confidence (extension-only or inconclusive-content classification) and no FORMAT override is given. Run the statement with PRINT to review the verdict and the generated DDL, or pin the format with WITH (FORMAT = '<name>'). PRINT always proceeds regardless of confidence; medium and high confidence proceed in both modes. - **One format per path, schema union within it.** When PATH is a directory, up to 3 candidate data files are sampled (a head sample only for the extras) and must agree on the format (json and ndjson count as one, both register USING JSON). A file that classifies as a different format fails the command with each file's verdict listed: point PATH at a subdirectory that holds one format, or pin one with WITH (FORMAT = '<name>') to force it, in which case the disagreement is recorded as an evidence warning instead. Unreadable, empty, or unclassifiable siblings are warnings only. Different versions or shapes of the SAME format are expected: for JSON, NDJSON, and XML the discovered paths of every same-format sampled file are unioned into the synthesized flatten config, matching the engine's per-file schema evolution (schema union, NULL-filled missing columns), so a shape that only appears in a second file still lands in the registered config. - **Duplicate names.** EXECUTE errors when zone.schema.object is already registered in the catalog. Drop the existing object first (DROP EXTERNAL TABLE for an external table, DROP DELTA TABLE for a Delta table) or discover under a different name. - **Gzip.** No DeltaForge file connector decompresses gzip on read, so a gzip-compressed sample is rejected with guidance to decompress first; gzip files sitting next to readable data files are noted in the evidence column as files the registered table will not read. - **Empty inputs.** A zero-byte file and a directory with no data files (hidden, '_'-prefixed, and .crc entries are never sampled; subdirectories are probed up to MAX_DEPTH) both fail with explicit errors instead of registering an empty table. ## Compatibility DISCOVER is a DeltaForge extension. It composes the format-detection machinery with the external-table registration path so a raw landing file becomes a governed catalog object in one statement.

Parameters

NameTypeDescription
table_nameFully qualified three-part name 'zone.schema.object' under which the discovered table is registered. The zone supplies both the storage root the path is resolved against and the credentials used to read the files.
pathPath RELATIVE to the zone's storage root pointing at the file or directory to discover. Absolute paths and '..' traversal are rejected; the same contract as REGISTER DELTA TABLE's LOCATION. Leading './' segments, trailing slashes, and Windows backslash separators are accepted and normalized. Because the zone resolves the root, the same statement works for local zones and cloud zones (ADLS, S3, GCS) alike.
optionsDiscovery options as KEY = value pairs. FORMAT = '<name>' overrides auto-detection (csv, json, ndjson, xml, parquet, avro, orc, excel, hl7, edi, fhir). FILE_METADATA = true adds the df_file_name and df_row_number provenance columns. RECURSIVE = true descends into subdirectories. Any other key passes through unchanged to the generated external table's OPTIONS.
modePRINT returns the exact generated CREATE EXTERNAL TABLE (or REGISTER DELTA TABLE for a Delta layout) statement without running it. EXECUTE, also the default when the keyword is omitted, performs the registration and returns a summary row. EXECUTE additionally refuses low-confidence detections unless FORMAT is pinned; PRINT always proceeds so the verdict can be reviewed.

Examples

-- Minimal: detect the format and register the table in one statement
DISCOVER bronze.landing.events PATH 'incoming/events';
-- Override detection when you already know the format
DISCOVER bronze.landing.clicks PATH 'incoming/clicks' WITH (FORMAT = 'ndjson');
-- PRINT mode: inspect the generated DDL without registering anything
DISCOVER bronze.landing.orders PATH 'incoming/orders.xml' PRINT;
-- Cloud zone: the path is still relative, credentials come from the zone
DISCOVER lake.raw.sensor_readings PATH 'iot/2026/06/readings.parquet';
-- Add file provenance columns to the registered table
DISCOVER bronze.landing.shipments PATH 'edi/shipments' WITH (FILE_METADATA = true);

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →