CREATE EXTERNAL TABLE

Creates an external table that references data files at a storage location, enabling SQL queries without copying data into a managed Delta table.

Category: data-loadingDeltaForge extension

Syntax

CREATE [OR REPLACE] EXTERNAL TABLE [IF NOT EXISTS] <table>
  LOCATION '<path>' [USING <format>]
  [(<col1> <type1>, ...)] [OPTIONS (<key>=<val>, ...)]

Description

## Overview CREATE EXTERNAL TABLE registers a read-only table backed by external data files at a specified storage location. The data remains at its original location and is not copied or converted. Queries against an external table read directly from the source files, with predicate pushdown and projection pushdown applied where the format supports it. External tables are the primary mechanism for querying data in non-Delta formats (CSV, JSON, Parquet, Avro, ORC, XML, Excel, HL7, EDI, FHIR, Protobuf) without first ingesting it into a managed Delta table. ## Schema Inference vs. Explicit Schema When no column definitions are provided, the engine reads a sample of files from the source location and infers the schema. The inferred schema includes column names and data types derived from the data. For CSV files, column names come from the header row (if header = true) or are auto-generated (column_0, column_1, ...). For JSON, column names are derived from object keys with configurable flattening rules. For binary columnar formats (Parquet, Avro, ORC), the embedded schema metadata is used directly. Specifying an explicit schema overrides inference entirely. This is recommended when: - Source files have no header row - Type inference produces incorrect types (e.g., numeric strings inferred as integers) - Only a subset of columns is needed ## External vs. Managed Tables An external table references data at a location you control. DeltaForge does not manage the lifecycle of the underlying files. DROP EXTERNAL TABLE (without WITH FILES) only removes the metadata registration; the source files remain untouched. A managed Delta table (created with CREATE DELTA TABLE) stores data in Delta format under the engine's control, with full transactional guarantees, schema evolution, time travel, and merge support. | Aspect | External Table | Managed Delta Table | |---------------------|--------------------------|----------------------------| | Data location | User-controlled | Engine-managed | | Format | Any supported format | Delta Lake only | | Transactions | No | Yes (ACID) | | Time travel | No | Yes | | Schema evolution | No | Yes | | Write support | Read-only | Full DML | ## Supported Formats | Format | USING Value | Schema Source | Predicate Pushdown | |-----------|-----------------|------------------------|--------------------| | CSV | CSV | Header row or explicit | Row-level filter | | JSON | JSON, NDJSON | Key inference + flatten| Row-level filter | | Parquet | PARQUET | Embedded metadata | Yes (row groups) | | Avro | AVRO | Embedded schema | Row-level filter | | ORC | ORC | Embedded metadata | Yes (stripes) | | XML | XML | Tag inference | Row-level filter | | Excel | EXCEL, XLSX, etc| Header row or explicit | Row-level filter | | HL7 | HL7, HL7V2 | Segment/field mapping | Row-level filter | | EDI | EDI, X12, etc | Segment mapping | Row-level filter | | FHIR | FHIR, FHIR4, etc| Resource schema | Row-level filter | | Protobuf | PROTOBUF, PB | Descriptor inference | Row-level filter | | Iceberg | ICEBERG | Iceberg metadata | Yes (manifests) | ## Cloud Storage For cloud paths (s3://, abfss://, gs://), the engine resolves object store credentials from the session's permission token. The object store is configured once during table creation and reused for all subsequent queries. If credentials are not available at creation time, the table is registered in catalog-only mode and becomes queryable once credentials are provided. ## Catalog Persistence When a catalog router is configured, external table metadata (name, location, format, options) is persisted to the control plane catalog. This allows the table to be rediscovered and re-registered automatically when a compute node starts or when the catalog is synced. ## Result Columns | Column | Type | Description | |--------------|--------|-------------| | column_name | STRING | Inferred or declared column name | | column_type | STRING | Arrow data type of the column | The result set lists all columns in the newly created external table.

Parameters

Name	Type	Description
`table`		Specifies the table name. Accepts unqualified names (e.g., events), two-part names (schema.events), or fully qualified three-part names (zone.schema.events). For three-part names, the zone and schema must already exist.
`location`		Specifies the storage path or URI pointing to the external data files. Supports local paths (file:///data/files/ or /data/files/), Amazon S3 (s3://bucket/prefix/), Azure Blob Storage and ADLS Gen2 (abfss://container@account.dfs.core.windows.net/path/), and Google Cloud Storage (gs://bucket/prefix/). Tilde expansion is supported for local paths (~/data/). For directories, all matching files are included.
`format`		Specifies the file format of the data at the source location, provided via the USING or FORMAT keyword. Accepted values: CSV, JSON, NDJSON, PARQUET, AVRO, ORC, XML, EXCEL (also XLS, XLSX, XLSM, XLSB, ODS), HL7, EDI (also X12, EDIFACT, TRADACOMS), FHIR, PROTOBUF (also PROTO, PB), ICEBERG. When omitted, the format is inferred from the file extension. If the extension is ambiguous, CSV is assumed.
`schema`		Specifies an explicit schema as a parenthesized, comma-separated list of column name and data type pairs. Example: (id BIGINT, name STRING, created_at TIMESTAMP). When omitted, the engine reads sample data from the source location and infers the schema automatically. Explicit schemas are useful when source files lack headers or when type inference produces undesirable types.
`options`		Specifies format-specific and provider options as a parenthesized list of key = value pairs. Common options include: json_flatten_config (JSON flattening rules), file_metadata (system column injection), recursive (scan subdirectories), file_filter (glob pattern for file selection), batch_size (rows per read batch), max_rows (limit total rows), preserve_column_names (prevent name normalization). For CSV: header, delimiter, quote, escape, comment, null_value, skip_rows, trim_whitespace, encoding, infer_schema_rows. Values must be single-quoted strings.

Examples

-- Create an external table over CSV files on S3
CREATE EXTERNAL TABLE bronze.raw.orders
  USING CSV
  LOCATION 's3://data-lake/orders/'
  OPTIONS (header = 'true', delimiter = '|');

-- Create an external table with explicit schema
CREATE EXTERNAL TABLE staging.transactions (
    txn_id BIGINT,
    amount DECIMAL(10,2),
    currency STRING,
    txn_date TIMESTAMP
  )
  USING PARQUET
  LOCATION '/data/parquet/transactions/';

-- Create a JSON external table with flattening configuration
CREATE EXTERNAL TABLE IF NOT EXISTS bronze.raw.customers
  USING JSON
  LOCATION '/data/json/customers/'
  OPTIONS (
    json_flatten_config = '{
      "root_path": "$",
      "include_paths": ["$.id", "$.email", "$.name"],
      "max_depth": 2,
      "separator": "_",
      "infer_types": true
    }',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
  );

-- Create an external table over ORC files with system columns
CREATE EXTERNAL TABLE IF NOT EXISTS warehouse.orc_data.inventory
  USING ORC
  LOCATION 'abfss://raw@storageacct.dfs.core.windows.net/inventory/'
  OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
  );

-- Replace an existing external table with a new location
CREATE OR REPLACE EXTERNAL TABLE analytics.web.clickstream
  USING JSON
  LOCATION 'gs://analytics-bucket/clickstream/2024/';

-- Create an external table over XML files
CREATE EXTERNAL TABLE IF NOT EXISTS bronze.raw.product_catalog
  USING XML
  LOCATION '/data/xml/products/'
  OPTIONS (
    recursive = 'true',
    file_filter = '*.xml'
  );

Pitfalls

When no USING clause is provided and file extensions are absent or ambiguous, the format defaults to CSV. This can cause silent schema misinterpretation when pointing at JSON or binary files.
Schema inference requires reading sample data from the source location. For large directories or slow cloud storage, this adds latency to the CREATE statement. Providing an explicit schema avoids this cost.
External tables are read-only. INSERT, UPDATE, DELETE, and MERGE operations are not supported. To write data, first load it into a managed Delta table with COPY INTO or CREATE TABLE AS SELECT.
OR REPLACE drops and re-registers the table in the DataFusion session. Any cached query plans referencing the old table are invalidated. Active queries may fail if the table is replaced mid-execution.
OPTIONS values must be single-quoted strings, even for boolean and numeric values. For example: header = 'true', not header = true.
CREATE EXTERNAL TABLE resolves the LOCATION at creation time: it lists the files there and, when no explicit column list is given, reads a sample to infer the schema. The path must therefore exist and be readable when the statement runs; an absent local or accessible cloud path fails with 'Path does not exist'. When the data is produced by an upstream step such as INVOKE API ENDPOINT, create the external table AFTER that step has landed files, not before. An explicit column list skips inference but not the file listing, so it does not lift the existence requirement. The one deferral is a cloud path whose credentials are not yet available, which registers in catalog-only mode (see Cloud Storage).