SHOW STATS

Runs a SQL statement under instrumentation and returns a per-metric breakdown of what the planner intended (PLAN view) or what the runtime actually consumed (ACTUAL view). Inline form: every invocation captures its own SQL, no per-session ring. Zero overhead when not invoked.

Category: systemDeltaForge extension

Syntax

SHOW STATS (PLAN | ACTUAL) <sql_statement>

Description

## Overview SHOW STATS is the engine's observability surface for query execution. Each invocation runs an inner SQL statement under instrumentation, captures the metrics for that statement only, and returns them as a long-form result set. The two views (PLAN and ACTUAL) are deliberately separated so a benchmark consumer cannot accidentally cite optimizer estimates as if they were measured numbers. The inline form means SHOW STATS is self-contained: no session history ring, no "give me the last query's stats" affordance. Every call carries its own SQL and its own measurements. A benchmark report typically runs the bare query (for wall-clock measurement), then runs the same SQL under `SHOW STATS ACTUAL` to capture the breakdown, and reports the two numbers side-by-side. ## Result Schema (long-form) Both views return the same four-column shape: | Column | Type | Description | |---|---|---| | `category` | utf8 (not null) | Metric group: `query`, `time`, `files`, `row_groups`, `columns`, `pages`, `bytes`, `rows`, `cache`, `memory`. | | `metric` | utf8 (not null) | Metric name. | | `value` | utf8 (nullable) | Decimal-string for numeric metrics, raw string for identity metrics (`statement_kind`, `sql_preview`), NULL when the metric is wired but not populated on this codepath. | | `unit` | utf8 (not null) | One of `str`, `epoch_ms`, `ms`, `ns`, `count`, `bytes`, `rows`. | ### NULL vs zero (load-bearing for benchmarks) - `value = NULL` means the metric is wired but not populated for this query. Examples: `control_plane_validation_ms` is NULL when running in-process (no control plane in path); `bytes_from_page_cache` is NULL when the scan goes through the DataFusion + parquet-rs path that does not consult the morsel-native page cache. - `value = "0"` means the metric is wired and the runtime observed zero. Example: `files_pruned_partition = "0"` on a non-partitioned table. A benchmark publication that cites a NULL metric is making a claim the engine has not actually measured. Cite "0" freely; treat NULL as "not measured here" and disclose accordingly. ## Category ordering The metric rows are emitted in a fixed sequence so the breakdown reads top-down from "what did I run" to "how much time / IO / memory did it cost": 1. `query` (statement identity) 2. `time` (both views; phases in chronological order, totals last: `total_engine_time_ms` then `total_time_ms`) 3. `files` (PLAN: pruned counts + planned; ACTUAL: touched) 4. `row_groups` (PLAN: pruned counts + planned; ACTUAL: touched) 5. `columns` (PLAN: total/projected/chunks_planned; ACTUAL: chunks_fetched) 6. `pages` (PLAN only) 7. `bytes` (PLAN: planned compressed + uncompressed; ACTUAL: read_disk, decoded, written, cache contributions) 8. `rows` (PLAN: rows in planned row groups; ACTUAL: consumed, returned, affected, filtered) 9. `cache` (ACTUAL only; DIM cache lifetime counters + per-query page/metadata cache deltas) 10. `memory` (ACTUAL only; peak per-batch Arrow memory) The `time` category appears in BOTH views with the same metric set. SHOW STATS PLAN runs the inner SQL to populate the planner-side counters (otherwise `total_row_groups`, `row_groups_pruned_*`, and the byte budgets would be empty), so the timings are valid measurements there too. Time metrics are listed first in both views because they are the primary cross-engine comparison surface (DuckDB, Spark, etc.). ## Metric catalog: PLAN view (21 metrics) All counts and bytes come from walking the physical plan's `DeltaScanExec` nodes and aggregating their `ScanRuntimeStats` after `df.create_physical_plan().await` returns. The walk happens before the inner statement streams, so PLAN numbers reflect the optimizer's pre-fetch decision. ### query | Metric | Source | Description | |---|---|---| | `statement_kind` | First keyword of the inner SQL, lowercased (`select`, `insert`, ...). | Statement classification, used as a hint for downstream tooling. | | `sql_preview` | First ~80 chars of the inner SQL after stripping leading whitespace / comments. | Confirms the metrics record applies to the SQL the operator thinks they ran. | | `submitted_at_unix_ms` | `SystemTime::now()` at the moment `execute_show_stats` started. | Wall-clock submission timestamp; NULL if the timestamp could not be sampled. | ### files | Metric | Source | Description | |---|---|---| | `total_files` | `DeltaScanExec::get_scan_stats().total_files`. | Files in the table version the planner saw, before any pruning. | | `files_pruned_partition` | `get_scan_stats().files_skipped_by_partition`. | Files eliminated by partition-value pruning (file-level, before any metadata fetch). | | `files_pruned_manifest` | Drained from `PruningSink::Manifest` entries. | Files eliminated by manifest min/max statistics (Iceberg / Delta manifest level, before parquet footers). Currently NULL (no source pushes these yet). | | `files_pruned_stats` | `get_scan_stats().files_skipped_by_stats`. | Files eliminated by per-file min/max statistics. | | `files_pruned_bloom` | Reserved. | File-level bloom filtering is not yet implemented; the column stays NULL even when row-group-level bloom prunes things (those go through `row_groups_pruned_bloom`). | | `files_planned` | `get_scan_stats().files_scanned`. | Files the pruning ladder said would be scanned (= `total_files - sum(files_pruned_*)`). | ### row_groups | Metric | Source | Description | |---|---|---| | `total_row_groups` | `metadata.num_row_groups()` summed per file, accumulated at `share_idx == 0`. | Row groups across the planned-to-scan files, before row-group-level pruning. Set unconditionally so a full scan (no predicate) still reports a non-zero number. | | `row_groups_pruned_stats` | `RowGroupPruner` `pruning_result.pruned_row_groups.len()` summed per file, accumulated at `share_idx == 0`. | Row groups eliminated by min/max statistics within surviving files. | | `row_groups_pruned_bloom` | Drained from `PruningSink::Bloom` entries. | Row groups eliminated by bloom filters within surviving files. Wired on the parquet-io ladder; NULL on scan paths that do not consult bloom. | | `row_groups_pruned_nulls` | Drained from `PruningSink::Nulls` entries. | Row groups eliminated by null-count specials (IS NULL on no-null column, IS NOT NULL on all-null column). | | `row_groups_planned` | `selected_row_groups_pre_share.len()` per file at `share_idx == 0`, summed. | Row groups the ladder said would be scanned. Uses the PRE-share survivor list so multi-share scans report per-file totals, not per-share slices. | ### columns | Metric | Source | Description | |---|---|---| | `total_columns_schema` | `get_scan_stats().total_columns`. | Total leaf columns in the file schema. | | `columns_projected` | `get_scan_stats().columns_projected`. | Root-field columns the optimizer's projection picked. | | `column_chunks_planned` | Per-file projection-aware loop: sum over surviving row groups of `ProjectionMask::leaf_included` column chunks. | Column chunks the projection covers across planned row groups. Drives the planned-bytes accounting. | ### pages | Metric | Source | Description | |---|---|---| | `pages_pruned_page_index` | Drained from `PruningSink::PageIndex` entries. | Pages within surviving row groups eliminated by the page-index ladder. Wired only when `page_pruning` is active and a predicate exists. | ### bytes | Metric | Source | Description | |---|---|---| | `bytes_planned_compressed` | Per-file loop: sum of `chunk.compressed_size()` over `ProjectionMask::leaf_included` chunks in planned row groups. | Byte budget the optimizer thought the scan would consume. Falls back to the `IoCounters` `bytes_read` when the Delta scan path did not populate this counter (e.g. non-Delta connectors). | | `bytes_planned_uncompressed` | Per-file loop: sum of `chunk.uncompressed_size()` over the same projected leaves. | Same as above but uncompressed; useful for memory-budget estimation. | ### rows | Metric | Source | Description | |---|---|---| | `rows_in_planned_row_groups` | Per-file loop: sum of `rg.num_rows()` over planned row groups. | Maximum row count the scan could have emitted if everything matched. Pre-LIMIT, pre-predicate, pre-DV. | ## Metric catalog: ACTUAL view (41 metrics) ACTUAL numbers come from runtime instrumentation: per-batch wall-clock around `stream.next().await`, a `PruningSink` accumulator that tracks which row groups actually produced batches, a `BytesReadCounter` decorator on the range reader, and snapshot-pair deltas around the per-query SHOW STATS invocation. ### query Same three identity metrics as PLAN (`statement_kind`, `sql_preview`, `submitted_at_unix_ms`). ### time Phases listed in chronological request order. The first two are typically zero in standalone / in-process deployments; the remaining ten are populated for every query that produces row batches. There are two summary totals at the end: `total_engine_time_ms` (the engine-only, cross-engine-comparable number) and `total_time_ms` (the full per-request wall-clock including auth). **All time metrics are emitted in milliseconds (`ms`) with sub-millisecond decimal precision** (microsecond resolution: `0.015700` is 15.7 microseconds expressed in ms). Single-unit emission so a consumer reading the time category never has to switch scales while scanning the breakdown. Internal storage is mixed (`u64` ms for legacy phases, `u64` ns for newer phases); the render path normalises both to f64 ms and strips trailing zeros, so `5.000000` shows as `5` and `85.452000` shows as `85.452`. | Metric | Source | Description | |---|---|---| | `auth_time_ms` | `execution_trace`'s `auth_register` phase duration. | JWT verification, user lookup, token exchange. `0` in standalone / in-process deployments. | | `control_plane_validation_ms` | Reserved for control-plane-supplied timing. | RBAC, catalog-existence checks, query authorization. NULL in standalone / in-process deployments; populated by the control plane when one is in the path. | | `planning_time_ms` | Wall-clock around `sql_with_stack_protection(&self.session, &sql)`. | DataFusion SQL parse + logical-plan construction. | | `compile_time_ms` | Wall-clock around `df.create_physical_plan()`, divided by 1_000_000 from the `u64` ns storage. | Logical-to-physical compilation + every DataFusion optimizer rule + delta-forge extension passes. First call on a fresh session can be inflated by lazy dispatch-table initialisation; run the same SQL twice and compare to separate first-call cost from steady-state. | | `execution_time_ms` | Wall-clock from `execute_stream()` to the first batch landing. | Time-to-first-batch latency for streaming SELECTs. | | `streaming_time_ms` | Wall-clock from the first batch to the last batch on the stream. | Result delivery time. Useful for distinguishing "engine work" from "result delivery" when comparing against external benchmarks. | | `io_wait_max_ms` | Max of `ScanRuntimeStats::read_time_max_nanos` across `DeltaScanExec` nodes, in milliseconds. | Slowest per-share scan-IO wait, the closest single number to the critical-path IO cost. One share = one parallel row-group reader task. Conflates pure IO wait with parquet decode time per share until a `CountingRangeReader`-equivalent latency tracker splits them; document as "scan IO + decode time, slowest share". | | `io_wait_avg_ms` | `read_time_sum_nanos / read_time_share_count` across `DeltaScanExec` nodes, in milliseconds. | Typical per-share scan-IO wait. Read alongside `io_wait_max_ms` for a sense of skew. Same conflation caveat. | | `io_wait_min_ms` | Min of `ScanRuntimeStats::read_time_min_nanos` across `DeltaScanExec` nodes that ran at least one share, in milliseconds. | Fastest per-share scan-IO wait. With max and avg, exposes per-share skew (max - min near zero means readers are balanced; large gap means one or more stragglers). | | `scan_share_count` | Sum of `ScanRuntimeStats::read_time_share_count` across `DeltaScanExec` nodes. | Effective IO parallelism for the query. The divisor used to derive `io_wait_avg_ms`. The summed-across-shares total IO wait time is intentionally NOT surfaced because it would routinely exceed wall-clock and has no useful interpretation without this count. | | `decode_time_ms` | Wall-clock around each `stream.next().await` summed, divided by 1_000_000. | Per-batch decode + downstream operator + IO wait. Compare to `io_wait_max_ms` for context: when decode is comparable to or larger than the critical-path IO wait, decode is the bottleneck. Very small batches (e.g. one tiny result batch from a LIMIT or COUNT) can produce sub-microsecond decode values; expect tens of microseconds for small results, low ms for large ones. | | `filter_time_ms` | `ScanRuntimeStats::filter_time_nanos` summed, divided by 1_000_000. | Aggregate time evaluating predicates (row filters + DV application). | | `total_engine_time_ms` | `LastStatementStats::total_engine_time_ms` (`planning_time_ms + execution_time_ms + streaming_time_ms`). | Engine-only wall-clock for the statement. Excludes `auth_time_ms` and `control_plane_validation_ms` so the value is directly comparable to other engines' query-time numbers (DuckDB, Spark, etc.) where there is no auth / control-plane phase. Cite this in cross-engine benchmarks. | | `total_time_ms` | Wall-clock around the whole `execute_show_stats` invocation. | Full per-request wall-clock: auth + planning + execution + streaming + the per-query stats drain. Should be slightly higher than the sum of the phases above; the gap is plan-walk + sink drain + cache-snapshot overhead inside SHOW STATS itself. Cite this for end-to-end user-perceived latency. | ### files | Metric | Source | Description | |---|---|---| | `files_touched` | `PruningSink::finalize_actual_reads()` -- one accumulator entry per file scanned. | Files that produced at least one batch downstream. Less than or equal to `plan_files_planned`; the gap is files the scan never opened (LIMIT satisfied earlier). | ### row_groups | Metric | Source | Description | |---|---|---| | `row_groups_touched` | Sum of `surviving_rgs.len()` across accumulator entries (per-file, not per-share). | Row groups that produced at least one batch downstream. Per-rg precise on the Delta scan path (one stream per row group); inferred from cumulative rows_consumed on parquet/iceberg connector paths. | ### columns | Metric | Source | Description | |---|---|---| | `column_chunks_fetched` | Per accumulator entry: surviving row groups times `ProjectionMask::roots` expanded leaf count. | Column chunks actually fetched off disk or cache. Sum of projected leaves across touched row groups. | ### bytes | Metric | Source | Description | |---|---|---| | `bytes_read_disk` | `BytesReadCounter::load()` after stream drain. | Bytes read off disk via the range reader. Cache hits do NOT contribute here. Coverage caveat: the fast path of `open_parquet_stream` is fully decorated; some bloom pre-pass and parquet-rs fallback paths may not increment this counter. | | `bytes_decoded_uncompressed` | Sum of `RecordBatch::get_array_memory_size()` across drain batches. | Arrow uncompressed memory consumed downstream. Arrow allocator overhead can drift a few percent from parquet's `total_byte_size`. Cite as "Arrow uncompressed memory consumed", not as "parquet uncompressed bytes". | | `bytes_written` | `IoCounters::bytes_written()`. | Bytes written to storage during the statement. `0` for read-only queries. | | `bytes_from_page_cache` | Snapshot-pair delta of `PageCacheStats::bytes_served`. | Bytes served from the morsel-native page cache during this query. NULL when the scan does not consult that cache (the default DataFusion + parquet-rs path). | | `bytes_from_metadata_cache` | Snapshot-pair delta of `MetadataCacheStats::bytes_served`. | Bytes served from the parquet metadata cache (footer + page-index sidecars). Footer reads can be substantial on wide schemas; this number tells consumers whether they are amortising footer cost across queries. | ### rows | Metric | Source | Description | |---|---|---| | `rows_consumed` | Sum of `RecordBatch::num_rows()` across drain batches at the operator above the scan. | Rows the executor produced for the operator above the scan, post-DV but pre-LIMIT. | | `rows_returned` | Same drain summed, but at the result-set boundary. | Rows actually delivered to the consumer. For a `LIMIT n` query, this is exactly `n`. | | `rows_affected` | `EagerMetadata::rows_affected` for DML. | Rows changed by INSERT / UPDATE / DELETE / MERGE. NULL for SELECT. | | `rows_filtered_dv` | `ScanRuntimeStats::rows_filtered_dv` summed. | Rows filtered out by deletion vectors during the scan. | | `rows_filtered_predicate` | Reserved for parquet-rs `RowFilter` callback wiring. | Rows filtered out by predicate evaluation post-decode (late materialization or post-decode row-filter). Does NOT count rows skipped by row-group / page-index pruning (those rows were never decoded). Currently NULL pending the callback. | ### cache The `dim_cache_*` counters are cumulative across the engine process. The page-cache and metadata-cache counters are snapshot-pair deltas around the inner statement, so they are per-query accurate when no other queries run concurrently against the same executor. | Metric | Source | Description | |---|---|---| | `dim_cache_warm_hits` | `delta_forge_table::indexes::dim_cache_lifetime_counters().warm_hits()`. | Cumulative DIM-cache warm hits (AST-matched dim-routable queries that resolved from the LRU without disk IO). | | `dim_cache_cold_loads` | Same source. | Cumulative DIM-cache cold loads (per-column parquet read from disk on miss). | | `dim_cache_rows_served` | Same source. | Cumulative rows served from the DIM cache. | | `dim_cache_bytes_served` | Same source. | Cumulative Arrow-memory bytes served from the DIM cache. | | `dim_cache_load_time_ms` | Same source. | Cumulative wall-clock time spent on DIM cold-load disk reads. | | `page_cache_hits` | `PageCache::global().stats()` delta. | Per-query page-cache hits. NULL when the scan path does not consult the morsel-native page cache. | | `page_cache_misses` | Same source. | Per-query page-cache misses. | | `page_cache_bytes_served` | `PageCacheStats::bytes_served` delta. | Bytes served from the page cache (same source as `bytes_from_page_cache`; surfaced under both names for consumers that group by cache layer). | | `metadata_cache_hits` | `default_metadata_cache().stats()` delta. | Per-query metadata-cache hits. | | `metadata_cache_misses` | Same source. | Per-query metadata-cache misses. | ### memory | Metric | Source | Description | |---|---|---| | `peak_memory_bytes` | Max of `RecordBatch::get_array_memory_size()` across drain batches. | Peak per-batch Arrow memory observed during the drain inside `execute_show_stats`. Approximate, not RSS-precise. Cite as "Arrow per-batch peak"; do not equate to OS-level memory. | ## Wired vs PLUMBED vs PENDING (three-state model) For published benchmark consumption, each metric is in one of three states: - **WIRED**: Source populates the metric, source value matches ground truth, integration test verifies it. Safe to cite. - **PLUMBED**: Source populates the metric, but no integration test yet verifies it against ground truth. Not safe to cite in a published benchmark even though the column has a value. - **PENDING**: Metric has a stable column name and source design but is not populated yet. Always NULL until the source lands. Disclose as "not measured". The per-metric state is tracked in `docs/show-stats-methodology.md`. As of the long-form pivot, the WIRED set covers identity (3) + plan files/row_groups/columns/bytes/rows (12) + actual times (11) + actual files/row_groups/chunks/bytes/rows (12) + dim cache (5) + memory (1). PLUMBED columns are the parquet-rs / parquet-io cache deltas, awaiting integration coverage. PENDING columns are `files_pruned_manifest`, `files_pruned_bloom`, `rows_filtered_predicate`. ## Zero-overhead default A bare `SELECT` outside SHOW STATS pays zero stats cost: no plan walks, no cache snapshots, no sink allocations. The instrumentation only runs while SHOW STATS is active. Specifically: - `PruningSink` and `BytesReadCounter` are installed on the `TaskContext`'s `SessionConfig` for the duration of the inner statement only. Scan execs that lift them with `task_ctx.session_config().get_extension::<...>()` see `None` outside SHOW STATS and skip the per-row-group register / per-batch record hooks. - `stats_capture_depth` is an `AtomicU64` on the executor. `execute_show_stats` increments it on entry, RAII-decrements on exit. Per-statement drain sites (`collect_to_execution_result`, dim fast path, GUI streaming/eager paths) gate plan-walk and cache snapshotting on `depth > 0`. Off-mode overhead is one `is_some()` check. - Cache snapshot deltas are computed only inside `execute_show_stats`; the global counters increment regardless of whether SHOW STATS is active, so no extra work is done on the hot path. The published-benchmark methodology requires running the query bare (for measurement) and running it under `SHOW STATS ACTUAL` (for the breakdown); the two numbers are reported separately. Running the same query inside SHOW STATS introduces measurable overhead from the instrumentation and is not the published wall-clock. ## Methodology - Long-form `(category, metric, value, unit)` shape is locked. Downstream tooling parses by metric name; column ordering and category names will not change. - `Option<u64>` discipline on every metric. NULL means "not wired or not populated for this codepath"; `0` means "wired and observed zero". Bare `u64` would leak silent zeros. - Single drain site: `LastStatementStats::build()` is the only place that constructs the record; `with_plan_slice` and `with_actual_slice` are the only ways to populate the slices. Anything else is a bug. - Bare `SHOW STATS` (no view selector) is rejected at the parser; the view is required. There is no path that accidentally surfaces "the plan" as if it were "what happened". - The full methodology, including the invariants benchmark consumers can assert (PLAN/ACTUAL parity on full scans, projection-pushdown lower bounds, LIMIT divergence), lives in `docs/show-stats-methodology.md`.

Parameters

NameTypeDescription
viewSelects which slice to return. PLAN reports the optimizer / pruning-ladder pre-fetch estimate (surviving files, surviving row groups, projected column chunks, planned bytes). ACTUAL reports the runtime measurements (phase timings, touched files / row groups / chunks, bytes read off disk vs from each cache, rows consumed and returned, peak memory). Bare SHOW STATS without a view is rejected at the parser; the explicit selector prevents publishing PLAN numbers as if they were ACTUAL, or vice versa.
statementSQL statement to run under instrumentation. The statement is executed exactly once and its result rows are discarded; the SHOW STATS return value is the metrics record, not the inner data. The inner statement plans against the live session catalog, so it can reference any registered table or view.

Examples

-- PLAN: what will the optimizer read?
SHOW STATS PLAN SELECT * FROM warehouse.sales.orders WHERE order_date >= DATE '2026-01-01';
-- ACTUAL: what did the runtime actually consume?
SHOW STATS ACTUAL SELECT * FROM warehouse.sales.orders WHERE order_date >= DATE '2026-01-01';
-- Confirm column-projection pushdown is honoured at read time
SHOW STATS ACTUAL SELECT order_id, customer_id FROM warehouse.sales.orders;
-- LIMIT cancellation: PLAN reports the full footprint, ACTUAL stops early
SHOW STATS ACTUAL SELECT * FROM warehouse.sales.orders LIMIT 1000;
-- COUNT(*) metadata fast path: column_chunks_fetched should be 0
SHOW STATS ACTUAL SELECT COUNT(*) FROM warehouse.sales.orders;
-- Aggregate query: scope the metrics record to the inner statement
SHOW STATS ACTUAL SELECT region, SUM(total_amount) FROM warehouse.sales.orders GROUP BY region;

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →