SHOW STATS

Runs a SQL statement under instrumentation and returns a per-metric breakdown of what the planner intended (PLAN view) or what the runtime actually consumed (ACTUAL view). Inline form: every invocation captures its own SQL, no per-session ring. Zero overhead when not invoked.

Category: systemDeltaForge extension

Syntax

SHOW STATS (PLAN | ACTUAL) <sql_statement>

Description

## Overview SHOW STATS is the engine's observability surface for query execution. Each invocation runs an inner SQL statement under instrumentation, captures the metrics for that statement only, and returns them as a long-form result set. The two views (PLAN and ACTUAL) are deliberately separated so a benchmark consumer cannot accidentally cite optimizer estimates as if they were measured numbers. The inline form means SHOW STATS is self-contained: no session history ring, no "give me the last query's stats" affordance. Every call carries its own SQL and its own measurements. A benchmark report typically runs the bare query (for wall-clock measurement), then runs the same SQL under `SHOW STATS ACTUAL` to capture the breakdown, and reports the two numbers side-by-side. ## Result Schema (long-form) Both views return the same four-column shape: | Column | Type | Description | |---|---|---| | `category` | utf8 (not null) | Metric group: `query`, `time`, `files`, `row_groups`, `columns`, `pages`, `bytes`, `rows`, `cache`, `memory`, `spill`. | | `metric` | utf8 (not null) | Metric name. | | `value` | utf8 (nullable) | Decimal-string for numeric metrics, raw string for identity metrics (`statement_kind`, `sql_preview`), NULL when the metric is wired but not populated on this codepath. | | `unit` | utf8 (not null) | One of `str`, `epoch_ms`, `ms`, `ns`, `count`, `bytes`, `rows`. | ### NULL vs zero (load-bearing for benchmarks) - `value = NULL` means the metric is wired but not populated for this query. Examples: `control_plane_validation_ms` is NULL when running in-process (no control plane in path); `bytes_from_page_cache` is NULL when the scan goes through the DataFusion + parquet-rs path that does not consult the morsel-native page cache. - `value = "0"` means the metric is wired and the runtime observed zero. Example: `files_pruned_partition = "0"` on a non-partitioned table. A benchmark publication that cites a NULL metric is making a claim the engine has not actually measured. Cite "0" freely; treat NULL as "not measured here" and disclose accordingly. ## Cypher execution metrics (ACTUAL view) When the inner statement is Cypher (`SHOW STATS ACTUAL` prefixed directly to a `USE <graph>` + `CALL algo.* ...` block as ONE statement, with no `;` between the wrapper and the Cypher), the ACTUAL view adds Cypher-specific categories alongside the SQL ones. These are the instrument for "where did my graph query spend time", and in particular for telling a cold first query apart from a warm one. ### cypher.algorithm | Metric | Description | |---|---| | `name` | Algorithm invoked, e.g. `algo.pagerank`. | | `used_gpu` | `true` when the algorithm ran on the GPU (`ON GPU`). | | `node_count` / `edge_count` | Graph size the algorithm operated on. | | `rows_out` | Rows produced (after YIELD / RETURN / LIMIT). | | `algorithm_ms` | End-to-end algorithm wall time (runs synchronously, so trustworthy). | ### cypher.gpu | Metric | Description | |---|---| | `used_gpu` | Whether any GPU op ran. | | `gpu_wall_ms` | Total GPU wall time (upload + compute + readback). | | `transfer_to_gpu_ms` | Host to VRAM upload of the CSR buffers. | | `compute_ms` | Kernel compute time. | | `transfer_from_gpu_ms` | VRAM to host readback of results. | ### cypher.csr How the CSR topology was resolved and, on a cold load, where the time went. This category answers the most common graph-performance question: why is the first query after a restart slow. | Metric | Unit | Description | |---|---|---| | `source` | str | `memory-cache` (warm Tier-1 hit, ~0 ms), `disk` (`.dcsr` read), or the from-tables rebuild label. | | `load_ms` | ms | Total CSR topology resolution wall time. | | `disk_read_ms` | ms | `.dcsr` slurp + zstd decompress; populated only on the cold disk path, NULL on a warm hit or a from-tables rebuild. | | `weights_load_ms` | ms | Wall time rebuilding the per-edge weight array from the source parquet. The `.dcsr` deliberately does NOT persist weights (storing them would bloat the topology file for every algorithm), so a cold load of a WEIGHTED graph re-reads them. This is typically the dominant cold-start cost. NULL when no weight reload ran: a warm hit, a rebuild, or an unweighted query whose AST does not consume edge weights. | | `weights_group_ms` | ms | Portion of `weights_load_ms` spent grouping CSR slots by source `(file, row_group)` from `edge_loc` before the concurrent read. On large graphs this is the single-threaded O(edges) bucket-build and can be the largest sub-cost once the read itself is parallelised. | | `weights_scan_ms` | ms | Portion of `weights_load_ms` spent on the concurrent weight-column read. | | `weights_scatter_ms` | ms | Portion of `weights_load_ms` spent placing weights into CSR edge slots. | | `weights_path` | str | `provenance` (the fast `edge_loc`-directed parallel reload that reads only the weight column and places each weight by its persisted source-row pointer) or `scan-scatter` (the legacy `(src,dst)` scan + match fallback, used when a graph lacks complete edge provenance). | ### cypher.expand / cypher.property Populated for MATCH-expansion and property-materialisation shapes; both are NULL for a pure algorithm `CALL` (no expansion, no property gather). The `cypher.property` block splits `property_ms` so a slow gather can be attributed without a tracing build. `tier` names the path that served the columns (`in_memory`, `row_fetch`, or `scatter`), and `in_memory_ms` / `row_fetch_ms` / `scatter_ms` split the wall time across those tiers (at most one disk-backed tier is non-zero per gather). When the full-scan `scatter` fallback runs (a graph with no `.dptr` provenance, or no row fetcher / non-flattened property mode), its inner cost is broken down further: `scatter_plan_ms` (DataFusion planning of the column-pruned SELECT), `scatter_collect_ms` (the datalake read, which dominates a slow cloud gather), `scatter_build_ms` (the source-row to CSR-slot scatter-index build), and `scatter_encode_ms` (concat + per-column scatter + dictionary encode). `scatter_rows_scanned` is the rows pulled from the source table; comparing it against `rows` exposes the O(N) full-scan amplification a cached or pointer-backed read avoids (for example scanning an entire node table to return a few hundred matched rows). ## Category ordering The metric rows are emitted in a fixed sequence so the breakdown reads top-down from "what did I run" to "how much time / IO / memory did it cost": 1. `query` (statement identity) 2. `time` (both views; phases in chronological order, totals last: `total_engine_time_ms` then `total_time_ms`) 3. `files` (PLAN: pruned counts + planned; ACTUAL: touched) 4. `row_groups` (PLAN: pruned counts + planned; ACTUAL: touched) 5. `columns` (PLAN: total/projected/chunks_planned; ACTUAL: chunks_fetched) 6. `pages` (PLAN only) 7. `bytes` (PLAN: planned compressed + uncompressed; ACTUAL: read_disk, decoded, written, cache contributions) 8. `rows` (PLAN: rows in planned row groups; ACTUAL: consumed, returned, affected, filtered) 9. `cache` (ACTUAL only; DIM cache lifetime counters + per-query page/metadata cache deltas) 10. `memory` (ACTUAL only; peak per-batch Arrow memory) 11. `spill` (ACTUAL only; bytes / files / rows DataFusion spilled to disk) The `time` category appears in BOTH views with the same metric set. SHOW STATS PLAN runs the inner SQL to populate the planner-side counters (otherwise `total_row_groups`, `row_groups_pruned_*`, and the byte budgets would be empty), so the timings are valid measurements there too. Time metrics are listed first in both views because they are the primary cross-engine comparison surface (DuckDB, Spark, etc.). ## Metric catalog: PLAN view (21 metrics) All counts and bytes come from walking the physical plan's `DeltaScanExec` nodes and aggregating their `ScanRuntimeStats` after `df.create_physical_plan().await` returns. The walk happens before the inner statement streams, so PLAN numbers reflect the optimizer's pre-fetch decision. ### query | Metric | Source | Description | |---|---|---| | `statement_kind` | First keyword of the inner SQL, lowercased (`select`, `insert`, ...). | Statement classification, used as a hint for downstream tooling. | | `sql_preview` | First ~80 chars of the inner SQL after stripping leading whitespace / comments. | Confirms the metrics record applies to the SQL the operator thinks they ran. | | `submitted_at_unix_ms` | `SystemTime::now()` at the moment `execute_show_stats` started. | Wall-clock submission timestamp; NULL if the timestamp could not be sampled. | ### files | Metric | Source | Description | |---|---|---| | `total_files` | `DeltaScanExec::get_scan_stats().total_files`. | Files in the table version the planner saw, before any pruning. | | `files_pruned_partition` | `get_scan_stats().files_skipped_by_partition`. | Files eliminated by partition-value pruning (file-level, before any metadata fetch). | | `files_pruned_manifest` | Drained from `PruningSink::Manifest` entries. | Files eliminated by manifest min/max statistics (Iceberg / Delta manifest level, before parquet footers). Currently NULL (no source pushes these yet). | | `files_pruned_stats` | `get_scan_stats().files_skipped_by_stats`. | Files eliminated by per-file min/max statistics. | | `files_pruned_bloom` | Reserved. | File-level bloom filtering is not yet implemented; the column stays NULL even when row-group-level bloom prunes things (those go through `row_groups_pruned_bloom`). | | `files_planned` | `get_scan_stats().files_scanned`. | Files the pruning ladder said would be scanned (= `total_files - sum(files_pruned_*)`). | ### row_groups | Metric | Source | Description | |---|---|---| | `total_row_groups` | `metadata.num_row_groups()` summed per file, accumulated at `share_idx == 0`. | Row groups across the planned-to-scan files, before row-group-level pruning. Set unconditionally so a full scan (no predicate) still reports a non-zero number. | | `row_groups_pruned_stats` | `RowGroupPruner` `pruning_result.pruned_row_groups.len()` summed per file, accumulated at `share_idx == 0`. | Row groups eliminated by min/max statistics within surviving files. | | `row_groups_pruned_bloom` | Drained from `PruningSink::Bloom` entries. | Row groups eliminated by bloom filters within surviving files. Wired on the parquet-io ladder; NULL on scan paths that do not consult bloom. | | `row_groups_pruned_nulls` | Drained from `PruningSink::Nulls` entries. | Row groups eliminated by null-count specials (IS NULL on no-null column, IS NOT NULL on all-null column). | | `row_groups_planned` | `selected_row_groups_pre_share.len()` per file at `share_idx == 0`, summed. | Row groups the ladder said would be scanned. Uses the PRE-share survivor list so multi-share scans report per-file totals, not per-share slices. | ### columns | Metric | Source | Description | |---|---|---| | `total_columns_schema` | `get_scan_stats().total_columns`. | Total leaf columns in the file schema. | | `columns_projected` | `get_scan_stats().columns_projected`. | Root-field columns the optimizer's projection picked. | | `column_chunks_planned` | Per-file projection-aware loop: sum over surviving row groups of `ProjectionMask::leaf_included` column chunks. | Column chunks the projection covers across planned row groups. Drives the planned-bytes accounting. | ### pages | Metric | Source | Description | |---|---|---| | `pages_pruned_page_index` | Drained from `PruningSink::PageIndex` entries. | Pages within surviving row groups eliminated by the page-index ladder. Wired only when `page_pruning` is active and a predicate exists. | ### bytes | Metric | Source | Description | |---|---|---| | `bytes_planned_compressed` | Per-file loop: sum of `chunk.compressed_size()` over `ProjectionMask::leaf_included` chunks in planned row groups. | Byte budget the optimizer thought the scan would consume. Falls back to the `IoCounters` `bytes_read` when the Delta scan path did not populate this counter (e.g. non-Delta connectors). | | `bytes_planned_uncompressed` | Per-file loop: sum of `chunk.uncompressed_size()` over the same projected leaves. | Same as above but uncompressed; useful for memory-budget estimation. | ### rows | Metric | Source | Description | |---|---|---| | `rows_in_planned_row_groups` | Per-file loop: sum of `rg.num_rows()` over planned row groups. | Maximum row count the scan could have emitted if everything matched. Pre-LIMIT, pre-predicate, pre-DV. | ## Metric catalog: ACTUAL view (44 metrics) ACTUAL numbers come from runtime instrumentation: per-batch wall-clock around `stream.next().await`, a `PruningSink` accumulator that tracks which row groups actually produced batches, a `BytesReadCounter` decorator on the range reader, and snapshot-pair deltas around the per-query SHOW STATS invocation. ### query Same three identity metrics as PLAN (`statement_kind`, `sql_preview`, `submitted_at_unix_ms`). ### time Phases listed in chronological request order. The first two are typically zero in standalone / in-process deployments; the remaining ten are populated for every query that produces row batches. There are two summary totals at the end: `total_engine_time_ms` (the engine-only, cross-engine-comparable number) and `total_time_ms` (the full per-request wall-clock including auth). **All time metrics are emitted in milliseconds (`ms`) with sub-millisecond decimal precision** (microsecond resolution: `0.015700` is 15.7 microseconds expressed in ms). Single-unit emission so a consumer reading the time category never has to switch scales while scanning the breakdown. Internal storage is mixed (`u64` ms for legacy phases, `u64` ns for newer phases); the render path normalises both to f64 ms and strips trailing zeros, so `5.000000` shows as `5` and `85.452000` shows as `85.452`. | Metric | Source | Description | |---|---|---| | `auth_time_ms` | `execution_trace`'s `auth_register` phase duration. | JWT verification, user lookup, token exchange. `0` in standalone / in-process deployments. | | `control_plane_validation_ms` | Reserved for control-plane-supplied timing. | RBAC, catalog-existence checks, query authorization. NULL in standalone / in-process deployments; populated by the control plane when one is in the path. | | `planning_time_ms` | Wall-clock around `sql_with_stack_protection(&self.session, &sql)`. | DataFusion SQL parse + logical-plan construction. | | `compile_time_ms` | Wall-clock around `df.create_physical_plan()`, divided by 1_000_000 from the `u64` ns storage. | Logical-to-physical compilation + every DataFusion optimizer rule + delta-forge extension passes. First call on a fresh session can be inflated by lazy dispatch-table initialisation; run the same SQL twice and compare to separate first-call cost from steady-state. | | `execution_time_ms` | Wall-clock from `execute_stream()` to the first batch landing. | Time-to-first-batch latency for streaming SELECTs. | | `streaming_time_ms` | Wall-clock from the first batch to the last batch on the stream. | Result delivery time. Useful for distinguishing "engine work" from "result delivery" when comparing against external benchmarks. | | `io_wait_max_ms` | Max of `ScanRuntimeStats::read_time_max_nanos` across `DeltaScanExec` nodes, in milliseconds. | Slowest per-share scan-IO wait, the closest single number to the critical-path IO cost. One share = one parallel row-group reader task. Conflates pure IO wait with parquet decode time per share until a `CountingRangeReader`-equivalent latency tracker splits them; document as "scan IO + decode time, slowest share". | | `io_wait_avg_ms` | `read_time_sum_nanos / read_time_share_count` across `DeltaScanExec` nodes, in milliseconds. | Typical per-share scan-IO wait. Read alongside `io_wait_max_ms` for a sense of skew. Same conflation caveat. | | `io_wait_min_ms` | Min of `ScanRuntimeStats::read_time_min_nanos` across `DeltaScanExec` nodes that ran at least one share, in milliseconds. | Fastest per-share scan-IO wait. With max and avg, exposes per-share skew (max - min near zero means readers are balanced; large gap means one or more stragglers). | | `scan_share_count` | Sum of `ScanRuntimeStats::read_time_share_count` across `DeltaScanExec` nodes. | Effective IO parallelism for the query. The divisor used to derive `io_wait_avg_ms`. The summed-across-shares total IO wait time is intentionally NOT surfaced because it would routinely exceed wall-clock and has no useful interpretation without this count. | | `decode_time_ms` | Wall-clock around each `stream.next().await` summed, divided by 1_000_000. | Per-batch decode + downstream operator + IO wait. Compare to `io_wait_max_ms` for context: when decode is comparable to or larger than the critical-path IO wait, decode is the bottleneck. Very small batches (e.g. one tiny result batch from a LIMIT or COUNT) can produce sub-microsecond decode values; expect tens of microseconds for small results, low ms for large ones. | | `filter_time_ms` | `ScanRuntimeStats::filter_time_nanos` summed, divided by 1_000_000. | Aggregate time evaluating predicates (row filters + DV application). | | `total_engine_time_ms` | `LastStatementStats::total_engine_time_ms` (`planning_time_ms + execution_time_ms + streaming_time_ms`). | Engine-only wall-clock for the statement. Excludes `auth_time_ms` and `control_plane_validation_ms` so the value is directly comparable to other engines' query-time numbers (DuckDB, Spark, etc.) where there is no auth / control-plane phase. Cite this in cross-engine benchmarks. | | `total_time_ms` | Wall-clock around the whole `execute_show_stats` invocation. | Full per-request wall-clock: auth + planning + execution + streaming + the per-query stats drain. Should be slightly higher than the sum of the phases above; the gap is plan-walk + sink drain + cache-snapshot overhead inside SHOW STATS itself. Cite this for end-to-end user-perceived latency. | ### files | Metric | Source | Description | |---|---|---| | `files_touched` | `PruningSink::finalize_actual_reads()` -- one accumulator entry per file scanned. | Files that produced at least one batch downstream. Less than or equal to `plan_files_planned`; the gap is files the scan never opened (LIMIT satisfied earlier). | ### row_groups | Metric | Source | Description | |---|---|---| | `row_groups_touched` | Sum of `surviving_rgs.len()` across accumulator entries (per-file, not per-share). | Row groups that produced at least one batch downstream. Per-rg precise on the Delta scan path (one stream per row group); inferred from cumulative rows_consumed on parquet/iceberg connector paths. | ### columns | Metric | Source | Description | |---|---|---| | `column_chunks_fetched` | Per accumulator entry: surviving row groups times `ProjectionMask::roots` expanded leaf count. | Column chunks actually fetched off disk or cache. Sum of projected leaves across touched row groups. | ### bytes | Metric | Source | Description | |---|---|---| | `bytes_read_disk` | `BytesReadCounter::load()` after stream drain. | Bytes read off disk via the range reader. Cache hits do NOT contribute here. Coverage caveat: the fast path of `open_parquet_stream` is fully decorated; some bloom pre-pass and parquet-rs fallback paths may not increment this counter. | | `bytes_decoded_uncompressed` | Sum of `RecordBatch::get_array_memory_size()` across drain batches. | Arrow uncompressed memory consumed downstream. Arrow allocator overhead can drift a few percent from parquet's `total_byte_size`. Cite as "Arrow uncompressed memory consumed", not as "parquet uncompressed bytes". | | `bytes_written` | `IoCounters::bytes_written()`. | Bytes written to storage during the statement. `0` for read-only queries. | | `bytes_from_page_cache` | Snapshot-pair delta of `PageCacheStats::bytes_served`. | Bytes served from the morsel-native page cache during this query. NULL when the scan does not consult that cache (the default DataFusion + parquet-rs path). | | `bytes_from_metadata_cache` | Snapshot-pair delta of `MetadataCacheStats::bytes_served`. | Bytes served from the parquet metadata cache (footer + page-index sidecars). Footer reads can be substantial on wide schemas; this number tells consumers whether they are amortising footer cost across queries. | ### rows | Metric | Source | Description | |---|---|---| | `rows_consumed` | Sum of `RecordBatch::num_rows()` across drain batches at the operator above the scan. | Rows the executor produced for the operator above the scan, post-DV but pre-LIMIT. | | `rows_returned` | Same drain summed, but at the result-set boundary. | Rows actually delivered to the consumer. For a `LIMIT n` query, this is exactly `n`. | | `rows_affected` | `EagerMetadata::rows_affected` for DML. | Rows changed by INSERT / UPDATE / DELETE / MERGE. NULL for SELECT. | | `rows_filtered_dv` | `ScanRuntimeStats::rows_filtered_dv` summed. | Rows filtered out by deletion vectors during the scan. | | `rows_filtered_predicate` | Reserved for parquet-rs `RowFilter` callback wiring. | Rows filtered out by predicate evaluation post-decode (late materialization or post-decode row-filter). Does NOT count rows skipped by row-group / page-index pruning (those rows were never decoded). Currently NULL pending the callback. | ### cache The `dim_cache_*` counters are cumulative across the engine process. The page-cache and metadata-cache counters are snapshot-pair deltas around the inner statement, so they are per-query accurate when no other queries run concurrently against the same executor. | Metric | Source | Description | |---|---|---| | `dim_cache_warm_hits` | `delta_forge_table::indexes::dim_cache_lifetime_counters().warm_hits()`. | Cumulative DIM-cache warm hits (AST-matched dim-routable queries that resolved from the LRU without disk IO). | | `dim_cache_cold_loads` | Same source. | Cumulative DIM-cache cold loads (per-column parquet read from disk on miss). | | `dim_cache_rows_served` | Same source. | Cumulative rows served from the DIM cache. | | `dim_cache_bytes_served` | Same source. | Cumulative Arrow-memory bytes served from the DIM cache. | | `dim_cache_load_time_ms` | Same source. | Cumulative wall-clock time spent on DIM cold-load disk reads. | | `page_cache_hits` | `PageCache::global().stats()` delta. | Per-query page-cache hits. NULL when the scan path does not consult the morsel-native page cache. | | `page_cache_misses` | Same source. | Per-query page-cache misses. | | `page_cache_bytes_served` | `PageCacheStats::bytes_served` delta. | Bytes served from the page cache (same source as `bytes_from_page_cache`; surfaced under both names for consumers that group by cache layer). | | `metadata_cache_hits` | `default_metadata_cache().stats()` delta. | Per-query metadata-cache hits. | | `metadata_cache_misses` | Same source. | Per-query metadata-cache misses. | ### memory | Metric | Source | Description | |---|---|---| | `peak_memory_bytes` | Max of `RecordBatch::get_array_memory_size()` across drain batches. | Peak per-batch Arrow memory observed during the drain inside `execute_show_stats`. Approximate, not RSS-precise. Cite as "Arrow per-batch peak"; do not equate to OS-level memory. | ### spill DataFusion's external sort, grouped hash aggregate, and sort-merge join spill intermediate state to disk when the shared DeltaForge memory budget is exhausted. The engine wires an OS-managed `DiskManager` into the runtime (`session_factory::build_shared_runtime_env`) and the `DeltaForgeMemoryPool` returns `ResourcesExhausted` under pressure, which is the signal those operators use to spill and retry rather than fail. These three metrics are summed across every spill-capable operator in the executed plan, walked after the inner statement drains. The NULL-vs-zero split is load-bearing here. A spill-capable operator registers all three spill metrics (initialised to zero) the moment it builds its stream, so a sort that ran but stayed within budget reports `0`. An operator that cannot spill at all (a scan, a projection, a coalesce) never registers the metric, so a query with no sort / aggregate / join reports NULL. Reading NULL as "did not spill" is therefore wrong: it means "the plan had no operator that could spill". Read `0` as "a spillable operator ran and stayed in memory", and any non-zero value as "the query spilled". | Metric | Unit | Source | Description | |---|---|---|---| | `spilled_bytes` | bytes | `MetricsSet::spilled_bytes()` summed across plan nodes. | Total bytes written to disk spill files. The headline "bytes spilled to disk" number. NULL when no spill-capable operator ran; `0` when one ran without spilling; non-zero when the query spilled. A non-zero value means the working set exceeded the memory budget (`DELTAFORGE_BUFFERPOOL_*`); raise the budget or add a more selective predicate / projection to keep the operator in memory. | | `spill_count` | count | `MetricsSet::spill_count()` summed across plan nodes. | Number of spill FILES written (one per spill episode), not spilled batches. A high count relative to `spilled_bytes` indicates many small spill flushes. | | `spilled_rows` | rows | `MetricsSet::spilled_rows()` summed across plan nodes. | Rows written to disk spill files. Lets you reason about average spilled-row width (`spilled_bytes / spilled_rows`) and how much of the input was forced through disk. | ## Wired vs PLUMBED vs PENDING (three-state model) For published benchmark consumption, each metric is in one of three states: - **WIRED**: Source populates the metric, source value matches ground truth, integration test verifies it. Safe to cite. - **PLUMBED**: Source populates the metric, but no integration test yet verifies it against ground truth. Not safe to cite in a published benchmark even though the column has a value. - **PENDING**: Metric has a stable column name and source design but is not populated yet. Always NULL until the source lands. Disclose as "not measured". The per-metric state is tracked in `docs/show-stats-methodology.md`. As of the long-form pivot, the WIRED set covers identity (3) + plan files/row_groups/columns/bytes/rows (12) + actual times (11) + actual files/row_groups/chunks/bytes/rows (12) + dim cache (5) + memory (1) + spill (3). The spill triple is sourced from DataFusion's own spill baseline metrics; the presence and the NULL-vs-zero semantics are verified by `show_stats_edge_cases::spill_metrics_null_vs_zero`, and the spilled volume under memory pressure is exercised by DataFusion's own spill test suite. PLUMBED columns are the parquet-rs / parquet-io cache deltas, awaiting integration coverage. PENDING columns are `files_pruned_manifest`, `files_pruned_bloom`, `rows_filtered_predicate`. ## Zero-overhead default A bare `SELECT` outside SHOW STATS pays zero stats cost: no plan walks, no cache snapshots, no sink allocations. The instrumentation only runs while SHOW STATS is active. Specifically: - `PruningSink` and `BytesReadCounter` are installed on the `TaskContext`'s `SessionConfig` for the duration of the inner statement only. Scan execs that lift them with `task_ctx.session_config().get_extension::<...>()` see `None` outside SHOW STATS and skip the per-row-group register / per-batch record hooks. - `stats_capture_depth` is an `AtomicU64` on the executor. `execute_show_stats` increments it on entry, RAII-decrements on exit. Per-statement drain sites (`collect_to_execution_result`, dim fast path, GUI streaming/eager paths) gate plan-walk and cache snapshotting on `depth > 0`. Off-mode overhead is one `is_some()` check. - Cache snapshot deltas are computed only inside `execute_show_stats`; the global counters increment regardless of whether SHOW STATS is active, so no extra work is done on the hot path. The published-benchmark methodology requires running the query bare (for measurement) and running it under `SHOW STATS ACTUAL` (for the breakdown); the two numbers are reported separately. Running the same query inside SHOW STATS introduces measurable overhead from the instrumentation and is not the published wall-clock. ## Methodology - Long-form `(category, metric, value, unit)` shape is locked. Downstream tooling parses by metric name; column ordering and category names will not change. - `Option<u64>` discipline on every metric. NULL means "not wired or not populated for this codepath"; `0` means "wired and observed zero". Bare `u64` would leak silent zeros. - Single drain site: `LastStatementStats::build()` is the only place that constructs the record; `with_plan_slice` and `with_actual_slice` are the only ways to populate the slices. Anything else is a bug. - Bare `SHOW STATS` (no view selector) is rejected at the parser; the view is required. There is no path that accidentally surfaces "the plan" as if it were "what happened". - The full methodology, including the invariants benchmark consumers can assert (PLAN/ACTUAL parity on full scans, projection-pushdown lower bounds, LIMIT divergence), lives in `docs/show-stats-methodology.md`.

Parameters

Name	Type	Description
`view`		Selects which slice to return. PLAN reports the optimizer / pruning-ladder pre-fetch estimate (surviving files, surviving row groups, projected column chunks, planned bytes). ACTUAL reports the runtime measurements (phase timings, touched files / row groups / chunks, bytes read off disk vs from each cache, rows consumed and returned, peak memory). Bare SHOW STATS without a view is rejected at the parser; the explicit selector prevents publishing PLAN numbers as if they were ACTUAL, or vice versa.
`statement`		SQL statement to run under instrumentation. The statement is executed exactly once and its result rows are discarded; the SHOW STATS return value is the metrics record, not the inner data. The inner statement plans against the live session catalog, so it can reference any registered table or view.

Examples

-- PLAN: what will the optimizer read?
SHOW STATS PLAN SELECT * FROM warehouse.sales.orders WHERE order_date >= DATE '2026-01-01';

-- ACTUAL: what did the runtime actually consume?
SHOW STATS ACTUAL SELECT * FROM warehouse.sales.orders WHERE order_date >= DATE '2026-01-01';

-- Confirm column-projection pushdown is honoured at read time
SHOW STATS ACTUAL SELECT order_id, customer_id FROM warehouse.sales.orders;

-- LIMIT cancellation: PLAN reports the full footprint, ACTUAL stops early
SHOW STATS ACTUAL SELECT * FROM warehouse.sales.orders LIMIT 1000;

-- COUNT(*) metadata fast path: column_chunks_fetched should be 0
SHOW STATS ACTUAL SELECT COUNT(*) FROM warehouse.sales.orders;

-- Aggregate query: scope the metrics record to the inner statement
SHOW STATS ACTUAL SELECT region, SUM(total_amount) FROM warehouse.sales.orders GROUP BY region;

Pitfalls

Bare `SHOW STATS` without `PLAN` or `ACTUAL` is rejected. The view selector is required so a benchmark consumer cannot accidentally publish PLAN numbers as ACTUAL, or vice versa.
PLAN values are the optimizer's pre-fetch estimate, not measurements. Running `SHOW STATS PLAN <sql>` actually plans and executes the inner SQL (so the metrics record is complete), but the PLAN slice reflects pre-fetch decisions. Use ACTUAL when you want runtime numbers.
ACTUAL `bytes_read_disk` can be slightly higher than PLAN `bytes_planned_compressed` because the actual disk read includes parquet footer + page-index overhead beyond the column-chunk compressed sum. This is expected; the gap is the per-file metadata cost.
`bytes_from_page_cache` and `bytes_from_metadata_cache` are NULL on the default Delta-scan codepath (DataFusion + parquet-rs). Those caches only fire on the morsel-native fast path (per-table `use_owned_stack=true`). Reading the columns as "the caches are cold" is wrong: the query did not consult those caches at all. The DIM cache, by contrast, is served direct and fires on every dim-routable query regardless of scan path.
`rows_filtered_predicate` is currently NULL. The parquet-rs `RowFilter` callback wiring that would populate it is not yet in place. Do not cite this column in benchmarks until the methodology doc flips it from PENDING to WIRED.
Concurrent `SHOW STATS` calls against the same executor pollute each other's cache deltas: the global page-cache and metadata-cache counters increment on every event regardless of which call's session-config is active. For benchmark publication, run one SHOW STATS at a time.
`peak_memory_bytes` is the peak `RecordBatch::get_array_memory_size()` across batches inside `execute_show_stats`. It is approximate, not RSS-precise, and is bounded above by the largest single batch the operator pipeline produced. Cite as "Arrow per-batch peak".
PLAN's `row_groups_planned + row_groups_pruned_stats + row_groups_pruned_bloom + row_groups_pruned_nulls` should equal `total_row_groups`. If they do not, one of the pruning paths is reporting double-counts or the PRE-share survivor list capture has drifted; raise an issue rather than re-deriving the planned count locally.
First query on a weighted graph is slow: read cypher.csr.weights_load_ms After a restart the first `CALL algo.*` against a weighted graph re-reads the per-edge weights from the source parquet, because the `.dcsr` cache stores topology but not weights. Under `SHOW STATS ACTUAL` this shows up as `cypher.csr.source = disk` with a large `cypher.csr.weights_load_ms` (and a small `cypher.gpu.compute_ms`). It is not GPU or decompress cost. Run the algorithm once to warm the cache; subsequent runs report `source = memory-cache` and `load_ms` near zero.
`spill.spilled_bytes = NULL` does NOT mean "the query did not spill". It means the plan had no operator that CAN spill (a plain scan or projection never spills). Only sort, grouped hash aggregate, and sort-merge join register the spill metrics. A spillable operator that stayed within the memory budget reports `0`, not NULL. So read NULL as "not applicable", `0` as "ran and stayed in memory", and any non-zero value as "spilled". The spill metrics are ACTUAL-only; the PLAN view never emits them because spilling is a runtime decision, not a planner estimate.
`spill.spill_count` counts spill FILES, not spilled rows or batches. A single sort that spills its working set in several flushes increments `spill_count` once per flush file while `spilled_rows` accumulates every row. Do not read `spill_count` as a row or batch count; use `spilled_rows` for volume.