ANALYZE TABLE

Computes histogram statistics and optional bloom filters for table columns to improve query optimization.

Category: statisticsDeltaForge extension

Syntax

ANALYZE TABLE <table> COMPUTE STATISTICS
  [FOR COLUMNS (<columns>) | FOR ALL COLUMNS]
  [METHOD <method>] [BINS <n>] [SAMPLE RATE <rate>]
  [TOPK <n>] [RDC THRESHOLD <t>]
  [BLOOM FILTER [ONLY] COLUMNS (<columns>)]

Description

## Overview Computes column-level histogram statistics and optional Parquet-native bloom filter indexes for a Delta table. These statistics are used by the query optimizer to improve selectivity estimation, join ordering, and file pruning. Statistics are stored as domain metadata in the Delta transaction log, making them available to all subsequent queries without additional configuration. ## Behavior - The command scans table data (or a sample of it) to build histogram representations for the specified columns. - When no columns are specified, all numeric columns (integer, float, decimal) are analyzed. String and date columns are included only with FOR ALL COLUMNS or explicit column lists. - The TABLE keyword after ANALYZE is optional. Both ANALYZE TABLE orders and ANALYZE orders are accepted. - The COMPUTE STATISTICS keywords are consumed but optional in practice; the parser accepts their absence. - Histogram statistics are written as a HistogramStatisticsMetadata domain metadata action in the Delta log. This makes them part of the table's versioned history. - When BLOOM FILTER COLUMNS are specified, the command performs two operations: (1) updates table properties (delta.bloomFilter.columns, delta.bloomFilter.fpp, delta.bloomFilter.numItems) and (2) rewrites existing Parquet files to embed Split Block Bloom Filters (SBBF). Future writes automatically include bloom filters for configured columns. - BLOOM FILTER ONLY COLUMNS causes the command to create bloom filters without computing histogram statistics. - The SAMPLE RATE option uses random row sampling. Results are approximate when the rate is below 1.0, but computation time scales linearly with the sample rate. - The FLAT method uses the Randomized Dependence Coefficient to identify columns that are strongly correlated with query predicates before building histograms, reducing unnecessary computation on low-value columns. ## Variants ### Bloom Filter Only When BLOOM FILTER ONLY COLUMNS is specified, only bloom filter indexes are created. No histogram statistics are computed. This is useful when only point-lookup optimization is needed. ## Access Control | Privilege | Object | Notes | |-----------|--------|-------| | Write access | Target table | Required because the command modifies table metadata and may rewrite files. | ## Compatibility ANALYZE TABLE COMPUTE STATISTICS is a DeltaForge extension. The bloom filter implementation uses the Parquet-native Split Block Bloom Filter (SBBF) format, which is compatible with other Delta Lake readers that support Parquet SBBF.

Parameters

Name	Type	Description
`table`		Specifies the table name or path to analyze. Supports fully qualified names (zone.schema.table).
`columns`		Specifies the columns to compute histogram statistics for. When omitted, all numeric columns are analyzed by default. Use FOR ALL COLUMNS to explicitly analyze every column regardless of type.
`method`		Specifies the histogram construction method. Valid values: equi_height (equal-height bins), streaming (single-pass approximate), maxdiff (maximum difference boundaries), compressed (frequent values plus equi-height residual), hybrid (compressed with frequency tracking), v_optimal (minimizes variance within bins), flat (uses RDC correlation to select columns). Default: equi_height.
`num_bins`		Specifies the number of histogram bins. More bins provide finer granularity for selectivity estimation but increase storage overhead. Accepted via BINS or NUM_BINS.
`sample_rate`		Specifies the fraction of rows to sample for statistics computation. Valid range: 0.0 to 1.0. A value of 1.0 scans all rows. Lower values reduce computation time at the cost of accuracy. Accepted via SAMPLE RATE or SAMPLE.
`top_k`		Specifies the number of most-frequent values to track. Applies to the compressed and hybrid histogram methods. Frequent values are stored separately from the histogram bins for precise selectivity estimation.
`rdc_threshold`		Specifies the Randomized Dependence Coefficient threshold for the FLAT method. Columns with inter-column correlation above this threshold are selected for analysis. Valid range: 0.0 to 1.0.
`bloom_filter_columns`		Specifies columns for which to create Parquet-native bloom filter indexes. Bloom filters enable efficient file-level pruning for equality predicates. When specified, table properties are updated and existing files are rewritten with embedded bloom filters.

Examples

-- Analyze specific columns with default equi-height histograms
ANALYZE TABLE orders COMPUTE STATISTICS FOR COLUMNS (customer_id, product_id);

-- Analyze all columns
ANALYZE TABLE orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Use hybrid method with 100 bins and top-10 frequent values
ANALYZE TABLE orders COMPUTE STATISTICS METHOD hybrid BINS 100 TOPK 10;

-- Sampled analysis for a very large table
ANALYZE TABLE events COMPUTE STATISTICS
  FOR COLUMNS (event_type, user_id, region)
  SAMPLE RATE 0.1;

-- Create bloom filters for point-lookup columns
ANALYZE TABLE orders COMPUTE STATISTICS BLOOM FILTER COLUMNS (order_id, txn_id);

-- Combined: histograms on numeric columns plus bloom filters on ID columns
ANALYZE TABLE sales COMPUTE STATISTICS
  FOR COLUMNS (amount, quantity, discount)
  METHOD equi_height BINS 200
  BLOOM FILTER COLUMNS (sale_id, customer_id);

Pitfalls

Specifying BLOOM FILTER COLUMNS triggers a full table rewrite to embed bloom filters in existing Parquet files. On large tables, this can be a long-running operation.
Statistics computed with a low SAMPLE RATE may produce inaccurate histograms for skewed data distributions. Consider using a rate of at least 0.1 for reliable selectivity estimates.
Running ANALYZE TABLE does not automatically invalidate cached query plans. Subsequent queries will pick up the new statistics, but currently executing queries continue using the previous statistics.