Aggregate Functions

Functions that compute a single result from a set of rows.

Category: select-grammar

Syntax

<func>([DISTINCT] <expr>) [FILTER (WHERE <condition>)]

Description

## Overview Aggregate functions compute a single result value from a set of input rows. They collapse multiple rows into one output row (or one per group when combined with GROUP BY). Common aggregates include COUNT, SUM, AVG, MIN, and MAX. DeltaForge extends the standard set with approximate aggregates (APPROX_COUNT_DISTINCT, APPROX_PERCENTILE), statistical functions (VARIANCE, STDDEV, CORR, COVAR), bitwise aggregates (BIT_AND, BIT_OR, BIT_XOR), and logical aggregates (BOOL_AND, BOOL_OR, EVERY). ## Execution Order Aggregation occurs after FROM, JOIN, and WHERE filtering, as part of the GROUP BY evaluation phase. If no GROUP BY is present, the entire filtered result set is treated as a single group. HAVING filters run after aggregation. Aggregate results are then available to the SELECT list, ORDER BY, and window functions. ## Behavior - **DISTINCT modifier**: When DISTINCT is specified, the aggregate operates only on unique values. `COUNT(DISTINCT col)` counts unique non-NULL values. `SUM(DISTINCT col)` sums unique values. DISTINCT can be applied to most aggregate functions. - **FILTER clause**: The FILTER (WHERE ...) clause restricts which rows are included in the aggregate computation without affecting other aggregates in the same query. This is more readable and often more efficient than CASE WHEN expressions for conditional aggregation. - **NULL handling**: All standard aggregate functions ignore NULL input values except COUNT(*), which counts rows regardless of NULL. `SUM(col)` over an all-NULL group returns NULL (not 0). `COUNT(col)` over an all-NULL group returns 0. `COUNT(*)` counts all rows including those with NULLs. - **Aggregate pushdown**: For simple aggregates without GROUP BY or with partition-aligned grouping, the optimizer can compute COUNT(*), MIN(col), and MAX(col) directly from Parquet file metadata statistics. This avoids reading row data entirely and can resolve queries in milliseconds regardless of table size. - **Approximate aggregates**: APPROX_COUNT_DISTINCT uses HyperLogLog to estimate cardinality with typical error below 2%. APPROX_PERCENTILE uses t-digest for approximate quantile computation. These are significantly faster than exact equivalents on large datasets. - **Statistical aggregates**: VARIANCE (VAR_SAMP), VAR_POP, STDDEV (STDDEV_SAMP), STDDEV_POP compute variance and standard deviation. CORR and COVAR_SAMP/COVAR_POP compute correlation and covariance between two columns. - **String aggregation**: STRING_AGG concatenates string values with a delimiter, with optional ORDER BY to control concatenation order. - **Nested aggregates**: SQL does not permit directly nesting aggregate functions (e.g., AVG(SUM(x)) is invalid). Use a subquery or CTE to compute the inner aggregate first, then apply the outer aggregate to the intermediate result. - **Window context**: Any aggregate function can also be used as a window function by adding an OVER clause. In that context, it computes over the window frame rather than the entire group. ## Compatibility COUNT, SUM, AVG, MIN, and MAX conform to the SQL:1992 standard. The FILTER clause conforms to SQL:2003. DISTINCT inside aggregates conforms to SQL:1992. Approximate aggregates (APPROX_COUNT_DISTINCT, APPROX_PERCENTILE), statistical aggregates, and STRING_AGG are widely supported extensions.

Examples

-- Basic aggregates: COUNT, SUM, AVG, MIN, MAX
SELECT
  COUNT(*) AS total_orders,
  SUM(amount) AS total_revenue,
  AVG(amount) AS avg_order_value,
  MIN(amount) AS smallest_order,
  MAX(amount) AS largest_order
FROM orders;

-- COUNT DISTINCT: unique values
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;

-- FILTER clause: conditional aggregation
SELECT
  COUNT(*) FILTER (WHERE status = 'shipped') AS shipped_count,
  COUNT(*) FILTER (WHERE status = 'pending') AS pending_count,
  COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_count
FROM orders;

-- Grouped aggregation
SELECT region, product_category,
  SUM(amount) AS revenue,
  COUNT(*) AS order_count
FROM orders
GROUP BY region, product_category
ORDER BY revenue DESC;

-- BOOL_AND / BOOL_OR: logical aggregates
SELECT department,
  BOOL_AND(is_active) AS all_active,
  BOOL_OR(is_manager) AS has_manager
FROM employees
GROUP BY department;

-- STRING_AGG: concatenate values
SELECT department,
  STRING_AGG(name, ', ' ORDER BY name) AS member_list
FROM employees
GROUP BY department;

Aggregate Functions

Syntax

Description

Examples

See Also