SELECT

Retrieves data from tables with filtering, grouping, and ordering.

Category: select-grammar

Syntax

SELECT [DISTINCT | ALL] [TOP <n> [PERCENT]]
  <columns>
FROM <source>
[WHERE <condition>]
[GROUP BY <columns>]
[HAVING <condition>]
[ORDER BY <columns>]
[LIMIT <n>] [OFFSET <n>]

Description

## Overview SELECT is the foundational query statement in SQL. It retrieves zero or more rows from one or more sources (tables, views, subqueries, or table-valued functions) and applies an ordered pipeline of filtering, grouping, aggregation, and sorting operations to produce a result set. In DeltaForge, SELECT queries execute against Delta Lake tables backed by Parquet data files, with the query engine (built on Apache DataFusion) applying extensive optimizations before any data is read from storage. ## Execution Order The logical evaluation order of a SELECT statement follows the SQL standard, not the textual order of the clauses: 1. **FROM** and **JOIN** - identify source tables and combine rows 2. **WHERE** - filter individual rows before grouping 3. **GROUP BY** - partition rows into groups 4. **HAVING** - filter groups after aggregation 5. **SELECT** - evaluate expressions and column references 6. **DISTINCT** - remove duplicate result rows 7. **ORDER BY** - sort the final result set 8. **LIMIT / OFFSET** - restrict the number of returned rows Column aliases defined in the SELECT list are not visible to WHERE or HAVING because those clauses are evaluated earlier in the logical pipeline. ## Behavior - **DISTINCT vs. ALL**: DISTINCT eliminates duplicate rows from the result set. ALL (the default) retains all rows including duplicates. DISTINCT compares entire rows; two rows are duplicates only if every column value is equal, treating two NULLs as equal for this comparison. - **TOP**: TOP N limits the result to the first N rows (or N percent of rows when PERCENT is specified). TOP is rewritten internally to LIMIT for execution. - **Star expansion**: `*` expands to all columns of all sources in the FROM clause. `alias.*` expands to all columns from a specific table alias. Star expansion occurs during query planning, so the resolved column list reflects the table schema at planning time. - **Predicate pushdown**: WHERE predicates on partition columns enable partition pruning, skipping entire file groups without reading data. Predicates on non-partition columns leverage min/max file statistics and bloom filters to skip Parquet files that cannot contain matching rows. - **Data skipping**: The optimizer evaluates WHERE predicates against per-file statistics at four levels: partition values, min/max column statistics, bloom filter membership, and finally row-level filtering. This layered approach is always conservative and never skips files that may contain matching data. - **NULL handling**: Comparisons involving NULL follow standard three-valued logic. `WHERE col = NULL` matches no rows; use `IS NULL` instead. NULLs propagate through arithmetic and most function calls. - **FETCH FIRST syntax**: DeltaForge also accepts the SQL-standard `FETCH FIRST N ROWS ONLY` syntax, which is automatically rewritten to `LIMIT N` before execution. ## Compatibility SELECT conforms to the SQL:2016 standard with the following DeltaForge extensions: TOP with PERCENT (T-SQL style), FETCH FIRST rewriting, and integration with Delta Lake partition pruning and file-level data skipping. All standard SELECT features including DISTINCT, column aliasing, and expression evaluation are fully supported.

Examples

-- Basic column selection with a filter
SELECT name, age
FROM customers
WHERE active = true
ORDER BY name;
-- Eliminate duplicate rows
SELECT DISTINCT region
FROM orders;
-- Retrieve the top 10 rows from a large table
SELECT TOP 10 *
FROM large_table;
-- Column aliasing and expression evaluation
SELECT
  product_name AS product,
  unit_price * quantity AS line_total,
  CASE WHEN unit_price > 100 THEN 'premium' ELSE 'standard' END AS tier
FROM order_items;
-- Combine filtering, grouping, and ordering
SELECT region, COUNT(*) AS order_count, SUM(amount) AS total_revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY region
HAVING SUM(amount) > 10000
ORDER BY total_revenue DESC;
-- TOP with PERCENT
SELECT TOP 5 PERCENT
  employee_id, salary
FROM employees
ORDER BY salary DESC;

See Also

Open in interactive docs →   DeltaForge home →