Combines rows from two or more tables based on a related column.
<table1> [INNER | LEFT | RIGHT | FULL | CROSS | NATURAL] JOIN <table2>
ON <condition>
-- or: USING (<columns>)
## Overview JOIN clauses combine rows from two or more tables (or subqueries) based on a related column or an arbitrary predicate. Joins are the primary mechanism for navigating relationships between normalized tables. In DeltaForge, joins execute within the DataFusion query engine using hash join, sort-merge join, or nested-loop join strategies, selected automatically by the cost-based optimizer. ## Execution Order JOIN is evaluated as part of the FROM clause, which is the first logical step in query processing. When multiple tables are joined, the optimizer determines the physical join order using cost-based join reordering. The algorithm enumerates join orderings with dynamic programming (based on Selinger's classic approach) and selects the plan with the lowest estimated cost, weighting cardinality at 70% and data size at 30% by default. ## Behavior - **INNER JOIN**: Returns only rows where the join condition evaluates to true in both tables. This is the default join type when JOIN is written without a qualifier. - **LEFT [OUTER] JOIN**: Returns all rows from the left table. For rows with no match in the right table, the right-side columns are filled with NULL. - **RIGHT [OUTER] JOIN**: Returns all rows from the right table. Non-matching left-side columns are NULL. - **FULL [OUTER] JOIN**: Returns all rows from both tables. Non-matching sides are filled with NULL. - **CROSS JOIN**: Produces the Cartesian product of both tables. No ON clause is permitted. Use with caution on large tables. - **NATURAL JOIN**: Implicitly joins on all columns that share the same name in both tables. The shared columns appear once in the output rather than twice. - **USING clause**: An alternative to ON for equi-joins. `USING (col)` is equivalent to `ON left.col = right.col`, and the shared column appears once in the result. - **NULL handling**: In equi-join conditions, NULL = NULL evaluates to UNKNOWN (not TRUE), so rows with NULL join keys do not match. Use IS NOT DISTINCT FROM for NULL-safe equality. - **Join reordering**: The optimizer reorders inner joins to minimize intermediate result sizes. It uses table statistics (row counts, column cardinality) collected from Parquet file metadata. Left, right, and full outer joins have constrained reordering since changing order changes semantics. - **Predicate pushdown across joins**: WHERE predicates that reference only one side of a join are pushed below the join into a scan-level filter, enabling partition pruning and data skipping on that table independently. - **Skew join optimization**: When one side of a join has heavily skewed key distribution, the optimizer can split the skewed keys into separate partitions to avoid data hotspots during hash join execution. - **Runtime filters**: For joins where the build side (typically the smaller table) is much smaller than the probe side, the optimizer injects a runtime bloom filter derived from the build side. This filter prunes probe-side files before they are read, significantly reducing I/O. ## Compatibility All join types conform to the SQL:2016 standard. DeltaForge supports INNER, LEFT, RIGHT, FULL OUTER, CROSS, and NATURAL joins, along with the USING clause. Semi-joins and anti-joins are available implicitly through EXISTS/NOT EXISTS and IN/NOT IN subqueries, which the optimizer rewrites into semi-join and anti-join physical operators when beneficial.
-- Inner join: only matching rows from both tables
SELECT o.id, c.name, o.amount
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
-- Left join: all rows from the left table, NULLs where no match
SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- Right join: all rows from the right table
SELECT o.id, p.product_name
FROM order_items o
RIGHT JOIN products p ON o.product_id = p.id;
-- Full outer join: all rows from both tables
SELECT a.account_id, b.balance
FROM accounts_2024 a
FULL JOIN accounts_2025 b ON a.account_id = b.account_id;
-- Cross join: Cartesian product of two tables
SELECT d.date_key, p.product_id
FROM dim_date d
CROSS JOIN dim_product p;
-- Natural join: implicit equi-join on columns with matching names
SELECT *
FROM orders
NATURAL JOIN customers;