Find connected components (groups of mutually reachable nodes)
SELECT * FROM graph_components(table_name)
Connected components partitions an undirected graph into maximal groups of mutually reachable nodes. Two nodes belong to the same component if and only if a path exists between them, regardless of direction. This algorithm is a fundamental preprocessing step used to identify disconnected subgraphs, validate data integrity in network datasets, and scope downstream analyses (such as PageRank or community detection) to individual components. The function loads edge data from a registered Delta table into a Compressed Sparse Row (CSR) representation. Edge direction is treated as bidirectional for the purpose of reachability. The algorithm performs a series of BFS or union-find traversals, assigning each node a component identifier. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity is O(V + E), where V is the vertex count and E is the edge count. This is a linear-time algorithm and scales efficiently to graphs with millions of edges. It is one of the cheapest graph operations and is commonly used as a first step before running more expensive algorithms. GPU acceleration is available for connected components. When a GPU accelerator is registered, the engine attempts a parallel label-propagation approach on the GPU and falls back to CPU if GPU resources are unavailable. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology between invocations, so repeated calls on the same table skip the graph construction phase.
| Name | Type | Description |
|---|---|---|
table_name | Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). Edge weights are ignored for component detection. |
CREATE TABLE network_edges AS
SELECT * FROM VALUES
(1, 2, 1.0),
(2, 3, 1.0),
(3, 1, 1.0),
(10, 11, 1.0),
(11, 12, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_components('network_edges');
SELECT COUNT(DISTINCT component_id) AS num_components
FROM graph_components('network_edges');
SELECT component_id, COUNT(*) AS size
FROM graph_components('network_edges')
GROUP BY component_id
ORDER BY size DESC;
SELECT node_id, component_id
FROM graph_components('network_edges')
WHERE component_id = (
SELECT component_id
FROM graph_components('network_edges')
GROUP BY component_id
ORDER BY COUNT(*) DESC
LIMIT 1
);
SELECT p.*
FROM graph_pagerank('network_edges') p
JOIN graph_components('network_edges') c
ON p.node_id = c.node_id
WHERE c.component_id = (
SELECT component_id
FROM graph_components('network_edges')
GROUP BY component_id
ORDER BY COUNT(*) DESC
LIMIT 1
);