Connected Components

Find connected components (groups of mutually reachable nodes)

Category: community-detection

Syntax

SELECT * FROM graph_components(table_name)

Description

Connected components partitions an undirected graph into maximal groups of mutually reachable nodes. Two nodes belong to the same component if and only if a path exists between them, regardless of direction. This algorithm is a fundamental preprocessing step used to identify disconnected subgraphs, validate data integrity in network datasets, and scope downstream analyses (such as PageRank or community detection) to individual components. The function loads edge data from a registered Delta table into a Compressed Sparse Row (CSR) representation. Edge direction is treated as bidirectional for the purpose of reachability. The algorithm performs a series of BFS or union-find traversals, assigning each node a component identifier. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity is O(V + E), where V is the vertex count and E is the edge count. This is a linear-time algorithm and scales efficiently to graphs with millions of edges. It is one of the cheapest graph operations and is commonly used as a first step before running more expensive algorithms. GPU acceleration is available for connected components. When a GPU accelerator is registered, the engine attempts a parallel label-propagation approach on the GPU and falls back to CPU if GPU resources are unavailable. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology between invocations, so repeated calls on the same table skip the graph construction phase.

Parameters

NameTypeDescription
table_nameSpecify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). Edge weights are ignored for component detection.

Examples

CREATE TABLE network_edges AS
SELECT * FROM VALUES
  (1, 2, 1.0),
  (2, 3, 1.0),
  (3, 1, 1.0),
  (10, 11, 1.0),
  (11, 12, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_components('network_edges');
SELECT COUNT(DISTINCT component_id) AS num_components
FROM graph_components('network_edges');
SELECT component_id, COUNT(*) AS size
FROM graph_components('network_edges')
GROUP BY component_id
ORDER BY size DESC;
SELECT node_id, component_id
FROM graph_components('network_edges')
WHERE component_id = (
  SELECT component_id
  FROM graph_components('network_edges')
  GROUP BY component_id
  ORDER BY COUNT(*) DESC
  LIMIT 1
);
SELECT p.*
FROM graph_pagerank('network_edges') p
JOIN graph_components('network_edges') c
  ON p.node_id = c.node_id
WHERE c.component_id = (
  SELECT component_id
  FROM graph_components('network_edges')
  GROUP BY component_id
  ORDER BY COUNT(*) DESC
  LIMIT 1
);

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →