Compute pairwise similarity between nodes based on shared neighbors
SELECT * FROM graph_similarity(table_name [, node_id] [, metric])
Node similarity computes pairwise structural similarity between nodes based on the overlap of their neighbor sets. Unlike attribute-based similarity, this function uses only the graph topology to determine how alike two nodes are. It supports three metrics: Jaccard (intersection over union of neighbor sets, default), cosine (angle between neighborhood indicator vectors), and overlap coefficient (intersection over the smaller set size). This function is foundational for link prediction, duplicate detection, and collaborative filtering. The function loads edge data from a registered Delta table as an undirected graph in Compressed Sparse Row (CSR) format. When invoked without a node_id argument it returns similarity scores for all qualifying node pairs in the graph; supplying node_id restricts output to pairs that include the given node. The result set has a fixed three-column shape (node_a, node_b, similarity), one row per pair. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity for a single pair is O(d1 + d2) where d1 and d2 are the degrees of the two nodes (neighbor-set intersection via sorted merge). All-pairs computation costs O(V^2 * d_avg), which can be prohibitive for large graphs. Restricting to a single node reduces cost to O(V * d_avg). For large-scale similarity searches, prefer graph_knn which returns the top-k neighbors per node instead of materialising all pairs. GPU acceleration is not currently available for node similarity. All computation runs on the CPU. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology, so repeated similarity queries on the same table reuse the cached graph.
| Name | Type | Description |
|---|---|---|
table_name | Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). The graph is loaded as undirected for similarity computation. Edge weights are ignored. | |
node_id | Restricts the computation to pairs involving this node. When omitted, the function returns similarity scores for all qualifying node pairs in the graph. Must be a BIGINT node ID present in the edge table. | |
metric | Specifies the similarity metric used for scoring. Valid values: 'jaccard' (default, intersection over union of neighbor sets), 'cosine' (angle between neighborhood indicator vectors), and 'overlap' (intersection divided by the minimum neighbor-set size, also known as Szymkiewicz-Simpson). Metric names are case-insensitive; unknown values fall back to the default. |
CREATE TABLE collaborations AS
SELECT * FROM VALUES
(1, 2, 1.0), (1, 3, 1.0), (1, 4, 1.0),
(2, 3, 1.0), (2, 5, 1.0),
(3, 4, 1.0), (3, 5, 1.0),
(4, 6, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_similarity('collaborations');
SELECT node_a, node_b, similarity
FROM graph_similarity('collaborations', 1)
ORDER BY similarity DESC;
SELECT node_a, node_b, similarity
FROM graph_similarity('collaborations', NULL, 'cosine')
WHERE similarity > 0.5
ORDER BY similarity DESC;
WITH ranked AS (
SELECT
node_a,
node_b,
similarity,
ROW_NUMBER() OVER (PARTITION BY node_a ORDER BY similarity DESC) AS rn
FROM graph_similarity('collaborations', NULL, 'overlap')
)
SELECT node_a, node_b AS best_match, similarity
FROM ranked
WHERE rn = 1;