Node Similarity

Compute pairwise similarity between nodes based on shared neighbors

Category: similarity

Syntax

SELECT * FROM graph_similarity(table_name [, node_id] [, metric])

Description

Node similarity computes pairwise structural similarity between nodes based on the overlap of their neighbor sets. Unlike attribute-based similarity, this function uses only the graph topology to determine how alike two nodes are. It supports three metrics: Jaccard (intersection over union of neighbor sets, default), cosine (angle between neighborhood indicator vectors), and overlap coefficient (intersection over the smaller set size). This function is foundational for link prediction, duplicate detection, and collaborative filtering. The function loads edge data from a registered Delta table as an undirected graph in Compressed Sparse Row (CSR) format. When invoked without a node_id argument it returns similarity scores for all qualifying node pairs in the graph; supplying node_id restricts output to pairs that include the given node. The result set has a fixed three-column shape (node_a, node_b, similarity), one row per pair. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity for a single pair is O(d1 + d2) where d1 and d2 are the degrees of the two nodes (neighbor-set intersection via sorted merge). All-pairs computation costs O(V^2 * d_avg), which can be prohibitive for large graphs. Restricting to a single node reduces cost to O(V * d_avg). For large-scale similarity searches, prefer graph_knn which returns the top-k neighbors per node instead of materialising all pairs. GPU acceleration is not wired through the graph_similarity table function, so an ON GPU hint on a SQL query that uses it is silently ignored and this metric report runs on the CPU. Note that the table function and the Cypher CALL are two different things: graph_similarity(table, node1, node2) reports structural metrics (jaccard, common_neighbors, adamic_adar), whereas the separate Cypher CALL surface (USE ... ON GPU ... CALL algo.similarity) does expose a GPU dispatch and returns a pairwise score (node1Id, node2Id, score, with aliases source_id, target_id, similarity). The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology, so repeated similarity queries on the same table reuse the cached graph.

Parameters

Name	Type	Description
`table_name`		Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). The graph is loaded as undirected for similarity computation. Edge weights are ignored.
`node_id`		Restricts the computation to pairs involving this node. When omitted, the function returns similarity scores for all qualifying node pairs in the graph. Must be a BIGINT node ID present in the edge table.
`metric`		Specifies the similarity metric used for scoring. Valid values: 'jaccard' (default, intersection over union of neighbor sets), 'cosine' (angle between neighborhood indicator vectors), and 'overlap' (intersection divided by the minimum neighbor-set size, also known as Szymkiewicz-Simpson). Metric names are case-insensitive; unknown values fall back to the default.

Examples

CREATE TABLE collaborations AS
SELECT * FROM VALUES
  (1, 2, 1.0), (1, 3, 1.0), (1, 4, 1.0),
  (2, 3, 1.0), (2, 5, 1.0),
  (3, 4, 1.0), (3, 5, 1.0),
  (4, 6, 1.0)
AS t(src, dst, weight);

SELECT * FROM graph_similarity('collaborations');

SELECT node_a, node_b, similarity
FROM graph_similarity('collaborations', 1)
ORDER BY similarity DESC;

SELECT node_a, node_b, similarity
FROM graph_similarity('collaborations', NULL, 'cosine')
WHERE similarity > 0.5
ORDER BY similarity DESC;

WITH ranked AS (
  SELECT
    node_a,
    node_b,
    similarity,
    ROW_NUMBER() OVER (PARTITION BY node_a ORDER BY similarity DESC) AS rn
  FROM graph_similarity('collaborations', NULL, 'overlap')
)
SELECT node_a, node_b AS best_match, similarity
FROM ranked
WHERE rn = 1;

Pitfalls

All-pairs similarity is quadratic in the number of nodes. A graph with 100,000 nodes produces up to 5 billion pairs and will exhaust memory on all but the largest machines. Restrict to a single node via the node_id argument, pre-filter the source edge table, or use graph_knn for top-k retrieval instead.
Pairs with zero shared neighbors are omitted from the result rather than returned with similarity 0.0. A missing pair in the output implies no structural overlap; do not LEFT JOIN against a cross-product of nodes without explicitly handling this.
An unrecognised metric silently falls back to Jaccard rather than raising an error. Misspelling 'jackard' or 'cosin' may produce results that appear correct but differ from the intended metric. Validate the metric string.
Similarity is derived from the undirected neighbor set. Directed edges are merged into a single undirected neighbor list, which discards role asymmetries (e.g. follower vs followee). Use a directional variant pattern in source data if direction matters.
The node_id argument cannot filter to a specific pair. Only pairs that include the given node are returned. Filter the result set with an additional WHERE node_b = <other_id> predicate when scoring exactly one pair.
GPU acceleration is not wired through the graph_similarity table function, so an ON GPU hint on a SQL query that uses it is silently ignored and the all-pairs metric report stays CPU-bound even when other pipeline stages use GPU. The table function (a CPU-only metric report) is a different thing from the Cypher CALL surface (USE ... ON GPU ... CALL algo.similarity), which does expose a GPU dispatch and returns a pairwise score with columns node1Id, node2Id, score (and aliases source_id, target_id, similarity).