K-Nearest Neighbors

Find the k most similar nodes for each node based on graph structure

Category: similarity

Syntax

SELECT * FROM graph_knn(table_name [, k] [, metric])

Description

K-nearest neighbors (KNN) identifies the k most structurally similar nodes for each node in the graph. Similarity is measured by the overlap of neighbor sets rather than by attribute values. Two nodes that share many common neighbors are considered structurally similar, making this algorithm valuable for recommendation systems, link prediction, and entity resolution in graph-structured data. The function loads edge data from a registered Delta table as an undirected graph in Compressed Sparse Row (CSR) format. For each node, the algorithm computes a similarity score against all other nodes using the selected metric (Jaccard by default), then retains the top k results. Jaccard similarity is defined as the size of the intersection of two neighbor sets divided by the size of their union. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity depends on the similarity metric. For Jaccard similarity, computing all pairs requires O(V * d_max^2) in the worst case, where d_max is the maximum degree. In practice, sparse graphs with low average degree are much cheaper. Setting a smaller k does not reduce the comparison cost but limits the output size. For very large or dense graphs, consider restricting analysis to a subgraph. GPU acceleration is not wired through the graph_knn table function, so an ON GPU hint on a SQL query that uses it is silently ignored and it runs on the CPU. The Cypher CALL surface (USE ... ON GPU ... CALL algo.knn) does expose a GPU dispatch, returning the same source_id, neighbor_id, similarity, and rank columns as the CPU path. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology, so repeated KNN queries with different k values on the same table reuse the cached graph.

Parameters

Name	Type	Description
`table_name`		Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). The graph is loaded as undirected for similarity computation. Edge weights are ignored; similarity is based on neighborhood overlap.
`k`		Set the number of nearest neighbors to return per source node. Default is 5. Valid range is 1 or greater. Higher values return more neighbors but increase the output cardinality; the underlying pairwise similarity cost is unchanged.
`metric`		Specifies the similarity metric. Valid values include 'jaccard' (default) and 'cosine'. Jaccard measures the size of the neighbor-set intersection divided by the union; cosine measures the angle between neighborhood indicator vectors. Metric names are case-insensitive; unknown values fall back to the default.

Examples

CREATE TABLE co_purchases AS
SELECT * FROM VALUES
  (1, 2, 1.0), (1, 3, 1.0), (1, 4, 1.0),
  (2, 3, 1.0), (2, 5, 1.0),
  (3, 4, 1.0), (3, 5, 1.0),
  (4, 6, 1.0),
  (5, 6, 1.0)
AS t(src, dst, weight);

SELECT * FROM graph_knn('co_purchases');

SELECT source_id, neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 3)
ORDER BY source_id, rank;

SELECT neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 10)
WHERE source_id = 1
ORDER BY rank;

SELECT source_id, neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 5, 'cosine')
ORDER BY source_id, rank;

SELECT
  k.neighbor_id,
  p.product_name AS similar_product,
  k.similarity,
  k.rank
FROM graph_knn('co_purchases', 5) k
JOIN products p ON k.neighbor_id = p.id
WHERE k.source_id = 1
ORDER BY k.rank;

Pitfalls

Similarity is structural, not attribute-based. Two nodes can have identical attributes yet zero graph_knn similarity if they share no neighbors, and two unrelated nodes can score high simply because they both connect to the same high-degree hub. Validate that structural similarity matches the recommendation semantics you need.
k limits the output rows per source but not the compute cost. The algorithm still scores every candidate pair before truncating, so reducing k does not speed up large-graph runs. Subset the edge table up-front if compute budget is tight.
An unrecognised metric silently falls back to the default (Jaccard) rather than raising an error. Misspellings such as 'jacard' or 'cosin' may produce unexpected rankings. Validate the metric string before deployment.
Ties in similarity are broken by neighbor_id ordering, not by input ordering. Multiple neighbors with identical scores produce stable but arbitrary ranks that can shift when the underlying node ID encoding changes.
GPU acceleration is not wired through the graph_knn table function, so an ON GPU hint on a SQL query that uses it is silently ignored and it runs on the CPU. The Cypher CALL surface (USE ... ON GPU ... CALL algo.knn) does expose a GPU dispatch, returning the same source_id, neighbor_id, similarity, and rank columns as the CPU path.