K-Nearest Neighbors

Find the k most similar nodes for each node based on graph structure

Category: similarity

Syntax

SELECT * FROM graph_knn(table_name [, k] [, metric])

Description

K-nearest neighbors (KNN) identifies the k most structurally similar nodes for each node in the graph. Similarity is measured by the overlap of neighbor sets rather than by attribute values. Two nodes that share many common neighbors are considered structurally similar, making this algorithm valuable for recommendation systems, link prediction, and entity resolution in graph-structured data. The function loads edge data from a registered Delta table as an undirected graph in Compressed Sparse Row (CSR) format. For each node, the algorithm computes a similarity score against all other nodes using the selected metric (Jaccard by default), then retains the top k results. Jaccard similarity is defined as the size of the intersection of two neighbor sets divided by the size of their union. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The time complexity depends on the similarity metric. For Jaccard similarity, computing all pairs requires O(V * d_max^2) in the worst case, where d_max is the maximum degree. In practice, sparse graphs with low average degree are much cheaper. Setting a smaller k does not reduce the comparison cost but limits the output size. For very large or dense graphs, consider restricting analysis to a subgraph. GPU acceleration is not currently available for graph KNN. All computation runs on the CPU. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology, so repeated KNN queries with different k values on the same table reuse the cached graph.

Parameters

NameTypeDescription
table_nameSpecify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). The graph is loaded as undirected for similarity computation. Edge weights are ignored; similarity is based on neighborhood overlap.
kSet the number of nearest neighbors to return per source node. Default is 5. Valid range is 1 or greater. Higher values return more neighbors but increase the output cardinality; the underlying pairwise similarity cost is unchanged.
metricSpecifies the similarity metric. Valid values include 'jaccard' (default) and 'cosine'. Jaccard measures the size of the neighbor-set intersection divided by the union; cosine measures the angle between neighborhood indicator vectors. Metric names are case-insensitive; unknown values fall back to the default.

Examples

CREATE TABLE co_purchases AS
SELECT * FROM VALUES
  (1, 2, 1.0), (1, 3, 1.0), (1, 4, 1.0),
  (2, 3, 1.0), (2, 5, 1.0),
  (3, 4, 1.0), (3, 5, 1.0),
  (4, 6, 1.0),
  (5, 6, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_knn('co_purchases');
SELECT source_id, neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 3)
ORDER BY source_id, rank;
SELECT neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 10)
WHERE source_id = 1
ORDER BY rank;
SELECT source_id, neighbor_id, similarity, rank
FROM graph_knn('co_purchases', 5, 'cosine')
ORDER BY source_id, rank;
SELECT
  k.neighbor_id,
  p.product_name AS similar_product,
  k.similarity,
  k.rank
FROM graph_knn('co_purchases', 5) k
JOIN products p ON k.neighbor_id = p.id
WHERE k.source_id = 1
ORDER BY k.rank;

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →