Compute the minimum spanning tree of a weighted graph
SELECT * FROM graph_mst(table_name)
The minimum spanning tree (MST) algorithm selects a subset of edges from a weighted, undirected graph that connects all nodes with the minimum possible total edge weight, without forming any cycles. The resulting tree contains exactly V-1 edges for a connected graph of V nodes. MST is widely used in network design (minimizing cable or pipeline costs), clustering (via single-linkage hierarchical clustering), and approximation algorithms for NP-hard problems such as the traveling salesman problem. The function loads edge data from a registered Delta table as an undirected graph in Compressed Sparse Row (CSR) format. The algorithm uses a Kruskal or Prim-based approach to greedily select minimum-weight edges that do not create cycles. Column names for source, target, and weight are auto-detected from standard conventions. If no weight column is found, all edges are assigned a default weight of 1.0. For disconnected graphs, the algorithm produces a minimum spanning forest (one tree per connected component). The time complexity is O(E log E) for Kruskal's algorithm (dominated by edge sorting) or O(E + V log V) for Prim's algorithm with a Fibonacci heap. Both approaches are efficient for sparse graphs. For dense graphs, the edge-sorting cost of Kruskal's approach dominates. GPU acceleration is not currently available for MST computation. All processing runs on the CPU. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR topology for reuse, so repeated MST computations on the same table (e.g., after parameter changes in related analyses) avoid redundant graph loading.
| Name | Type | Description |
|---|---|---|
table_name | Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). A weight column must be present; if absent, all edges are treated as having weight 1.0. The graph is loaded as undirected for MST computation. |
CREATE TABLE cable_costs AS
SELECT * FROM VALUES
(1, 2, 4.0),
(1, 3, 8.0),
(2, 3, 2.0),
(2, 4, 5.0),
(3, 4, 3.0),
(3, 5, 7.0),
(4, 5, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_mst('cable_costs');
SELECT SUM(weight) AS total_mst_cost
FROM graph_mst('cable_costs');
SELECT COUNT(*) AS mst_edge_count
FROM graph_mst('cable_costs');
SELECT src, dst, weight
FROM graph_mst('cable_costs')
ORDER BY weight DESC
LIMIT 1;
SELECT
m.src, s.name AS src_name,
m.dst, d.name AS dst_name,
m.weight AS cost
FROM graph_mst('cable_costs') m
JOIN locations s ON m.src = s.id
JOIN locations d ON m.dst = d.id
ORDER BY m.weight;