Label Propagation

Detect communities by iterative neighbor-label voting until labels stabilize

Category: community-detection

Syntax

SELECT * FROM graph_labelpropagation(table_name [, max_iterations])

Description

Label Propagation (Raghavan, Albert, and Kumara 2007) assigns each node the community label that occurs most frequently among its neighbours, iterating until labels stabilize. Each node starts in its own community. On every round every node tallies a weighted vote across its neighbours (and, on directed graphs, its in-neighbours as well), then adopts the label with the highest total vote; ties break to the smallest label id for determinism, matching the Neo4j GDS gds.labelPropagation tie-break rule. The implementation uses a synchronous Gauss-Jacobi update: every node in a round reads the previous round's labels, so the round result does not depend on traversal order. Communities are renumbered contiguously starting from 0 before being returned. The function loads edge data from a registered Delta table as a graph in Compressed Sparse Row format. Column names are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). When an edge weight column is present, weights scale each vote contribution; otherwise every neighbour contributes 1.0. Time complexity is O(max_iterations times E) for E edges, dominated by a single sparse pass per round; the CPU path parallelizes the per-vertex vote tally with Rayon over a per-thread AHashMap accumulator. GPU acceleration is available. The dispatcher routes to GPU when the active accelerator accepts the graph and the node count meets or exceeds 5,000 (delta-forge-graph-gpu thresholds::LABEL_PROPAGATION); smaller graphs run on the CPU path because the GPU transfer overhead dominates. The GPU kernel selects a per-vertex strategy by degree: a private fixed-capacity dedup table for degree at or below 32 (LOW_DEG path), and a sampled aggregation that draws up to 1,024 deterministic per-vertex samples per round for higher-degree vertices (SAMPLE_CAP path). The sampling cap exists specifically to prevent Windows TDR timeouts on hub-heavy graphs where an O(deg^2) dedup scan over a multi-thousand-degree vertex would otherwise stall a single workgroup past the 2-second driver budget; the dispatch is further sliced into 4,096-vertex chunks so each submission is bounded in wall time regardless of how concentrated the hub workload is. The GPU path also adds a parity-gated oscillation guard (only vertices whose id parity matches the iteration parity are permitted to flip labels in a given round), mirroring cuGraph's Louvain up_down flag, which prevents the well-known two-cycle that synchronous LPA produces on bipartite substructure.

Parameters

Name	Type	Description
`table_name`		Specifies the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). An optional weight column scales each neighbor's vote when present; on directed graphs both outgoing and incoming neighbours contribute to the vote, matching Neo4j GDS gds.labelPropagation default orientation behavior.
`max_iterations`		Sets the maximum number of synchronous voting rounds. Default is 10, matching the Neo4j GDS gds.labelPropagation default. The loop exits early once an entire round produces no label changes. Real graphs typically converge in 3 to 5 rounds, so increasing this value rarely changes the final labelling.

Examples

CREATE TABLE coauthors AS
SELECT * FROM VALUES
  (1, 2, 1.0), (1, 3, 1.0), (2, 3, 1.0),
  (4, 5, 1.0), (4, 6, 1.0), (5, 6, 1.0),
  (3, 4, 0.2)
AS t(src, dst, weight);

SELECT * FROM graph_labelpropagation('coauthors');

SELECT * FROM graph_labelpropagation('coauthors', 3);

SELECT COUNT(DISTINCT community_id) AS num_communities
FROM graph_labelpropagation('coauthors');

SELECT community_id, COUNT(*) AS members
FROM graph_labelpropagation('coauthors')
GROUP BY community_id
ORDER BY members DESC, community_id;

SELECT l.community_id, COUNT(*) AS researchers, MAX(p.affiliation) AS sample_affiliation
FROM graph_labelpropagation('coauthors') l
JOIN people p ON l.node_id = p.id
GROUP BY l.community_id
ORDER BY researchers DESC;

Pitfalls

Label Propagation is not modularity-aware. It finds connected dense regions but offers no quality guarantee on the partition; for high-quality partitions on graphs where structure is subtle, prefer graph_louvain or graph_leiden.
Synchronous LPA can oscillate on bipartite-like substructure: two nodes flip labels every round forever. The CPU path simply caps the iteration count and returns whatever labelling held at the cap; the GPU path additionally enforces a parity gate that converts two-cycles into four-cycles and eventually settles, but the two engines can land on different non-trivial labellings for the same oscillating input.
Community ids are not stable across runs. Renumbering is contiguous from 0 but assignment depends on the order in which final labels are first observed, so a re-run on the same edge table can swap community 0 and community 1 even when topology is identical. Compare by membership, not by id.
Edge direction matters. On directed graphs both outgoing and incoming neighbours contribute to the vote (matching Neo4j's default orientation), so reversing the edge table flips no scores but loading the same data as undirected (e.g. by symmetrizing upstream) doubles every contribution and produces a different labelling.
On hub-heavy graphs (degree above 1024) the GPU path approximates the vote tally with a deterministic 1,024-sample subset of each vertex's neighbours per round, while the CPU path always walks every edge. Scores are not bit-identical between engines for such graphs; convergence behaviour is preserved (empirically equivalent at S=1024 per Slota et al.) but the exact community boundaries on hub vertices can differ.
Below 5,000 nodes the GPU dispatcher returns BelowThreshold and execution falls back to the CPU path even when an accelerator is registered. Relying on GPU latency on small fixtures will reproducibly miss the GPU path.