Random Walk Corpus Generation

Generate a corpus of fixed-length random walks starting from every node in the graph

Category: embeddings

Syntax

SELECT * FROM graph_randomwalk(table_name [, walks_per_node [, walk_length [, seed]]])

Description

Random walk corpus generation produces the canonical input for embedding pipelines in the node2vec, DeepWalk, and FastRP families. Starting from every vertex, the algorithm launches `walks_per_node` independent fixed-length walks. At each step it picks a uniformly random forward neighbour of the current node; a walk terminates when it reaches a vertex with no forward neighbours, in which case the output array is shorter than `walk_length`. The function matches the Neo4j GDS `gds.randomWalk.stream` defaults of `walksPerNode = 1` and `walkLength = 10`. The function loads edge data from a registered Delta table into a Compressed Sparse Row (CSR) representation and walks it directly. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The unweighted (uniform-neighbour) variant is the only sampler exposed through the SQL surface: the underlying CPU and GPU primitives both support weighted sampling proportional to edge weight, but the table function hard-codes the `weighted` flag to false so that present edge weights are loaded into the CSR yet ignored when picking the next vertex. Determinism is preserved by deriving each walk's PRNG state from `(seed, source_idx, walk_idx)` via SplitMix64 mixing constants. The same `(table_name, walks_per_node, walk_length, seed)` tuple produces byte-identical walks on every run and on every hardware adapter, because the GPU path uses the same SplitMix64 scheme as the CPU path rather than a device-specific PRNG. Reordering the source edge table will reshuffle node indices in the CSR and therefore change which neighbour the PRNG selects: lock the source snapshot if the goal is reproducibility across schema changes. The time complexity is O(walks_per_node · walk_length · node_count) and the memory footprint is O(walks_per_node · walk_length · node_count) for the materialised JSON-array column. CPU execution is single-threaded over walks (the inner loop is cheap and the cache pressure favours one walk in flight at a time); GPU execution launches one thread per walk with each thread advancing through all `walk_length` steps locally, since WGSL has no efficient cooperative primitive for cross-walk coordination. Sink-truncated walks are marked with an INVALID sentinel on the GPU and trimmed on the host when assembling the JSON string. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR so repeated invocations skip the graph build.

Parameters

NameTypeDescription
table_nameSpecify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). Edge weights are loaded if present and are forwarded to the CSR; the current SQL surface always runs the unweighted (uniform-neighbour) variant, so any present weights are ignored at sampling time. Edge direction is honoured: walks step only along forward neighbours.
walks_per_nodeSet the number of walks generated from each starting node. The total number of output rows is approximately `walks_per_node * node_count`, minus any walks that terminate early at a sink. Matches the Neo4j GDS `gds.randomWalk.stream` default of 1. Common values for embedding pipelines like node2vec sit in the 10 to 20 range.
walk_lengthSet the total number of nodes in each walk, including the source. A walk of length `L` consists of the source plus up to `L-1` transitions, so an `L = 10` walk emits a 10-element JSON array (or fewer elements if the walk hits a sink). Matches the Neo4j GDS `gds.randomWalk.stream` default of 10. Larger values give downstream embedding models a wider context window at proportional cost.
seedSet the seed for the SplitMix64 PRNG that drives neighbour sampling. Each walk derives its own PRNG state from `(seed, source_idx, walk_idx)`, so the output is deterministic across runs, across hardware adapters (the GPU path uses the same SplitMix64 scheme as the CPU path), and across dispatch reorderings. Change the seed to draw an independent corpus from the same graph.

Examples

CREATE TABLE rw_demo AS
SELECT * FROM VALUES
  (1, 2, 1.0), (2, 3, 1.0), (3, 4, 1.0),
  (4, 1, 1.0), (2, 5, 1.0), (5, 1, 1.0),
  (3, 5, 1.0)
AS t(src, dst, weight);
SELECT * FROM graph_randomwalk('rw_demo', 5, 10) ORDER BY source_node_id;
SELECT * FROM graph_randomwalk('rw_demo');
SELECT * FROM graph_randomwalk('rw_demo', 5, 10, 12345);
WITH walks AS (
  SELECT source_node_id, walk,
         LENGTH(walk) - LENGTH(REPLACE(walk, ',', '')) + 1 AS step_count
  FROM graph_randomwalk('rw_demo', 10, 8)
)
SELECT COUNT(*) AS truncated_walks
FROM walks
WHERE step_count < 8;
SELECT walk
FROM graph_randomwalk('rw_demo', 20, 5, 7)
ORDER BY source_node_id;

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →