Random Walk Corpus Generation

Generate a corpus of fixed-length random walks starting from every node in the graph

Category: embeddings

Syntax

SELECT * FROM graph_randomwalk(table_name [, walks_per_node [, walk_length [, seed]]])

Description

Random walk corpus generation produces the canonical input for embedding pipelines in the node2vec, DeepWalk, and FastRP families. Starting from every vertex, the algorithm launches `walks_per_node` independent fixed-length walks. At each step it picks a uniformly random forward neighbour of the current node; a walk terminates when it reaches a vertex with no forward neighbours, in which case the output array is shorter than `walk_length`. The function matches the Neo4j GDS `gds.randomWalk.stream` defaults of `walksPerNode = 1` and `walkLength = 10`. The function loads edge data from a registered Delta table into a Compressed Sparse Row (CSR) representation and walks it directly. Column names for source and target are auto-detected from standard conventions (src/source/src_id, dst/target/dst_id). The unweighted (uniform-neighbour) variant is the only sampler exposed through the SQL surface: the underlying CPU and GPU primitives both support weighted sampling proportional to edge weight, but the table function hard-codes the `weighted` flag to false so that present edge weights are loaded into the CSR yet ignored when picking the next vertex. Determinism is preserved by deriving each walk's PRNG state from `(seed, source_idx, walk_idx)` via SplitMix64 mixing constants. The same `(table_name, walks_per_node, walk_length, seed)` tuple produces byte-identical walks on every run and on every hardware adapter, because the GPU path uses the same SplitMix64 scheme as the CPU path rather than a device-specific PRNG. Reordering the source edge table will reshuffle node indices in the CSR and therefore change which neighbour the PRNG selects: lock the source snapshot if the goal is reproducibility across schema changes. The time complexity is O(walks_per_node · walk_length · node_count) and the memory footprint is O(walks_per_node · walk_length · node_count) for the materialised JSON-array column. CPU execution is single-threaded over walks (the inner loop is cheap and the cache pressure favours one walk in flight at a time); GPU execution launches one thread per walk with each thread advancing through all `walk_length` steps locally, since WGSL has no efficient cooperative primitive for cross-walk coordination. Sink-truncated walks are marked with an INVALID sentinel on the GPU and trimmed on the host when assembling the JSON string. The graph cache (256 MB default budget, LRU eviction, 10-minute idle timeout) retains the CSR so repeated invocations skip the graph build.

Parameters

Name	Type	Description
`table_name`		Specify the name of the registered Delta table containing edge data. The table must include source and target columns (auto-detected as src/source/src_id and dst/target/dst_id). Edge weights are loaded if present and are forwarded to the CSR; the current SQL surface always runs the unweighted (uniform-neighbour) variant, so any present weights are ignored at sampling time. Edge direction is honoured: walks step only along forward neighbours.
`walks_per_node`		Set the number of walks generated from each starting node. The total number of output rows is approximately `walks_per_node * node_count`, minus any walks that terminate early at a sink. Matches the Neo4j GDS `gds.randomWalk.stream` default of 1. Common values for embedding pipelines like node2vec sit in the 10 to 20 range.
`walk_length`		Set the total number of nodes in each walk, including the source. A walk of length `L` consists of the source plus up to `L-1` transitions, so an `L = 10` walk emits a 10-element JSON array (or fewer elements if the walk hits a sink). Matches the Neo4j GDS `gds.randomWalk.stream` default of 10. Larger values give downstream embedding models a wider context window at proportional cost.
`seed`		Set the seed for the SplitMix64 PRNG that drives neighbour sampling. Each walk derives its own PRNG state from `(seed, source_idx, walk_idx)`, so the output is deterministic across runs, across hardware adapters (the GPU path uses the same SplitMix64 scheme as the CPU path), and across dispatch reorderings. Change the seed to draw an independent corpus from the same graph.

Examples

CREATE TABLE rw_demo AS
SELECT * FROM VALUES
  (1, 2, 1.0), (2, 3, 1.0), (3, 4, 1.0),
  (4, 1, 1.0), (2, 5, 1.0), (5, 1, 1.0),
  (3, 5, 1.0)
AS t(src, dst, weight);

SELECT * FROM graph_randomwalk('rw_demo', 5, 10) ORDER BY source_node_id;

SELECT * FROM graph_randomwalk('rw_demo');

SELECT * FROM graph_randomwalk('rw_demo', 5, 10, 12345);

WITH walks AS (
  SELECT source_node_id, walk,
         LENGTH(walk) - LENGTH(REPLACE(walk, ',', '')) + 1 AS step_count
  FROM graph_randomwalk('rw_demo', 10, 8)
)
SELECT COUNT(*) AS truncated_walks
FROM walks
WHERE step_count < 8;

SELECT walk
FROM graph_randomwalk('rw_demo', 20, 5, 7)
ORDER BY source_node_id;

Pitfalls

Reproducibility requires that `seed`, `walks_per_node`, `walk_length`, and the source graph all stay stable. If the source edge table is reloaded with a different snapshot, or if the CSR build order shifts because the order of insert statements changed, neighbour indices renumber and the walks diverge even when the seed is unchanged. Pin the source table snapshot when comparing walk corpora across runs.
The SQL surface always runs the unweighted (uniform-neighbour) variant; the `weighted` flag is hard-coded to false in the extract function. Edge weights are loaded into the CSR but ignored at sampling time. If a weighted random walk is required for proportional sampling, call the Rust API directly. Note that when weighted sampling IS used at the API level, all edge weights must be positive: the weighted sampler treats `w <= 0.0` as a zero-mass contributor and falls back to uniform sampling if the total mass on a row is non-positive, so zero or negative weights silently produce uniform behaviour rather than a documented error.
Walks that reach a node with no forward neighbours terminate early. The emitted JSON array is therefore not always `walk_length` elements long. Downstream consumers that assume a fixed shape (for example, feeding the corpus into a tensor with a static second dimension) must either pad or drop short walks. The `INVALID_VERTEX` sentinel used inside the GPU shader is not exposed in the output; the host trims at the first invalid slot before serialising.
Walks step only along forward neighbours; the algorithm respects edge direction. On an undirected source table (each edge stored as a pair of forward edges, or loaded with `directed = false`) the random walker can revisit previously seen nodes freely. On a directed source table walks can become trapped in dead-end branches earlier than expected. Use graph_components or graph_scc to characterise reachability before drawing conclusions from short walks.
GPU acceleration is available, but the GPU path's PRNG byte sequence is intentionally not the same as cuGraph's Philox-driven output for the same nominal seed. SplitMix64 was chosen because WGSL has no efficient Philox implementation. Statistical properties (uniform neighbour selection, expected coverage) match cuGraph exactly; byte-for-byte walk equality with cuGraph does not. The CPU and GPU paths in DeltaForge use the same SplitMix64 mixing constants and therefore produce identical walks, so the only divergence is against external cuGraph corpora.
The result column `walk` is a single JSON-style string per row, not a per-step `(walk_id, step_idx, node_id)` triple. Querying the k-th node of each walk requires a SQL JSON function or a string split. Materialise the column into a structured array only when the downstream consumer needs random access by step index, since the JSON encoding is intentionally compact for bulk corpus transfer.