ORC Warehouse Inventory: Window Functions on Mixed Numerics

Apply ROW_NUMBER, RANK, LAG/LEAD, NTILE, and running totals to 100 rows of warehouse inventory stored across 2 ORC files with int32, float64, and boolean columns.

Category: orc

Syntax

-- ================================================================
-- SETUP
-- ================================================================
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS external.orc_inventory
    COMMENT 'ORC-backed warehouse inventory tables';

CREATE EXTERNAL TABLE IF NOT EXISTS external.orc_inventory.stock
USING ORC
LOCATION '/data/warehouse'
OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- ================================================================
-- QUERIES
-- ================================================================
-- ROW_NUMBER, top quantity per category (ties broken by row order).
SELECT sku_id, warehouse, category, product_name, quantity_on_hand,
       ROW_NUMBER() OVER (PARTITION BY category ORDER BY quantity_on_hand DESC) AS rn
FROM external.orc_inventory.stock;

-- RANK, identical cost values share a rank (1, 2, 2, 4).
SELECT sku_id, warehouse, category, product_name, unit_cost,
       RANK() OVER (PARTITION BY warehouse ORDER BY unit_cost DESC) AS cost_rank
FROM external.orc_inventory.stock;

-- LAG / LEAD, neighbour comparison inside a partition.
SELECT product_name, quantity_on_hand,
       LAG(quantity_on_hand)  OVER (ORDER BY product_name) AS prev_qty,
       LEAD(quantity_on_hand) OVER (ORDER BY product_name) AS next_qty
FROM external.orc_inventory.stock
WHERE warehouse = 'WH-NORTH' AND category = 'Electronics'
ORDER BY product_name;

-- NTILE, split into 4 equal cost buckets.
SELECT cost_quartile,
       COUNT(*)               AS item_count,
       ROUND(MIN(unit_cost), 2) AS min_cost,
       ROUND(MAX(unit_cost), 2) AS max_cost
FROM (
    SELECT unit_cost,
           NTILE(4) OVER (ORDER BY unit_cost) AS cost_quartile
    FROM external.orc_inventory.stock
) q
GROUP BY cost_quartile
ORDER BY cost_quartile;

-- Reorder alerts, window rank filtered by business rule.
SELECT sku_id, warehouse, category, product_name,
       quantity_on_hand, reorder_point,
       reorder_point - quantity_on_hand AS deficit,
       ROW_NUMBER() OVER (
           PARTITION BY category
           ORDER BY reorder_point - quantity_on_hand DESC
       ) AS urgency_rank
FROM external.orc_inventory.stock
WHERE quantity_on_hand < reorder_point
ORDER BY category, urgency_rank;

-- ================================================================
-- CLEANUP
-- ================================================================
DROP EXTERNAL TABLE IF EXISTS external.orc_inventory.stock WITH FILES;
DROP SCHEMA IF EXISTS external.orc_inventory;
DROP ZONE IF EXISTS external;

Description

## When to Use Use this pattern when your ORC data is a flat fact table of SKUs, assets, or measurements and you need in-partition ranking, neighbour comparisons, or quartile bucketing. Window functions stay inside the Arrow-native execution engine, so they stream efficiently over ORC stripes without a materialisation step. ## What You Will Learn 1. ROW_NUMBER() OVER (PARTITION BY category ORDER BY quantity DESC) for tie-breaking top-N-per-group without correlated subqueries. 2. RANK() for tied orderings, shows the 1, 2, 2, 4 semantics versus ROW_NUMBER's 1, 2, 3, 4. 3. LAG and LEAD for neighbour comparisons within a sorted partition, returns NULL at the partition edges. 4. NTILE(4) to assign cost quartiles evenly across the full table and verify the boundary min/max per bucket. 5. SUM(col) OVER (PARTITION BY ... ORDER BY ...) as a running total, the classic DeltaForge / DataFusion pattern for cumulative aggregation. 6. Combining a window function with a WHERE filter to surface the top `urgency_rank` per category for reorder alerts. ## Prerequisites - A zone of TYPE EXTERNAL. - Two ORC files (wh_north.orc, wh_south.orc) sitting under one data_path, the external table reads the directory and unions them. - No partitioning or DETECT SCHEMA steps: CREATE EXTERNAL TABLE USING ORC auto-detects the merged schema from both file footers.

Pitfalls

Open in interactive docs →   DeltaForge home →