ORC Energy Meters: Advanced Aggregation at Scale

Aggregate 1,500 smart-meter readings across 3 monthly ORC files using HAVING, COUNT DISTINCT, FILTER clauses, and CTE + window ranking.

Category: orc

Syntax

-- ================================================================
-- SETUP
-- ================================================================
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS external.orc_energy
    COMMENT 'ORC-backed energy meter tables';

CREATE EXTERNAL TABLE IF NOT EXISTS external.orc_energy.readings
USING ORC
LOCATION '/data/energy'
OPTIONS (
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- ================================================================
-- QUERIES
-- ================================================================
-- Full scan, 1,500 readings across 3 monthly files.
SELECT COUNT(*) FROM external.orc_energy.readings;

-- Rate-plan distribution with multiple aggregates in one pass.
SELECT rate_plan,
       COUNT(*)                      AS reading_count,
       ROUND(SUM(kwh_consumed), 3)   AS total_kwh,
       ROUND(AVG(kwh_consumed), 3)   AS avg_kwh
FROM external.orc_energy.readings
GROUP BY rate_plan
ORDER BY reading_count DESC;

-- HAVING, meters whose total consumption exceeds 2,000 kWh.
SELECT meter_id,
       ROUND(SUM(kwh_consumed), 3) AS total_kwh,
       COUNT(*)                    AS reading_count
FROM external.orc_energy.readings
GROUP BY meter_id
HAVING SUM(kwh_consumed) > 2000
ORDER BY total_kwh DESC;

-- COUNT DISTINCT, cardinality of meters and plans.
SELECT COUNT(DISTINCT meter_id)  AS distinct_meters,
       COUNT(DISTINCT rate_plan) AS distinct_plans
FROM external.orc_energy.readings;

-- FILTER clause, peak vs off-peak counts per rate plan in one scan.
SELECT rate_plan,
       COUNT(*) FILTER (WHERE is_peak_hour = true)  AS peak_count,
       COUNT(*) FILTER (WHERE is_peak_hour = false) AS offpeak_count
FROM external.orc_energy.readings
GROUP BY rate_plan
ORDER BY rate_plan;

-- Monthly rollup via df_file_name (populated by file_metadata OPTIONS).
SELECT df_file_name,
       COUNT(*)                    AS reading_count,
       ROUND(SUM(kwh_consumed), 3) AS monthly_kwh
FROM external.orc_energy.readings
GROUP BY df_file_name
ORDER BY df_file_name;

-- CTE + ROW_NUMBER, top 5 meters with rate-plan context.
WITH meter_totals AS (
    SELECT meter_id,
           rate_plan,
           ROUND(SUM(kwh_consumed), 3) AS total_kwh,
           COUNT(*)                    AS readings,
           ROW_NUMBER() OVER (ORDER BY SUM(kwh_consumed) DESC) AS usage_rank
    FROM external.orc_energy.readings
    GROUP BY meter_id, rate_plan
)
SELECT meter_id, rate_plan, total_kwh, readings, usage_rank
FROM meter_totals
WHERE usage_rank <= 5
ORDER BY usage_rank;

-- ================================================================
-- CLEANUP
-- ================================================================
DROP EXTERNAL TABLE IF EXISTS external.orc_energy.readings WITH FILES;
DROP SCHEMA IF EXISTS external.orc_energy;
DROP ZONE IF EXISTS external;

Description

## When to Use Use this pattern when your ORC dataset is a time-partitioned stream of numeric events (meter readings, telemetry, billing records) and you need analytical answers that go beyond plain GROUP BY, segmented counts via FILTER, thresholded groups via HAVING, and top-N ranking via CTE + ROW_NUMBER. It also demonstrates how DeltaForge handles mixed numeric types (float64, boolean, nullable columns) when multiple ORC files are read in one table. ## What You Will Learn 1. Multi-file scan over 3 monthly ORC files (readings_2026-0{1,2,3}.orc) with `file_metadata` exposing the source file per row for per-month rollups. 2. HAVING on aggregated sums (`SUM(kwh_consumed) > 2000`) to keep only high-usage meters, runs after GROUP BY, unlike WHERE. 3. FILTER clause for segmented counts in a single pass (peak vs off-peak), preferable to CASE-based sums when you want COUNT semantics. 4. COUNT DISTINCT for cardinality checks (distinct meters = 50, distinct plans = 3). 5. CTE + ROW_NUMBER() OVER (ORDER BY SUM(...) DESC) to rank grouped results: DeltaForge rejects correlated scalar subqueries, so this is the sanctioned top-N pattern. ## Prerequisites - A zone of TYPE EXTERNAL. - Three ORC files (one per month) sitting under the demo data path. The CREATE EXTERNAL TABLE reads the directory, so any file added later is picked up on the next scan. - Sufficient memory for full aggregation, 1,500 rows is trivial, but the patterns scale to millions with the same SQL.

Pitfalls

Open in interactive docs →   DeltaForge home →