Iceberg V3: Deletion Vectors (Puffin)

Read a native Iceberg v3 table whose 36 faulty-scanner rows are retracted via a Puffin-encoded deletion vector: DeltaForge applies the Roaring Bitmap and exposes the 504 surviving shipments.

Category: iceberg

Syntax

-- ============================================================================
-- SETUP
-- ============================================================================
-- Registers an external table backed by a native Apache Iceberg v3 table that
-- uses Puffin-encoded deletion vectors for row-level deletes.
--
-- DeltaForge reads the Iceberg metadata chain directly:
--   v3.metadata.json -> manifest list -> manifests -> Parquet data file
--   + Puffin deletion vector (Roaring Bitmap of row positions)
-- The DV is applied at scan time so callers see only the surviving rows.

-- STEP 1: Zone & Schema
CREATE ZONE IF NOT EXISTS {{zone_name}} TYPE EXTERNAL
    COMMENT 'External tables - demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS {{zone_name}}.iceberg_demos
    COMMENT 'Apache Iceberg native table demos';

-- STEP 2: Register the Iceberg v3 table (format-version = 3).
-- DeltaForge parses v3.metadata.json to discover schema, data files, and
-- the Puffin deletion vector referenced from the manifest.
CREATE EXTERNAL TABLE IF NOT EXISTS {{zone_name}}.iceberg_demos.shipment_manifests
USING ICEBERG
LOCATION '{{data_path}}/shipment_manifests';

GRANT ADMIN ON TABLE {{zone_name}}.iceberg_demos.shipment_manifests TO USER {{current_user}};


-- ============================================================================
-- QUERIES
-- ============================================================================

-- Query 1: Post-delete row count.
-- The Parquet file physically holds 540 rows; the DV retracts 36, leaving 504.
ASSERT ROW_COUNT = 504
SELECT * FROM {{zone_name}}.iceberg_demos.shipment_manifests;

-- Query 2: The faulty scanner must be fully gone.
ASSERT ROW_COUNT = 0
SELECT *
FROM {{zone_name}}.iceberg_demos.shipment_manifests
WHERE scanner_id = 'SCAN-ERR';

-- Query 3: Per-region shipment counts (post-delete).
ASSERT ROW_COUNT = 3
ASSERT VALUE shipment_count = 144 WHERE region = 'Americas'
ASSERT VALUE shipment_count = 180 WHERE region = 'EMEA'
ASSERT VALUE shipment_count = 180 WHERE region = 'APAC'
SELECT region, COUNT(*) AS shipment_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests
GROUP BY region
ORDER BY region;

-- Query 4: Distinct scanner count (16 originally, 15 after SCAN-ERR removal).
ASSERT ROW_COUNT = 1
ASSERT VALUE scanner_count = 15
SELECT COUNT(DISTINCT scanner_id) AS scanner_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests;

-- Query 5: Per-category distribution (6 categories).
ASSERT ROW_COUNT = 6
ASSERT VALUE shipment_count = 86  WHERE product_category = 'Automotive-Parts'
ASSERT VALUE shipment_count = 107 WHERE product_category = 'Electronics'
ASSERT VALUE shipment_count = 73  WHERE product_category = 'Heavy-Machinery'
ASSERT VALUE shipment_count = 82  WHERE product_category = 'Perishable-Foods'
ASSERT VALUE shipment_count = 77  WHERE product_category = 'Pharmaceuticals'
ASSERT VALUE shipment_count = 79  WHERE product_category = 'Textiles'
SELECT product_category, COUNT(*) AS shipment_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests
GROUP BY product_category
ORDER BY product_category;

-- Query 6: Hazardous shipments by region (boolean aggregation).
ASSERT ROW_COUNT = 3
ASSERT VALUE hazardous_count = 17 WHERE region = 'Americas'
ASSERT VALUE hazardous_count = 21 WHERE region = 'EMEA'
ASSERT VALUE hazardous_count = 21 WHERE region = 'APAC'
SELECT region, SUM(CASE WHEN is_hazardous THEN 1 ELSE 0 END) AS hazardous_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests
GROUP BY region
ORDER BY region;

-- Query 7: Average weight by region (float aggregation post-DV).
ASSERT ROW_COUNT = 3
ASSERT VALUE avg_weight = 1276.76 WHERE region = 'Americas'
ASSERT VALUE avg_weight = 1344.52 WHERE region = 'EMEA'
ASSERT VALUE avg_weight = 1216.02 WHERE region = 'APAC'
SELECT region, ROUND(AVG(weight_kg), 2) AS avg_weight
FROM {{zone_name}}.iceberg_demos.shipment_manifests
GROUP BY region
ORDER BY region;

-- Query 8: Weight range (overall).
ASSERT ROW_COUNT = 1
ASSERT VALUE min_weight = 1.29
ASSERT VALUE max_weight = 2499.03
ASSERT VALUE avg_weight = 1279.27
SELECT
    ROUND(MIN(weight_kg), 2) AS min_weight,
    ROUND(MAX(weight_kg), 2) AS max_weight,
    ROUND(AVG(weight_kg), 2) AS avg_weight
FROM {{zone_name}}.iceberg_demos.shipment_manifests;

-- Query 9: Top carriers by volume (12 carriers).
ASSERT ROW_COUNT = 12
ASSERT VALUE shipment_count = 58 WHERE carrier = 'Maersk'
ASSERT VALUE shipment_count = 48 WHERE carrier = 'Kuehne-Nagel'
ASSERT VALUE shipment_count = 46 WHERE carrier = 'FedEx'
ASSERT VALUE shipment_count = 45 WHERE carrier = 'UPS'
SELECT carrier, COUNT(*) AS shipment_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests
GROUP BY carrier
ORDER BY shipment_count DESC, carrier;

-- Query 10: Low-value shipments (< $500).
ASSERT ROW_COUNT = 5
SELECT shipment_id, region, carrier, product_category, declared_value
FROM {{zone_name}}.iceberg_demos.shipment_manifests
WHERE declared_value < 500
ORDER BY declared_value ASC;

-- Query 11: Distinct entity counts over the post-delete view.
ASSERT ROW_COUNT = 1
ASSERT VALUE distinct_shipments = 504
ASSERT VALUE distinct_countries = 18
ASSERT VALUE distinct_carriers = 12
ASSERT VALUE distinct_categories = 6
SELECT
    COUNT(DISTINCT shipment_id)         AS distinct_shipments,
    COUNT(DISTINCT destination_country) AS distinct_countries,
    COUNT(DISTINCT carrier)             AS distinct_carriers,
    COUNT(DISTINCT product_category)    AS distinct_categories
FROM {{zone_name}}.iceberg_demos.shipment_manifests;

-- VERIFY: end-to-end DV correctness check.
ASSERT ROW_COUNT = 1
ASSERT VALUE total_rows = 504
ASSERT VALUE region_count = 3
ASSERT VALUE scanner_count = 15
ASSERT VALUE faulty_rows = 0
ASSERT VALUE total_hazardous = 59
ASSERT VALUE category_count = 6
ASSERT VALUE country_count = 18
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT region)   AS region_count,
    COUNT(DISTINCT scanner_id) AS scanner_count,
    SUM(CASE WHEN scanner_id = 'SCAN-ERR' THEN 1 ELSE 0 END) AS faulty_rows,
    SUM(CASE WHEN is_hazardous THEN 1 ELSE 0 END) AS total_hazardous,
    COUNT(DISTINCT product_category)     AS category_count,
    COUNT(DISTINCT destination_country)  AS country_count
FROM {{zone_name}}.iceberg_demos.shipment_manifests;


-- ============================================================================
-- CLEANUP
-- ============================================================================

DROP EXTERNAL TABLE IF EXISTS {{zone_name}}.iceberg_demos.shipment_manifests WITH FILES;
DROP SCHEMA IF EXISTS {{zone_name}}.iceberg_demos;
DROP ZONE IF EXISTS {{zone_name}};

Description

## When to Use Run this demo when you need to verify DeltaForge can read **Iceberg format-version 3** tables that use deletion vectors for row-level deletes. V3 replaces v2's position-delete *files* with **Puffin**-encoded deletion vectors: each DV is a Roaring Bitmap of row positions inside a specific data file, stored in a sidecar `.puffin` blob and referenced from the manifest. This is the mechanism modern Spark 4.0 + Iceberg 1.10.x tables use after a targeted `DELETE`/`MERGE`, and it is the hardest Iceberg read path to get right. ## What You Will Learn 1. How DeltaForge discovers a Puffin DV file from v3 metadata and manifests. 2. How the Roaring Bitmap in a Puffin blob masks row positions in a Parquet data file at scan time. 3. Why the user-visible row count (504) differs from the Parquet file's physical row count (540). 4. That the faulty `scanner_id = 'SCAN-ERR'` rows are gone from every query result, including `COUNT(DISTINCT scanner_id)`, which drops from 16 to 15. 5. V3 metadata-chain traversal: `v3.metadata.json` → manifest list (Avro) → data manifest (Avro) → Parquet data + Puffin DV. 6. How snapshot history is preserved across multiple `metadata.json` files (`v1`, `v2`, `v3`). 7. Post-delete aggregations: per-region counts, hazardous shipment sums, weight statistics, carrier ranking. 8. Why this path is strictly read-only, writing deletion vectors is a UniForm/Delta writer concern, not an external-table one. ## Prerequisites Provide a pre-generated Iceberg v3 table at `{{data_path}}/shipment_manifests` that contains a single Parquet data file (540 physical rows) plus a Puffin deletion-vector file retracting 36 rows. The metadata/ directory must include `v3.metadata.json` (the latest) along with `v1`/`v2` ancestors and their Avro manifest lists. Spark 4.0 with Iceberg 1.10.1 is the reference writer. No Delta conversion and no Iceberg catalog service required: DeltaForge reads directly from the filesystem / object store.

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →