Excel Sales Analytics: Superstore Orders

Unify four annual Superstore XLSX files (9,994 orders) into a single external table, then isolate one year with file_filter and run regional + product analytics.

Category: excel

Syntax

-- Zone + schema
CREATE ZONE IF NOT EXISTS {{zone_name}} TYPE EXTERNAL;
CREATE SCHEMA IF NOT EXISTS {{zone_name}}.excel_demos;

-- Unified table over all 4 XLSX files
CREATE EXTERNAL TABLE IF NOT EXISTS {{zone_name}}.excel_demos.all_orders
USING EXCEL LOCATION '{{data_path}}'
OPTIONS (
    sheet_name = 'Orders',
    has_header = 'true',
    infer_schema_rows = '1000',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- Single-year projection using file_filter
CREATE EXTERNAL TABLE IF NOT EXISTS {{zone_name}}.excel_demos.orders_2017
USING EXCEL LOCATION '{{data_path}}'
OPTIONS (
    sheet_name = 'Orders',
    has_header = 'true',
    file_filter = 'sales-data-2017*',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

-- 9,994 unified orders across 4 files
ASSERT ROW_COUNT = 9994
SELECT * FROM {{zone_name}}.excel_demos.all_orders;

-- Rows per source file (proxy for year)
ASSERT ROW_COUNT = 4
ASSERT VALUE row_count = 1993 WHERE df_file_name LIKE '%2014%'
ASSERT VALUE row_count = 3312 WHERE df_file_name LIKE '%2017%'
SELECT df_file_name, COUNT(*) AS row_count
FROM {{zone_name}}.excel_demos.all_orders
GROUP BY df_file_name
ORDER BY df_file_name;

-- Regional analytics with floating-point bands
ASSERT ROW_COUNT = 4
ASSERT VALUE orders = 3203 WHERE region = 'West'
ASSERT WARNING VALUE total_sales BETWEEN 725456.82 AND 725458.82 WHERE region = 'West'
SELECT region,
       COUNT(*) AS orders,
       ROUND(SUM(CAST(sales AS DOUBLE)), 2) AS total_sales,
       ROUND(SUM(CAST(profit AS DOUBLE)), 2) AS total_profit,
       ROUND(AVG(CAST(discount AS DOUBLE)), 3) AS avg_discount
FROM {{zone_name}}.excel_demos.all_orders
GROUP BY region
ORDER BY total_sales DESC;

-- Top sub-categories by profit
ASSERT ROW_COUNT = 10
SELECT category, sub_category,
       COUNT(*) AS orders,
       ROUND(SUM(CAST(sales AS DOUBLE)), 2) AS total_sales,
       ROUND(SUM(CAST(profit AS DOUBLE)), 2) AS total_profit
FROM {{zone_name}}.excel_demos.all_orders
GROUP BY category, sub_category
ORDER BY total_profit DESC
LIMIT 10;

-- Cleanup
DROP EXTERNAL TABLE IF EXISTS {{zone_name}}.excel_demos.all_orders WITH FILES;
DROP EXTERNAL TABLE IF EXISTS {{zone_name}}.excel_demos.orders_2017 WITH FILES;

Description

## When to Use Use this demo when you have a directory of year-partitioned Excel files and want to treat them as one table for analytics while still being able to carve out a single year on demand. It exercises the production-grade combination of multi-file reading, `sheet_name` selection, `file_filter` glob isolation, `file_metadata` for lineage, and `infer_schema_rows` for type inference across a real-world sales dataset. ## What You Will Learn 1. How to point one LOCATION at multiple XLSX files and have `USING EXCEL` union them transparently. 2. How `sheet_name = 'Orders'` keeps the reader on the right tab when every workbook in the directory uses the same sheet name. 3. How `file_filter = 'sales-data-2017*'` creates a year-scoped external table without moving the underlying files. 4. How `file_metadata` exposes `df_file_name` so you can GROUP BY source year without parsing filenames in the app layer. 5. How `infer_schema_rows = '1000'` balances type-inference accuracy against schema-discovery latency for large workbooks. 6. How to validate type inference (dates, numbers, strings) via `information_schema.columns`. ## Prerequisites - A directory containing the four `sales-data-YYYY.xlsx` files (2014–2017). - A target zone (created as EXTERNAL if missing). - Familiarity with aggregate queries and ASSERT / ASSERT WARNING semantics.

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →