VACUUM

Removes old, unreferenced data files from a Delta table to reclaim storage.

Category: maintenanceDeltaForge extension

Syntax

VACUUM <table> [RETAIN <n> HOURS] [CDC_RETAIN <n> HOURS] [DRY RUN]

Description

## Overview VACUUM removes data files from a Delta table's storage directory that are no longer referenced by the current table version and have exceeded the retention period. Over time, operations such as DELETE, UPDATE, MERGE, and OPTIMIZE create new data files and logically remove old ones. These old files remain on disk to support time travel queries until VACUUM physically deletes them. ## How It Works VACUUM performs two phases: 1. **Log scan**: Reads the Delta transaction log to build a set of all file paths referenced by the current table snapshot. 2. **Storage scan**: Lists all files in the table's storage directory and identifies files not in the referenced set whose modification time is older than the retention threshold. 3. **Deletion**: Removes the identified files from storage and reports metrics (files deleted, bytes freed). The storage scan ensures that orphaned files (files not tracked in the transaction log, such as those from failed writes) are also cleaned up. ## CDF File Retention Change Data Feed files in the `_change_data/` directory follow a separate retention policy. By default, CDF files use the log retention duration (`delta.logRetentionDuration`, default 30 days) rather than the data file retention period. This is because CDF readers and streaming ETL pipelines often need access to change history spanning a longer window. Use CDC_RETAIN to override this default. ## Vacuum Protocol Check When the `vacuumProtocolCheck` feature is enabled on a table (Writer Version 7, Reader Version 3), VACUUM must verify both reader and writer protocol before deleting files. This prevents older vacuum implementations from accidentally removing files managed by newer writer features, mitigating the risk of data corruption. ## Retention Safety The default retention period of 168 hours (7 days) provides a safety margin for concurrent readers using time travel. Setting the retention below this threshold may cause active queries or time travel operations to fail with file-not-found errors. If concurrent readers are not a concern (for example, in a batch-only workload), shorter retention periods can be used to reclaim storage more aggressively. ## Result Set Returns a result set with two rows: | metric | value | |--------|-------| | files_deleted | Number of files removed | | bytes_freed | Total bytes reclaimed | ## Access Control | Privilege | Object | Notes | |-----------|--------|-------| | Ownership or write | Table | Required to delete files from the table's storage. | ## Compatibility VACUUM is a DeltaForge extension implementing the Delta Lake VACUUM protocol. The CDC_RETAIN clause is a DeltaForge enhancement for independent CDF retention control.

Parameters

NameTypeDescription
tableSpecifies the name or path of the Delta table to vacuum. The table must be registered in the session (via CREATE DELTA TABLE or OPEN DELTA TABLE). Fully qualified names (zone.schema.table) are supported.
retention_hoursRetention period in hours for data files.
cdc_retention_hoursRetention period in hours for CDC files.
dry_runIf true, only lists files that would be deleted.

Examples

-- Remove unreferenced files older than the default 7-day retention
VACUUM customers;
-- Preview which files would be deleted
VACUUM customers DRY RUN;
-- Use a custom retention period of 72 hours
VACUUM customers RETAIN 72 HOURS;
-- Retain data files for 24 hours and CDF files for 720 hours (30 days)
VACUUM orders RETAIN 24 HOURS CDC_RETAIN 720 HOURS;
-- Vacuum with the optional TABLE keyword
VACUUM TABLE sales.transactions RETAIN 168 HOURS;
-- Vacuum with dry run to preview deletions
VACUUM orders RETAIN 48 HOURS DRY RUN;

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →