Pseudonymisation Lifecycle: Insurance Claims

Insurance-claims walkthrough of the full pseudonymisation rule lifecycle: CREATE five rules with PRIORITY, SHOW them, SELECT through them, DROP a single rule, and verify that surviving rules still protect PII after the catalog sync.

Category: pseudonymisation

Syntax

-- ============================================================================
-- SETUP: Insurance claims + 5 pseudonymisation rules with PRIORITY
-- ============================================================================
CREATE ZONE IF NOT EXISTS {{zone_name}}
    TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS {{zone_name}}.pseudonymisation_demos
    COMMENT 'Insurance claims with protection rules';

CREATE DELTA TABLE IF NOT EXISTS {{zone_name}}.pseudonymisation_demos.insurance_claims (
    claim_id         VARCHAR,
    policy_holder_id VARCHAR,
    claimant_name    VARCHAR,
    date_of_birth    VARCHAR,
    ssn              VARCHAR,
    claim_type       VARCHAR,
    description      VARCHAR,
    amount           DOUBLE,
    status           VARCHAR,
    filed_date       VARCHAR
) LOCATION '{{data_path}}/insurance_claims';

GRANT ADMIN ON TABLE {{zone_name}}.pseudonymisation_demos.insurance_claims TO USER {{current_user}};

-- Five rules, different PRIORITY levels
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (ssn)
    TRANSFORM redact  PRIORITY 20
    PARAMS (mask = '***-**-****');

CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (claimant_name)
    TRANSFORM keyed_hash  SCOPE person  PRIORITY 10
    PARAMS (salt = 'insurance_name_salt');

CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (date_of_birth)
    TRANSFORM generalize  SCOPE relationship
    PARAMS (range = 10);

CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (policy_holder_id)
    TRANSFORM tokenize  SCOPE person  PRIORITY 5;

CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (description)
    TRANSFORM mask
    PARAMS (show = 10);

-- ============================================================================
-- QUERIES: Lifecycle: list, read, aggregate, drop one, re-verify
-- ============================================================================
-- 1. Enumerate the five rules
ASSERT ROW_COUNT = 5
SHOW PSEUDONYMISATION RULES FOR {{zone_name}}.pseudonymisation_demos.insurance_claims;

-- 2. Protected SELECT across all 5 columns
ASSERT ROW_COUNT = 5
ASSERT VALUE claim_type   = 'Auto'        WHERE claim_id = 'CLM-2024-001'
ASSERT VALUE ssn_redacted = '***-**-****' WHERE claim_id = 'CLM-2024-001'
SELECT claim_id,
       policy_holder_id AS holder_token,
       claimant_name    AS name_hash,
       date_of_birth    AS dob_generalized,
       ssn              AS ssn_redacted,
       claim_type,
       description      AS desc_masked,
       amount, status
FROM {{zone_name}}.pseudonymisation_demos.insurance_claims;

-- 3. Aggregations untouched by pseudonymisation
ASSERT ROW_COUNT = 3
ASSERT VALUE claim_count  = 2       WHERE claim_type = 'Auto'
ASSERT VALUE total_amount = 7700.00 WHERE claim_type = 'Auto'
SELECT claim_type, COUNT(*) AS claim_count, SUM(amount) AS total_amount
FROM {{zone_name}}.pseudonymisation_demos.insurance_claims
GROUP BY claim_type;

-- 4. Selective DROP, remove the description mask only
DROP PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims (description);

-- 5. Re-verify, catalog sync is asynchronous, so <= 5
ASSERT ROW_COUNT <= 5
SHOW PSEUDONYMISATION RULES FOR {{zone_name}}.pseudonymisation_demos.insurance_claims;

-- SSN still redacted after the DROP
ASSERT VALUE ssn = '***-**-****' WHERE claim_id = 'CLM-2024-001'
SELECT claim_id, ssn, claim_type, amount, status
FROM {{zone_name}}.pseudonymisation_demos.insurance_claims;

-- ============================================================================
-- CLEANUP
-- ============================================================================
DROP PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.insurance_claims;
DROP TABLE   IF EXISTS {{zone_name}}.pseudonymisation_demos.insurance_claims;
DROP SCHEMA  IF EXISTS {{zone_name}}.pseudonymisation_demos;
DROP ZONE    IF EXISTS {{zone_name}};

Description

## When to Use Use this demo when your compliance or data-protection team needs to see the day-two operational story for pseudonymisation, not just creating rules, but reviewing, selectively dropping, and auditing coverage over time. The scenario is an insurance claims pipeline: routine analytics needs five PII columns protected (SSN, claimant name, date of birth, policy-holder id, free-text description), but a policy review later determines that the `description` column no longer needs masking. This demo exercises exactly that motion: DROP one rule, leave the other four intact, confirm the result with SHOW. It is also the canonical demo for understanding rule PRIORITY: five rules on the same table with priorities 0, 5, 10, 20, the highest priority wins when more than one rule matches a column. ## What You Will Learn 1. How to register five heterogeneous rules on one table (`redact`, `keyed_hash`, `generalize`, `tokenize`, `mask`) covering all the common PII shapes a claims team sees [ref: CREATE_PSEUDONYM_RULE] 2. How PRIORITY resolves overlapping column patterns: higher priority wins, and ties break by creation order 3. How SHOW PSEUDONYMISATION RULES FOR `<table>` becomes the audit surface, call it before, after DROP, and again post-cleanup to prove rule count [ref: SHOW_PSEUDONYM_RULES] 4. How DROP PSEUDONYMISATION RULE ON `<table>` `(column)` removes exactly one rule, leaves the others untouched, and propagates through the catalog sync so subsequent SELECT sees the un-masked column immediately 5. Why the post-DROP ASSERT uses `ROW_COUNT <= 5` rather than `= 4`: catalog sync is asynchronous, so the harness tolerates either the pre- or post-sync state [ref: DROP_PSEUDONYM_RULE] ## Prerequisites - Workspace with CREATE / DROP authority on zones, schemas, Delta tables, and pseudonymisation rules - `{{data_path}}/insurance_claims` reachable and writable - KeyStore configured for the two salts used (`insurance_name_salt`), test environments may keep these in session memory - Understanding that `tokenize` emits opaque `TOK_...` strings that are only re-identifiable by the tokenisation service, unlike `keyed_hash` which a holder of the salt can re-derive

Pitfalls

DROP PSEUDONYMISATION RULE ON `<table>` with no column pattern drops every rule on that table, correct for cleanup, disastrous for lifecycle management. In production always pass the column pattern explicitly: `DROP ... (description)`, not `DROP ... ;`.
The demo asserts `ROW_COUNT <= 5` after DROP on purpose: catalog sync is asynchronous, so a fast-running harness can see the pre-sync (still 5) or post-sync (now 4) snapshot. If your own tests require a hard `= 4`, issue `SYNC CATALOG` first or poll until the count stabilises, do not tighten the assertion without that guard.
PRIORITY only matters when more than one rule's column pattern matches the same column. In this demo every pattern is an exact column name, so the priorities (20, 10, 5, 0, 0) never actually arbitrate anything, removing them would be functionally equivalent. Priorities become load-bearing as soon as you introduce wildcard rules like `*_name`.
`tokenize` emits a `TOK_` + 16-char SHA256 substring, which is re-identifiable only if you retain the original-to-token mapping externally. If the tokenisation service (or the salt it used) is lost, policy_holder_id becomes permanently opaque, this is a feature for minimisation, a footgun for operational recovery. Back up your token map with the same rigour as an encryption key.
`keyed_hash` with a plain-text `salt = 'insurance_name_salt'` is deterministic and linkable, do not treat that salt as a non-secret string. Anyone with the salt can re-hash a short list of candidate names offline and re-identify every row. Rotate salts on a policy schedule, and budget for the cascade of downstream joins that rotation invalidates.
`generalize` with `range = 10` on `date_of_birth` produces decade buckets. Combined with `zip_code` and `gender` (absent from this table but typically adjacent), this is well within the Sweeney re-identification threshold, generalize is a layer, not a guarantee. Treat k-anonymity claims as quantitative, not categorical.
`mask` with `show = 10` on `description` leaves the first 10 characters readable, which is often the claim type and location ("Collision ", "Water dama", "Storm dama"). For free-text columns this frequently leaks enough context to re-identify the claim from public news sources. Prefer `redact` for free text unless you have measured the leakage.
Rules drop does not propagate backwards in time, prior query results already materialised (e.g. exported to CSV, cached in a BI tool) retain whatever transformation was active at export time. Lifecycle audits must include the downstream export inventory, not just SHOW PSEUDONYMISATION RULES.
GDPR retention alignment: dropping a rule un-masks PII immediately for new queries, but does not re-hydrate already-erased rows if you had previously run APPLY PSEUDONYMISATION. CREATE/DROP operate on the rule catalog; APPLY operates on the data. Keep the two lifecycles explicitly separated in your runbook.

Pseudonymisation Lifecycle: Insurance Claims

Syntax

Description

Pitfalls

See Also