Pharma de-identification workflow: register query-time rules on a clinical-trial table, verify transforms with SHOW and SELECT, and aggregate efficacy by treatment arm while PII stays protected, the on-ramp to APPLY PSEUDONYMISATION for GDPR Article 17 erasure.
-- ============================================================================
-- SETUP: Trial participants table + 3 query-time rules
-- ============================================================================
CREATE ZONE IF NOT EXISTS {{zone_name}}
TYPE EXTERNAL
COMMENT 'External tables, demo datasets and file-backed data';
CREATE SCHEMA IF NOT EXISTS {{zone_name}}.pseudonymisation_demos
COMMENT 'Clinical trial data with protection rules';
CREATE DELTA TABLE IF NOT EXISTS {{zone_name}}.pseudonymisation_demos.trial_participants (
subject_id VARCHAR,
trial_id VARCHAR,
trial_phase VARCHAR,
participant_name VARCHAR,
date_of_birth VARCHAR,
email VARCHAR,
ssn VARCHAR,
status VARCHAR,
treatment_arm VARCHAR,
efficacy_score DOUBLE,
outcome VARCHAR
) LOCATION '{{data_path}}/trial_participants';
GRANT ADMIN ON TABLE {{zone_name}}.pseudonymisation_demos.trial_participants TO USER {{current_user}};
-- Rule 1: hard-redact SSN
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (ssn)
TRANSFORM redact
PARAMS (mask = '***-**-****');
-- Rule 2: deterministic pseudonym for participant_name, joinable across tables
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (participant_name)
TRANSFORM keyed_hash
SCOPE person
PARAMS (salt = 'trial_name_salt_2024');
-- Rule 3: mask email, keep first 3 chars for support triage
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (email)
TRANSFORM mask
PARAMS (show = 3);
-- ============================================================================
-- QUERIES: Audit, read, aggregate, attrition
-- ============================================================================
-- Confirm exactly the 3 rules we registered
ASSERT ROW_COUNT = 3
SHOW PSEUDONYMISATION RULES FOR {{zone_name}}.pseudonymisation_demos.trial_participants;
-- Protected SELECT, clinical columns unchanged
ASSERT ROW_COUNT = 6
ASSERT VALUE ssn_redacted = '***-**-****' WHERE subject_id = 'SUBJ-001'
ASSERT VALUE treatment_arm = 'Drug A' WHERE subject_id = 'SUBJ-001'
SELECT subject_id,
participant_name AS name_hash,
date_of_birth,
email AS email_masked,
ssn AS ssn_redacted,
status, trial_id, treatment_arm, efficacy_score, outcome
FROM {{zone_name}}.pseudonymisation_demos.trial_participants
ORDER BY subject_id;
-- Efficacy by arm, the researcher's actual question
ASSERT ROW_COUNT = 3
ASSERT VALUE participant_count = 2 WHERE treatment_arm = 'Drug A'
SELECT treatment_arm,
COUNT(*) AS participant_count,
ROUND(AVG(efficacy_score), 2) AS avg_efficacy
FROM {{zone_name}}.pseudonymisation_demos.trial_participants
GROUP BY treatment_arm
ORDER BY avg_efficacy DESC;
-- ============================================================================
-- CLEANUP
-- ============================================================================
DROP PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants;
DROP TABLE IF EXISTS {{zone_name}}.pseudonymisation_demos.trial_participants;
DROP SCHEMA IF EXISTS {{zone_name}}.pseudonymisation_demos;
DROP ZONE IF EXISTS {{zone_name}};
## When to Use Use this demo when preparing a clinical-trial dataset for external researchers, a CRO handoff, or a regulatory submission where participants' identities must never leave the institution but the trial's statistical content must. It walks through a realistic Phase II/III layout (two trials, three treatment arms including placebo, attrition statuses) and shows how query-time pseudonymisation lets analysts run efficacy aggregations without ever reading a real name, SSN, or email. It is also the right demo to understand the boundary between query-time rules (CREATE PSEUDONYMISATION RULE, reversible by DROP) and permanent transformation (APPLY PSEUDONYMISATION, which generates an UPDATE and writes new Delta files). Withdrawn participants are the canonical use case for APPLY: once a subject exercises GDPR Article 17 right to erasure, a rule alone is insufficient because the raw bytes remain in the Delta log. ## What You Will Learn 1. How to protect three PII columns (SSN, participant name, email) with three different transforms (`redact`, `keyed_hash` SCOPE person, `mask`) in one setup script [ref: CREATE_PSEUDONYM_RULE] 2. How SCOPE `PERSON` keyed_hash gives researchers a stable pseudonym for a participant across visits and tables, enabling longitudinal analysis without exposing identity 3. How treatment-arm aggregations (`AVG(efficacy_score) GROUP BY treatment_arm`) remain byte-exact under pseudonymisation because the transform wraps the TableProvider, not the stored columns 4. How to audit active protection with SHOW PSEUDONYMISATION RULES FOR `<table>` before handing off to an external party [ref: SHOW_PSEUDONYM_RULES] 5. The conceptual jump to APPLY PSEUDONYMISATION: the same transform types, but executed as an UPDATE against the Delta table, producing new data files and triggering the need for a VACUUM to close the time-travel window [ref: APPLY_PSEUDONYM] ## Prerequisites - DeltaForge workspace with permission to CREATE ZONE, CREATE SCHEMA, CREATE DELTA TABLE, and GRANT ADMIN - `{{data_path}}/trial_participants` resolvable and writable - KeyStore configured for the `keyed_hash` salt (`trial_name_salt_2024`), in production this resolves through the VaultKeyStore; in a local sandbox the salt is held in session memory - Familiarity with GDPR Article 17 (right to erasure) and ICH GCP E6(R2) data-integrity expectations so you can distinguish query-time protection from actual erasure