Pseudonymisation Apply: Clinical Trial De-identification

Pharma de-identification workflow: register query-time rules on a clinical-trial table, verify transforms with SHOW and SELECT, and aggregate efficacy by treatment arm while PII stays protected, the on-ramp to APPLY PSEUDONYMISATION for GDPR Article 17 erasure.

Category: pseudonymisation

Syntax

-- ============================================================================
-- SETUP: Trial participants table + 3 query-time rules
-- ============================================================================
CREATE ZONE IF NOT EXISTS {{zone_name}}
    TYPE EXTERNAL
    COMMENT 'External tables, demo datasets and file-backed data';

CREATE SCHEMA IF NOT EXISTS {{zone_name}}.pseudonymisation_demos
    COMMENT 'Clinical trial data with protection rules';

CREATE DELTA TABLE IF NOT EXISTS {{zone_name}}.pseudonymisation_demos.trial_participants (
    subject_id       VARCHAR,
    trial_id         VARCHAR,
    trial_phase      VARCHAR,
    participant_name VARCHAR,
    date_of_birth    VARCHAR,
    email            VARCHAR,
    ssn              VARCHAR,
    status           VARCHAR,
    treatment_arm    VARCHAR,
    efficacy_score   DOUBLE,
    outcome          VARCHAR
) LOCATION '{{data_path}}/trial_participants';

GRANT ADMIN ON TABLE {{zone_name}}.pseudonymisation_demos.trial_participants TO USER {{current_user}};

-- Rule 1: hard-redact SSN
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (ssn)
    TRANSFORM redact
    PARAMS (mask = '***-**-****');

-- Rule 2: deterministic pseudonym for participant_name, joinable across tables
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (participant_name)
    TRANSFORM keyed_hash
    SCOPE person
    PARAMS (salt = 'trial_name_salt_2024');

-- Rule 3: mask email, keep first 3 chars for support triage
CREATE PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants (email)
    TRANSFORM mask
    PARAMS (show = 3);

-- ============================================================================
-- QUERIES: Audit, read, aggregate, attrition
-- ============================================================================
-- Confirm exactly the 3 rules we registered
ASSERT ROW_COUNT = 3
SHOW PSEUDONYMISATION RULES FOR {{zone_name}}.pseudonymisation_demos.trial_participants;

-- Protected SELECT, clinical columns unchanged
ASSERT ROW_COUNT = 6
ASSERT VALUE ssn_redacted = '***-**-****' WHERE subject_id = 'SUBJ-001'
ASSERT VALUE treatment_arm = 'Drug A'     WHERE subject_id = 'SUBJ-001'
SELECT subject_id,
       participant_name AS name_hash,
       date_of_birth,
       email            AS email_masked,
       ssn              AS ssn_redacted,
       status, trial_id, treatment_arm, efficacy_score, outcome
FROM {{zone_name}}.pseudonymisation_demos.trial_participants
ORDER BY subject_id;

-- Efficacy by arm, the researcher's actual question
ASSERT ROW_COUNT = 3
ASSERT VALUE participant_count = 2 WHERE treatment_arm = 'Drug A'
SELECT treatment_arm,
       COUNT(*)                       AS participant_count,
       ROUND(AVG(efficacy_score), 2)  AS avg_efficacy
FROM {{zone_name}}.pseudonymisation_demos.trial_participants
GROUP BY treatment_arm
ORDER BY avg_efficacy DESC;

-- ============================================================================
-- CLEANUP
-- ============================================================================
DROP PSEUDONYMISATION RULE ON {{zone_name}}.pseudonymisation_demos.trial_participants;
DROP TABLE   IF EXISTS {{zone_name}}.pseudonymisation_demos.trial_participants;
DROP SCHEMA  IF EXISTS {{zone_name}}.pseudonymisation_demos;
DROP ZONE    IF EXISTS {{zone_name}};

Description

## When to Use Use this demo when preparing a clinical-trial dataset for external researchers, a CRO handoff, or a regulatory submission where participants' identities must never leave the institution but the trial's statistical content must. It walks through a realistic Phase II/III layout (two trials, three treatment arms including placebo, attrition statuses) and shows how query-time pseudonymisation lets analysts run efficacy aggregations without ever reading a real name, SSN, or email. It is also the right demo to understand the boundary between query-time rules (CREATE PSEUDONYMISATION RULE, reversible by DROP) and permanent transformation (APPLY PSEUDONYMISATION, which generates an UPDATE and writes new Delta files). Withdrawn participants are the canonical use case for APPLY: once a subject exercises GDPR Article 17 right to erasure, a rule alone is insufficient because the raw bytes remain in the Delta log. ## What You Will Learn 1. How to protect three PII columns (SSN, participant name, email) with three different transforms (`redact`, `keyed_hash` SCOPE person, `mask`) in one setup script [ref: CREATE_PSEUDONYM_RULE] 2. How SCOPE `PERSON` keyed_hash gives researchers a stable pseudonym for a participant across visits and tables, enabling longitudinal analysis without exposing identity 3. How treatment-arm aggregations (`AVG(efficacy_score) GROUP BY treatment_arm`) remain byte-exact under pseudonymisation because the transform wraps the TableProvider, not the stored columns 4. How to audit active protection with SHOW PSEUDONYMISATION RULES FOR `<table>` before handing off to an external party [ref: SHOW_PSEUDONYM_RULES] 5. The conceptual jump to APPLY PSEUDONYMISATION: the same transform types, but executed as an UPDATE against the Delta table, producing new data files and triggering the need for a VACUUM to close the time-travel window [ref: APPLY_PSEUDONYM] ## Prerequisites - DeltaForge workspace with permission to CREATE ZONE, CREATE SCHEMA, CREATE DELTA TABLE, and GRANT ADMIN - `{{data_path}}/trial_participants` resolvable and writable - KeyStore configured for the `keyed_hash` salt (`trial_name_salt_2024`), in production this resolves through the VaultKeyStore; in a local sandbox the salt is held in session memory - Familiarity with GDPR Article 17 (right to erasure) and ICH GCP E6(R2) data-integrity expectations so you can distinguish query-time protection from actual erasure

Pitfalls

See Also

Open in interactive docs →   DeltaForge home →