Creates a runtime data transformation rule for GDPR compliance.
CREATE PSEUDONYMISATION RULE ON <table> (<column_pattern>)
TRANSFORM <type>
[SCOPE TRANSACTION|RELATIONSHIP|PERSON]
[PRIORITY <n>]
[PARAMS (<key>=<val>, ...)]
[EXEMPT ROLES (<role>, ...) [USERS (<user>, ...)]
| EXEMPT USERS (<user>, ...) [ROLES (<role>, ...)]]
## Overview CREATE PSEUDONYMISATION RULE registers a data transformation rule that is applied transparently at query time. When a user issues a SELECT against the target table, matching columns are transformed according to the specified function before results are returned. The underlying data on disk remains unchanged. This non-destructive approach supports GDPR Article 4(5) pseudonymisation requirements while preserving the ability to use Delta log time travel to access historical snapshots. Rules are persisted to the catalog via the control plane API when a CatalogRouter is configured (the standard compute node deployment). In standalone or testing contexts, rules are held in session-scoped memory and do not survive restarts. ## Behavior - After a rule is created, the executor triggers a catalog sync so that all registered TableProviders are re-wrapped with a PseudonymTableProvider. Subsequent SELECT queries against the target table will return transformed values for matching columns. - Transforms are applied at query execution time by wrapping the underlying TableProvider. Numeric aggregations (COUNT, SUM, AVG) operate on the original stored values and produce correct results because the transform intercepts display output, not the stored bytes. - The column pattern is matched against each column name in the table schema. Wildcard patterns use glob semantics ('*' matches any sequence, '?' matches one character). Regex patterns must be enclosed in forward slashes. - The scope parameter controls a ValueCache that ensures consistent pseudonymisation. PERSON scope produces the same pseudonym for an individual across all tables and queries. RELATIONSHIP scope produces the same pseudonym for related records within a logical group. TRANSACTION scope produces consistent pseudonyms only within a single query execution. - Cryptographic key material for keyed_hash and encrypt transforms is resolved through the KeyStore interface. In production deployments, the VaultKeyStore fetches keys from the universal Key Vault (backed by the OS credential store) with DashMap caching for lock-free subsequent access. - Multiple rules can target the same table. When a column matches more than one rule, the rule with the highest priority value is selected. - The optional EXEMPT clause carves principals out of the rule. A rule with no EXEMPT clause applies to every principal (the historical default). When EXEMPT ROLES (...) and/or EXEMPT USERS (...) are present, a principal reads raw values for the matched column iff their user identifier is listed in EXEMPT USERS or any of their resolved roles is listed in EXEMPT ROLES. The cross-match runs on the compute node against the principal's roles and user_id carried in the permission token; exempt principals get the rule filtered out before the table provider is wrapped, so there is no per-row overhead on the unmask path. - Pipeline runners are principals too. A pipeline executes as its configured service principal, and that principal's role memberships drive its exempt status the same way an interactive user's do. The default-secure behaviour is that pipelines see and persist transformed values; grant the pipeline's service principal a role listed in EXEMPT ROLES if a particular pipeline must read raw values. ## Compatibility This command aligns with GDPR Article 4(5) pseudonymisation and supports HIPAA Safe Harbor de-identification workflows. The seven built-in transform types cover the most common data protection patterns for healthcare (HL7, FHIR), financial (EDI, KYC), and human resources data.
| Name | Type | Description |
|---|---|---|
table_name | Specify the fully qualified table name in zone.schema.table format. The table must already exist in the catalog. | |
column_pattern | Specify a column name, wildcard pattern (e.g., '*_name', 'patient_*'), or regex pattern (e.g., '/ssn|tax_id/'). The pattern type is auto-detected from the pattern content: patterns containing '*' or '?' are treated as wildcard; patterns wrapped in '/' are treated as regex; all others are exact matches. | |
pattern_type | Override the auto-detected pattern matching strategy. Valid values: EXACT (match column name literally), WILDCARD (glob-style with '*' and '?' characters), REGEX (full regular expression). Defaults to EXACT when the pattern contains no special characters. | |
transform_type | Specify the pseudonymisation transform to apply at query time. Valid values: keyed_hash (deterministic SHA256 with salt for linkable pseudonyms), hash (one-way SHA256 fingerprint without salt), encrypt (reversible encryption requiring a key for decryption), redact (full replacement with a fixed mask string), generalize (reduce numeric or date precision to a configurable range), tokenize (generate an opaque 'TOK_' prefixed token from a SHA256 substring), mask (partial visibility showing the first N characters with the remainder replaced by '*'). | |
scope | Control the consistency boundary for pseudonymised values. Valid values: TRANSACTION (same pseudonym only within a single query execution), RELATIONSHIP (same pseudonym for related records such as a patient across multiple visits), PERSON (same pseudonym for an individual across the entire dataset). Defaults to RELATIONSHIP. | |
priority | Set the rule evaluation priority. Higher values are evaluated first when multiple rules match the same column. Defaults to 0. | |
params | Provide transform-specific parameters as key-value pairs enclosed in parentheses. Common parameters: 'key' (encryption/hash key identifier for keyed_hash and encrypt), 'replacement' (mask string for redact, defaults to '***REDACTED***'), 'mask' (replacement string for redact), 'show' (number of leading characters to preserve for mask, defaults to 4), 'range' (bucket size for generalize, defaults to 10), 'algorithm' (encryption algorithm for encrypt). | |
exempt_roles | Roles whose members read raw values for the matched column despite this rule. Listed inside EXEMPT ROLES (...). The principal is exempt iff at least one of their resolved roles appears in this list. Empty (or clause omitted) means the rule applies to every principal. | |
exempt_users | User identifiers (typically email or username) listed inside EXEMPT USERS (...) whose principals read raw values despite this rule. Identifiers may be bare identifiers or quoted strings (use quoted strings for emails containing '@'). Empty (or clause omitted) means the rule applies to every user. |
CREATE PSEUDONYMISATION RULE ON medical.patients (patient_name)
TRANSFORM keyed_hash
SCOPE PERSON;
CREATE PSEUDONYMISATION RULE ON hr.employees (*_ssn)
TRANSFORM redact
PARAMS (replacement = '***-**-****');
CREATE PSEUDONYMISATION RULE ON finance.transactions (account_number)
TRANSFORM mask
SCOPE TRANSACTION
PARAMS (show = 4);
CREATE PSEUDONYMISATION RULE ON clinical.trial_participants (date_of_birth)
TRANSFORM generalize
SCOPE PERSON
PARAMS (range = 10);
CREATE PSEUDONYMISATION RULE ON insurance.claims (*_name)
TRANSFORM keyed_hash
SCOPE RELATIONSHIP
PRIORITY 10;
CREATE PSEUDONYMISATION RULE ON edi.hipaa_claims (bpr_8)
TRANSFORM redact
PARAMS (replacement = '**********');