Learn by Directing AI
All materials

pii-classification-checklist.md

PII Classification and Masking Checklist

Direct identifiers

These are always PII regardless of context:

  • Names -- full names, first names, last names
  • Email addresses -- personal or work
  • Phone numbers -- mobile or landline
  • Government IDs -- social security numbers, national ID numbers, passport numbers
  • Financial identifiers -- bank account numbers, credit card numbers

If a column contains any of these, it requires masking in the pipeline. No judgment needed -- these are PII by definition.

Indirect identifiers

These are not PII alone but become identifying in combination:

  • Job role + location + date -- in a small team, a "Senior Blade Technician" at "Farm DK-01" on a specific date may identify exactly one person
  • Rare category values -- a job title held by only one person in the organization
  • Geographic precision -- exact coordinates combined with timestamps can trace an individual's movements
  • Age or birth year + location -- in small populations, narrows to very few individuals
  • Transaction patterns -- unique purchasing or maintenance patterns that fingerprint an individual

The test: in a small population, could this value (alone or combined with other available columns) narrow identification to one person?

Classification process

For each column in your pipeline:

  1. Is it a direct identifier? Names, emails, phone numbers, government IDs. If yes, mask it.
  2. Could it identify someone when combined with other columns in the same dataset? A technician's job role combined with the farm they work at and the date of a maintenance event may identify them in a team of three.
  3. In a small population, could this value narrow to one person? If only one technician specializes in gearbox maintenance at a specific farm, their maintenance records are identifiable even without their name.

Document your classification decision for each column. The decision is a judgment -- record the reasoning so it can be reviewed.

Masking approaches

SHA-256 hashing

Replaces the value with a fixed-length hash. Deterministic -- the same input always produces the same hash, which means hashed values can still be used as join keys.

Use when: You need to join on the masked field (e.g., linking maintenance records to other tables by technician). The hash preserves referential integrity without exposing the original value.

Risk: Without salt, SHA-256 on low-cardinality fields is effectively reversible. If a field has only 50 possible values (like technician names in a 250-person company), an attacker can hash all 50 and match. Always salt low-cardinality fields.

Redaction

Replaces the value with a constant (e.g., "REDACTED" or "***"). Irreversible. Breaks joins.

Use when: The field is not needed for analysis or joining. The value only needs to exist in the schema for structural completeness.

Generalization

Reduces precision. Exact age becomes age band (20-29). Exact location becomes region. Exact timestamp becomes date.

Use when: The analytical value is in the category, not the exact value. Preserves some utility while reducing identification risk.

Verification surfaces

After implementing masking, verify that PII does not leak through any of these surfaces:

  • Mart models -- query the final tables. Are masked fields showing hashed values, not originals?
  • Staging tables -- can an analyst query the staging layer directly and see unmasked PII?
  • dbt docs output -- run dbt docs generate. Does the documentation contain sample values from the source/staging layer that expose PII?
  • Debug/query logs -- run a query with logging enabled. Do SQL logs contain unmasked values in WHERE clauses or sample data?
  • CI/CD logs -- do test outputs or build logs print PII values?
  • Cached query results -- are any query caches storing unmasked values?

Each surface is an independent leakage vector. Masking the mart layer alone leaves five other surfaces potentially exposed.

Salt and low-cardinality warning

SHA-256 hashing without salt on a field with few possible values is not secure. Example: if there are 10 technician names in the dataset, an attacker can:

  1. List all 10 names
  2. Hash each one with SHA-256
  3. Match the hashes to the values in the dataset

This takes seconds. The hashing provides no protection.

Always salt low-cardinality fields. A salt is a random string appended to each value before hashing. Even if the attacker knows the hash algorithm, they cannot reverse it without the salt. Store the salt securely, separate from the data.

For fields with high cardinality (thousands of unique values), unsalted SHA-256 provides reasonable protection because brute-force matching becomes impractical.