Learn by Directing AI

Step 1: Read the PII classification checklist

Open materials/pii-classification-checklist.md. This framework helps you classify which fields in the pipeline contain personal data and decide how to handle them.

PII classification is not a lookup table. Whether a field is personally identifiable depends on context, combination, and population size. A name is always PII. A job role is not PII alone -- but "Senior Blade Technician" at "Farm DK-01" on a specific date, in a team of three, identifies exactly one person.

Read through the checklist. Pay attention to the distinction between direct identifiers (always PII) and indirect identifiers (PII in combination). The verification surfaces section is critical -- it lists every place PII can leak beyond the output tables.

Step 2: Classify PII with AI

Direct Claude to review all columns across the pipeline for PII. Feed it the data dictionary and the PII classification checklist.

Check what Claude identifies. AI catches the obvious: technician names in the maintenance logs are direct identifiers. But indirect identifiers require judgment. Consider: in a small team at a specific farm, how many people hold a specific job role? If only one technician specializes in gearbox maintenance at Farm DK-01, their maintenance records are identifiable even without their name -- the combination of farm, component type, and date narrows to one person.

AI commonly misses this kind of combination risk. You need to evaluate the population sizes yourself.

Step 3: Build the SHA-256 masking macro

Build a dbt Jinja macro that hashes technician names using SHA-256. The hashing is deterministic -- the same name always produces the same hash -- which means you can still join on the hashed value. If Lars Pedersen appears in both the maintenance logs and another table, both tables produce the same hash, and joins still work.

Direct Claude to build the macro. The macro should accept a column name and return the SHA-256 hash of that column's value.

Consider the salt question. SHA-256 without salt on a low-cardinality field is effectively reversible. There are only 8 technician names in this dataset. An attacker can hash all 8 and match. Salt makes the hash irreversible even with a small population -- but salt must be stored securely and consistently. For this project, implement salted hashing to understand the pattern, even though the risk is contained.

Step 4: Apply masking in the staging layer

Where you apply masking matters. If you mask only in the mart models, the staging and intermediate layers still contain raw names. Any analyst who queries the staging table directly sees unmasked PII.

Apply the masking macro in the staging model for maintenance logs. Every downstream layer -- intermediate, mart, dimension -- should only see the hashed value. The raw name exists only in the source data.

This is a governance design decision. Retrofitting masking after the pipeline is built means PII has already propagated through intermediate layers, documentation, and logs. Applying masking at the staging layer ensures that every downstream surface is clean by construction, not by verification.

Step 5: Check dbt docs for PII leakage

Run dbt docs generate and examine the output. The generated documentation includes column descriptions, model lineage, and sometimes sample values pulled from the source or staging layer.

dbt docs generate

Search the generated documentation for any technician name. If the staging source definition exposes raw data, technician names may appear in the docs even though the mart models are properly masked.

This is a surface you might not have expected. The masking macro protects every SQL query against the mart models. But dbt docs generate reads metadata from the source definitions and the information schema, which may include unmasked values from the raw layer.

Step 6: Check debug logs

Run a query with logging enabled and examine the SQL logs. Do the logs contain unmasked technician names in WHERE clauses or sample data? If your masking is in the staging model but you run a debug query against the raw source, the log captures the raw values.

Step 7: Verify staging table access

Can an analyst querying the staging table directly still see unmasked technician names? If your masking is in the staging model itself (not just a view on top of it), the answer should be no. If the staging model is a view that transforms the raw source, the underlying source table is still queryable.

Check your pipeline architecture. Does the masking happen early enough that no queryable surface exposes raw PII?

✓ Check

Check: Run dbt docs generate and search the output for any technician name. If you find one, where did it come from -- the staging source definition or the raw data layer?