Learn by Directing AI

Step 1: What DVC does

Before writing any feature code, understand the tool that makes features reproducible.

DVC -- Data Version Control -- tracks datasets and pipeline outputs alongside git-tracked code. Git handles your Python scripts. DVC handles the large files those scripts produce: cleaned datasets, feature matrices, model artifacts. When you make a commit, git records the code and DVC records a hash of the data. Together they answer a question that matters for every ML project: which code and which data produced this result?

Without DVC, running the same feature pipeline on two different machines can produce different features because the raw data has drifted. With DVC, you can check out any historical commit and reproduce the exact features that existed at that point.

Step 2: Initialize DVC

Ask Claude to initialize DVC in your project and configure it to track the raw datasets.

Initialize DVC in this project. Add the raw CSV files (transactions.csv, products.csv, customers.csv) to DVC tracking. Make sure the .dvc files are git-tracked.

DVC creates a .dvc/ directory for its configuration and generates .dvc files for each tracked dataset. These .dvc files contain the hash of the data -- they go into git. The actual data files go into .gitignore so they are not committed to git directly.

This is the foundation of feature versioning. Every change to the raw data or the features computed from it will be captured through this system.

Step 3: Build the tabular feature pipeline

Open materials/dvc-config-template.yaml. It defines three pipeline stages: prepare, features, and train. Each stage lists its command, dependencies, and outputs.

Now ask Claude to build the tabular feature pipeline. The features you need: customer purchase history (frequency, recency, average spend), product popularity (purchase counts, wishlist counts, category rank), and category affinity (which categories each customer gravitates toward).

AI commonly generates feature engineering code that computes everything in a single script without thinking about the data integrity issues in your specific dataset. Check what it produces. The product ID recycling problem from the profiling step matters here: if the pipeline treats product_id as a unique identifier, it conflates different seasonal variants.

Direct Claude to create a composite key -- product_id combined with season -- that distinguishes the Spring 2023 version of a dress from the Fall 2024 version. This is not optional. Without it, every feature that depends on product identity is wrong.

Step 4: Version the feature pipeline with DVC

With the tabular features built, version them. Ask Claude to configure DVC to track the feature computation outputs -- not just the raw data, but the intermediate feature matrices.

Configure DVC to track the feature outputs. I want the pipeline to version both the raw data and the computed features, so I can reproduce any historical feature set by checking out the right commit.

The DVC pipeline file (dvc.yaml) should now have the features stage tracking its outputs. When you change the feature code and rerun, DVC records the new outputs. When you check out an older commit, dvc checkout restores the features that existed at that point.

Step 5: Separate feature computation from model training

This is an architectural decision with production consequences. Right now, features are computed and immediately consumed by training. That means every training run recomputes features from raw data.

Ask Claude to restructure the pipeline so that feature computation and model training are separate stages with an explicit boundary: features are computed once and written to a feature store (a dedicated directory of versioned feature files), and training reads from that store.

AI will often generate monolithic scripts that interleave feature computation and model training in one file. Direct Claude to separate them. The features stage should produce output files. The train stage should read those files as inputs. No feature logic in the training script.

This separation means multiple models can share the same features. It means you can test feature computation independently. And it means feature versioning is clean -- you know exactly what features fed any model.

Step 6: Verify the versioned features

Run dvc repro to execute the full pipeline. Then verify two things.

First: DVC reproducibility. Check out a previous commit, run dvc checkout, and confirm the features match exactly. This proves the versioning works -- any historical experiment is reproducible.

Second: the composite key. Check the feature matrix for the ~50 recycled product IDs. Each seasonal variant should produce distinct features. If two rows have the same product_id but different seasons and identical feature values, the composite key is not working.

✓ Check