Learn by Directing AI
Unit 2

Building the Pipeline

Step 1: Why Pipelines exist

In P4 you caught data leakage by hand -- verifying that preprocessing happened after splitting, checking that test set statistics did not contaminate training. That worked, but it depended on you remembering to check every time.

A scikit-learn Pipeline makes the wrong execution order impossible. When preprocessing steps are inside a Pipeline, fit() is called only on training data during cross-validation. Transformations are applied consistently during training and prediction. The Pipeline does not rely on vigilance -- it enforces correct ordering by design.

This is the shift from "catch leakage" to "make leakage impossible." The same principle that P4 taught through verification, P6 teaches through engineering.

Step 2: Open the Pipeline template

Open materials/pipeline-template.py. This is a skeleton showing the structure: column group definitions at the top, a ColumnTransformer in the middle routing each column type to its transformer, and a full Pipeline at the bottom chaining preprocessing, feature selection, and the estimator.

The template has placeholder column lists -- you fill them in from your profiling. It imports StandardScaler for numeric columns, OneHotEncoder for categorical columns, and TfidfVectorizer for text columns. The structure is complete. The choices are yours.

Step 3: Design the Pipeline architecture

Before directing Claude to build anything, decide which columns get which treatment. From your profiling:

  • Numeric columns need scaling. Experience years, hospital minimum experience, and similar continuous values go through StandardScaler.
  • Categorical columns need encoding. Region, specialization, shift preference, department, location type -- discrete values that OneHotEncoder handles.
  • Text columns need TF-IDF. Nurse bios and hospital requirement notes are free text. TfidfVectorizer converts each into a vector of word frequencies.

The ColumnTransformer routes each column type to its correct transformer. Use column names, not indices. AI commonly generates ColumnTransformers with hard-coded column indices -- [0, 3, 7] instead of ["nurse_experience_years", "hospital_min_experience"]. Indices break the moment column order changes. Names are robust.

Feed Claude your column type plan before asking it to generate the Pipeline code. Providing the column grouping as context -- which columns are numeric, which are categorical, which are text -- produces better code than asking Claude to figure it out from the dataset.

Step 4: Build the Pipeline with feature selection inside cross-validation

Direct Claude to construct the full Pipeline: ColumnTransformer, feature selection, and a baseline estimator (logistic regression or random forest).

One critical detail: feature selection must happen inside the Pipeline, not before it. If you run SelectKBest(X, y) on the full dataset before creating the Pipeline, you have introduced a subtle form of leakage. Features selected on the full dataset -- including the test set -- will appear more predictive than they really are. Inside the Pipeline, SelectKBest runs on each training fold independently during cross-validation.

AI commonly performs feature selection on the full dataset as the natural code order. The code looks correct -- it runs without errors and produces reasonable-looking scores. But the evaluation is compromised. Specify that feature selection goes inside the Pipeline, after the ColumnTransformer.

After building the Pipeline, have a second model review the construction. Ask a fresh Claude session to examine the Pipeline code and check: are all preprocessing steps inside the Pipeline? Is feature selection inside the Pipeline? Are column names used instead of indices? Cross-checking catches structural errors that the original session may have normalized.

Step 5: Train and evaluate the baseline

Split the data, fit the Pipeline on training data, and evaluate on the test set. Use cross-validation to check for consistency across folds.

Run cross-validation and look at the fold scores. They should be consistent -- all within a narrow range. If one fold scores dramatically higher than the others, that is a leakage signal. Consistent scores across folds mean the Pipeline is working as intended: no information from the test fold is leaking into training.

Print the Pipeline. The output shows the nested structure: ColumnTransformer with its sub-transformers, feature selection, and the estimator. Reading a well-constructed Pipeline tells another practitioner exactly what happens to the data, in what order, and with what parameters. The Pipeline's structure is the documentation.

Step 6: Track in MLflow

Log the experiment in MLflow: Pipeline configuration, cross-validation scores, feature selection parameters, and the baseline model's test performance. This becomes the reference point for comparison when transfer learning enters in the next unit.

✓ Check

Check: The Pipeline's cross-validation scores are consistent across folds (no single fold dramatically higher, which would indicate leakage). Feature selection runs inside the Pipeline (check that SelectKBest is inside the Pipeline, not applied before).