Learn by Directing AI

Step 1: Understand the pipeline template

The template describes a modular pipeline structure: four modules (data_loader, feature_engineer, trainer, evaluator), each with a single responsibility, orchestrated by a single entry point (train.py) that reads configuration from a YAML file.

In a notebook, data flows through shared memory. You can run cells out of order, skip cells, or rely on state from a previous session without realizing it. When you convert to scripts, every dependency becomes explicit. Data flows through function calls with defined inputs and outputs. If a module needs something, it asks for it through its arguments.

Step 2: Understand configuration externalization

The template shows a config.yaml structure with sections for data paths, feature settings, model hyperparameters, and output paths. Everything that might change between runs lives in the config, not in code.

This separation matters. A config file is an experiment specification. Changing the experiment means editing config.yaml, not hunting through code for hardcoded values. A colleague can see exactly what parameters produced a given model by reading one file.

Step 3: Create the pipeline modules

Direct Claude to create the four pipeline modules based on the template stubs:

Create the four pipeline modules from the template in materials/pipeline-template.md: data_loader.py, feature_engineer.py, trainer.py, evaluator.py. Each module should have clear input/output interfaces matching the template signatures. The feature engineering should match the logic in materials/model-artifacts/feature_pipeline.py. The trainer should use the same PyTorch architecture as the model in materials/model-artifacts/serve.py.

The template provides structure. You provide judgment about what the implementation should contain -- the feature engineering logic needs to match what the existing model expects, and the training loop needs to produce a model compatible with serve.py.

Step 4: Create config.yaml and train.py

Direct Claude to create the configuration file and entry point:

Create config.yaml with sections for data paths, feature settings, model hyperparameters (learning_rate, epochs, batch_size, hidden_sizes), and output paths. Then create train.py as the entry point that reads config.yaml via argparse --config flag and orchestrates the four modules in sequence.

Step 5: Review AI's output for monolithic tendencies

Before running anything, read through the generated code.

AI commonly generates scripts that mix responsibilities. Check for: configuration values hardcoded in module files instead of passed through config, modules that import each other's internals instead of communicating through function interfaces, a train.py that does processing work instead of just orchestrating. If you find hardcoded values, refactor them into config.yaml. If you find tangled imports, simplify the interfaces.

Step 6: Run and verify the pipeline

Run the pipeline end-to-end:

python train.py --config config.yaml

The output should show each module executing in sequence: data loading, feature engineering, training (with loss values), and evaluation (with metrics). A trained model file should appear in the configured output directory.

If the pipeline fails, the error will point to the module and the interface where it broke. That specificity is exactly the point -- in a notebook, the same error would have been invisible or misleading.

✓ Check