Pipeline Template: Modular Training Pipeline
Why modularize
A notebook hides how data flows between steps. Variables live in shared memory, so you can run cells out of order, skip cells, or rely on state from a previous session without realizing it. When you convert to a scripted pipeline, every dependency becomes explicit. Data flows through function calls with defined inputs and outputs. If a module needs something, it asks for it. If something breaks, the error points to the module and the interface, not to "run all cells again."
Module structure
The pipeline separates into four modules, each with a single responsibility:
config.yaml
|
v
train.py (entry point)
|
+---> data_loader.py (load and validate raw data)
| |
| v
+---> feature_engineer.py (transform raw data into model features)
| |
| v
+---> trainer.py (train the model)
| |
| v
+---> evaluator.py (evaluate model performance)
Each module reads its inputs from function arguments and returns its outputs. No module reaches into another module's internals. No module reads config directly -- train.py reads config and passes relevant sections to each module.
Entry point pattern
train.py is the orchestrator. It does three things:
- Reads config.yaml using argparse for the
--configflag - Calls each module in sequence, passing outputs from one as inputs to the next
- Saves final artifacts (trained model, evaluation metrics)
The goal: python train.py --config config.yaml runs the entire pipeline. One command. No ambiguity.
Configuration externalization
All values that might change between runs belong in config.yaml, not in code:
data:
data_path: "data/sensor_readings.csv"
harvest_path: "data/harvest_records.csv"
train_split: 0.8
features:
normalize: true
derived_features:
- temperature_rainfall_ratio
- moisture_humidity_index
model:
hidden_sizes: [32, 16]
dropout_rate: 0.1
learning_rate: 0.001
epochs: 100
batch_size: 32
output:
model_path: "output/model.pt"
metrics_path: "output/metrics.json"
Changing the experiment means changing the config, not editing code. A colleague can see exactly what parameters produced a given model by reading the config file.
Module stubs
data_loader.py
"""Data loading and validation module."""
def load_data(data_path: str, harvest_path: str) -> tuple:
"""Load sensor readings and harvest records.
Args:
data_path: Path to sensor readings CSV.
harvest_path: Path to harvest records CSV.
Returns:
Tuple of (sensor_df, harvest_df) as pandas DataFrames.
"""
pass
feature_engineer.py
"""Feature engineering module."""
import torch
def engineer_features(sensor_df, harvest_df, config: dict) -> tuple:
"""Transform raw data into model-ready features.
Args:
sensor_df: Raw sensor readings DataFrame.
harvest_df: Harvest records DataFrame.
config: Features section of config.yaml.
Returns:
Tuple of (X, y) as torch Tensors.
"""
pass
trainer.py
"""Model training module."""
import torch
import torch.nn as nn
def train_model(X: torch.Tensor, y: torch.Tensor, config: dict) -> nn.Module:
"""Train a feedforward neural network.
Args:
X: Feature tensor.
y: Target tensor.
config: Model section of config.yaml.
Returns:
Trained PyTorch model.
"""
pass
evaluator.py
"""Model evaluation module."""
import torch.nn as nn
import torch
def evaluate_model(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> dict:
"""Evaluate model performance.
Args:
model: Trained PyTorch model.
X: Feature tensor (test set).
y: Target tensor (test set).
Returns:
Dictionary of evaluation metrics (mae, rmse, r2).
"""
pass