Pipeline Template: Modular Training Pipeline

Why modularize

A notebook hides how data flows between steps. Variables live in shared memory, so you can run cells out of order, skip cells, or rely on state from a previous session without realizing it. When you convert to a scripted pipeline, every dependency becomes explicit. Data flows through function calls with defined inputs and outputs. If a module needs something, it asks for it. If something breaks, the error points to the module and the interface, not to "run all cells again."

Module structure

The pipeline separates into four modules, each with a single responsibility:

config.yaml
    |
    v
train.py (entry point)
    |
    +---> data_loader.py      (load and validate raw data)
    |         |
    |         v
    +---> feature_engineer.py  (transform raw data into model features)
    |         |
    |         v
    +---> trainer.py           (train the model)
    |         |
    |         v
    +---> evaluator.py         (evaluate model performance)

Each module reads its inputs from function arguments and returns its outputs. No module reaches into another module's internals. No module reads config directly -- train.py reads config and passes relevant sections to each module.

Entry point pattern

train.py is the orchestrator. It does three things:

Reads config.yaml using argparse for the --config flag
Calls each module in sequence, passing outputs from one as inputs to the next
Saves final artifacts (trained model, evaluation metrics)

The goal: python train.py --config config.yaml runs the entire pipeline. One command. No ambiguity.

Configuration externalization

All values that might change between runs belong in config.yaml, not in code:

data:
  data_path: "data/sensor_readings.csv"
  harvest_path: "data/harvest_records.csv"
  train_split: 0.8

features:
  normalize: true
  derived_features:
    - temperature_rainfall_ratio
    - moisture_humidity_index

model:
  hidden_sizes: [32, 16]
  dropout_rate: 0.1
  learning_rate: 0.001
  epochs: 100
  batch_size: 32

output:
  model_path: "output/model.pt"
  metrics_path: "output/metrics.json"

Changing the experiment means changing the config, not editing code. A colleague can see exactly what parameters produced a given model by reading the config file.

Module stubs

data_loader.py

"""Data loading and validation module."""

def load_data(data_path: str, harvest_path: str) -> tuple:
    """Load sensor readings and harvest records.

    Args:
        data_path: Path to sensor readings CSV.
        harvest_path: Path to harvest records CSV.

    Returns:
        Tuple of (sensor_df, harvest_df) as pandas DataFrames.
    """
    pass

feature_engineer.py

"""Feature engineering module."""

import torch

def engineer_features(sensor_df, harvest_df, config: dict) -> tuple:
    """Transform raw data into model-ready features.

    Args:
        sensor_df: Raw sensor readings DataFrame.
        harvest_df: Harvest records DataFrame.
        config: Features section of config.yaml.

    Returns:
        Tuple of (X, y) as torch Tensors.
    """
    pass

trainer.py

"""Model training module."""

import torch
import torch.nn as nn

def train_model(X: torch.Tensor, y: torch.Tensor, config: dict) -> nn.Module:
    """Train a feedforward neural network.

    Args:
        X: Feature tensor.
        y: Target tensor.
        config: Model section of config.yaml.

    Returns:
        Trained PyTorch model.
    """
    pass

evaluator.py

"""Model evaluation module."""

import torch.nn as nn
import torch

def evaluate_model(model: nn.Module, X: torch.Tensor, y: torch.Tensor) -> dict:
    """Evaluate model performance.

    Args:
        model: Trained PyTorch model.
        X: Feature tensor (test set).
        y: Target tensor (test set).

    Returns:
        Dictionary of evaluation metrics (mae, rmse, r2).
    """
    pass

pipeline-template.md