ML P4: Coffee Yield Prediction

Client

Valentina Reyes, Owner and Export Director at Finca Esperanza Exports (Neiva, Colombia). Specialty coffee exporter sourcing from twelve farms in Huila, exporting to roasters in Europe, Japan, and the US.

What you're building

A yield prediction model that predicts coffee harvest per farm from sensor data. The model must use temporal splitting (train on earlier harvests, test on the most recent), prevent data leakage, and deliver per-farm predictions Valentina can use for contract negotiations.

Tech stack

Python 3.10+
pandas (data manipulation)
scikit-learn (preprocessing, Pipelines, baseline models)
PyTorch (neural network training)
MLflow (experiment tracking)
Jupyter notebooks (exploration and training)

File structure

materials/
  CLAUDE.md          -- this file
  valentina-email.md -- Valentina's initial email
  tickets.md         -- work breakdown
  eval-template.md   -- evaluation design template (student fills)
  sensor-data.csv    -- daily sensor readings, 12 farms, 2 years
  harvest-records.csv -- yield per farm per harvest period
  scripts/           -- generation scripts (not student-facing)

Ticket list

T-01: Profile sensor dataset structure and quality
T-02: Profile harvest records and link to sensor data
T-03: Design feature aggregation from daily readings to harvest-level features
T-04: Build feature engineering pipeline with temporal awareness
T-05: Implement temporal train/test split
T-06: Verify no preprocessing leakage (preprocessing after split or inside Pipeline)
T-07: Identify and classify outliers with domain reasoning
T-08: Handle outliers based on domain judgment (keep variety effects, flag sensor anomalies)
T-09: Document all feature construction decisions with rationale
T-10: Set up PyTorch model architecture for yield regression
T-11: Implement training loop with loss curve monitoring
T-12: Configure early stopping with appropriate patience
T-13: Run baseline comparison (scikit-learn vs PyTorch) with MLflow tracking
T-14: Generate per-farm yield predictions from best model
T-15: Deliver predictions to Valentina in business terms
T-16: Write README and close project

Verification targets

Temporal split in use (train on harvests 1-3, test on harvest 4)
No preprocessing before split (or preprocessing inside a scikit-learn Pipeline)
Loss curves plotted with training and validation loss
Early stopping configured
Both scikit-learn baseline and PyTorch model logged in MLflow
Per-farm predictions generated from honest pipeline
Feature documentation complete with domain reasoning

Commit convention

Commit after completing each ticket group. Use descriptive messages: "T-01/T-02: Profile sensor and harvest data", "T-03-T-06: Feature pipeline with temporal split", etc.