ML P4: Coffee Yield Prediction
Client
Valentina Reyes, Owner and Export Director at Finca Esperanza Exports (Neiva, Colombia). Specialty coffee exporter sourcing from twelve farms in Huila, exporting to roasters in Europe, Japan, and the US.
What you're building
A yield prediction model that predicts coffee harvest per farm from sensor data. The model must use temporal splitting (train on earlier harvests, test on the most recent), prevent data leakage, and deliver per-farm predictions Valentina can use for contract negotiations.
Tech stack
- Python 3.10+
- pandas (data manipulation)
- scikit-learn (preprocessing, Pipelines, baseline models)
- PyTorch (neural network training)
- MLflow (experiment tracking)
- Jupyter notebooks (exploration and training)
File structure
materials/
CLAUDE.md -- this file
valentina-email.md -- Valentina's initial email
tickets.md -- work breakdown
eval-template.md -- evaluation design template (student fills)
sensor-data.csv -- daily sensor readings, 12 farms, 2 years
harvest-records.csv -- yield per farm per harvest period
scripts/ -- generation scripts (not student-facing)
Ticket list
- T-01: Profile sensor dataset structure and quality
- T-02: Profile harvest records and link to sensor data
- T-03: Design feature aggregation from daily readings to harvest-level features
- T-04: Build feature engineering pipeline with temporal awareness
- T-05: Implement temporal train/test split
- T-06: Verify no preprocessing leakage (preprocessing after split or inside Pipeline)
- T-07: Identify and classify outliers with domain reasoning
- T-08: Handle outliers based on domain judgment (keep variety effects, flag sensor anomalies)
- T-09: Document all feature construction decisions with rationale
- T-10: Set up PyTorch model architecture for yield regression
- T-11: Implement training loop with loss curve monitoring
- T-12: Configure early stopping with appropriate patience
- T-13: Run baseline comparison (scikit-learn vs PyTorch) with MLflow tracking
- T-14: Generate per-farm yield predictions from best model
- T-15: Deliver predictions to Valentina in business terms
- T-16: Write README and close project
Verification targets
- Temporal split in use (train on harvests 1-3, test on harvest 4)
- No preprocessing before split (or preprocessing inside a scikit-learn Pipeline)
- Loss curves plotted with training and validation loss
- Early stopping configured
- Both scikit-learn baseline and PyTorch model logged in MLflow
- Per-farm predictions generated from honest pipeline
- Feature documentation complete with domain reasoning
Commit convention
Commit after completing each ticket group. Use descriptive messages: "T-01/T-02: Profile sensor and harvest data", "T-03-T-06: Feature pipeline with temporal split", etc.