Step 1: Plan the Feature Approach
The raw sensor data is daily readings. The prediction target is harvest totals per farm. You need to bridge the gap: aggregate daily sensor data into features that capture growing-season conditions.
Open materials/eval-template.md. This is an evaluation design template. Before building anything, fill it in. What are you predicting? Yield per farm per harvest, in kilograms. What metrics make sense for regression? MAE tells you the average prediction error in the same units as yield. RMSE penalizes large misses more heavily. R-squared tells you how much of the yield variation your features explain.
What's the baseline? Valentina's current method: last year's numbers. If the model can't beat "repeat the previous harvest," it's not useful.
Step 2: Design the Features
Direct Claude to propose features by aggregating sensor readings over the growing season. For each harvest, the relevant weather patterns are the months before picking: flowering happens October through November, cherry development December through March. Features like mean temperature during flowering, total rainfall during cherry development, and soil moisture trends during the growing season capture what drives yield.
Consider derived features too. The interaction between temperature and rainfall may matter -- a warm dry period versus a warm wet period produces different conditions for the coffee. Polynomial features can capture non-linear relationships between altitude and yield.
Review what Claude proposes. For each feature, there's an implicit hypothesis: "this measurement during this period affects yield." Some hypotheses make sense for coffee agriculture. Others might be statistical noise. You're the one who decides which features carry domain reasoning and which are just numbers.
Step 3: Build the Feature Pipeline
Direct Claude to build the feature engineering pipeline. It should aggregate daily sensor readings into harvest-level features, handle the sensor gaps (impute missing values or flag them), merge features with the harvest records, and create the derived features you chose.
Review the pipeline code. Check that the aggregation windows align with the growing seasons. Check that the sensor gap handling makes sense -- does it impute values for the missing weeks, or does it flag those farms as having incomplete data?
Step 4: Split and Preprocess
Direct Claude to split the data into training and test sets and apply preprocessing (scaling the features).
Look carefully at what Claude produces. AI commonly gets the ordering of preprocessing and splitting wrong in ways that are invisible in the code but devastating for the results. It also tends to default to random splitting regardless of the data's structure.
Check: does the code preprocess first and then split? Does it split randomly, mixing harvests from different years? Either of these contaminates the evaluation. If the model has already seen the pattern from 2024 when it's tested on 2024 predictions, the test metrics are meaningless.
If you notice suspicious results -- metrics that seem too good for a small agricultural dataset with two years of data -- that's a signal. Dig into the pipeline code and find where the contamination happens.
Step 5: Talk to Dr. Sarah Chen
Dr. Sarah Chen is available as a senior colleague when you hit the temporal data issue. She's a data scientist who specializes in evaluation design and data quality.
If you've noticed the leakage -- either through code review or through suspiciously good metrics -- she confirms what you found. "If the model has seen 2024 rainfall when predicting the 2024 harvest, it's not predicting -- it's cheating. Train on the earlier harvests, test on the later ones. That's the only honest split for temporal data."
She'll push on the split design: train on the first three harvests (2022-H2, 2023-H1, 2023-H2), test on the fourth (2024-H1). This simulates real forecasting -- using past data to predict the next season.
Step 6: Fix the Pipeline
Direct Claude to implement the temporal split: train on harvests 1 through 3, test on harvest 4. Move all preprocessing to happen after the split, or use a scikit-learn Pipeline that enforces the correct order automatically. A Pipeline chains preprocessing and model training together so that fitting only sees the training data.
Run the corrected pipeline. The metrics will be lower than before. That drop is not a failure -- it's the difference between an honest evaluation and a contaminated one. The leaky pipeline's R-squared might have been 0.90. The honest pipeline might show 0.55 to 0.65. The gap between those numbers measures how much the leakage inflated the results.
The honest metrics are what Valentina can rely on. A model that honestly predicts 60% of yield variation is more useful than one that appears to predict 90% but falls apart when the next harvest actually arrives.
Check: The pipeline uses temporal splitting (not random). Preprocessing happens after the split (or inside a Pipeline). The corrected metrics are lower than the leaky metrics but represent honest performance.