Learn by Directing AI
Unit 1

Meet Roberto and understand the dyeing operation

Step 1: Read Roberto's message

Roberto Hernandez runs production at Textiles del Pacifico in San Salvador. He reached out through the industry association. Here is his WhatsApp message:

WhatsApp -- Roberto Hernandez

Hey, Roberto here from Textiles del Pacifico in San Salvador. Got your contact from the industry association.

Quick version: I run production for a textile dyeing operation, 3 lines, about 30 batches per line per day. Our re-dye rate is at 8% and my biggest client says it needs to be under 5% by the next contract review.

I have tons of data -- every batch is logged with machine settings, temperatures, chemical concentrations, humidity, operator, color match scores. But I can't make sense of it across all three lines. Line 1 is older equipment so comparing directly doesn't work.

I need someone to help me analyze this data and figure out what actually drives quality. I have about 14 months of batch records I can export.

Can I send you a sample file? It's CSV from our production system, about 900 batches per day across all lines.

Let me know

Three production lines. Roughly 90 batches per day total. A re-dye rate that needs to drop from 8% to 5% before the next contract review. Roberto has the data but can't analyze it systematically -- and he already knows that comparing across lines is complicated because Line 1 runs older equipment.

Before you look at any data, message Roberto. The quality of your analysis depends on the quality of the questions you ask now. Ask him about the three lines, the differences between them, and anything about how the data is recorded that might affect comparisons.

Notice something about this data: it includes operator IDs. Every batch is tied to a specific worker. Roberto mentions his operators by name and talks about the factory floor with pride. As you work with this data, keep in mind who it is about. Six operators produce these batches. Their individual performance will become visible in any analysis you build.

Step 2: Message Roberto

Open the Roberto Hernandez chat and ask about his production lines and data. Good questions focus on differences between the lines and how data is recorded -- not on technical pipeline details Roberto would not understand.

If you ask about data consistency across lines, Roberto will mention something important: Line 1 records temperature in Fahrenheit. The machines are older American equipment from the 1990s, and nobody ever changed the settings. Lines 2 and 3 record in Celsius. Roberto converts in his head and does not think of it as a data problem.

If you do not ask, he will not mention it. That information only surfaces through the right questions.

Roberto will also mention his workers. He knows them by name. Miguel on Line 1 has been there since the beginning. The operators and their performance are part of the story -- and part of the data.

Step 3: Review the batch data sample

Roberto sent a sample file. Open materials/batch-data-sample.csv and look at what you are working with. The file has 30 rows -- 10 per production line -- with these columns:

  • batch_id -- unique identifier for each batch (format: L1-20250115-001)
  • line_number -- which production line ran the batch (1, 2, or 3)
  • fabric_type -- polyester or cotton_blend
  • dye_formula -- the dye formula code used
  • temperature -- dyeing temperature
  • humidity -- plant humidity percentage
  • chemical_concentration -- chemical concentration in g/L
  • color_match_score -- how close the dye color matched the target
  • pass_fail -- whether the batch passed quality inspection
  • operator_id -- which operator ran the batch (OP-001 through OP-006)
  • timestamp -- when the batch was processed

Look at the temperature column. Scan the values for Line 1 rows versus Line 2 and Line 3 rows. Line 1 temperatures are in the 160-200 range. Lines 2 and 3 are in the 68-95 range. That is a factor of roughly two -- consistent with one set of values being Fahrenheit and the other Celsius.

Load the sample into DuckDB and run a quick query to see the ranges by line:

SELECT line_number, MIN(temperature), MAX(temperature), ROUND(AVG(temperature), 1) as avg_temp
FROM read_csv_auto('materials/batch-data-sample.csv')
GROUP BY line_number
ORDER BY line_number;

The result should show Line 1 averaging somewhere around 170-185, while Lines 2 and 3 average around 75-85. The discrepancy is obvious once you group by line.

Step 4: Understand why this matters

Consider what happens if you average temperature across all three lines without converting units first. Line 1's Fahrenheit values pull the average up. Lines 2 and 3's Celsius values pull it down. The result is a number between the two scales that represents neither Fahrenheit nor Celsius -- it is meaningless.

Any analysis that correlates temperature with color match quality across all three lines will produce misleading results if this conversion is not handled. A "high temperature" on Line 1 (say 200F, which is 93C) looks like an outlier when plotted alongside Line 2 and Line 3 values in the 70-90 range. But it is actually a normal operating temperature for that equipment.

This is a different kind of data problem than you have seen before. The data is not broken. No rows are missing. No values are null or malformed. The temperature column contains valid numbers on every row. The problem is that those numbers mean different things depending on which line produced them -- and nothing in the data tells you that. Only Roberto's domain knowledge reveals it, and only if you ask the right question.

This is the first constraint to track. There may be others. When you start building the pipeline, the staging layer will need to handle this conversion before any downstream analysis.

✓ Check

✓ Check: Can you explain why averaging temperature across all three lines without unit conversion would produce misleading results?