Learn by Directing AI

Step 1: Load the production logs

The production logs are Parquet -- a columnar format you have not worked with before. Parquet files carry schema metadata (column names and types) embedded in the file itself. That metadata can be stale if the schema changed after the file was generated.

Direct AI to load the production logs into DuckDB:

Load materials/production-logs.parquet into DuckDB. Show me the column names, inferred types, and row count. Then compare the inferred column types against what the data dictionary says they should be.

You should see roughly 50,000 rows across three production lines (LINE-A, LINE-B, LINE-C), covering 18 months of shift-level data. Check the schema inference: AI guesses column types from the data. A column of IDs that happen to be numeric might be inferred as integer when it should be string. Compare every inferred type against the data dictionary.

Validate the Parquet metadata against the actual data. The metadata says the file has certain columns with certain types. Do the actual columns match? This is a format-specific check -- CSV files do not carry metadata, so this concern does not exist there.

Step 2: Load and flatten the sales data

The sales data is JSON -- nested API export with customer objects and line item arrays. This is the first time you are working with nested JSON data.

Direct AI to load the sales data:

Load materials/sales-data.json into a pandas DataFrame. Show me the resulting columns, types, and row count.

Watch what AI does with the nested structure. Each order has a customer object (with customer_id, customer_type, channel) and a line_items array (with product_type, quantity, unit_price, total). AI will flatten this into a table -- but how?

AI commonly flattens nested JSON inconsistently. First-level keys become columns. Deeper nesting gets stringified into a single column or only the first item gets extracted. The data you lose in flattening is data you cannot analyze.

Compare AI's flattened structure against the data dictionary. Identify what was lost or misrepresented. If the line_items array was stringified instead of properly expanded, that means you cannot analyze product-level detail without fixing the flattening approach.

graph LR
    subgraph "Nested JSON"
        A[order] --> B[order_id]
        A --> C[order_date]
        A --> D[customer]
        D --> E[customer_id]
        D --> F[customer_type]
        D --> G[channel]
        A --> H[line_items]
        H --> I["item 1: product, qty, price"]
        H --> J["item 2: product, qty, price"]
    end
    subgraph "Flattened Table"
        K[order_id] 
        L[order_date]
        M[customer_id]
        N[customer_type]
        O[channel]
        P["line_items → ???"]
    end
    A -.->|"AI flattens"| K
    D -.-> M
    H -.->|"lost or stringified"| P

Step 3: Profile the procurement and quality data

The procurement records and quality results are both CSV -- familiar territory. But format-specific issues still apply.

Direct AI to load and profile the procurement records:

Load materials/procurement-records.csv into DuckDB. Show me the column names, types, row count, and check for encoding issues, delimiter problems, or quoted fields with commas inside.

About 500 rows of monthly supplier data covering 24 months. Check for the standard CSV concerns: encoding, delimiters, quoted fields. These are the same validation steps you have done before -- the difference is that you are now doing them alongside JSON and Parquet validation, and each format has different failure modes.

Then load the quality results:

Load materials/quality-results.csv into DuckDB. Show me the column names, types, row count, and check for any transcription artifacts from the PDF-to-CSV conversion.

About 3,000 rows of batch-level quality test data. These were derived from PDF lab reports -- check for transcription artifacts. Missing values, inconsistent formatting, or truncated fields are common when data moves from PDF to CSV.

Step 4: Assess data freshness

Now step back and look across all four sources. The data dictionary notes when each source was last updated. Not all of them are current.

Direct AI to show the most recent record date for each source:

For each of the four data sources (production logs, sales data, procurement records, quality results), show me the most recent date in the data.

The sales data is the stale one -- last updated about three weeks before the production data. That means any analysis combining sales and production data has a gap. Revenue figures from three weeks ago may not reflect current order patterns. The question is not whether the data is clean -- it is whether the data is current enough for the analysis Siobhan needs.

Data freshness is a quality dimension. AI does not check it. AI analyzes whatever it receives without noting that the most recent record is weeks old. You need to assess freshness yourself and decide: can the stale data still support the analysis, or does the gap undermine the conclusions?

Step 5: Test scale-appropriate tools

The production logs have roughly 50,000 rows. That is large enough to notice a difference between tools.

Direct AI to time a simple aggregation in pandas versus DuckDB or Polars:

Time a simple aggregation on the production logs -- average units_produced grouped by production_line_id -- using pandas first, then using DuckDB SQL. Show me the execution time for each.

At 50,000 rows, pandas will work but may be noticeably slower than DuckDB or Polars. AI defaults to pandas because it is the most common tool in its training data, even when the dataset makes pandas impractical. The response to a slow pandas operation is not "get more memory" -- it is "use a tool designed for the scale."

If DuckDB or Polars is faster, use it for production log queries going forward. If the difference is negligible at this scale, note that and keep the option in mind for when datasets grow. The point is not that pandas is bad -- it is that choosing the right tool for the data size is a professional decision, not an afterthought.

Step 6: Document format-specific findings

Before moving on, update your project context. For each of the four sources, document:

Format-specific issues found (JSON flattening decisions, Parquet metadata validation, CSV parsing notes)
Data freshness status (current or stale, with the most recent record date)
Scale considerations (which tool to use for which source)

This documentation goes into your working notes or CLAUDE.md. When you direct AI in future sessions, including format-specific rules in the context -- "always expand line_items array before aggregating sales data" or "production logs: use DuckDB, not pandas" -- changes what AI produces on the first prompt.

✓ Check