Learn by Directing AI

Step 1: Understand API extraction

Loading a CSV file is a conversation between your code and your own file system. API extraction is different. You are making requests to someone else's server. That server can be slow, rate-limit you, return data in pages, or go down entirely. Your extraction script is a guest, not a host.

The Open-Meteo API returns daily weather observations. You tested it in Unit 2 and saw the response structure. Now you are building the full extraction: 18 months of daily data for Kostanay, loaded into DuckDB.

Step 2: Direct the extraction

Direct Claude Code to write a Python script that extracts daily weather data from Open-Meteo. Specify:

Location: Kostanay, Kazakhstan (latitude 53.21, longitude 63.63)
Date range: 2025-01-01 to 2025-06-30
Fields: temperature min, max, and average; relative humidity; precipitation
Output: Rows loaded into a DuckDB table

Be specific about the date range and fields. AI will use sensible defaults if you leave things unspecified, but "sensible" and "what Assel needs" are not always the same thing.

Step 3: Check the logs

Run the extraction script. Before you check the final table, look at the logs.

Good logging tells you what happened during extraction: how many records were fetched per request, whether any errors occurred, how long each request took. If the script just prints "Done" at the end, that is not logging -- it is decoration. Ask Claude to add structured logging if it is missing: records per page, error counts, timing.

Logging is how the pipeline reports on itself. Without it, your only diagnostic tool is comparing the final count against the expected number. With it, you know which part of the extraction had problems before you even look at the data.

Step 4: Verify the count

Count the rows in the weather table. Compare against the API's reported total for the date range.

This is the same verification question you asked with Carlos's honey data: did everything arrive? But the technique is different. With CSV files, you knew how many rows the source had because you could count the file. With an API, the source reports its own total and you compare against that.

If the count is off, the most common cause is pagination. AI commonly implements pagination logic that works for most pages but stops early on the last page. Check whether the extraction captured every day in the range. If days are missing from the end, the pagination boundary is the place to look.

Step 5: Fix pagination if needed

If the row count does not match the expected total, direct Claude Code to review the pagination logic and fix it. Re-run the extraction and verify the count matches.

The specific error depends on what Claude generated. The pattern to watch for: pagination that works correctly for all pages except the boundary conditions. The extraction completes without errors, reports success, and delivers fewer records than the source contains. This is a silent failure -- the most dangerous kind, because nothing tells you it happened except the count check.

Step 6: Test idempotency

Run the extraction a second time. Count the rows again.

If the count doubled, the extraction used INSERT to load rows without checking whether they already existed. Running the same extraction twice should produce the same output -- same row count, same data. This is idempotency. If your pipeline cannot be safely re-run, it is not a production pipeline.

Direct Claude Code to fix the loading pattern. Use a replacement strategy: delete-then-insert keyed on date, or a MERGE pattern. After the fix, run the extraction a third time and confirm the row count is identical to the first correct run.