Learn by Directing AI

Step 1: Set up the GitHub Actions workflow

Quality checks that run manually are quality checks that get forgotten. The next step is making them automatic -- a CI/CD pipeline that runs dbt tests and Soda Core checks on every pull request.

Open materials/github-actions-template.yml. This is a starting workflow template for GitHub Actions. It defines a workflow that triggers on pull requests and runs the quality checks as pipeline stages.

Direct Claude to configure the workflow for your project. The workflow needs to:

Install Python dependencies (dbt-core, dbt-duckdb, soda-core)
Run dbt build to materialize models
Run dbt test to check row-level properties
Run soda scan to check batch-level patterns

The workflow file goes in .github/workflows/ in your repository. When you push a branch and open a PR, GitHub runs the workflow automatically. If any check fails, the PR is marked as failing and the merge button is blocked.

This transforms quality from something you do manually ("remember to run the tests") into a pipeline stage that runs whether you remember or not.

Step 2: Test the CI pipeline

Create a branch, make a deliberate change, and open a PR to test the pipeline.

git checkout -b test-ci-pipeline

Make a small change to one of your dbt models -- something that should pass all tests. Push the branch and open a PR:

git push -u origin test-ci-pipeline

Watch the GitHub Actions run. The workflow should execute the dbt and Soda Core steps. If the paths in your workflow file are wrong, the steps will fail -- check the error output in the Actions tab and fix the paths.

Once you have confirmed the pipeline works on a passing change, test the failure case. Remove a NOT NULL test from your schema, or introduce a change that should trigger a dbt test failure. Push the change. The CI pipeline should catch it and mark the PR as failing.

A quality gate that only passes is not a quality gate. Verify that it catches failures before you trust it.

Step 3: Configure Dagster freshness policies

Roberto's data arrives daily -- one batch export from the production system at the end of each shift, around 10pm. He checks the dashboard each morning. If the data from last night's batches is not there by 8am, he needs to know something went wrong.

A freshness policy is a contract between the pipeline and its consumers. It says: "this data will be no more than N hours old." When the policy is violated -- the data is staler than the threshold -- Dagster raises a visible alert.

Direct Claude to configure freshness policies for the pipeline assets. AI commonly sets thresholds with round numbers -- "6 hours" or "12 hours" -- without reference to the actual business need. Roberto's need is specific: data from the 10pm export should be available by 8am. That is a 10-hour window. A 6-hour threshold is too tight (the export might not finish until 11pm). A 24-hour threshold is too loose (Roberto would not know about a failure until the next morning).

The right threshold comes from understanding who consumes the data and when they need it. Set the freshness policy based on Roberto's actual workflow, not on a round number.

Step 4: Choose a materialisation strategy

Roberto's data arrives predictably -- once per day, end of shift. Two options for materialising the pipeline assets:

Schedule-based: Run at 11pm every night, after the last batch at 10pm. Simple to configure. The downside: if the export is late (say midnight), the scheduled run processes yesterday's data and the new data sits until the next night's run.

Sensor-based: Run when new data lands. More responsive -- the pipeline runs as soon as the export arrives, regardless of when. The downside: more complex to configure. A sensor watches for file changes, which means defining what "new data" looks like and handling edge cases (partial writes, duplicate exports).

For Roberto's operation with predictable daily exports, schedule-based materialisation is the simpler choice. But consider whether the simplicity is worth the risk of processing stale data when exports are late. If Roberto's shift patterns change or a second shift starts exporting data at a different time, the schedule breaks.

Direct Claude to configure the materialisation schedule in Dagster. The asset definitions should include both the freshness policy and the schedule.

Step 5: Explore cost awareness

Query DuckDB's execution statistics. How many rows does the most expensive transformation process? What is the total data volume the pipeline handles?

SELECT * FROM duckdb_queries() ORDER BY total_time DESC LIMIT 5;

With DuckDB running locally, these queries cost nothing. But the same pipeline running on BigQuery, Snowflake, or Redshift would cost real money -- charged per byte scanned or per compute second. A window function that scans the full batch history to find the "latest" row, when the pipeline only needs the last 30 days, would process 14 months of data unnecessarily.

Cost awareness is a design concern, not a finance concern. Note in your CLAUDE.md that if this pipeline migrates to a cloud warehouse, the window function scans and full-table aggregations should be reviewed for cost optimization. This is context for future work, not an action item for now.

✓ Check

✓ Check: Push a commit that breaks a dbt test (e.g., remove a NOT NULL constraint). Does the CI pipeline catch it and block the PR?

Configure CI/CD quality gates and Dagster freshness

Step 1: Set up the GitHub Actions workflow

Step 2: Test the CI pipeline

Step 3: Configure Dagster freshness policies

Step 4: Choose a materialisation strategy

Step 5: Explore cost awareness