Learn by Directing AI
All materials

soda-core-guide.md

Soda Core Configuration Guide

Soda Core is a data quality monitoring tool. It runs checks against your data and reports whether the results fall within your defined thresholds. It works alongside dbt tests -- dbt tests verify row-level properties, Soda Core checks verify batch-level patterns.

Installation

pip install soda-core-duckdb

Configuration

Create a soda_config.yml file in your project root:

data_source my_duckdb:
  type: duckdb
  # Path to your DuckDB database file
  path: dev.duckdb

Writing checks

Create a soda_checks.yml file. Each check targets a specific table and defines what to verify.

Row count anomaly detection

Check that today's row count is within a percentage of the previous day's count:

checks for stg_batch_data:
  # Verify row count is within expected range
  - row_count > 0
  - row_count:
      warn: when < 600
      fail: when < 400

Column value ranges

Verify that column values fall within expected ranges:

checks for stg_batch_data:
  # Temperature should be in Celsius after staging conversion
  - min(temperature) >= 50
  - max(temperature) <= 120

  # Humidity should be 0-100
  - max(humidity) <= 100
  - min(humidity) >= 0

Trend checks (batch-level patterns)

Check that values stay within historical norms. These are the checks that catch batch-level anomalies that row-level tests miss:

checks for fct_daily_quality:
  # Average color match score should be within expected range
  # These bounds come from domain knowledge, not data profiling
  - avg(normalized_score):
      warn: when < 70
      fail: when < 60
      warn: when > 98

Null rate checks

Monitor for unexpected nulls:

checks for stg_batch_data:
  # Humidity can have sensor gaps but should be < 5% null
  - missing_percent(humidity) < 5

  # These columns should never be null
  - missing_count(batch_id) = 0
  - missing_count(line_number) = 0
  - missing_count(operator_id) = 0

Running checks

Run all checks:

soda scan -d my_duckdb -c soda_config.yml soda_checks.yml

Interpreting results

Soda Core reports three statuses:

  • PASS: The check result falls within the defined threshold
  • WARN: The check result is approaching the threshold (if a warn level is defined)
  • FAIL: The check result exceeds the threshold

The output shows the actual measured value alongside your threshold, so you can see not just whether the check passed but how close the value is to the boundary. This is different from dbt tests, which only report pass/fail.

Key difference from dbt tests

dbt tests answer: "Is this specific row correct?" (unique, not_null, accepted_values, custom SQL assertions)

Soda Core checks answer: "Is this batch of data normal?" (row counts within historical range, averages within expected bounds, null rates below thresholds)

A pipeline where all dbt tests pass can still have an anomalous batch that only Soda Core would catch -- for example, if an upstream source silently stops sending records for one production line, each row that does arrive passes all dbt tests, but the batch has 33% fewer rows than normal.