Soda Core Configuration Guide
Soda Core is a data quality monitoring tool. It runs checks against your data and reports whether the results fall within your defined thresholds. It works alongside dbt tests -- dbt tests verify row-level properties, Soda Core checks verify batch-level patterns.
Installation
pip install soda-core-duckdb
Configuration
Create a soda_config.yml file in your project root:
data_source my_duckdb:
type: duckdb
# Path to your DuckDB database file
path: dev.duckdb
Writing checks
Create a soda_checks.yml file. Each check targets a specific table and defines what to verify.
Row count anomaly detection
Check that today's row count is within a percentage of the previous day's count:
checks for stg_batch_data:
# Verify row count is within expected range
- row_count > 0
- row_count:
warn: when < 600
fail: when < 400
Column value ranges
Verify that column values fall within expected ranges:
checks for stg_batch_data:
# Temperature should be in Celsius after staging conversion
- min(temperature) >= 50
- max(temperature) <= 120
# Humidity should be 0-100
- max(humidity) <= 100
- min(humidity) >= 0
Trend checks (batch-level patterns)
Check that values stay within historical norms. These are the checks that catch batch-level anomalies that row-level tests miss:
checks for fct_daily_quality:
# Average color match score should be within expected range
# These bounds come from domain knowledge, not data profiling
- avg(normalized_score):
warn: when < 70
fail: when < 60
warn: when > 98
Null rate checks
Monitor for unexpected nulls:
checks for stg_batch_data:
# Humidity can have sensor gaps but should be < 5% null
- missing_percent(humidity) < 5
# These columns should never be null
- missing_count(batch_id) = 0
- missing_count(line_number) = 0
- missing_count(operator_id) = 0
Running checks
Run all checks:
soda scan -d my_duckdb -c soda_config.yml soda_checks.yml
Interpreting results
Soda Core reports three statuses:
- PASS: The check result falls within the defined threshold
- WARN: The check result is approaching the threshold (if a warn level is defined)
- FAIL: The check result exceeds the threshold
The output shows the actual measured value alongside your threshold, so you can see not just whether the check passed but how close the value is to the boundary. This is different from dbt tests, which only report pass/fail.
Key difference from dbt tests
dbt tests answer: "Is this specific row correct?" (unique, not_null, accepted_values, custom SQL assertions)
Soda Core checks answer: "Is this batch of data normal?" (row counts within historical range, averages within expected bounds, null rates below thresholds)
A pipeline where all dbt tests pass can still have an anomalous batch that only Soda Core would catch -- for example, if an upstream source silently stops sending records for one production line, each row that does arrive passes all dbt tests, but the batch has 33% fewer rows than normal.