Learn by Directing AI
All materials

pipeline-spec-template.md

Pipeline Specification

Project overview

[Describe the project: who is the client, what do they need, what does the pipeline produce]

Sources

Source name Format Location Refresh strategy Watermark column Notes
[Source 1] [CSV/JSON/API] [File path or endpoint] [Full / Incremental] [Column name or N/A] [Any special considerations]
[Source 2]

Target schema

Staging layer

[Define the staging models. What does each one do? Source-conform only -- no business logic.]

Model name Source Key fields Materialization
[stg...]_ [Source name] [List key fields] [table / incremental]

Intermediate layer

[Define intermediate models that combine or reshape staging data.]

Model name Sources Purpose Materialization
[int...]_ [Which staging models] [What this model does] [table / incremental]

Mart layer

[Define the mart models that serve business users.]

Model name Purpose Grain Key metrics
[fct...]_ [What business question it answers] [One row per...] [Key columns]
[dim...]_ [What entity it tracks] [One row per...] [Key columns]

Extraction pattern

Full vs incremental decision

[For each source, decide: full refresh or incremental? Document your reasoning.]

Source Strategy Rationale
[Source 1] [Full / Incremental] [Why this strategy for this source]
[Source 2]

Watermark design

[For incremental sources: which column is the watermark? How trustworthy is it? What are the edge cases?]

MERGE key design

[Define the natural key for MERGE (upsert) operations. What uniquely identifies a record?]

Model MERGE key columns Rationale
[stg...]_ [Column list] [Why these columns uniquely identify a record]

Quality checks

dbt tests

[List the dbt tests: unique, not_null, accepted_values, custom business logic tests]

Soda Core checks

[List the Soda Core trend checks: row count ranges, statistical bounds, freshness]

Monitoring

Watermark progression

[How will you monitor that the watermark advances on each run? What's the alert threshold?]

Schedule

[How often does the pipeline run? What triggers it? Schedule-based or sensor-based?]