dbt + Airflow + Snowflake: The Trident of the Modern Data Platform and How to Use It Well

Why These Three Tools Win Together

The combination of dbt, Apache Airflow, and Snowflake (or BigQuery/Redshift as substitutes) has become the de facto standard for modern analytical data platforms because each tool is genuinely excellent at its specific job, and the three jobs are complementary rather than overlapping.

Snowflake is the storage and query engine: a cloud-native warehouse that scales compute independently from storage, supports semi-structured data natively, and delivers fast SQL performance at scale. Airflow is the orchestration layer: a Python-based scheduler that defines, executes, and monitors complex directed acyclic graphs of tasks. dbt is the transformation layer: a SQL-based tool that turns raw warehouse data into clean, tested, documented analytical models.

What Airflow Does (and Doesn't Do)

Airflow's job is orchestration — defining what runs when, in what order, and handling failures, retries, and alerts. A well-designed Airflow DAG for a data pipeline defines: extract tasks that pull data from source systems via API or database replication, load tasks that write raw data to the warehouse staging layer, and a dbt run task that triggers the transformation models.

What Airflow does not do well is transformation. A common anti-pattern is embedding heavy data processing logic in Airflow operators — PySpark jobs triggered from Airflow, or Pandas DataFrames manipulated in Python tasks. This pattern is fragile: it's hard to test, difficult to debug, and puts transformation logic in a place where data engineers expect to find orchestration logic. The right place for transformation logic is dbt, not Airflow.

Designing a dbt Project That Stays Maintainable

dbt projects accumulate complexity quickly as the number of models grows. A project with two hundred models and no structure is harder to maintain than a project with fifty well-organised models. The standard architecture for maintainable dbt projects follows three layers.

The staging layer contains one model per source table — thin transformations that rename columns, cast data types, and apply basic data cleanliness rules. The intermediate layer contains business logic: joins, aggregations, and complex calculations that are used by multiple downstream models. The mart layer contains the final analytical models that analysts and BI tools query — one model per business concept, fully documented, with dbt tests on primary keys and referential integrity.

Snowflake Cost Management: The Part Nobody Mentions

Snowflake's elastic compute model is powerful, but it can produce surprise billing if not managed carefully. The most common cost driver is virtual warehouse sizing: teams default to large warehouses for all workloads because they run queries faster, without accounting for the fact that they're billed per second for the compute they consume.

The right approach is to size warehouses to the workload. dbt transformation runs on medium warehouses; interactive analyst queries on small warehouses with auto-suspend set to sixty seconds; large ad-hoc analytical queries on a dedicated large warehouse that spins down automatically. Implementing this isolation — separate virtual warehouses per workload type — typically reduces Snowflake costs by 30–50% without meaningfully impacting query performance.

Testing: The Part Most Teams Skip

dbt includes a built-in testing framework that most teams underuse. The minimum viable test coverage for a production dbt project includes uniqueness and not-null tests on every primary key (preventing duplicate rows from silently corrupting downstream metrics), referential integrity tests on foreign keys (ensuring joins don't silently drop rows), and accepted-value tests on categorical columns (catching upstream data quality issues before they reach the mart layer).

Beyond dbt's built-in tests, tools like Great Expectations or dbt's custom test framework allow distribution tests — checks that a column's values fall within an expected range, or that a daily row count doesn't drop below a threshold. These catch data pipeline failures that structural tests miss: a source system that started sending zeros instead of nulls, a timezone handling bug that doubled events on daylight saving time.

dbt + Airflow + Snowflake: The Trident of the Modern Data Platform and How to Use It Well

Why These Three Tools Win Together

What Airflow Does (and Doesn't Do)

Designing a dbt Project That Stays Maintainable

Snowflake Cost Management: The Part Nobody Mentions

Testing: The Part Most Teams Skip

Related Resources

From ETL to ELT: Why the Modern Data Stack Changed the Fundamental Architecture of Data Pipelines

Pipeline Reliability: The Data Engineering Discipline That Separates Good Platforms from Great Ones