The Original Problem ETL Was Solving
Extract-Transform-Load emerged in an era of expensive storage and limited compute. Data warehouses — physical appliances running Oracle or Teradata — charged by the row and by the query. Storing raw, unprocessed data was wasteful; warehouses were designed to hold only the pre-shaped data that analysts would actually query. The transformation step happened before load, in a separate compute environment, precisely because you couldn't afford to load and transform in the warehouse.
This architectural constraint produced ETL pipelines: complex, fragile, expensive-to-maintain systems that did heavy transformation in Java or PL/SQL before touching the warehouse. When a business requirement changed — a new dimension needed, a calculation revised — the ETL code had to be updated, redeployed, and the historical data often had to be reprocessed.
Cloud Warehouses Changed the Economics
The arrival of Snowflake, BigQuery, and Redshift changed the underlying economics. These cloud-native warehouses separate storage from compute, with storage priced at near-commodity rates and compute scaling elastically. Loading raw data became cheap. Running SQL transformations inside the warehouse became fast. The architectural constraint that necessitated ETL disappeared.
With ELT — Extract, Load, Transform — raw data is loaded into the warehouse first, in its original form. Transformations happen inside the warehouse using SQL, executed on demand. The implications are significant: raw data is preserved and can be re-transformed when requirements change; transformations are written in SQL rather than procedural code, making them accessible to a broader set of practitioners; and the transformation layer can be version-controlled, tested, and documented like application code.
dbt as the Transformation Layer
dbt (data build tool) formalised the ELT pattern and gave it software engineering discipline. A dbt model is a SQL SELECT statement that defines a transformation — it selects from raw or intermediate tables and produces a new table or view. dbt handles the execution order, manages dependencies between models, and runs tests against the output.
What dbt added to SQL transformations was the engineering practices that analytics teams had been missing: version control through Git, automated testing on data outputs, documentation generated from code comments, and lineage graphs that show exactly which raw tables a given metric depends on. For the first time, data transformations could be reviewed in pull requests, rolled back like application code, and maintained by a team rather than a single person with tribal knowledge.
Where ELT Has Limits
The ELT pattern is not universally superior. It works well for analytical workloads where the destination is a data warehouse and query latency is acceptable. It is less suited for operational pipelines where data needs to flow in near-real-time into transactional systems, where intermediate state needs to be materialised for performance reasons, or where the transformation involves complex logic that SQL expresses poorly.
For streaming use cases — event-driven architectures where you need to aggregate, join, and route events as they arrive — tools like Apache Kafka and Flink are better fits than warehouse-centric ELT. The modern data platform often combines both paradigms: ELT for the analytical layer (dbt on Snowflake), stream processing for the operational layer (Kafka with Flink or Spark Streaming).
Choosing the Right Architecture for Your Scale
For teams under fifty engineers and moderate data volumes (sub-terabyte daily), a pure ELT stack — a cloud warehouse with dbt transformations and Airbyte or Fivetran for ingestion — is almost always the right starting point. It is simpler to operate, faster to build on, and easier to hire for than a complex streaming architecture.
The signal that you need to extend beyond pure ELT is when business requirements demand sub-hour data freshness, when you need to act on events in real time (fraud detection, personalisation, alerting), or when your transformation logic requires stateful computation that SQL cannot express. At that point, introducing a streaming layer is justified — and the transition is easier if the rest of your data platform is already well-structured.