Modern ETL without a Data Warehouse

A dbt and DuckDB approach

June 13, 2025 Check out the code

Introduction

Data transformation is a critical step in any data pipeline. For many use cases¹, a full-scale cloud data warehouse offers rich tooling, provides scalability, and easy access to analytical tools. However, this power often comes with increased complexity, cost, and maintenance overhead. At the other end of the spectrum, a few simple scripts working on a data dump offers an easy starting point but lacks the scalability and robust structure needed for managing complex transformations as projects evolve.

Meme illustrating the choice between simple scripts and complex cloud data warehouses

This article describes a middle ground between these two approaches, aimed at scenarios where datasets are manageable in size, somewhat independent, for example if used by different consumers, and yet there’s a strong need for well-defined, maintainable transformations without incurring the costs and complexities of a full-scale data warehouse.

We’ll demonstrate how to build such a pipeline and deploy with minimal infrastructure by leveraging the combined strengths of dbt-core and DuckDB. This approach offers reduced cloud dependencies for the core transformation engine, lower costs, and a simplified setup and management process. As a practical example, we’ll use the World Bank World Development Indicators, showing code snippets and illustrating how robust results can be derived from a relatively simple local data stack.

Goals

Our specific goals for this project were:

Integrated Data Model: Our primary data source was the WDI CSV download, but we also needed to enrich this by integrating data from the World Bank’s REST API. A key requirement was to unify these sources into a consistent data model with standardized naming conventions.
Data Reshaping for Analysis: The raw WDI data often presents historical figures in a wide, column-oriented format. To facilitate easier querying and time-series analysis, we aimed to reshape this into a long, row-oriented format.

For example:

// Wide Format (Column-Oriented)
// Each row represents one country & indicator, with years as columns.

Country Name | Indicator Name | 1960 | 1961 | ...
-------------|----------------|------|------|----
Aruba        | GDP per capita | 100  | 102  | ...
Afghanistan  | Population     | 5000 | 5050 | ...

// Long Format (Row-Oriented)
// Each row represents one country, indicator, and year observation.

Country Name | Indicator Name | Year | Value
-------------|----------------|------|-------
Aruba        | GDP per capita | 1960 | 100
Aruba        | GDP per capita | 1961 | 102
Afghanistan  | Population     | 1960 | 5000
Afghanistan  | Population     | 1961 | 5050
...          | ...            | ...  | ...

Clear Data Layering: We wanted to achieve a clear architectural separation between raw data, an intermediate staging layer, and the final marts layer. To explain:
- Raw Data Layer: This is the initial landing zone for data as it’s ingested from various sources (e.g., CSV files, API extracts). The data here is typically in its original, unaltered format, reflecting the structure and content provided by the source systems.
- Staging Layer: This layer takes the raw, ingested data and applies initial transformations, such as standardizing column names (e.g., converting a Country Name column to country_name), casting data types, and basic cleaning. This ensures a consistent foundation for subsequent transformations.
- Marts Layer: This layer builds upon the staging layer to create polished, consumer-ready datasets, often aggregated or reshaped for specific analytical purposes.
  
  This separation is important in my experience for maintainability, allowing us to evolve ingestion or intermediate transformation logic without immediately impacting downstream users. The diagram below illustrates this flow with each box linked to the respective dbt layer definitions for this project:

Pre-calculated Aggregations: To optimize query performance and reduce computational load for common analytical questions, we wanted to easily be able to define frequently used aggregations for pre-calculation.
Comprehensive Documentation and Lineage: We wanted a way to provide documentation on the transformation, a clean way to provide definitions for the data interface, and transparent lineage explaining how each data point was derived from its original source.
Flexible and Cost-Effective Output: To ensure flexibility for any downstream consumers, our strategy was to primarily deliver final datasets as Parquet files in a cloud object store (Cloudflare R2 in our case, chosen for its cost-effectiveness and lack of egress fees). We also aimed to demonstrate the capability to easily load this transformed data into a relational database (e.g., Cloudflare’s D1 serverless SQLite).
Low-Complexity: We also wanted a solution that was easy to maintain and simple to debug and reason about.
Low Cost Finally, we wanted a solution that had relatively low cost overall given the scale of the project.

Why Standard Solutions Can Fall Short

Let’s compare the two standard solutions: simple scripting versus a full cloud data warehouse and supporting tools.

Goal	Simple Scripting & File-based	Cloud Data Warehouse
Integrated Data Model	🔴 Difficult to manage consistency as complexity grows. Ad-hoc.	🟢 Strong. Designed for structured, relational data models.
Data Reshaping for Analysis	🟡 Possible, but complex reshaping can become unwieldy and error-prone.	🟢 Excellent. SQL provides powerful reshaping capabilities.
Clear Data Layering	🔴 Lacks inherent structure for layering; becomes a tangled web of scripts.	🟢 Good. Supports logical separation (e.g., staging, marts).
Pre-calculated Aggregations	🟡 Can be done, but managing dependencies and updates is manual and complex.	🟢 Strong. Materialized views or summary tables are common.
Comprehensive Documentation & Lineage	🔴 Almost entirely manual and often neglected. Difficult to trace data flow.	🟢 Good. Many tools offer automated documentation and lineage.
Flexible & Cost-Effective Output	🟢 Flexible in output formats (files).	🟡 Very flexible (SQL access, connectors), but storage/egress can be costly.
Low-Complexity	🟡 Low initial complexity. Becomes very complex to maintain as project grows.	🔴 High initial setup complexity. Managed services reduce some operational burden.
Low-Cost Infrastructure	🟢 Very low initial infrastructure cost (often none).	🔴 Can be high due to storage, compute, and egress fees, even for smaller datasets. Managed services add to platform costs.

As the table illustrates, neither option perfectly aligns with our project’s needs:

The Cloud Data Warehouse Approach: When Comprehensive Becomes Complex

For projects like ours, opting for a full Cloud Data Warehouse (CDW) solution (e.g., Snowflake, Google BigQuery, Amazon Redshift) presents a significant hurdle: the inherent complexity and overhead of its comprehensive ecosystem. While CDWs excel at structured transformations, lineage, and scaling for very large datasets—often integrating with a rich array of tools for ingestion, orchestration, and BI—this power comes at a cost. Setting up, configuring, and managing these interconnected services, even with provider-managed infrastructure, introduces a layer of operational complexity and potential expense that can be disproportionate for smaller to medium-scale data transformation tasks like our WDI project. Furthermore, committing to a specific CDW often involves a degree of vendor lock-in, which can have long-term cost and flexibility implications. This was an overarching problem we wanted to see if we could avoid.

The Simple Scripting Approach: Trading Robustness for Initial Simplicity

Conversely, while a Simple Scripting & File-based approach (e.g., using Python or Shell scripts) offers low initial cost and setup effort, it typically sacrifices long-term robustness, maintainability, and scalability. The free-form nature of scripts makes it hard to reason about the transformations occurring, especially as the number of data sources, transformation steps, and interdependencies grows—as was the case with our project’s multiple goals. This lack of inherent structure leads to difficulties in testing, debugging, and evolving the pipeline, particularly in a team environment.

Seeking a Better Way

Given these challenges, the central question becomes:

How can we implement a data transformation workflow that is:

Well-structured and easy to evolve?
Low-cost and simple to deploy and manage?
Capable of efficiently handling typical analytical transformations?

The Solution: dbt-core Meets DuckDB

The solution we use lies in the powerful combination of dbt-core and DuckDB.

Introducing dbt (Data Build Tool)

dbt is an open-source command-line tool that empowers data analysts and engineers to transform data within their data store more effectively. It brings software engineering best practices to the analytics workflow.

Its core capabilities include:

SQL-centric: Define transformations primarily using SQL (with Python model support in newer versions), a language familiar to most data practitioners. For example, one aggregation transformation from our project looks like this:
```
select year, count(value) as data_points_count
from {{ ref('fct_wdi_history') }}
where value is not null
group by year
```
View Code
Dependency Management: dbt automatically understands and manages the dependencies between transformation steps (called “models”).
Automated Testing: Implement data quality tests to ensure the integrity of your transformations and output data. For example, we can test for nulls by providing yaml configuration to dbt:
```
- name: indicator_code
description: "Indicator code."
tests:
	- not_null
```
View Code
Documentation & Lineage: Automatically generate comprehensive documentation and a visual lineage graph, showing how data flows through your pipeline. See here for the generated documentation for the project.
Version Control: Treats your transformation logic as code, enabling version control.

For this project, we used dbt-core, the open-source Python package that you can run anywhere. There’s also dbt Cloud, a managed cloud-based solution offering additional features like a scheduler and UI, but dbt-core provides all the essential transformation capabilities we needed.

Introducing DuckDB

DuckDB is an in-process analytical data management system (OLAP RDBMS). Think of it as SQLite for analytics.

Its key advantages for our use case are:

Simplicity: There’s no separate server process to manage; it runs within your Python application or CLI session. Installation is as easy as:
```
pip install duckdb
```
Speed: DuckDB is highly optimized for analytical queries and transformations, employing techniques like columnar storage and vectorized query execution.
SQL-centric: It speaks SQL fluently, making it a perfect partner for dbt’s SQL-first approach.

Versatile Data Ingestion: DuckDB can directly query data from various file formats, including Parquet, CSV, and JSON, often without needing a separate loading step. For example with Parquet:

con = duckdb.connect(DUCKDB_DATABASE)
con.execute("CREATE SCHEMA IF NOT EXISTS public")
query = f"""
CREATE TABLE IF NOT EXISTS public.{table_name} AS
SELECT * FROM read_parquet('{parquet_path}');
"""
con.execute(query)

View Code

The Combined Architecture

When dbt-core and DuckDB are used together, they form a lightweight yet powerful data transformation engine.

Here’s a conceptual flow of how they interact in this project:

Explanation of the flow:

Data Ingestion: Raw data is acquired from sources (CSVs, APIs, etc.) and stored locally to be available to DuckDB. In our project we also replicate this data to Cloudflare R2 object store so we have a full record. Parquet is a good choice for the storage format due to its efficiency.
DuckDB as the Engine: DuckDB is configured as the “data warehouse” in dbt’s configuration with the database running locally:

wdi:
	outputs:
		prod:
			type: duckdb
			path: ../wdi.duckdb
			threads: 4

View Code

dbt Orchestration: dbt-core executes the SQL models we’ve defined. These models perform transformations (cleaning, joining, aggregating) on the data within DuckDB. This is represented by the transition from DuckDB acting as a reader of raw data to DuckDB holding transformed data, orchestrated by dbt. dbt manages the order of execution based on dependencies.
Output & Consumption: The transformed data resides within DuckDB. From there, it can be queried directly for analysis, or in our project materialized (written out) to Parquet stored in Cloudflare R2 as well as the relational database Cloudflare D1.

This architecture allows us to run the entire dbt transformation pipeline locally as part of our CircleCI CI/CD environment, using DuckDB as an ephemeral engine, without needing a dedicated data warehouse service.

The Results

The accompanying GitHub project provides a concrete implementation of this approach using World Development Indicator data. And below are some live visualizations of the transformed data:

Total Countries Tracked

265

Total Indicators Tracked

431

Data Points Per Year

Chart will render on client.

Appendix: Additional Considerations

Beyond the core setup, I learned a few additional things:

Development Environment: DuckDB vs. Postgres for Concurrency

A practical challenge encountered during development was DuckDB’s default behavior with file-based databases: typically, only one process can write to a DuckDB database at a time. In a development environment, I had multiple tools needing concurrent access – dbt running transformations, a VS Code SQL plugin for querying, a database client for inspection, and the DuckDB CLI itself.

To address this, a local PostgreSQL database was used for development. dbt’s profiles.yml file makes it straightforward to define multiple environments. This allowed for easy switching: using Postgres for interactive development where concurrency was beneficial, and DuckDB for “production-like” runs.

Leveraging `rclone` for Cloud Storage Agnosticity

To manage the movement of data to and from cloud storage (like Cloudflare R2), rclone (“rsync for cloud storage”) proved to be an invaluable tool.

The benefits of using rclone include:

Simplified Scripting: It provides an easy command-line interface for interacting with numerous cloud storage providers (Cloudflare R2, AWS S3, Google Cloud Storage, etc.), simplifying data synchronization scripts.
Vendor Independence: Using rclone offers a degree of vendor independence. For our project we chose to use Cloudflare R2 due to its cost-effectiveness, particularly its absence of egress fees, making it attractive for scenarios where data might be pulled for analysis by various tools or services. However if you decide to switch cloud storage providers in the future, the changes required in your data synchronization scripts are minimal, primarily involving rclone configuration updates.

For example, my Machine Learning for Flight Delays project leverages Google Cloud’s data and machine learning services.↩