dbt Best Practices That Actually Scale: Lessons from 500+ Models

When I joined my current company two years ago, our dbt project had 47 models, no naming conventions, and a single schema.yml file with exactly three tests. Today we run 530+ models across four business domains, deploy to production six times a day, and our CI pipeline catches schema-breaking changes before they hit Snowflake. Getting here was not a straight line. This post covers every dbt best practice that actually survived contact with reality at scale.

I have read the official dbt best practices guide, watched the Coalesce talks, and browsed dozens of "how we structure dbt" blog posts. Most of them are great starting points but stop right where things get interesting: the part where your project grows past 200 models and everything you thought was clean starts creaking. These are the patterns and lessons that held up for us.

Context: We run dbt Core 1.8 on Snowflake, orchestrated by Airflow, with a monorepo that serves four analytics teams. Some of these practices are warehouse-agnostic, but I will call out Snowflake-specific bits where relevant.

Project Structure That Does Not Collapse at Scale

The single most impactful decision you will make in a dbt project is your directory structure. Get it wrong early and you will spend months untangling spaghetti references. Get it right and onboarding a new analyst takes a day instead of a week.

Here is the structure we settled on after three refactors:

models/
├── staging/
│   ├── stripe/
│   │   ├── _stripe__sources.yml
│   │   ├── _stripe__models.yml
│   │   ├── stg_stripe__payments.sql
│   │   ├── stg_stripe__customers.sql
│   │   └── stg_stripe__invoices.sql
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   ├── stg_salesforce__accounts.sql
│   │   ├── stg_salesforce__opportunities.sql
│   │   └── stg_salesforce__contacts.sql
│   └── internal_app/
│       ├── _internal_app__sources.yml
│       └── stg_internal_app__events.sql
├── intermediate/
│   ├── finance/
│   │   ├── _finance__models.yml
│   │   ├── int_payments__pivoted_by_method.sql
│   │   └── int_revenue__monthly_aggregated.sql
│   └── sales/
│       ├── _sales__models.yml
│       └── int_opportunities__enriched.sql
├── marts/
│   ├── finance/
│   │   ├── _finance__models.yml
│   │   ├── fct_revenue.sql
│   │   ├── fct_payments.sql
│   │   └── dim_customers.sql
│   └── sales/
│       ├── _sales__models.yml
│       ├── fct_deals.sql
│       └── dim_accounts.sql
└── utilities/
    ├── util_date_spine.sql
    └── util_currency_rates.sql

A few things to notice. Staging models are grouped by source system, not by business domain. This is critical. A Stripe payment is a Stripe payment regardless of which team consumes it. Intermediate and mart models are grouped by business domain because that is how humans think about the data once it has been cleaned.

The underscore-prefixed YAML files (_stripe__sources.yml) sort to the top of each directory, making them easy to find. We use the double-underscore convention consistently: system__entity for staging, domain__transformation for intermediate, and fact/dimension prefixes for marts.

The dbt_project.yml Config That Saves You

Folder structure alone is not enough. You need your dbt_project.yml to enforce materialization and schema policies by directory:

-- dbt_project.yml (relevant section)
models:
  my_project:
    staging:
      +materialized: view
      +schema: staging
      +tags: ["staging"]
    intermediate:
      +materialized: ephemeral
      +schema: intermediate
      +tags: ["intermediate"]
    marts:
      +materialized: table
      +schema: analytics
      +tags: ["marts"]
      finance:
        +schema: finance
      sales:
        +schema: sales

Staging models are views because they are thin wrappers over source tables. Intermediate models default to ephemeral because they exist to simplify mart logic and rarely need to be queried directly. Marts are tables because that is what BI tools hit. We override intermediate to table on a case-by-case basis when debugging requires it or when a model is referenced by 5+ downstream nodes.

Naming Conventions That Survive 500 Models

At 50 models, inconsistent naming is annoying. At 500 models, it is genuinely dangerous. Someone writes a query against payments when they meant fct_payments and suddenly the CFO's dashboard shows unfiltered test transactions.

Here is our naming contract:

LayerPrefixPatternExample
Sources(none)source('system', 'table')source('stripe', 'payments')
Stagingstg_stg_{source}__{entity}stg_stripe__payments
Intermediateint_int_{entity}__{verb}int_payments__pivoted_by_method
Factsfct_fct_{event/action}fct_revenue
Dimensionsdim_dim_{entity}dim_customers
Utilitiesutil_util_{purpose}util_date_spine

The double underscore in staging names is not just cosmetic. It creates a clear visual parse: everything before __ is the source system, everything after is the entity. When you are scanning a lineage graph with hundreds of nodes, that two-character separator saves real cognitive effort.

Column naming deserves the same rigor. Every primary key is {entity}_id. Every timestamp ends in _at. Every boolean starts with is_ or has_. Every date (without time) ends in _date. We enforce this with a custom macro that runs in CI and flags violations.

Testing Strategy That Catches Real Problems

This is where most dbt projects fall short, and I include our old setup in that statement. Having unique and not_null on your primary keys is table stakes. A testing strategy that actually protects you at scale needs three layers.

Layer 1: Schema Tests (the Foundation)

-- _finance__models.yml
version: 2

models:
  - name: fct_revenue
    description: "Monthly recognized revenue by customer and product line"
    columns:
      - name: revenue_id
        description: "Surrogate key: customer_id + product_id + revenue_month"
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: revenue_amount_usd
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 10000000
      - name: revenue_month
        tests:
          - not_null
          - dbt_utils.not_constant

Notice the accepted_range test on revenue. We added that after a currency conversion bug caused a $47 transaction to show up as $47,000,000 in the finance mart. The not_constant test on the date column catches a subtle failure mode where an incremental model silently stops processing new data and every row gets the same timestamp.

Layer 2: Custom Data Tests

Schema tests check individual columns. Data tests check business logic across columns and tables:

-- tests/assert_revenue_matches_payments.sql
-- Revenue recognized should never exceed total payments received

with revenue as (
    select
        customer_id,
        sum(revenue_amount_usd) as total_revenue
    from {{ ref('fct_revenue') }}
    group by customer_id
),

payments as (
    select
        customer_id,
        sum(payment_amount_usd) as total_payments
    from {{ ref('fct_payments') }}
    where payment_status = 'succeeded'
    group by customer_id
)

select
    r.customer_id,
    r.total_revenue,
    p.total_payments,
    r.total_revenue - p.total_payments as discrepancy
from revenue r
join payments p on r.customer_id = p.customer_id
where r.total_revenue > p.total_payments * 1.01  -- 1% tolerance for timing

This test has saved us twice. Once from a duplicate payment ingestion issue and once from a timezone mismatch that double-counted midnight transactions.

Layer 3: Freshness and Volume Tests

-- _stripe__sources.yml
sources:
  - name: stripe
    database: raw
    schema: stripe
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: payments
        loaded_at_field: created
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}

Source freshness tests are your early warning system. When Fivetran silently breaks a connector at 2 AM, you want to know before the morning standup. We run dbt source freshness every hour via Airflow and send alerts to Slack.

For volume monitoring, we use a simple macro that compares today's row count against a rolling 7-day average and alerts on deviations greater than 30%. Boring, effective, catches things that schema tests never will.

Incremental Models That Do Not Bite You

Incremental models are where dbt goes from "nice SQL organizer" to "actual data engineering tool." They are also where most production incidents originate. After burning ourselves several times, here is the pattern we trust:

-- fct_events.sql
{{
  config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',
    cluster_by=['event_date', 'event_type'],
    on_schema_change='append_new_columns'
  )
}}

with source_events as (
    select
        event_id,
        user_id,
        event_type,
        event_properties,
        occurred_at,
        date(occurred_at) as event_date,
        _loaded_at
    from {{ ref('stg_internal_app__events') }}

    {% if is_incremental() %}
    where _loaded_at > (
        select dateadd(hour, -3, max(_loaded_at))
        from {{ this }}
    )
    {% endif %}
),

final as (
    select
        event_id,
        user_id,
        event_type,
        event_properties,
        occurred_at,
        event_date,
        _loaded_at,
        current_timestamp() as _dbt_updated_at
    from source_events
)

select * from final

There are three important details in that model. First, the 3-hour lookback window (dateadd(hour, -3, max(_loaded_at))) instead of a strict greater-than comparison. Late-arriving data is a fact of life. If your incremental logic assumes perfect ordering, you will eventually lose rows. The overlap costs a tiny bit of extra merge processing but guarantees correctness.

Second, we use _loaded_at (the timestamp when the record arrived in our warehouse) rather than a business timestamp like occurred_at. Business timestamps can be backdated, replayed, or corrected. The load timestamp monotonically increases and is the only reliable cursor for incremental processing.

Third, on_schema_change='append_new_columns' prevents your model from silently dropping new fields when the source schema evolves. The default behavior of ignore has caused us more data quality issues than any other single dbt default.

When to Use Full Refresh

Not everything should be incremental. We use full refresh for any model under 10 million rows and under 2 minutes of build time. The cognitive overhead of maintaining incremental logic is only worth it when full refreshes become expensive. Our rule of thumb: if dbt run --full-refresh -s model_name takes less than 120 seconds, keep it as a table.

CI/CD That Actually Gates Deploys

Running dbt build in CI is straightforward. Making it actually prevent bad merges requires a few non-obvious configurations.

Here is our GitHub Actions workflow, trimmed to the important parts:

# .github/workflows/dbt-ci.yml (YAML, shown for readability)
name: dbt CI

on:
  pull_request:
    paths:
      - 'models/**'
      - 'macros/**'
      - 'tests/**'
      - 'dbt_project.yml'
      - 'packages.yml'

jobs:
  dbt-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dbt
        run: pip install dbt-snowflake==1.8.*

      - name: Run modified models
        run: |
          dbt build \
            --select state:modified+ \
            --defer \
            --state ./prod-manifest/ \
            --target ci \
            --fail-fast
        env:
          DBT_CI_SCHEMA: "ci_pr_${{ github.event.pull_request.number }}"

      - name: Check source freshness
        run: dbt source freshness --target ci

      - name: Cleanup CI schema
        if: always()
        run: |
          dbt run-operation drop_schema \
            --args '{schema_name: "ci_pr_${{ github.event.pull_request.number }}"}'

The key feature is state:modified+ combined with --defer. This builds only the models changed in the PR plus their downstream dependents, reading upstream unmodified models from the production manifest. A PR that touches one staging model might build 15 downstream models instead of all 530. This brought our CI time from 45 minutes to under 8 minutes on average.

Each PR gets its own Snowflake schema (ci_pr_142), which gets cleaned up after the run. This means multiple PRs can run CI concurrently without interference.

We also run a dbt docs generate step and post a comment on the PR with a link to the generated docs site. Reviewers can click through the lineage graph for any changed model without cloning the branch. That small touch increased our review quality noticeably.

Performance Optimization Patterns

At 530 models, a naive dbt build takes 90 minutes on our warehouse. Here is how we got the production run down to 22 minutes.

Cluster Keys and Sort Orders

On Snowflake, clustering keys on large fact tables made the biggest single difference. But you have to cluster on the columns that your downstream queries actually filter on, not the ones that seem logical:

{{
  config(
    materialized='incremental',
    unique_key='event_id',
    cluster_by=['event_date', 'user_id'],
    -- NOT cluster_by=['event_id'] -- unique keys make terrible cluster keys
  )
}}

Model-Level Warehouse Sizing

Not all models need the same compute. We tag heavy models and route them to a larger warehouse:

-- dbt_project.yml
models:
  my_project:
    marts:
      +snowflake_warehouse: TRANSFORM_WH_MEDIUM
      finance:
        fct_revenue:
          +snowflake_warehouse: TRANSFORM_WH_XLARGE

That single revenue model was consuming 40% of our total CI credits because it was running on a medium warehouse and taking 18 minutes. Moving it to XL cut it to 3 minutes and actually reduced our costs because the warehouse suspended faster.

Macro for Conditional Materializations

In development, we want fast iteration. In production, we want performance. This macro lets us toggle materialization by target:

-- macros/get_materialization.sql
{% macro get_incremental_or_table() %}
  {% if target.name == 'dev' %}
    {{ return('table') }}
  {% else %}
    {{ return('incremental') }}
  {% endif %}
{% endmacro %}

-- Usage in a model:
{{ config(materialized=get_incremental_or_table()) }}

In dev, every model is a simple table with a full scan. In prod and CI, the incremental logic kicks in. This eliminates the most common class of "works in dev, breaks in prod" bugs where developers forget to test the incremental path.

Anti-Patterns: What Not to Do

I have made every mistake on this list at least once. Hopefully you can skip a few of them.

1. Source References in Mart Models

-- BAD: mart model directly referencing source
select * from {{ source('stripe', 'payments') }}
where status = 'succeeded'

-- GOOD: mart model references staging layer
select * from {{ ref('stg_stripe__payments') }}
where payment_status = 'succeeded'

When a mart references a source directly, you lose the staging layer's renaming, type casting, and filtering. You also create a hidden coupling: if the source schema changes, every model that touches it breaks instead of just the staging model.

2. Business Logic in Staging Models

-- BAD: staging model with business logic
select
    id as payment_id,
    amount / 100.0 as payment_amount_usd,  -- currency conversion in staging!
    case
        when status in ('paid', 'settled') then 'succeeded'
        when status = 'refunded' then 'refunded'
        else 'pending'
    end as payment_status
from {{ source('stripe', 'payments') }}
where created > '2023-01-01'  -- arbitrary filter in staging!

-- GOOD: staging model does only rename + cast + dedupe
select
    id as payment_id,
    amount as payment_amount_cents,
    status as payment_status_raw,
    currency as payment_currency,
    created as created_at
from {{ source('stripe', 'payments') }}

Staging models should be boring. Rename columns to your conventions, cast types, maybe deduplicate, and that is it. The moment you put business logic in staging, you make an implicit decision that affects every downstream consumer, and six months from now nobody will remember why payments before 2023 are missing.

3. The 500-Line Model

If a single model file exceeds ~150 lines of SQL, it almost certainly contains logic that should be split into an intermediate model. Long models are hard to test, hard to debug, and impossible to reuse. We have a linting rule that warns on any model over 200 lines. The fix is always the same: extract CTEs into intermediate models.

4. Disabling Tests to Fix CI

When a test fails in CI, the temptation is to add severity: warn or just delete the test. We did this three times early on and each time the underlying data quality issue snowballed for weeks before someone noticed. Now our rule is simple: if a test fails, either fix the data, fix the test, or file a ticket and add a comment explaining why the test is temporarily set to warn with an expected resolution date.

5. Incremental Models Without a Full Refresh Strategy

-- BAD: no way to recover from drift
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

-- GOOD: explicit late-arriving data window + scheduled full refresh
{% if is_incremental() %}
where updated_at > (
    select dateadd(day, -3, max(updated_at)) from {{ this }}
)
{% endif %}

Every incremental model drifts from reality over time. Late-arriving records, retroactive corrections, deleted source rows. You need two things: a lookback window to handle normal late arrivals, and a scheduled full refresh (we do weekly on Sundays) to reconcile everything else. If you cannot afford a weekly full refresh, your model is too expensive and you should optimize the underlying query first.

6. Not Using dbt Packages

I see teams writing custom date spine macros, surrogate key generators, and pivot helpers from scratch. The dbt-utils, dbt-expectations, and dbt-date packages exist and are battle-tested across thousands of projects. Use them. Your custom generate_surrogate_key macro probably has an edge case bug that dbt_utils.generate_surrogate_key fixed two years ago.

Documentation as a First-Class Citizen

At 500 models, your dbt docs site is not a nice-to-have. It is how new team members understand what data exists and where it comes from. We enforce documentation in two ways.

First, every model must have a description in its YAML file. Our CI pipeline fails if any model lacks a description. This sounds draconian but in practice it takes 30 seconds per model and saves hours of Slack messages asking "what does int_payments__pivoted_by_method actually do?"

Second, we use doc blocks for anything that needs more than one sentence:

-- models/marts/finance/_finance__docs.md
{% docs fct_revenue %}

Monthly recognized revenue by customer and product line.

**Grain:** One row per customer per product per calendar month.

**Important notes:**
- Revenue is recognized on invoice finalization date, not payment date
- Amounts are converted to USD using the daily spot rate from `util_currency_rates`
- Refunds are negative rows, not deletions
- Only includes customers with `is_test = false`

**Common joins:**
- `dim_customers` on `customer_id` for company name, segment, CSM
- `dim_products` on `product_id` for product line, pricing tier

{% enddocs %}

That grain statement alone prevents half the "why are my numbers wrong" conversations. If someone joins fct_revenue to dim_customers without understanding the grain, they will accidentally fan out revenue by product. The doc block makes the grain impossible to miss.

Macros and DRY Principles (With Restraint)

Macros are powerful. They are also the fastest way to make your dbt project unreadable. Our guideline: only extract a macro when the same logic appears in three or more models and is genuinely identical (not just similar).

Here is one macro that earned its keep:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
    round({{ column_name }}::numeric / 100.0, {{ precision }})
{% endmacro %}

-- Usage in any model:
select
    payment_id,
    {{ cents_to_dollars('payment_amount_cents') }} as payment_amount_usd
from {{ ref('stg_stripe__payments') }}

And here is a macro that we regretted writing:

-- BAD: over-abstracted generic dimension macro
{% macro build_dimension(source_model, id_col, attribute_cols, scd_type=2) %}
  -- 80 lines of Jinja generating SCD logic
{% endmacro %}

-- Nobody could debug this when it broke. We replaced it with
-- explicit SQL in each dimension model.

The abstraction looked elegant in a PR review. In production at 3 AM when a dimension was generating duplicate rows, the five levels of Jinja indirection made root cause analysis nearly impossible. We spent less total time maintaining explicit (slightly repetitive) SQL than we spent debugging the macro.

What I Would Do Differently Starting Over

If I were starting a new dbt project today with everything I know from scaling to 500+ models, here is what I would do from day one:

  1. Enforce naming conventions before the 10th model. Renaming models after they have downstream dependents is painful. Set the conventions in a CONTRIBUTING.md and lint them in CI from the start.
  2. Put descriptions on every model and column immediately. Documentation debt compounds faster than any other kind of technical debt in analytics.
  3. Use the staging / intermediate / mart structure from the first model. The "we will refactor later" moment never comes voluntarily. It comes when something breaks in production and you are forced to.
  4. Set up CI with state comparison before you have 20 models. The state:modified+ selector makes CI fast from the beginning and teaches the team to think in terms of impact analysis.
  5. Schedule weekly full refreshes for all incremental models from the start. It costs nearly nothing when your tables are small and establishes the pattern before it becomes operationally critical.
  6. Write one data test per mart model. Not just schema tests. At least one test that encodes a business rule you know must hold. These become your regression safety net.
  7. Keep macros simple or do not write them. Three lines of repeated SQL across five models is better than a clever macro that nobody except the author can maintain.

Scaling a dbt project is not fundamentally about tooling or clever Jinja tricks. It is about discipline, conventions, and making the right defaults so easy to follow that doing the wrong thing requires more effort than doing the right thing. Every practice in this post exists because we learned the hard way what happens without it. I hope you can skip a few of those lessons.

Leave a Comment