dbt and Iceberg Tables: Building the Open Lakehouse on Snowflake

The promise of the “open lakehouse” has always been seductive: store data in open formats, query it from anywhere, and escape vendor lock-in. The reality, until recently, was a mess of manual metadata management, inconsistent tooling, and “it works in theory” architectures.

That’s changed. With dbt 1.9+ and Snowflake’s expanded Iceberg support, building production-grade open lakehouses is not just possible—it’s practical.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.

The Evolution of Iceberg on Snowflake#

Apache Iceberg tables became generally available on Snowflake in June 2024. Since then, the ecosystem has matured rapidly:

July 2025: Write support for externally managed Iceberg tables and catalog-linked databases reached preview
dbt 1.9: Native support for Iceberg table materialisation
dbt 1.11: Support for Glue catalog integration via catalog-linked databases

This means you can now use dbt to create and transform Iceberg tables that register metadata in external catalogs—not just Snowflake’s internal catalog.

Why This Matters#

The Old Way: Snowflake as a Silo#

[Source Data] → [Fivetran] → [Snowflake Native Tables] → [dbt] → [Snowflake Native Tables]
                                        ↓
                              [Export to S3 for Spark]

plaintext

Data lives in Snowflake’s proprietary format. To use it elsewhere (Spark ML jobs, Databricks notebooks), you need to export it.

The New Way: Snowflake as a Lakehouse Citizen#

[Source Data] → [Ingestion] → [S3 Iceberg Tables] ← [Snowflake + dbt]
                                        ↑
                              [Spark / Databricks / Trino]

plaintext

The same Iceberg table is readable and writable by multiple engines. No export pipelines. No data copies.

Configuring dbt for Iceberg#

Basic Iceberg Table Materialisation#

In your dbt_project.yml:

models:
  my_project:
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: my_s3_volume
      +base_location: warehouse/marts/

yaml

Or at the model level in schema.yml:

models:
  - name: dim_customer
    config:
      materialized: table
      table_format: iceberg
      external_volume: my_s3_volume
      base_location: warehouse/dims/customers/

yaml

Incremental Iceberg Models#

Iceberg tables support incremental materialisation with merge operations:

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        table_format='iceberg',
        external_volume='my_s3_volume',
        base_location='warehouse/facts/orders/',
        unique_key='order_id',
        incremental_strategy='merge'
    )
}}

SELECT
    order_id,
    customer_id,
    order_date,
    order_total,
    updated_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

sql

Key Point: The merge strategy leverages Iceberg’s ACID transactions, ensuring data consistency even when multiple processes access the table.

External Catalog Integration#

The real power emerges when you connect dbt to external Iceberg catalogs.

Option 1: Snowflake Open Catalog (Polaris)#

Snowflake Open Catalog (formerly Polaris) is an open-source Iceberg REST catalog. Configure it in Snowflake:

-- Create catalog integration
CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog
    CATALOG_SOURCE = POLARIS
    TABLE_FORMAT = ICEBERG
    CATALOG_NAMESPACE = 'my_namespace'
    REST_CONFIG = (
        CATALOG_URI = 'https://my-polaris-instance.example.com/api/catalog'
        WAREHOUSE = 'my_warehouse'
    )
    REST_AUTHENTICATION = (
        TYPE = OAUTH
        OAUTH_CLIENT_ID = 'my_client_id'
        OAUTH_CLIENT_SECRET = 'my_secret'
        OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
    )
    ENABLED = TRUE;

sql

Option 2: AWS Glue Catalog#

For AWS-centric architectures, link to Glue:

-- Create catalog-linked database
CREATE DATABASE glue_catalog_db
    FROM CATALOG INTEGRATION glue_catalog;

sql

In dbt, reference tables from the linked database:

-- models/staging/stg_external_events.sql
SELECT *
FROM {{ source('glue_catalog_db', 'raw_events') }}
WHERE event_date >= CURRENT_DATE - 7

sql

Note: As of dbt 1.11, incremental materialisation is not yet supported for Glue catalog tables. Use full table refreshes or manage incremental logic externally.

Practical Architecture: Multi-Engine Analytics#

Here’s a production pattern I’ve seen work well:

The Setup#

┌─────────────────────────────────────────────────────────────────┐
│                        S3 Data Lake                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ raw/        │  │ staging/    │  │ marts/      │             │
│  │ (Iceberg)   │  │ (Iceberg)   │  │ (Iceberg)   │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
└─────────────────────────────────────────────────────────────────┘
         ↑                 ↑                 ↑
         │                 │                 │
    ┌────┴────┐      ┌─────┴─────┐     ┌────┴────┐
    │ Spark   │      │ Snowflake │     │ Trino   │
    │ (Ingest)│      │ (dbt)     │     │ (Ad-hoc)│
    └─────────┘      └───────────┘     └─────────┘

plaintext

Workflow#

Spark ingests raw data into Iceberg tables (high-volume, cost-effective)
Snowflake + dbt transforms raw → staging → marts (SQL-native, governed)
Trino/Databricks runs ad-hoc ML workloads against the same marts (no data movement)

The dbt Side#

# dbt_project.yml
models:
  lakehouse_project:
    staging:
      +materialized: view  # Views for staging (reference raw Iceberg)
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: prod_lake_volume

yaml

Performance Considerations#

Compaction#

Iceberg tables accumulate small files over time. Snowflake automatically compacts Snowflake-managed Iceberg tables, but for externally managed tables, you need to handle this yourself (via Spark or a scheduled job).

Clustering#

For large Iceberg tables, define clustering keys:

{{
    config(
        materialized='table',
        table_format='iceberg',
        cluster_by=['order_date', 'region']
    )
}}

sql

Query Pruning#

Iceberg’s manifest files contain min/max statistics for each data file. Snowflake uses these for partition pruning. Ensure your queries filter on columns with good selectivity.

Common Pitfalls#

Pitfall 1: Forgetting External Volume Permissions#

Problem: dbt run fails with “Access denied to external volume.”

Solution: Grant the Snowflake role access to the external volume:

GRANT USAGE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
GRANT READ, WRITE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;

sql

Pitfall 2: Mixing Managed and Unmanaged Tables#

Problem: You create a table with CATALOG = 'SNOWFLAKE' but try to write to it from Spark.

Solution: Decide upfront who “owns” the table metadata:

Snowflake-managed: Only Snowflake writes; others read via Iceberg REST API
Externally-managed: Snowflake reads; writes via Spark/external engine

Pitfall 3: Schema Evolution Conflicts#

Problem: dbt adds a column, but Spark doesn’t see it.

Solution: Ensure all engines refresh their metadata cache. In Snowflake:

ALTER ICEBERG TABLE my_table REFRESH;

sql

When to Use Iceberg Tables#

Good fit for:

Multi-engine architectures (Snowflake + Spark + Databricks)
Cost-sensitive storage (S3 is cheaper than Snowflake native)
Regulatory requirements for data portability
ML/AI workloads that need direct file access

Not recommended for:

Small tables (< 100GB) with Snowflake-only access
Real-time streaming (use native tables with Snowpipe Streaming)
Workloads requiring Snowflake-specific features (Time Travel > 7 days, etc.)

Conclusion#

The combination of dbt 1.9+ and Snowflake’s Iceberg support makes the open lakehouse a practical reality. You no longer have to choose between SQL-native development (dbt) and open data formats (Iceberg)—you can have both.

Key Takeaways:

dbt supports Iceberg tables as first-class citizens via table_format: iceberg
External catalog integration enables true multi-engine architectures
Snowflake’s Iceberg performance is now within 5-10% of native tables
Plan your managed vs. unmanaged strategy upfront to avoid conflicts

The Evolution of Iceberg on Snowflake#

Why This Matters#

The Old Way: Snowflake as a Silo#

The New Way: Snowflake as a Lakehouse Citizen#

Configuring dbt for Iceberg#

Basic Iceberg Table Materialisation#

Incremental Iceberg Models#

External Catalog Integration#

Option 1: Snowflake Open Catalog (Polaris)#

Option 2: AWS Glue Catalog#

Practical Architecture: Multi-Engine Analytics#

The Setup#

Workflow#

The dbt Side#

Performance Considerations#

Compaction#

Clustering#

Query Pruning#

Common Pitfalls#

Pitfall 1: Forgetting External Volume Permissions#

Pitfall 2: Mixing Managed and Unmanaged Tables#

Pitfall 3: Schema Evolution Conflicts#

When to Use Iceberg Tables#

Conclusion#

Further Reading#

Disclaimer