dbt and Iceberg Tables: Building the Open Lakehouse on Snowflake
How dbt 1.9+ transforms Iceberg table development on Snowflake. From external catalogs to bi-directional writes, the open lakehouse is now production-ready.
The promise of the “open lakehouse” has always been seductive: store data in open formats, query it from anywhere, and escape vendor lock-in. The reality, until recently, was a mess of manual metadata management, inconsistent tooling, and “it works in theory” architectures.
That’s changed. With dbt 1.9+ and Snowflake’s expanded Iceberg support, building production-grade open lakehouses is not just possible—it’s practical.
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.
The Evolution of Iceberg on Snowflake#
Apache Iceberg tables became generally available on Snowflake in June 2024. Since then, the ecosystem has matured rapidly:
- July 2025: Write support for externally managed Iceberg tables and catalog-linked databases reached preview
- dbt 1.9: Native support for Iceberg table materialisation
- dbt 1.11: Support for Glue catalog integration via catalog-linked databases
This means you can now use dbt to create and transform Iceberg tables that register metadata in external catalogs—not just Snowflake’s internal catalog.
Why This Matters#
The Old Way: Snowflake as a Silo#
[Source Data] → [Fivetran] → [Snowflake Native Tables] → [dbt] → [Snowflake Native Tables]
↓
[Export to S3 for Spark]plaintextData lives in Snowflake’s proprietary format. To use it elsewhere (Spark ML jobs, Databricks notebooks), you need to export it.
The New Way: Snowflake as a Lakehouse Citizen#
[Source Data] → [Ingestion] → [S3 Iceberg Tables] ← [Snowflake + dbt]
↑
[Spark / Databricks / Trino]plaintextThe same Iceberg table is readable and writable by multiple engines. No export pipelines. No data copies.
Configuring dbt for Iceberg#
Basic Iceberg Table Materialisation#
In your dbt_project.yml:
models:
my_project:
marts:
+materialized: table
+table_format: iceberg
+external_volume: my_s3_volume
+base_location: warehouse/marts/yamlOr at the model level in schema.yml:
models:
- name: dim_customer
config:
materialized: table
table_format: iceberg
external_volume: my_s3_volume
base_location: warehouse/dims/customers/yamlIncremental Iceberg Models#
Iceberg tables support incremental materialisation with merge operations:
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
table_format='iceberg',
external_volume='my_s3_volume',
base_location='warehouse/facts/orders/',
unique_key='order_id',
incremental_strategy='merge'
)
}}
SELECT
order_id,
customer_id,
order_date,
order_total,
updated_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}sqlKey Point: The merge strategy leverages Iceberg’s ACID transactions, ensuring data consistency even when multiple processes access the table.
External Catalog Integration#
The real power emerges when you connect dbt to external Iceberg catalogs.
Option 1: Snowflake Open Catalog (Polaris)#
Snowflake Open Catalog (formerly Polaris) is an open-source Iceberg REST catalog. Configure it in Snowflake:
-- Create catalog integration
CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog
CATALOG_SOURCE = POLARIS
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'my_namespace'
REST_CONFIG = (
CATALOG_URI = 'https://my-polaris-instance.example.com/api/catalog'
WAREHOUSE = 'my_warehouse'
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_CLIENT_ID = 'my_client_id'
OAUTH_CLIENT_SECRET = 'my_secret'
OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
)
ENABLED = TRUE;sqlOption 2: AWS Glue Catalog#
For AWS-centric architectures, link to Glue:
-- Create catalog-linked database
CREATE DATABASE glue_catalog_db
FROM CATALOG INTEGRATION glue_catalog;sqlIn dbt, reference tables from the linked database:
-- models/staging/stg_external_events.sql
SELECT *
FROM {{ source('glue_catalog_db', 'raw_events') }}
WHERE event_date >= CURRENT_DATE - 7sqlNote: As of dbt 1.11, incremental materialisation is not yet supported for Glue catalog tables. Use full table refreshes or manage incremental logic externally.
Practical Architecture: Multi-Engine Analytics#
Here’s a production pattern I’ve seen work well:
The Setup#
┌─────────────────────────────────────────────────────────────────┐
│ S3 Data Lake │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ raw/ │ │ staging/ │ │ marts/ │ │
│ │ (Iceberg) │ │ (Iceberg) │ │ (Iceberg) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↑ ↑ ↑
│ │ │
┌────┴────┐ ┌─────┴─────┐ ┌────┴────┐
│ Spark │ │ Snowflake │ │ Trino │
│ (Ingest)│ │ (dbt) │ │ (Ad-hoc)│
└─────────┘ └───────────┘ └─────────┘plaintextWorkflow#
- Spark ingests raw data into Iceberg tables (high-volume, cost-effective)
- Snowflake + dbt transforms raw → staging → marts (SQL-native, governed)
- Trino/Databricks runs ad-hoc ML workloads against the same marts (no data movement)
The dbt Side#
# dbt_project.yml
models:
lakehouse_project:
staging:
+materialized: view # Views for staging (reference raw Iceberg)
marts:
+materialized: table
+table_format: iceberg
+external_volume: prod_lake_volumeyamlPerformance Considerations#
Compaction#
Iceberg tables accumulate small files over time. Snowflake automatically compacts Snowflake-managed Iceberg tables, but for externally managed tables, you need to handle this yourself (via Spark or a scheduled job).
Clustering#
For large Iceberg tables, define clustering keys:
{{
config(
materialized='table',
table_format='iceberg',
cluster_by=['order_date', 'region']
)
}}sqlQuery Pruning#
Iceberg’s manifest files contain min/max statistics for each data file. Snowflake uses these for partition pruning. Ensure your queries filter on columns with good selectivity.
Common Pitfalls#
Pitfall 1: Forgetting External Volume Permissions#
Problem: dbt run fails with “Access denied to external volume.”
Solution: Grant the Snowflake role access to the external volume:
GRANT USAGE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
GRANT READ, WRITE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;sqlPitfall 2: Mixing Managed and Unmanaged Tables#
Problem: You create a table with CATALOG = 'SNOWFLAKE' but try to write to it from Spark.
Solution: Decide upfront who “owns” the table metadata:
- Snowflake-managed: Only Snowflake writes; others read via Iceberg REST API
- Externally-managed: Snowflake reads; writes via Spark/external engine
Pitfall 3: Schema Evolution Conflicts#
Problem: dbt adds a column, but Spark doesn’t see it.
Solution: Ensure all engines refresh their metadata cache. In Snowflake:
ALTER ICEBERG TABLE my_table REFRESH;sqlWhen to Use Iceberg Tables#
Good fit for:
- Multi-engine architectures (Snowflake + Spark + Databricks)
- Cost-sensitive storage (S3 is cheaper than Snowflake native)
- Regulatory requirements for data portability
- ML/AI workloads that need direct file access
Not recommended for:
- Small tables (< 100GB) with Snowflake-only access
- Real-time streaming (use native tables with Snowpipe Streaming)
- Workloads requiring Snowflake-specific features (Time Travel > 7 days, etc.)
Conclusion#
The combination of dbt 1.9+ and Snowflake’s Iceberg support makes the open lakehouse a practical reality. You no longer have to choose between SQL-native development (dbt) and open data formats (Iceberg)—you can have both.
Key Takeaways:
- dbt supports Iceberg tables as first-class citizens via
table_format: iceberg - External catalog integration enables true multi-engine architectures
- Snowflake’s Iceberg performance is now within 5-10% of native tables
- Plan your managed vs. unmanaged strategy upfront to avoid conflicts
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.