❄️
Data Flakes

Back

The promise of the “open lakehouse” has always been seductive: store data in open formats, query it from anywhere, and escape vendor lock-in. The reality, until recently, was a mess of manual metadata management, inconsistent tooling, and “it works in theory” architectures.

That’s changed. With dbt 1.9+ and Snowflake’s expanded Iceberg support, building production-grade open lakehouses is not just possible—it’s practical.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

The Evolution of Iceberg on Snowflake#

Apache Iceberg tables became generally available on Snowflake in June 2024. Since then, the ecosystem has matured rapidly:

  • July 2025: Write support for externally managed Iceberg tables and catalog-linked databases reached preview
  • dbt 1.9: Native support for Iceberg table materialisation
  • dbt 1.11: Support for Glue catalog integration via catalog-linked databases

This means you can now use dbt to create and transform Iceberg tables that register metadata in external catalogs—not just Snowflake’s internal catalog.

Why This Matters#

The Old Way: Snowflake as a Silo#

[Source Data] → [Fivetran] → [Snowflake Native Tables] → [dbt] → [Snowflake Native Tables]

                              [Export to S3 for Spark]
plaintext

Data lives in Snowflake’s proprietary format. To use it elsewhere (Spark ML jobs, Databricks notebooks), you need to export it.

The New Way: Snowflake as a Lakehouse Citizen#

[Source Data] → [Ingestion] → [S3 Iceberg Tables] ← [Snowflake + dbt]

                              [Spark / Databricks / Trino]
plaintext

The same Iceberg table is readable and writable by multiple engines. No export pipelines. No data copies.

Configuring dbt for Iceberg#

Basic Iceberg Table Materialisation#

In your dbt_project.yml:

models:
  my_project:
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: my_s3_volume
      +base_location: warehouse/marts/
yaml

Or at the model level in schema.yml:

models:
  - name: dim_customer
    config:
      materialized: table
      table_format: iceberg
      external_volume: my_s3_volume
      base_location: warehouse/dims/customers/
yaml

Incremental Iceberg Models#

Iceberg tables support incremental materialisation with merge operations:

Key Point: The merge strategy leverages Iceberg’s ACID transactions, ensuring data consistency even when multiple processes access the table.

External Catalog Integration#

The real power emerges when you connect dbt to external Iceberg catalogs.

Option 1: Snowflake Open Catalog (Polaris)#

Snowflake Open Catalog (formerly Polaris) is an open-source Iceberg REST catalog. Configure it in Snowflake:

Option 2: AWS Glue Catalog#

For AWS-centric architectures, link to Glue:

-- Create catalog-linked database
CREATE DATABASE glue_catalog_db
    FROM CATALOG INTEGRATION glue_catalog;
sql

In dbt, reference tables from the linked database:

-- models/staging/stg_external_events.sql
SELECT *
FROM {{ source('glue_catalog_db', 'raw_events') }}
WHERE event_date >= CURRENT_DATE - 7
sql

Note: As of dbt 1.11, incremental materialisation is not yet supported for Glue catalog tables. Use full table refreshes or manage incremental logic externally.

Practical Architecture: Multi-Engine Analytics#

Here’s a production pattern I’ve seen work well:

The Setup#

┌─────────────────────────────────────────────────────────────────┐
│                        S3 Data Lake                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ raw/        │  │ staging/    │  │ marts/      │             │
│  │ (Iceberg)   │  │ (Iceberg)   │  │ (Iceberg)   │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
└─────────────────────────────────────────────────────────────────┘
         ↑                 ↑                 ↑
         │                 │                 │
    ┌────┴────┐      ┌─────┴─────┐     ┌────┴────┐
    │ Spark   │      │ Snowflake │     │ Trino   │
    │ (Ingest)│      │ (dbt)     │     │ (Ad-hoc)│
    └─────────┘      └───────────┘     └─────────┘
plaintext

Workflow#

  1. Spark ingests raw data into Iceberg tables (high-volume, cost-effective)
  2. Snowflake + dbt transforms raw → staging → marts (SQL-native, governed)
  3. Trino/Databricks runs ad-hoc ML workloads against the same marts (no data movement)

The dbt Side#

# dbt_project.yml
models:
  lakehouse_project:
    staging:
      +materialized: view  # Views for staging (reference raw Iceberg)
    marts:
      +materialized: table
      +table_format: iceberg
      +external_volume: prod_lake_volume
yaml

Performance Considerations#

Compaction#

Iceberg tables accumulate small files over time. Snowflake automatically compacts Snowflake-managed Iceberg tables, but for externally managed tables, you need to handle this yourself (via Spark or a scheduled job).

Clustering#

For large Iceberg tables, define clustering keys:

{{
    config(
        materialized='table',
        table_format='iceberg',
        cluster_by=['order_date', 'region']
    )
}}
sql

Query Pruning#

Iceberg’s manifest files contain min/max statistics for each data file. Snowflake uses these for partition pruning. Ensure your queries filter on columns with good selectivity.

Common Pitfalls#

Pitfall 1: Forgetting External Volume Permissions#

Problem: dbt run fails with “Access denied to external volume.”

Solution: Grant the Snowflake role access to the external volume:

GRANT USAGE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
GRANT READ, WRITE ON EXTERNAL VOLUME my_s3_volume TO ROLE transformer_role;
sql

Pitfall 2: Mixing Managed and Unmanaged Tables#

Problem: You create a table with CATALOG = 'SNOWFLAKE' but try to write to it from Spark.

Solution: Decide upfront who “owns” the table metadata:

  • Snowflake-managed: Only Snowflake writes; others read via Iceberg REST API
  • Externally-managed: Snowflake reads; writes via Spark/external engine

Pitfall 3: Schema Evolution Conflicts#

Problem: dbt adds a column, but Spark doesn’t see it.

Solution: Ensure all engines refresh their metadata cache. In Snowflake:

ALTER ICEBERG TABLE my_table REFRESH;
sql

When to Use Iceberg Tables#

Good fit for:

  • Multi-engine architectures (Snowflake + Spark + Databricks)
  • Cost-sensitive storage (S3 is cheaper than Snowflake native)
  • Regulatory requirements for data portability
  • ML/AI workloads that need direct file access

Not recommended for:

  • Small tables (< 100GB) with Snowflake-only access
  • Real-time streaming (use native tables with Snowpipe Streaming)
  • Workloads requiring Snowflake-specific features (Time Travel > 7 days, etc.)

Conclusion#

The combination of dbt 1.9+ and Snowflake’s Iceberg support makes the open lakehouse a practical reality. You no longer have to choose between SQL-native development (dbt) and open data formats (Iceberg)—you can have both.

Key Takeaways:

  • dbt supports Iceberg tables as first-class citizens via table_format: iceberg
  • External catalog integration enables true multi-engine architectures
  • Snowflake’s Iceberg performance is now within 5-10% of native tables
  • Plan your managed vs. unmanaged strategy upfront to avoid conflicts

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

Further Reading#

Disclaimer

The information provided on this website is for general informational purposes only. While we strive to keep the information up to date and correct, there may be instances where information is outdated or links are no longer valid. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.