Building a True Lakehouse with Snowflake Postgres and pg

The “lakehouse” has been a compelling architectural vision for years: a single storage layer that serves both transactional and analytical workloads, removing the need to copy data between an OLTP database and a data warehouse. In practice, achieving this has required significant engineering effort, careful trade-offs, and often a degree of compromise on either transactional integrity or analytical performance.

Snowflake’s acquisition of Crunchy Data, combined with the pg_lake PostgreSQL extension, changes that calculus. In this article, we will discuss what pg_lake is, how it bridges PostgreSQL’s OLTP capabilities with Snowflake’s analytical engine, and what this architecture looks like in practice.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.

The Problem: OLTP and OLAP Do Not Mix Well

Traditional architectures handle transactional and analytical workloads by separating them completely. A PostgreSQL or Oracle database manages day-to-day transactions; a nightly or near-real-time ETL process copies that data into a warehouse for analysis.

This separation introduces several problems:

Latency: Analytical queries run against data that is hours (or days) old
Data duplication: The same data exists in multiple locations with associated storage costs
Pipeline complexity: ETL jobs, schedulers, and error handling add operational overhead
Schema coupling: Changes to the source schema can break downstream pipelines

The lakehouse vision solves this by writing directly to open-format storage (Apache Iceberg) that both the OLTP engine and the analytical engine can query natively.

What is pg_lake?

pg_lake is a PostgreSQL extension developed by Snowflake that enables read and write access to Apache Iceberg tables directly from a PostgreSQL instance. Combined with Snowflake Postgres — the enterprise-grade PostgreSQL offering built on the Crunchy Data acquisition — it creates a unified storage architecture where:

PostgreSQL handles fast, row-oriented transactional writes
Iceberg tables in S3 serve as the shared storage layer
Snowflake reads those Iceberg tables for analytical queries without any data movement

graph TD subgraph OLTP[Transactional Layer] PG[Snowflake Postgres OLTP engine] --> HEAP[PostgreSQL heap tables fast row writes] PG --> PGLAKE[pg_lake extension Iceberg write capability] end subgraph Storage[Shared Storage Layer] S3[(AWS S3 Iceberg parquet files)] end subgraph OLAP[Analytical Layer] SF[Snowflake Warehouse OLAP engine] --> CAT[Catalog integration and external volume] end HEAP -->|Bulk sync via CTE| PGLAKE PGLAKE --> S3 CAT --> S3

How the Architecture Works

The PostgreSQL Instance

Snowflake Postgres operates as a standalone OLTP engine. You provision an instance via SQL, specifying compute family and storage allocation. Network policies control client access, and the instance is typically available within five to ten minutes of provisioning.

The PostgreSQL instance manages two types of tables:

Heap tables: Standard PostgreSQL row-store tables, optimised for fast individual inserts and updates — ideal for transactional workloads such as point-of-sale systems, order management, or event ingestion
Iceberg tables (via pg_lake): Column-oriented tables written to S3 in Apache Iceberg format, queryable from Snowflake

The Synchronisation Pattern

Rather than writing every individual transaction directly to Iceberg (which would be inefficient for high-throughput OLTP), the recommended pattern uses a bulk synchronisation approach:

-- Sync unprocessed transactions from heap table to Iceberg in bulk
WITH synced AS (
    INSERT INTO iceberg_transactions
    SELECT
        transaction_id,
        customer_id,
        store_id,
        amount,
        created_at
    FROM heap_transactions
    WHERE synced = FALSE
    RETURNING transaction_id
)
UPDATE heap_transactions
SET synced = TRUE
WHERE transaction_id IN (SELECT transaction_id FROM synced);

sql

Explanation: This CTE pattern atomically moves unsynced rows from the PostgreSQL heap table into the Iceberg table and marks them as synced. Synchronisation runs every 500 to 1,000 rows, keeping latency low while avoiding the overhead of per-row Iceberg writes. Bulk inserts of thousands of rows complete in under two seconds.

Storage Integration

The Iceberg tables written by pg_lake sit in AWS S3. Snowflake connects to this S3 bucket via a storage integration and external volume:

-- Create storage integration linking Snowflake to the S3 bucket
CREATE OR REPLACE STORAGE INTEGRATION pglake_integration
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = 'S3'
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/pglake-role'
    STORAGE_ALLOWED_LOCATIONS = ('s3://my-lakehouse-bucket/');

-- Create an external volume pointing to the Iceberg data
CREATE OR REPLACE EXTERNAL VOLUME pglake_volume
    STORAGE_LOCATIONS = (
        (
            NAME = 'pglake-s3'
            STORAGE_PROVIDER = 'S3'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/pglake-role'
            STORAGE_BASE_URL = 's3://my-lakehouse-bucket/'
        )
    );

sql

Explanation: The storage integration creates the trust relationship between Snowflake and the IAM role that has S3 read access. The external volume registers the physical storage location that Iceberg metadata files reference.

Once registered, Snowflake queries the Iceberg tables directly:

-- Query Iceberg tables written by PostgreSQL via pg_lake
SELECT
    store_id,
    SUM(amount) AS total_revenue,
    COUNT(*) AS transaction_count
FROM iceberg_transactions
WHERE created_at >= CURRENT_DATE - 7
GROUP BY store_id
ORDER BY total_revenue DESC;

sql

Why This Matters

True Zero-Copy Architecture

Snowflake reads the Iceberg parquet files in S3 directly — the data is not copied into Snowflake’s internal storage. This eliminates storage duplication and removes the ETL pipeline between the transactional and analytical layers entirely.

ACID Compliance Throughout

pg_lake preserves ACID properties when writing Iceberg tables. Snowflake’s reads are consistent with Iceberg’s snapshot isolation — you query a consistent version of the data without locking the PostgreSQL instance.

Vendor-Neutral Storage

Because the data is stored as Apache Iceberg on S3, it is not locked into either Snowflake or PostgreSQL. Other engines — Spark, Trino, Databricks — can read the same Iceberg tables without any additional pipeline.

Use Cases

This architecture suits situations where transactional freshness and analytical depth are both requirements:

Retail: Point-of-sale transactions written to PostgreSQL, available in Snowflake for dashboard queries within seconds
Financial services: Payment events captured transactionally, joined with historical data for real-time risk scoring
IoT and sensor data: High-frequency event ingestion into PostgreSQL heap tables, synced to Iceberg for time-series analysis
SaaS platforms: Multi-tenant operational data in PostgreSQL, with shared analytical views in Snowflake

Best Practices

From experience, I would recommend considering the below:

Tune synchronisation frequency: The bulk sync pattern is efficient, but synchronisation interval should match your analytical latency requirements. Every 500 rows suits near-real-time dashboards; larger batches suit hourly reporting.
Keep heap tables thin: Use the PostgreSQL heap table as a staging area for unsynced records, not as a long-term store. Move data to Iceberg promptly to keep heap table sizes manageable.
Set IAM session duration appropriately: The storage integration requires a maximum IAM session duration of 12 hours. Failing to configure this correctly prevents pg_lake from accessing S3.
Monitor S3 file counts: Frequent small syncs produce many small Iceberg files. Schedule periodic compaction jobs (via Snowflake or Spark) to maintain query performance.

Common Pitfalls

Pitfall: Writing Individual Rows Directly to Iceberg

Problem: Inserting individual transactional rows directly into Iceberg tables rather than using the bulk sync pattern results in excessive small files and degraded analytical performance.

Solution: Use the CTE-based bulk sync pattern. Let PostgreSQL heap tables absorb the transactional write load and batch-sync to Iceberg on a schedule.

Pitfall: Querying Iceberg Tables Before Metadata Refresh

Problem: Snowflake queries return stale results because Iceberg metadata has not been refreshed after new files are written by pg_lake.

Solution: Call ALTER ICEBERG TABLE <table_name> REFRESH after each sync cycle, or configure automatic metadata refresh via a Snowflake Task.

Conclusion

The combination of Snowflake Postgres and pg_lake makes a genuinely open lakehouse architecture practical for production workloads. By separating the write path (PostgreSQL OLTP) from the read path (Snowflake OLAP) while sharing a common Iceberg storage layer on S3, organisations can eliminate the ETL pipeline between transactional and analytical systems — and do so without vendor lock-in.

Key Takeaways:

pg_lake is a PostgreSQL extension that enables native Iceberg table writes from Snowflake Postgres
The bulk sync pattern efficiently moves transactional data from heap tables to Iceberg without per-row overhead
Snowflake reads Iceberg tables in S3 directly — no data duplication, no ETL pipeline
The architecture supports ACID compliance throughout the stack
Apache Iceberg storage means the data remains portable to other engines

Building a True Lakehouse with Snowflake Postgres and pg_lake

The Problem: OLTP and OLAP Do Not Mix Well

What is pg_lake?

How the Architecture Works

The PostgreSQL Instance

The Synchronisation Pattern

Storage Integration

Why This Matters

True Zero-Copy Architecture

ACID Compliance Throughout

Vendor-Neutral Storage

Use Cases

Best Practices

Common Pitfalls

Pitfall: Writing Individual Rows Directly to Iceberg

Pitfall: Querying Iceberg Tables Before Metadata Refresh

Conclusion

Further Reading

Disclaimer

Building a True Lakehouse with Snowflake Postgres and pg_lake

The Problem: OLTP and OLAP Do Not Mix Well#

What is pg_lake?#

How the Architecture Works#

The PostgreSQL Instance#

The Synchronisation Pattern#

Storage Integration#

Why This Matters#

True Zero-Copy Architecture#

ACID Compliance Throughout#

Vendor-Neutral Storage#

Use Cases#

Best Practices#

Common Pitfalls#

Pitfall: Writing Individual Rows Directly to Iceberg#

Pitfall: Querying Iceberg Tables Before Metadata Refresh#

Conclusion#

Further Reading#

Disclaimer

The Problem: OLTP and OLAP Do Not Mix Well

What is pg_lake?

How the Architecture Works

The PostgreSQL Instance

The Synchronisation Pattern

Storage Integration

Why This Matters

True Zero-Copy Architecture

ACID Compliance Throughout

Vendor-Neutral Storage

Use Cases

Best Practices

Common Pitfalls

Pitfall: Writing Individual Rows Directly to Iceberg

Pitfall: Querying Iceberg Tables Before Metadata Refresh

Conclusion

Further Reading