Building a True Lakehouse with Snowflake Postgres and pg_lake
Snowflake Postgres and the pg_lake extension let you run OLTP workloads and analytical queries on the same Iceberg data layer — without copying data.
The “lakehouse” has been a compelling architectural vision for years: a single storage layer that serves both transactional and analytical workloads, removing the need to copy data between an OLTP database and a data warehouse. In practice, achieving this has required significant engineering effort, careful trade-offs, and often a degree of compromise on either transactional integrity or analytical performance.
Snowflake’s acquisition of Crunchy Data, combined with the pg_lake PostgreSQL extension, changes that calculus. In this article, we will discuss what pg_lake is, how it bridges PostgreSQL’s OLTP capabilities with Snowflake’s analytical engine, and what this architecture looks like in practice.
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.
The Problem: OLTP and OLAP Do Not Mix Well#
Traditional architectures handle transactional and analytical workloads by separating them completely. A PostgreSQL or Oracle database manages day-to-day transactions; a nightly or near-real-time ETL process copies that data into a warehouse for analysis.
This separation introduces several problems:
- Latency: Analytical queries run against data that is hours (or days) old
- Data duplication: The same data exists in multiple locations with associated storage costs
- Pipeline complexity: ETL jobs, schedulers, and error handling add operational overhead
- Schema coupling: Changes to the source schema can break downstream pipelines
The lakehouse vision solves this by writing directly to open-format storage (Apache Iceberg) that both the OLTP engine and the analytical engine can query natively.
What is pg_lake?#
pg_lake is a PostgreSQL extension developed by Snowflake that enables read and write access to Apache Iceberg tables directly from a PostgreSQL instance. Combined with Snowflake Postgres — the enterprise-grade PostgreSQL offering built on the Crunchy Data acquisition — it creates a unified storage architecture where:
- PostgreSQL handles fast, row-oriented transactional writes
- Iceberg tables in S3 serve as the shared storage layer
- Snowflake reads those Iceberg tables for analytical queries without any data movement
How the Architecture Works#
The PostgreSQL Instance#
Snowflake Postgres operates as a standalone OLTP engine. You provision an instance via SQL, specifying compute family and storage allocation. Network policies control client access, and the instance is typically available within five to ten minutes of provisioning.
The PostgreSQL instance manages two types of tables:
- Heap tables: Standard PostgreSQL row-store tables, optimised for fast individual inserts and updates — ideal for transactional workloads such as point-of-sale systems, order management, or event ingestion
- Iceberg tables (via pg_lake): Column-oriented tables written to S3 in Apache Iceberg format, queryable from Snowflake
The Synchronisation Pattern#
Rather than writing every individual transaction directly to Iceberg (which would be inefficient for high-throughput OLTP), the recommended pattern uses a bulk synchronisation approach:
-- Sync unprocessed transactions from heap table to Iceberg in bulk
WITH synced AS (
INSERT INTO iceberg_transactions
SELECT
transaction_id,
customer_id,
store_id,
amount,
created_at
FROM heap_transactions
WHERE synced = FALSE
RETURNING transaction_id
)
UPDATE heap_transactions
SET synced = TRUE
WHERE transaction_id IN (SELECT transaction_id FROM synced);sqlExplanation: This CTE pattern atomically moves unsynced rows from the PostgreSQL heap table into the Iceberg table and marks them as synced. Synchronisation runs every 500 to 1,000 rows, keeping latency low while avoiding the overhead of per-row Iceberg writes. Bulk inserts of thousands of rows complete in under two seconds.
Storage Integration#
The Iceberg tables written by pg_lake sit in AWS S3. Snowflake connects to this S3 bucket via a storage integration and external volume:
-- Create storage integration linking Snowflake to the S3 bucket
CREATE OR REPLACE STORAGE INTEGRATION pglake_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/pglake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-lakehouse-bucket/');
-- Create an external volume pointing to the Iceberg data
CREATE OR REPLACE EXTERNAL VOLUME pglake_volume
STORAGE_LOCATIONS = (
(
NAME = 'pglake-s3'
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789:role/pglake-role'
STORAGE_BASE_URL = 's3://my-lakehouse-bucket/'
)
);sqlExplanation: The storage integration creates the trust relationship between Snowflake and the IAM role that has S3 read access. The external volume registers the physical storage location that Iceberg metadata files reference.
Once registered, Snowflake queries the Iceberg tables directly:
-- Query Iceberg tables written by PostgreSQL via pg_lake
SELECT
store_id,
SUM(amount) AS total_revenue,
COUNT(*) AS transaction_count
FROM iceberg_transactions
WHERE created_at >= CURRENT_DATE - 7
GROUP BY store_id
ORDER BY total_revenue DESC;sqlWhy This Matters#
True Zero-Copy Architecture#
Snowflake reads the Iceberg parquet files in S3 directly — the data is not copied into Snowflake’s internal storage. This eliminates storage duplication and removes the ETL pipeline between the transactional and analytical layers entirely.
ACID Compliance Throughout#
pg_lake preserves ACID properties when writing Iceberg tables. Snowflake’s reads are consistent with Iceberg’s snapshot isolation — you query a consistent version of the data without locking the PostgreSQL instance.
Vendor-Neutral Storage#
Because the data is stored as Apache Iceberg on S3, it is not locked into either Snowflake or PostgreSQL. Other engines — Spark, Trino, Databricks — can read the same Iceberg tables without any additional pipeline.
Use Cases#
This architecture suits situations where transactional freshness and analytical depth are both requirements:
- Retail: Point-of-sale transactions written to PostgreSQL, available in Snowflake for dashboard queries within seconds
- Financial services: Payment events captured transactionally, joined with historical data for real-time risk scoring
- IoT and sensor data: High-frequency event ingestion into PostgreSQL heap tables, synced to Iceberg for time-series analysis
- SaaS platforms: Multi-tenant operational data in PostgreSQL, with shared analytical views in Snowflake
Best Practices#
From experience, I would recommend considering the below:
-
Tune synchronisation frequency: The bulk sync pattern is efficient, but synchronisation interval should match your analytical latency requirements. Every 500 rows suits near-real-time dashboards; larger batches suit hourly reporting.
-
Keep heap tables thin: Use the PostgreSQL heap table as a staging area for unsynced records, not as a long-term store. Move data to Iceberg promptly to keep heap table sizes manageable.
-
Set IAM session duration appropriately: The storage integration requires a maximum IAM session duration of 12 hours. Failing to configure this correctly prevents pg_lake from accessing S3.
-
Monitor S3 file counts: Frequent small syncs produce many small Iceberg files. Schedule periodic compaction jobs (via Snowflake or Spark) to maintain query performance.
Common Pitfalls#
Pitfall: Writing Individual Rows Directly to Iceberg#
Problem: Inserting individual transactional rows directly into Iceberg tables rather than using the bulk sync pattern results in excessive small files and degraded analytical performance.
Solution: Use the CTE-based bulk sync pattern. Let PostgreSQL heap tables absorb the transactional write load and batch-sync to Iceberg on a schedule.
Pitfall: Querying Iceberg Tables Before Metadata Refresh#
Problem: Snowflake queries return stale results because Iceberg metadata has not been refreshed after new files are written by pg_lake.
Solution: Call ALTER ICEBERG TABLE <table_name> REFRESH after each sync cycle, or configure automatic metadata
refresh via a Snowflake Task.
Conclusion#
The combination of Snowflake Postgres and pg_lake makes a genuinely open lakehouse architecture practical for production workloads. By separating the write path (PostgreSQL OLTP) from the read path (Snowflake OLAP) while sharing a common Iceberg storage layer on S3, organisations can eliminate the ETL pipeline between transactional and analytical systems — and do so without vendor lock-in.
Key Takeaways:
- pg_lake is a PostgreSQL extension that enables native Iceberg table writes from Snowflake Postgres
- The bulk sync pattern efficiently moves transactional data from heap tables to Iceberg without per-row overhead
- Snowflake reads Iceberg tables in S3 directly — no data duplication, no ETL pipeline
- The architecture supports ACID compliance throughout the stack
- Apache Iceberg storage means the data remains portable to other engines
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.