❄️
Data Flakes

Back

The “lakehouse” has been a compelling architectural vision for years: a single storage layer that serves both transactional and analytical workloads, removing the need to copy data between an OLTP database and a data warehouse. In practice, achieving this has required significant engineering effort, careful trade-offs, and often a degree of compromise on either transactional integrity or analytical performance.

Snowflake’s acquisition of Crunchy Data, combined with the pg_lake PostgreSQL extension, changes that calculus. In this article, we will discuss what pg_lake is, how it bridges PostgreSQL’s OLTP capabilities with Snowflake’s analytical engine, and what this architecture looks like in practice.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

The Problem: OLTP and OLAP Do Not Mix Well#

Traditional architectures handle transactional and analytical workloads by separating them completely. A PostgreSQL or Oracle database manages day-to-day transactions; a nightly or near-real-time ETL process copies that data into a warehouse for analysis.

This separation introduces several problems:

  • Latency: Analytical queries run against data that is hours (or days) old
  • Data duplication: The same data exists in multiple locations with associated storage costs
  • Pipeline complexity: ETL jobs, schedulers, and error handling add operational overhead
  • Schema coupling: Changes to the source schema can break downstream pipelines

The lakehouse vision solves this by writing directly to open-format storage (Apache Iceberg) that both the OLTP engine and the analytical engine can query natively.

What is pg_lake?#

pg_lake is a PostgreSQL extension developed by Snowflake that enables read and write access to Apache Iceberg tables directly from a PostgreSQL instance. Combined with Snowflake Postgres — the enterprise-grade PostgreSQL offering built on the Crunchy Data acquisition — it creates a unified storage architecture where:

  • PostgreSQL handles fast, row-oriented transactional writes
  • Iceberg tables in S3 serve as the shared storage layer
  • Snowflake reads those Iceberg tables for analytical queries without any data movement
graph TD subgraph OLTP[Transactional Layer] PG[Snowflake Postgres<br/>OLTP engine] --> HEAP[PostgreSQL heap tables<br/>fast row writes] PG --> PGLAKE[pg_lake extension<br/>Iceberg write capability] end subgraph Storage[Shared Storage Layer] S3[(AWS S3<br/>Iceberg parquet files)] end subgraph OLAP[Analytical Layer] SF[Snowflake Warehouse<br/>OLAP engine] --> CAT[Catalog integration<br/>and external volume] end HEAP -->|Bulk sync via CTE| PGLAKE PGLAKE --> S3 CAT --> S3

How the Architecture Works#

The PostgreSQL Instance#

Snowflake Postgres operates as a standalone OLTP engine. You provision an instance via SQL, specifying compute family and storage allocation. Network policies control client access, and the instance is typically available within five to ten minutes of provisioning.

The PostgreSQL instance manages two types of tables:

  • Heap tables: Standard PostgreSQL row-store tables, optimised for fast individual inserts and updates — ideal for transactional workloads such as point-of-sale systems, order management, or event ingestion
  • Iceberg tables (via pg_lake): Column-oriented tables written to S3 in Apache Iceberg format, queryable from Snowflake

The Synchronisation Pattern#

Rather than writing every individual transaction directly to Iceberg (which would be inefficient for high-throughput OLTP), the recommended pattern uses a bulk synchronisation approach:

Explanation: This CTE pattern atomically moves unsynced rows from the PostgreSQL heap table into the Iceberg table and marks them as synced. Synchronisation runs every 500 to 1,000 rows, keeping latency low while avoiding the overhead of per-row Iceberg writes. Bulk inserts of thousands of rows complete in under two seconds.

Storage Integration#

The Iceberg tables written by pg_lake sit in AWS S3. Snowflake connects to this S3 bucket via a storage integration and external volume:

Explanation: The storage integration creates the trust relationship between Snowflake and the IAM role that has S3 read access. The external volume registers the physical storage location that Iceberg metadata files reference.

Once registered, Snowflake queries the Iceberg tables directly:

-- Query Iceberg tables written by PostgreSQL via pg_lake
SELECT
    store_id,
    SUM(amount) AS total_revenue,
    COUNT(*) AS transaction_count
FROM iceberg_transactions
WHERE created_at >= CURRENT_DATE - 7
GROUP BY store_id
ORDER BY total_revenue DESC;
sql

Why This Matters#

True Zero-Copy Architecture#

Snowflake reads the Iceberg parquet files in S3 directly — the data is not copied into Snowflake’s internal storage. This eliminates storage duplication and removes the ETL pipeline between the transactional and analytical layers entirely.

ACID Compliance Throughout#

pg_lake preserves ACID properties when writing Iceberg tables. Snowflake’s reads are consistent with Iceberg’s snapshot isolation — you query a consistent version of the data without locking the PostgreSQL instance.

Vendor-Neutral Storage#

Because the data is stored as Apache Iceberg on S3, it is not locked into either Snowflake or PostgreSQL. Other engines — Spark, Trino, Databricks — can read the same Iceberg tables without any additional pipeline.

Use Cases#

This architecture suits situations where transactional freshness and analytical depth are both requirements:

  • Retail: Point-of-sale transactions written to PostgreSQL, available in Snowflake for dashboard queries within seconds
  • Financial services: Payment events captured transactionally, joined with historical data for real-time risk scoring
  • IoT and sensor data: High-frequency event ingestion into PostgreSQL heap tables, synced to Iceberg for time-series analysis
  • SaaS platforms: Multi-tenant operational data in PostgreSQL, with shared analytical views in Snowflake

Best Practices#

From experience, I would recommend considering the below:

  1. Tune synchronisation frequency: The bulk sync pattern is efficient, but synchronisation interval should match your analytical latency requirements. Every 500 rows suits near-real-time dashboards; larger batches suit hourly reporting.

  2. Keep heap tables thin: Use the PostgreSQL heap table as a staging area for unsynced records, not as a long-term store. Move data to Iceberg promptly to keep heap table sizes manageable.

  3. Set IAM session duration appropriately: The storage integration requires a maximum IAM session duration of 12 hours. Failing to configure this correctly prevents pg_lake from accessing S3.

  4. Monitor S3 file counts: Frequent small syncs produce many small Iceberg files. Schedule periodic compaction jobs (via Snowflake or Spark) to maintain query performance.

Common Pitfalls#

Pitfall: Writing Individual Rows Directly to Iceberg#

Problem: Inserting individual transactional rows directly into Iceberg tables rather than using the bulk sync pattern results in excessive small files and degraded analytical performance.

Solution: Use the CTE-based bulk sync pattern. Let PostgreSQL heap tables absorb the transactional write load and batch-sync to Iceberg on a schedule.

Pitfall: Querying Iceberg Tables Before Metadata Refresh#

Problem: Snowflake queries return stale results because Iceberg metadata has not been refreshed after new files are written by pg_lake.

Solution: Call ALTER ICEBERG TABLE <table_name> REFRESH after each sync cycle, or configure automatic metadata refresh via a Snowflake Task.

Conclusion#

The combination of Snowflake Postgres and pg_lake makes a genuinely open lakehouse architecture practical for production workloads. By separating the write path (PostgreSQL OLTP) from the read path (Snowflake OLAP) while sharing a common Iceberg storage layer on S3, organisations can eliminate the ETL pipeline between transactional and analytical systems — and do so without vendor lock-in.

Key Takeaways:

  • pg_lake is a PostgreSQL extension that enables native Iceberg table writes from Snowflake Postgres
  • The bulk sync pattern efficiently moves transactional data from heap tables to Iceberg without per-row overhead
  • Snowflake reads Iceberg tables in S3 directly — no data duplication, no ETL pipeline
  • The architecture supports ACID compliance throughout the stack
  • Apache Iceberg storage means the data remains portable to other engines

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

Further Reading#

Disclaimer

The information provided on this website is for general informational purposes only. While we strive to keep the information up to date and correct, there may be instances where information is outdated or links are no longer valid. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.