Last year, Snowflake surprised everyone by announcing Polaris Catalog, an open-source technical catalog for Apache Iceberg.
Wait, why would a closed-source SaaS company release open-source infrastructure? Because they want to be the center of gravity for metadata, even if they aren’t storing the data.
The Problem: Catalog Chaos#
You have data in S3.
- Spark uses the Hive Metastore.
- Trino uses the Glue Catalog.
- Snowflake uses its internal catalog.
They all disagree on the schema. It’s a mess.
The Solution: REST Protocol#
Polaris implements the Iceberg REST Open API. It sits in the middle.
- Spark asks Polaris: “I want to write to Table A.”
- Snowflake asks Polaris: “I want to read Table A.”
Polaris ensures both engines see the same atomic snapshot of the transaction.
Setting it up#
Polaris can be hosted by Snowflake (managed) or run in your own Kubernetes cluster (self-hosted).
-- Connecting Snowflake to a Polaris Catalog
CREATE CATALOG INTEGRATION my_polaris
CATALOG_SOURCE = 'ICEBERG_REST'
TABLE_FORMAT = 'ICEBERG'
URI = 'https://my-polaris-url/api/catalog'
ENABLED = TRUE;sqlGovernance at the Layer#
The killer feature of Polaris is that it brings RBAC (Role Based Access Control) to the open lake. You can define who can read/write tables in Polaris, and those rules apply whether the user is coming from Spark, Flink, or Dremio.
Conclusion#
Polaris is the “Switzerland” of the Data Wars. It allows you to pick best-of-breed engines for different tasks while engaging in a single source of truth for table structure.