❄️
Data Flakes

Back

Data privacy regulations (GDPR, CCPA, etc.) aren’t going away. The biggest challenge for data teams is simply knowing where the sensitive data lives. You can’t protect what you can’t find.

Snowflake’s Classification feature (native to Horizon) automates this discovery process.

How it Works#

Snowflake analyzes the contents (data profiling) and metadata (column names) of your tables to determine if a column contains PII (Personally Identifiable Information).

It assigns:

  1. Semantic Category: What is it? (e.g., EMAIL, PHONE_NUMBER, GENDER).
  2. Privacy Category: How sensitive is it? (e.g., IDENTIFIER, QUASI_IDENTIFIER, SENSITIVE).

The EXTRACT_SEMANTIC_CATEGORIES Function#

You can run this manually to see what Snowflake finds:

SELECT EXTRACT_SEMANTIC_CATEGORIES('customer_data');
sql

This returns a JSON object proposing tags for columns. It doesn’t apply them yet—it lets you review.

Automating with Stored Procedures#

To make this useful at scale, we wrap this in a Stored Procedure that runs weekly (or triggers on table creation hooks).

-- 1. Call the classifier
CALL SYSTEM$CLASSIFY('my_db.public.customer_table', null);

-- 2. Review results (stored in a temporary table or resultset)
-- ...

-- 3. Apply the tags
CALL SYSTEM$APPLY_TAG('my_db.public.customer_table',
    ASSOCIATE_SEMANTIC_CATEGORY_TAGS => true,
    ASSOCIATE_PRIVACY_CATEGORY_TAGS => true
);
sql

Creating Custom Classifiers#

In 2025, Snowflake allows Custom Classifiers. If you have a specific internal ID format (e.g., “EMP-12345”) that the standard model misses, you can define regex patterns to catch it.

CREATE OR REPLACE CUSTOM CLASSIFIER employee_id_classifier
PASSING
    col_name LIKE '%EMP_ID%'
    OR
    col_value REGEX 'EMP-[0-9]{5}';

-- Add this instance to the classification process
sql

Tag-Based Masking Policies#

Once tags are applied, the magic happens. Your generic masking policy kicks in:

-- This policy is applied to the TAG, not the table
CREATE MASKING POLICY pii_mask AS (val string) RETURNS string ->
  CASE
    WHEN system$get_tag_on_current_column('privacy_category') = 'IDENTIFIER'
         AND current_role() != 'COMPLIANCE_OFFICER'
    THEN '***'
    ELSE val
  END;
sql

Now, every time the classifier finds a new email column and tags it, the data is instantly masked for unauthorized users. No manual ticket required.

Conclusion#

Automated classification shifts privacy from a “bottleneck” to a “background process.” It improves coverage and reduces risk, letting engineers focus on building pipelines rather than auditing column names.

Disclaimer

The information provided on this website is for general informational purposes only. While we strive to keep the information up to date and correct, there may be instances where information is outdated or links are no longer valid. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.