Automated PII Classification
Stop manually tagging columns. Use Snowflake
Data privacy regulations (GDPR, CCPA, etc.) aren’t going away. The biggest challenge for data teams is simply knowing where the sensitive data lives. You can’t protect what you can’t find.
Snowflake’s Classification feature (native to Horizon) automates this discovery process.
How it Works#
Snowflake analyzes the contents (data profiling) and metadata (column names) of your tables to determine if a column contains PII (Personally Identifiable Information).
It assigns:
- Semantic Category: What is it? (e.g.,
EMAIL,PHONE_NUMBER,GENDER). - Privacy Category: How sensitive is it? (e.g.,
IDENTIFIER,QUASI_IDENTIFIER,SENSITIVE).
The EXTRACT_SEMANTIC_CATEGORIES Function#
You can run this manually to see what Snowflake finds:
SELECT EXTRACT_SEMANTIC_CATEGORIES('customer_data');sqlThis returns a JSON object proposing tags for columns. It doesn’t apply them yet—it lets you review.
Automating with Stored Procedures#
To make this useful at scale, we wrap this in a Stored Procedure that runs weekly (or triggers on table creation hooks).
-- 1. Call the classifier
CALL SYSTEM$CLASSIFY('my_db.public.customer_table', null);
-- 2. Review results (stored in a temporary table or resultset)
-- ...
-- 3. Apply the tags
CALL SYSTEM$APPLY_TAG('my_db.public.customer_table',
ASSOCIATE_SEMANTIC_CATEGORY_TAGS => true,
ASSOCIATE_PRIVACY_CATEGORY_TAGS => true
);sqlCreating Custom Classifiers#
In 2025, Snowflake allows Custom Classifiers. If you have a specific internal ID format (e.g., “EMP-12345”) that the standard model misses, you can define regex patterns to catch it.
CREATE OR REPLACE CUSTOM CLASSIFIER employee_id_classifier
PASSING
col_name LIKE '%EMP_ID%'
OR
col_value REGEX 'EMP-[0-9]{5}';
-- Add this instance to the classification processsqlTag-Based Masking Policies#
Once tags are applied, the magic happens. Your generic masking policy kicks in:
-- This policy is applied to the TAG, not the table
CREATE MASKING POLICY pii_mask AS (val string) RETURNS string ->
CASE
WHEN system$get_tag_on_current_column('privacy_category') = 'IDENTIFIER'
AND current_role() != 'COMPLIANCE_OFFICER'
THEN '***'
ELSE val
END;sqlNow, every time the classifier finds a new email column and tags it, the data is instantly masked for unauthorized users. No manual ticket required.
Conclusion#
Automated classification shifts privacy from a “bottleneck” to a “background process.” It improves coverage and reduces risk, letting engineers focus on building pipelines rather than auditing column names.