Feature Engineering in Snowflake: Preparing Data for Machine Learning

Raw data is rarely in a form that machine learning algorithms can use effectively. A customer’s transaction history is a sequence of events; an ML model expects numerical features with consistent scale. A product category is a string; a classifier expects a binary indicator for each possible value. The process of transforming raw data into the structured, scaled, and encoded inputs that algorithms require is called feature engineering — and it is one of the most time-consuming parts of any ML project.

The traditional approach moves data out of the warehouse and into a Python environment where libraries like scikit-learn handle these transformations. This introduces data movement, latency, and a disconnect between where the data lives and where it is processed.

In this article, we will discuss how feature engineering can be performed directly within Snowflake, what tools and functions are available, and how this approach benefits data and engineering teams.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.

What is Feature Engineering?

Feature engineering is the process of transforming raw input data into representations that enable machine learning algorithms to learn patterns effectively. It bridges the gap between real-world data — which is messy, heterogeneous, and differently scaled — and the clean numerical inputs that most ML algorithms expect.

Common transformations include:

Scaling: Normalising values so different features have comparable ranges (preventing a salary column from dominating a tenure column simply because it has larger numbers)
Binning: Grouping continuous values into discrete categories (converting age into age bands)
Encoding: Converting categorical strings into numerical representations (turning “Gold”, “Silver”, “Bronze” into 1, 2, 3 or binary one-hot columns)
Aggregation: Deriving summary features from historical records (average spend per customer over 90 days)
Imputation: Handling missing values in a consistent, reproducible way

The quality of features frequently matters more than the choice of algorithm. A well-engineered feature set with a simple model often outperforms a sophisticated model applied to raw, unprepared data.

The Feature Engineering Pipeline

A typical feature engineering pipeline sits between raw data and model training:

graph LR A[(Raw data<br/>warehouse tables)] --> B[Feature engineering<br/>scale, encode, aggregate] B --> C[(Feature store<br/>ML-ready tables)] C --> D[Model training] C --> E[Real-time inference<br/>via SQL function] D --> F[Deployed model]

Keeping this pipeline inside Snowflake means the feature store is a Snowflake table — queryable, versioned, and subject to the same governance as all other data.

Feature Engineering in SQL

Snowflake’s SQL engine provides most of the building blocks for common feature transformations natively.

Scaling: MinMax Normalisation

MinMax scaling maps values to a 0–1 range, preserving relative differences while removing absolute scale:

-- MinMax scale the monthly_spend column to a 0-1 range
WITH spend_range AS (
    SELECT
        MIN(monthly_spend) AS min_spend,
        MAX(monthly_spend) AS max_spend
    FROM customer_features
)
SELECT
    customer_id,
    (monthly_spend - min_spend) / NULLIF(max_spend - min_spend, 0) AS monthly_spend_scaled
FROM customer_features, spend_range;

sql

Explanation: The CTE calculates the min and max once across the dataset. The main query applies the MinMax formula to each row. NULLIF prevents division by zero when all values are identical.

Scaling: Standard Normalisation (Z-Score)

Standard scaling centres values around zero with unit variance — appropriate when your algorithm assumes normally distributed features:

-- Standardise tenure_months by mean and standard deviation
WITH stats AS (
    SELECT
        AVG(tenure_months)    AS mean_tenure,
        STDDEV(tenure_months) AS stddev_tenure
    FROM customer_features
)
SELECT
    customer_id,
    (tenure_months - mean_tenure) / NULLIF(stddev_tenure, 0) AS tenure_standardised
FROM customer_features, stats;

sql

Binning: Creating Age Bands

Binning converts continuous values into discrete categories, useful when the relationship between a feature and the target is non-linear:

-- Bin customer age into bands using WIDTH_BUCKET
SELECT
    customer_id,
    age,
    WIDTH_BUCKET(age, 18, 80, 6) AS age_band   -- 6 equal-width bands from 18 to 80
FROM customers;

sql

Explanation: WIDTH_BUCKET(value, min, max, num_buckets) assigns each value to one of num_buckets equal-width intervals between min and max. This is Snowflake’s native equivalent of scikit-learn’s KBinsDiscretizer.

Encoding: One-Hot Encoding

One-hot encoding converts a categorical column into binary indicator columns — one per category:

-- One-hot encode the membership_tier column
SELECT
    customer_id,
    IFF(membership_tier = 'Gold', 1, 0)     AS tier_gold,
    IFF(membership_tier = 'Silver', 1, 0)   AS tier_silver,
    IFF(membership_tier = 'Bronze', 1, 0)   AS tier_bronze,
    IFF(membership_tier IS NULL, 1, 0)      AS tier_unknown
FROM customers;

sql

Explanation: For a column with known, stable categories, static IFF expressions are the simplest approach. For dynamic categories, a pivot or array-based approach is more robust.

Aggregation: Historical Features

Aggregate features capture behaviour patterns over time — often the most predictive features in customer or product ML models:

-- Build a feature table with 30, 60, and 90-day spend aggregations per customer
SELECT
    customer_id,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_total ELSE 0 END) AS spend_30d,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 60 THEN order_total ELSE 0 END) AS spend_60d,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 90 THEN order_total ELSE 0 END) AS spend_90d,
    COUNT(DISTINCT CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_id END) AS orders_30d,
    MAX(order_date)                                                               AS last_order_date,
    DATEDIFF('day', MAX(order_date), CURRENT_DATE)                               AS days_since_last_order
FROM orders
GROUP BY customer_id;

sql

Explanation: Conditional aggregation with CASE WHEN is more efficient than multiple separate queries. DATEDIFF calculates recency — a key feature for churn and next-purchase models.

Feature Engineering with Snowpark ML

For more complex transformations or when you want a scikit-learn compatible API with reproducible pipelines, Snowpark ML provides Python-native preprocessing classes that execute inside Snowflake:

from snowflake.ml.modeling.preprocessing import (
    MinMaxScaler,
    StandardScaler,
    OrdinalEncoder,
    OneHotEncoder
)
from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.pipeline import Pipeline

# Define a preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[
    # Impute missing values before scaling
    ('imputer', SimpleImputer(
        input_cols=['monthly_spend', 'tenure_months'],
        output_cols=['monthly_spend', 'tenure_months'],
        strategy='median'
    )),
    # Scale numeric features
    ('scaler', MinMaxScaler(
        input_cols=['monthly_spend', 'tenure_months'],
        output_cols=['spend_scaled', 'tenure_scaled']
    )),
    # Encode categorical features
    ('encoder', OneHotEncoder(
        input_cols=['membership_tier', 'region'],
        output_cols=['tier_encoded', 'region_encoded'],
        drop_input_cols=True
    ))
])

# Fit and transform — executes in Snowflake, not locally
feature_df = preprocessing_pipeline.fit_transform(raw_customer_df)

python

Explanation: Despite the familiar scikit-learn syntax, all computation runs inside Snowflake’s warehouse. The pipeline learns statistics (min, max, mean, categories) from the training data during fit and applies them consistently during transform. This ensures that the same transformations applied during training are applied identically at inference time.

The Snowflake Feature Store

For teams running multiple ML models, managing features consistently across models is a significant challenge. Snowflake’s Feature Store provides a governed catalogue of feature definitions that can be shared across teams and reused across models:

from snowflake.ml.feature_store import FeatureStore, Entity, FeatureView

# Define the entity that features describe
customer_entity = Entity(
    name='CUSTOMER',
    join_keys=['customer_id'],
    desc='Individual customer accounts'
)

# Define a feature view — a named, versioned set of features
customer_spend_features = FeatureView(
    name='customer_spend_features',
    entities=[customer_entity],
    feature_df=spend_feature_query,    # the SQL or DataFrame defining the features
    refresh_freq='1 day',
    desc='30/60/90-day spend and recency features'
)

# Register with the Feature Store
fs = FeatureStore(session, database='ML', schema='FEATURE_STORE')
fs.register_entity(customer_entity)
fs.register_feature_view(customer_spend_features, version='V1')

python

Explanation: Registered feature views are materialised as Snowflake tables on the defined refresh schedule. Any ML pipeline in the organisation can retrieve these features by joining on the entity key, ensuring consistent feature definitions across models.

Mapping scikit-learn Transformers to Snowflake

Teams migrating from scikit-learn will find most common transformers have direct Snowflake equivalents:

scikit-learn	Snowflake Equivalent
`MinMaxScaler`	SQL CTE formula or `snowflake.ml.preprocessing.MinMaxScaler`
`StandardScaler`	SQL `AVG`/`STDDEV` formula or `StandardScaler`
`KBinsDiscretizer`	`WIDTH_BUCKET()`
`LabelEncoder`	`DENSE_RANK() OVER (ORDER BY category)`
`OneHotEncoder`	`IFF(col = 'value', 1, 0)` or `OneHotEncoder`
`SimpleImputer`	`COALESCE()` or `SimpleImputer`
`RobustScaler`	Percentile-based SQL formula or `RobustScaler`

Most transformations “are not actually from scratch” — Snowflake’s SQL functions provide the building blocks, and Snowpark ML provides the sklearn-compatible API on top.

Business Applications

Feature engineering in Snowflake unlocks ML use cases that were previously impractical due to data movement constraints:

Churn prediction: Build recency, frequency, and monetary features directly from transaction tables — updated daily without extract pipelines
Credit scoring: Aggregate account behaviour features from millions of rows without moving data outside the financial data platform
Product recommendation: Compute user-item interaction features at scale using Snowflake’s distributed compute
Fraud detection: Derive real-time behavioural features from event streams via Dynamic Tables, serving inference via SQL function

Best Practices

From experience, I would recommend considering the below:

Fit preprocessing on training data only: When using MinMaxScaler or StandardScaler, the statistics (min, max, mean, stddev) must be calculated on the training set, not the full dataset. Leaking test set statistics into training inflates model performance metrics.
Store fitted parameters: Save the min, max, mean, and stddev values used during training. At inference time, apply these stored parameters rather than recalculating — otherwise your inference transformations will differ from training transformations as new data arrives.
Version your feature definitions: Feature definitions change over time. Use the Feature Store’s versioning to track what definition was active when each model version was trained.
Document feature semantics: A column called spend_scaled is meaningless without documentation of what it was scaled from and how. The Feature Store’s desc parameter is the minimum; a fuller data dictionary is better.

Conclusion

Feature engineering is one of the highest-value activities in any ML project, and Snowflake provides the tools to do it entirely within the warehouse. SQL window functions, WIDTH_BUCKET, conditional aggregation, and Snowpark ML’s scikit-learn compatible preprocessing classes cover the full range of common transformations.

The result is an ML pipeline that stays within your governed data environment: no extraction, no divergence between training and inference transformations, and feature definitions that any team in the organisation can discover and reuse.

Key Takeaways:

Feature engineering transforms raw data into the numerical, scaled, and encoded inputs that ML algorithms require
Snowflake SQL provides native equivalents for most scikit-learn preprocessing functions
Snowpark ML offers a scikit-learn compatible API for complex or chained transformations that executes inside Snowflake
The Snowflake Feature Store provides versioned, reusable feature definitions shared across models and teams
Fitting preprocessing on training data only — and storing those parameters — is essential for consistent inference

What is Feature Engineering?

The Feature Engineering Pipeline

Feature Engineering in SQL

Scaling: MinMax Normalisation

Scaling: Standard Normalisation (Z-Score)

Binning: Creating Age Bands

Encoding: One-Hot Encoding

Aggregation: Historical Features

Feature Engineering with Snowpark ML

The Snowflake Feature Store

Mapping scikit-learn Transformers to Snowflake

Business Applications

Best Practices

Conclusion

Further Reading

Disclaimer

Feature Engineering in Snowflake: Preparing Data for Machine Learning

What is Feature Engineering?#

The Feature Engineering Pipeline#

Feature Engineering in SQL#

Scaling: MinMax Normalisation#

Scaling: Standard Normalisation (Z-Score)#

Binning: Creating Age Bands#

Encoding: One-Hot Encoding#

Aggregation: Historical Features#

Feature Engineering with Snowpark ML#

The Snowflake Feature Store#

Mapping scikit-learn Transformers to Snowflake#

Business Applications#

Best Practices#

Conclusion#

Further Reading#

Disclaimer

What is Feature Engineering?

The Feature Engineering Pipeline

Feature Engineering in SQL

Scaling: MinMax Normalisation

Scaling: Standard Normalisation (Z-Score)

Binning: Creating Age Bands

Encoding: One-Hot Encoding

Aggregation: Historical Features

Feature Engineering with Snowpark ML

The Snowflake Feature Store

Mapping scikit-learn Transformers to Snowflake

Business Applications

Best Practices

Conclusion

Further Reading