Feature Engineering in Snowflake: Preparing Data for Machine Learning
Feature engineering transforms raw data into ML-ready inputs. Snowflake provides native SQL functions and Snowpark tools to do this without leaving the warehouse.
Raw data is rarely in a form that machine learning algorithms can use effectively. A customer’s transaction history is a sequence of events; an ML model expects numerical features with consistent scale. A product category is a string; a classifier expects a binary indicator for each possible value. The process of transforming raw data into the structured, scaled, and encoded inputs that algorithms require is called feature engineering — and it is one of the most time-consuming parts of any ML project.
The traditional approach moves data out of the warehouse and into a Python environment where libraries like scikit-learn handle these transformations. This introduces data movement, latency, and a disconnect between where the data lives and where it is processed.
In this article, we will discuss how feature engineering can be performed directly within Snowflake, what tools and functions are available, and how this approach benefits data and engineering teams.
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.
What is Feature Engineering?#
Feature engineering is the process of transforming raw input data into representations that enable machine learning algorithms to learn patterns effectively. It bridges the gap between real-world data — which is messy, heterogeneous, and differently scaled — and the clean numerical inputs that most ML algorithms expect.
Common transformations include:
- Scaling: Normalising values so different features have comparable ranges (preventing a salary column from dominating a tenure column simply because it has larger numbers)
- Binning: Grouping continuous values into discrete categories (converting age into age bands)
- Encoding: Converting categorical strings into numerical representations (turning “Gold”, “Silver”, “Bronze” into 1, 2, 3 or binary one-hot columns)
- Aggregation: Deriving summary features from historical records (average spend per customer over 90 days)
- Imputation: Handling missing values in a consistent, reproducible way
The quality of features frequently matters more than the choice of algorithm. A well-engineered feature set with a simple model often outperforms a sophisticated model applied to raw, unprepared data.
The Feature Engineering Pipeline#
A typical feature engineering pipeline sits between raw data and model training:
Keeping this pipeline inside Snowflake means the feature store is a Snowflake table — queryable, versioned, and subject to the same governance as all other data.
Feature Engineering in SQL#
Snowflake’s SQL engine provides most of the building blocks for common feature transformations natively.
Scaling: MinMax Normalisation#
MinMax scaling maps values to a 0–1 range, preserving relative differences while removing absolute scale:
-- MinMax scale the monthly_spend column to a 0-1 range
WITH spend_range AS (
SELECT
MIN(monthly_spend) AS min_spend,
MAX(monthly_spend) AS max_spend
FROM customer_features
)
SELECT
customer_id,
(monthly_spend - min_spend) / NULLIF(max_spend - min_spend, 0) AS monthly_spend_scaled
FROM customer_features, spend_range;sqlExplanation: The CTE calculates the min and max once across the dataset. The main query applies the MinMax formula
to each row. NULLIF prevents division by zero when all values are identical.
Scaling: Standard Normalisation (Z-Score)#
Standard scaling centres values around zero with unit variance — appropriate when your algorithm assumes normally distributed features:
-- Standardise tenure_months by mean and standard deviation
WITH stats AS (
SELECT
AVG(tenure_months) AS mean_tenure,
STDDEV(tenure_months) AS stddev_tenure
FROM customer_features
)
SELECT
customer_id,
(tenure_months - mean_tenure) / NULLIF(stddev_tenure, 0) AS tenure_standardised
FROM customer_features, stats;sqlBinning: Creating Age Bands#
Binning converts continuous values into discrete categories, useful when the relationship between a feature and the target is non-linear:
-- Bin customer age into bands using WIDTH_BUCKET
SELECT
customer_id,
age,
WIDTH_BUCKET(age, 18, 80, 6) AS age_band -- 6 equal-width bands from 18 to 80
FROM customers;sqlExplanation: WIDTH_BUCKET(value, min, max, num_buckets) assigns each value to one of num_buckets equal-width
intervals between min and max. This is Snowflake’s native equivalent of scikit-learn’s KBinsDiscretizer.
Encoding: One-Hot Encoding#
One-hot encoding converts a categorical column into binary indicator columns — one per category:
-- One-hot encode the membership_tier column
SELECT
customer_id,
IFF(membership_tier = 'Gold', 1, 0) AS tier_gold,
IFF(membership_tier = 'Silver', 1, 0) AS tier_silver,
IFF(membership_tier = 'Bronze', 1, 0) AS tier_bronze,
IFF(membership_tier IS NULL, 1, 0) AS tier_unknown
FROM customers;sqlExplanation: For a column with known, stable categories, static IFF expressions are the simplest approach. For dynamic categories, a pivot or array-based approach is more robust.
Aggregation: Historical Features#
Aggregate features capture behaviour patterns over time — often the most predictive features in customer or product ML models:
-- Build a feature table with 30, 60, and 90-day spend aggregations per customer
SELECT
customer_id,
SUM(CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_total ELSE 0 END) AS spend_30d,
SUM(CASE WHEN order_date >= CURRENT_DATE - 60 THEN order_total ELSE 0 END) AS spend_60d,
SUM(CASE WHEN order_date >= CURRENT_DATE - 90 THEN order_total ELSE 0 END) AS spend_90d,
COUNT(DISTINCT CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_id END) AS orders_30d,
MAX(order_date) AS last_order_date,
DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order
FROM orders
GROUP BY customer_id;sqlExplanation: Conditional aggregation with CASE WHEN is more efficient than multiple separate queries. DATEDIFF
calculates recency — a key feature for churn and next-purchase models.
Feature Engineering with Snowpark ML#
For more complex transformations or when you want a scikit-learn compatible API with reproducible pipelines, Snowpark ML provides Python-native preprocessing classes that execute inside Snowflake:
from snowflake.ml.modeling.preprocessing import (
MinMaxScaler,
StandardScaler,
OrdinalEncoder,
OneHotEncoder
)
from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.pipeline import Pipeline
# Define a preprocessing pipeline
preprocessing_pipeline = Pipeline(steps=[
# Impute missing values before scaling
('imputer', SimpleImputer(
input_cols=['monthly_spend', 'tenure_months'],
output_cols=['monthly_spend', 'tenure_months'],
strategy='median'
)),
# Scale numeric features
('scaler', MinMaxScaler(
input_cols=['monthly_spend', 'tenure_months'],
output_cols=['spend_scaled', 'tenure_scaled']
)),
# Encode categorical features
('encoder', OneHotEncoder(
input_cols=['membership_tier', 'region'],
output_cols=['tier_encoded', 'region_encoded'],
drop_input_cols=True
))
])
# Fit and transform — executes in Snowflake, not locally
feature_df = preprocessing_pipeline.fit_transform(raw_customer_df)pythonExplanation: Despite the familiar scikit-learn syntax, all computation runs inside Snowflake’s warehouse. The
pipeline learns statistics (min, max, mean, categories) from the training data during fit and applies them
consistently during transform. This ensures that the same transformations applied during training are applied
identically at inference time.
The Snowflake Feature Store#
For teams running multiple ML models, managing features consistently across models is a significant challenge. Snowflake’s Feature Store provides a governed catalogue of feature definitions that can be shared across teams and reused across models:
from snowflake.ml.feature_store import FeatureStore, Entity, FeatureView
# Define the entity that features describe
customer_entity = Entity(
name='CUSTOMER',
join_keys=['customer_id'],
desc='Individual customer accounts'
)
# Define a feature view — a named, versioned set of features
customer_spend_features = FeatureView(
name='customer_spend_features',
entities=[customer_entity],
feature_df=spend_feature_query, # the SQL or DataFrame defining the features
refresh_freq='1 day',
desc='30/60/90-day spend and recency features'
)
# Register with the Feature Store
fs = FeatureStore(session, database='ML', schema='FEATURE_STORE')
fs.register_entity(customer_entity)
fs.register_feature_view(customer_spend_features, version='V1')pythonExplanation: Registered feature views are materialised as Snowflake tables on the defined refresh schedule. Any ML pipeline in the organisation can retrieve these features by joining on the entity key, ensuring consistent feature definitions across models.
Mapping scikit-learn Transformers to Snowflake#
Teams migrating from scikit-learn will find most common transformers have direct Snowflake equivalents:
| scikit-learn | Snowflake Equivalent |
|---|---|
MinMaxScaler | SQL CTE formula or snowflake.ml.preprocessing.MinMaxScaler |
StandardScaler | SQL AVG/STDDEV formula or StandardScaler |
KBinsDiscretizer | WIDTH_BUCKET() |
LabelEncoder | DENSE_RANK() OVER (ORDER BY category) |
OneHotEncoder | IFF(col = 'value', 1, 0) or OneHotEncoder |
SimpleImputer | COALESCE() or SimpleImputer |
RobustScaler | Percentile-based SQL formula or RobustScaler |
Most transformations “are not actually from scratch” — Snowflake’s SQL functions provide the building blocks, and Snowpark ML provides the sklearn-compatible API on top.
Business Applications#
Feature engineering in Snowflake unlocks ML use cases that were previously impractical due to data movement constraints:
- Churn prediction: Build recency, frequency, and monetary features directly from transaction tables — updated daily without extract pipelines
- Credit scoring: Aggregate account behaviour features from millions of rows without moving data outside the financial data platform
- Product recommendation: Compute user-item interaction features at scale using Snowflake’s distributed compute
- Fraud detection: Derive real-time behavioural features from event streams via Dynamic Tables, serving inference via SQL function
Best Practices#
From experience, I would recommend considering the below:
-
Fit preprocessing on training data only: When using MinMaxScaler or StandardScaler, the statistics (min, max, mean, stddev) must be calculated on the training set, not the full dataset. Leaking test set statistics into training inflates model performance metrics.
-
Store fitted parameters: Save the min, max, mean, and stddev values used during training. At inference time, apply these stored parameters rather than recalculating — otherwise your inference transformations will differ from training transformations as new data arrives.
-
Version your feature definitions: Feature definitions change over time. Use the Feature Store’s versioning to track what definition was active when each model version was trained.
-
Document feature semantics: A column called
spend_scaledis meaningless without documentation of what it was scaled from and how. The Feature Store’sdescparameter is the minimum; a fuller data dictionary is better.
Conclusion#
Feature engineering is one of the highest-value activities in any ML project, and Snowflake provides the tools to do it
entirely within the warehouse. SQL window functions, WIDTH_BUCKET, conditional aggregation, and Snowpark ML’s
scikit-learn compatible preprocessing classes cover the full range of common transformations.
The result is an ML pipeline that stays within your governed data environment: no extraction, no divergence between training and inference transformations, and feature definitions that any team in the organisation can discover and reuse.
Key Takeaways:
- Feature engineering transforms raw data into the numerical, scaled, and encoded inputs that ML algorithms require
- Snowflake SQL provides native equivalents for most scikit-learn preprocessing functions
- Snowpark ML offers a scikit-learn compatible API for complex or chained transformations that executes inside Snowflake
- The Snowflake Feature Store provides versioned, reusable feature definitions shared across models and teams
- Fitting preprocessing on training data only — and storing those parameters — is essential for consistent inference
Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.