❄️
Data Flakes

Back

Raw data is rarely in a form that machine learning algorithms can use effectively. A customer’s transaction history is a sequence of events; an ML model expects numerical features with consistent scale. A product category is a string; a classifier expects a binary indicator for each possible value. The process of transforming raw data into the structured, scaled, and encoded inputs that algorithms require is called feature engineering — and it is one of the most time-consuming parts of any ML project.

The traditional approach moves data out of the warehouse and into a Python environment where libraries like scikit-learn handle these transformations. This introduces data movement, latency, and a disconnect between where the data lives and where it is processed.

In this article, we will discuss how feature engineering can be performed directly within Snowflake, what tools and functions are available, and how this approach benefits data and engineering teams.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

What is Feature Engineering?#

Feature engineering is the process of transforming raw input data into representations that enable machine learning algorithms to learn patterns effectively. It bridges the gap between real-world data — which is messy, heterogeneous, and differently scaled — and the clean numerical inputs that most ML algorithms expect.

Common transformations include:

  • Scaling: Normalising values so different features have comparable ranges (preventing a salary column from dominating a tenure column simply because it has larger numbers)
  • Binning: Grouping continuous values into discrete categories (converting age into age bands)
  • Encoding: Converting categorical strings into numerical representations (turning “Gold”, “Silver”, “Bronze” into 1, 2, 3 or binary one-hot columns)
  • Aggregation: Deriving summary features from historical records (average spend per customer over 90 days)
  • Imputation: Handling missing values in a consistent, reproducible way

The quality of features frequently matters more than the choice of algorithm. A well-engineered feature set with a simple model often outperforms a sophisticated model applied to raw, unprepared data.

The Feature Engineering Pipeline#

A typical feature engineering pipeline sits between raw data and model training:

graph LR A[(Raw data<br/>warehouse tables)] --> B[Feature engineering<br/>scale, encode, aggregate] B --> C[(Feature store<br/>ML-ready tables)] C --> D[Model training] C --> E[Real-time inference<br/>via SQL function] D --> F[Deployed model]

Keeping this pipeline inside Snowflake means the feature store is a Snowflake table — queryable, versioned, and subject to the same governance as all other data.

Feature Engineering in SQL#

Snowflake’s SQL engine provides most of the building blocks for common feature transformations natively.

Scaling: MinMax Normalisation#

MinMax scaling maps values to a 0–1 range, preserving relative differences while removing absolute scale:

-- MinMax scale the monthly_spend column to a 0-1 range
WITH spend_range AS (
    SELECT
        MIN(monthly_spend) AS min_spend,
        MAX(monthly_spend) AS max_spend
    FROM customer_features
)
SELECT
    customer_id,
    (monthly_spend - min_spend) / NULLIF(max_spend - min_spend, 0) AS monthly_spend_scaled
FROM customer_features, spend_range;
sql

Explanation: The CTE calculates the min and max once across the dataset. The main query applies the MinMax formula to each row. NULLIF prevents division by zero when all values are identical.

Scaling: Standard Normalisation (Z-Score)#

Standard scaling centres values around zero with unit variance — appropriate when your algorithm assumes normally distributed features:

-- Standardise tenure_months by mean and standard deviation
WITH stats AS (
    SELECT
        AVG(tenure_months)    AS mean_tenure,
        STDDEV(tenure_months) AS stddev_tenure
    FROM customer_features
)
SELECT
    customer_id,
    (tenure_months - mean_tenure) / NULLIF(stddev_tenure, 0) AS tenure_standardised
FROM customer_features, stats;
sql

Binning: Creating Age Bands#

Binning converts continuous values into discrete categories, useful when the relationship between a feature and the target is non-linear:

-- Bin customer age into bands using WIDTH_BUCKET
SELECT
    customer_id,
    age,
    WIDTH_BUCKET(age, 18, 80, 6) AS age_band   -- 6 equal-width bands from 18 to 80
FROM customers;
sql

Explanation: WIDTH_BUCKET(value, min, max, num_buckets) assigns each value to one of num_buckets equal-width intervals between min and max. This is Snowflake’s native equivalent of scikit-learn’s KBinsDiscretizer.

Encoding: One-Hot Encoding#

One-hot encoding converts a categorical column into binary indicator columns — one per category:

-- One-hot encode the membership_tier column
SELECT
    customer_id,
    IFF(membership_tier = 'Gold', 1, 0)     AS tier_gold,
    IFF(membership_tier = 'Silver', 1, 0)   AS tier_silver,
    IFF(membership_tier = 'Bronze', 1, 0)   AS tier_bronze,
    IFF(membership_tier IS NULL, 1, 0)      AS tier_unknown
FROM customers;
sql

Explanation: For a column with known, stable categories, static IFF expressions are the simplest approach. For dynamic categories, a pivot or array-based approach is more robust.

Aggregation: Historical Features#

Aggregate features capture behaviour patterns over time — often the most predictive features in customer or product ML models:

-- Build a feature table with 30, 60, and 90-day spend aggregations per customer
SELECT
    customer_id,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_total ELSE 0 END) AS spend_30d,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 60 THEN order_total ELSE 0 END) AS spend_60d,
    SUM(CASE WHEN order_date >= CURRENT_DATE - 90 THEN order_total ELSE 0 END) AS spend_90d,
    COUNT(DISTINCT CASE WHEN order_date >= CURRENT_DATE - 30 THEN order_id END) AS orders_30d,
    MAX(order_date)                                                               AS last_order_date,
    DATEDIFF('day', MAX(order_date), CURRENT_DATE)                               AS days_since_last_order
FROM orders
GROUP BY customer_id;
sql

Explanation: Conditional aggregation with CASE WHEN is more efficient than multiple separate queries. DATEDIFF calculates recency — a key feature for churn and next-purchase models.

Feature Engineering with Snowpark ML#

For more complex transformations or when you want a scikit-learn compatible API with reproducible pipelines, Snowpark ML provides Python-native preprocessing classes that execute inside Snowflake:

Explanation: Despite the familiar scikit-learn syntax, all computation runs inside Snowflake’s warehouse. The pipeline learns statistics (min, max, mean, categories) from the training data during fit and applies them consistently during transform. This ensures that the same transformations applied during training are applied identically at inference time.

The Snowflake Feature Store#

For teams running multiple ML models, managing features consistently across models is a significant challenge. Snowflake’s Feature Store provides a governed catalogue of feature definitions that can be shared across teams and reused across models:

Explanation: Registered feature views are materialised as Snowflake tables on the defined refresh schedule. Any ML pipeline in the organisation can retrieve these features by joining on the entity key, ensuring consistent feature definitions across models.

Mapping scikit-learn Transformers to Snowflake#

Teams migrating from scikit-learn will find most common transformers have direct Snowflake equivalents:

scikit-learnSnowflake Equivalent
MinMaxScalerSQL CTE formula or snowflake.ml.preprocessing.MinMaxScaler
StandardScalerSQL AVG/STDDEV formula or StandardScaler
KBinsDiscretizerWIDTH_BUCKET()
LabelEncoderDENSE_RANK() OVER (ORDER BY category)
OneHotEncoderIFF(col = 'value', 1, 0) or OneHotEncoder
SimpleImputerCOALESCE() or SimpleImputer
RobustScalerPercentile-based SQL formula or RobustScaler

Most transformations “are not actually from scratch” — Snowflake’s SQL functions provide the building blocks, and Snowpark ML provides the sklearn-compatible API on top.

Business Applications#

Feature engineering in Snowflake unlocks ML use cases that were previously impractical due to data movement constraints:

  • Churn prediction: Build recency, frequency, and monetary features directly from transaction tables — updated daily without extract pipelines
  • Credit scoring: Aggregate account behaviour features from millions of rows without moving data outside the financial data platform
  • Product recommendation: Compute user-item interaction features at scale using Snowflake’s distributed compute
  • Fraud detection: Derive real-time behavioural features from event streams via Dynamic Tables, serving inference via SQL function

Best Practices#

From experience, I would recommend considering the below:

  1. Fit preprocessing on training data only: When using MinMaxScaler or StandardScaler, the statistics (min, max, mean, stddev) must be calculated on the training set, not the full dataset. Leaking test set statistics into training inflates model performance metrics.

  2. Store fitted parameters: Save the min, max, mean, and stddev values used during training. At inference time, apply these stored parameters rather than recalculating — otherwise your inference transformations will differ from training transformations as new data arrives.

  3. Version your feature definitions: Feature definitions change over time. Use the Feature Store’s versioning to track what definition was active when each model version was trained.

  4. Document feature semantics: A column called spend_scaled is meaningless without documentation of what it was scaled from and how. The Feature Store’s desc parameter is the minimum; a fuller data dictionary is better.

Conclusion#

Feature engineering is one of the highest-value activities in any ML project, and Snowflake provides the tools to do it entirely within the warehouse. SQL window functions, WIDTH_BUCKET, conditional aggregation, and Snowpark ML’s scikit-learn compatible preprocessing classes cover the full range of common transformations.

The result is an ML pipeline that stays within your governed data environment: no extraction, no divergence between training and inference transformations, and feature definitions that any team in the organisation can discover and reuse.

Key Takeaways:

  • Feature engineering transforms raw data into the numerical, scaled, and encoded inputs that ML algorithms require
  • Snowflake SQL provides native equivalents for most scikit-learn preprocessing functions
  • Snowpark ML offers a scikit-learn compatible API for complex or chained transformations that executes inside Snowflake
  • The Snowflake Feature Store provides versioned, reusable feature definitions shared across models and teams
  • Fitting preprocessing on training data only — and storing those parameters — is essential for consistent inference

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

Further Reading#

Disclaimer

The information provided on this website is for general informational purposes only. While we strive to keep the information up to date and correct, there may be instances where information is outdated or links are no longer valid. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.