Implementing Machine Learning Workloads in Snowflake

Machine learning workloads and data platforms have historically lived in separate worlds. You extract data from your warehouse, move it to a Python environment or cloud ML service, train a model, and then find a way to get predictions back into the warehouse for reporting. Each handoff adds latency, infrastructure to manage, and security surface area.

Snowflake’s platform has evolved to change this. In this article, we will discuss the multiple approaches Snowflake now offers for ML workloads, from Cortex ML functions for turnkey predictions to Snowpark ML for custom model development, and in-database model training for teams that want to keep their ML pipeline entirely within Snowflake.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation ↗ for the latest on what specific features are included with any of the Editions.

The Case for In-Database ML

The traditional ML workflow involves a fundamental tension: your data lives in a warehouse where it is secure, governed, and well-integrated with your business context, but your ML tools want the data in a Python environment where they can operate freely.

The case for keeping ML inside Snowflake is straightforward:

Data security: The data never leaves the governed environment
No data movement: Training on terabytes of data without extract costs or latency
Operational simplicity: One fewer platform to manage, monitor, and secure
Consistent governance: RBAC, masking, and auditing apply to ML pipelines as they do to SQL queries

Snowflake offers three distinct approaches to ML, suited to different use cases and team capabilities.

Three Approaches to ML in Snowflake

graph TD A{What is your use case?} -->|Forecast, anomaly, classification| B[Cortex ML Functions<br/>No ML expertise required] A -->|Custom model with Python| C[Snowpark ML<br/>scikit-learn compatible API] A -->|Full SQL-based pipeline| D[In-Database Model Training<br/>SQL and stored procedures] B --> E[Fastest time to value] C --> F[Most flexibility] D --> G[Maximum data control]

Approach 1: Cortex ML Functions

For common ML tasks — time-series forecasting, anomaly detection, and classification — Cortex ML provides turnkey functions that require no ML expertise:

-- Create a forecast model on sales data
CREATE SNOWFLAKE.ML.FORECAST sales_forecast (
    INPUT_DATA    => TABLE(v_weekly_sales),
    SERIES_COLNAME   => 'region',
    TIMESTAMP_COLNAME => 'week_start',
    TARGET_COLNAME   => 'revenue'
);

-- Generate predictions for the next 8 weeks
CALL sales_forecast!FORECAST(
    FORECASTING_PERIODS => 8
);

sql

Explanation: The FORECAST function handles feature engineering, model selection, hyperparameter tuning, and prediction generation automatically. You provide clean time-series data and specify the target column; Cortex ML handles the rest. Results are returned as a table with predicted values and confidence intervals.

Cortex ML functions are the right choice when your use case matches a supported task type and you want a production result quickly. The trade-off is that you cannot inspect or customise the underlying model.

Approach 2: Snowpark ML

Snowpark ML provides a scikit-learn compatible API for custom model development that executes within Snowflake’s compute:

from snowflake.ml.modeling.ensemble import RandomForestRegressor
from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline

# Build a preprocessing and modelling pipeline
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler(
        input_cols=['tenure_months', 'monthly_spend', 'support_tickets'],
        output_cols=['tenure_scaled', 'spend_scaled', 'tickets_scaled']
    )),
    ('model', RandomForestRegressor(
        input_cols=['tenure_scaled', 'spend_scaled', 'tickets_scaled'],
        label_cols=['churn_score'],
        output_cols=['predicted_churn_score']
    ))
])

# Fit on training data (executes in Snowflake, not locally)
pipeline.fit(training_df)

# Score the full customer table
predictions = pipeline.predict(customers_df)

python

Explanation: Despite the familiar scikit-learn interface, the computation executes inside Snowflake. The training data is never moved to a local Python process. Snowpark ML translates the pipeline operations into optimised Snowflake queries and Snowpark jobs, running them on your configured warehouse.

Snowpark ML also integrates with the Snowflake Model Registry for versioned model storage and deployment:

from snowflake.ml.registry import Registry

# Register the trained pipeline
registry = Registry(session=session)
mv = registry.log_model(
    pipeline,
    model_name='customer_churn_model',
    version_name='v1_0',
    sample_input_data=training_df.limit(100),
    comment='Random Forest churn model trained on Q1 2026 data'
)

python

Explanation: The Model Registry stores model artefacts, metadata, and sample inputs within Snowflake. Registered models can be deployed as SQL functions, making predictions available to any team with appropriate access — no Python environment required at inference time.

Approach 3: In-Database Model Training with Stored Procedures

For teams that want complete control over the training algorithm and prefer to express everything in SQL and JavaScript (or Python) stored procedures, Snowflake’s compute engine is capable of supporting custom ML implementations directly.

The key insight is to offload heavy calculations to Snowflake’s query engine and use the stored procedure as an orchestration layer:

-- Example: calculate node split candidates using Snowflake's analytical functions
SELECT
    feature_column,
    candidate_split_value,
    STDDEV(target_col) OVER (PARTITION BY CASE WHEN feature_column <= candidate_split_value
                                              THEN 'left' ELSE 'right' END) AS child_stddev,
    COUNT(*) OVER (PARTITION BY CASE WHEN feature_column <= candidate_split_value
                                     THEN 'left' ELSE 'right' END) AS child_count
FROM training_data
CROSS JOIN split_candidates
ORDER BY child_stddev ASC;

sql

Explanation: Rather than implementing split evaluation in procedural code row-by-row, this pattern pushes the calculation into Snowflake’s distributed query engine using window functions. Snowflake’s parallelism evaluates all candidate splits simultaneously — far more efficient than iterating in application code.

Models trained this way are stored as JSON in a Snowflake table, converted to scoring logic (such as a CASE statement tree), and deployed as a SQL function for inference.

Storing and Versioning Models

Regardless of the training approach, trained models should be stored and versioned:

-- Model runs table for tracking experiments
CREATE OR REPLACE TABLE ml_model_runs (
    run_id          VARCHAR DEFAULT UUID_STRING(),
    algorithm       VARCHAR,
    model_name      VARCHAR,
    version         VARCHAR,
    trained_at      TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
    training_rows   INTEGER,
    rmse            FLOAT,
    parameters      VARIANT,
    model_object    VARIANT   -- stores the model as JSON
);

-- Record a training run
INSERT INTO ml_model_runs (algorithm, model_name, version, training_rows, rmse, parameters, model_object)
SELECT
    'decision_tree_regression',
    'sales_forecaster',
    'v2_0',
    COUNT(*),
    :rmse_value,
    PARSE_JSON(:parameters_json),
    PARSE_JSON(:model_json)
FROM training_data;

sql

Explanation: Storing model artefacts as VARIANT allows you to capture any model structure without schema changes. The parameters column records hyperparameters for reproducibility; model_object stores the serialised model for inference.

Choosing the Right Approach

Consideration	Cortex ML Functions	Snowpark ML	In-Database Custom
ML expertise required	None	Moderate	High
Time to first result	Hours	Days	Weeks
Customisation	None	High	Full
Model transparency	Low	High	Full
Supported algorithms	Forecast, anomaly, classify	scikit-learn compatible	Any
Inference deployment	Automatic	Model Registry	SQL function

Best Practices

From experience, I would recommend considering the below:

Start with Cortex ML Functions for common tasks: If your use case fits a supported function type, use it. The time saved on implementation and maintenance is substantial compared to building a custom model.
Use the Model Registry for all Snowpark ML models: Even for prototypes. Versioning and sample input logging from the beginning prevents the “which model is in production?” problem later.
Separate training and inference warehouses: Model training is compute-intensive and can disrupt concurrent analytical workloads. Use a dedicated warehouse for training runs.
Test predictions against a holdout set before deployment: Apply the trained model to a time-based holdout set, not just a random sample. Time-series leakage is a common error in in-database ML pipelines.

Common Pitfalls

Pitfall: Training on the Full Dataset Without Sampling

Problem: Training a model on a 10TB table using a single warehouse leads to very long training times and high credit consumption.

Solution: Use Snowflake’s SAMPLE function to create a representative training set. For most ML algorithms, training on a well-sampled 10% of data produces similar accuracy to training on 100% at a fraction of the cost.

-- Create a 10% stratified sample for training
CREATE OR REPLACE TABLE training_sample AS
SELECT * FROM customer_features
SAMPLE (10);

sql

Pitfall: Ignoring Feature Drift After Deployment

Problem: A model trained on Q1 data is still in production in Q4 with no monitoring. Its accuracy has degraded silently as the underlying data distribution has changed.

Solution: Implement monitoring queries that compare live prediction distributions against training baseline distributions. Set up Snowflake Alerts to notify when prediction distributions shift beyond a defined threshold.

Conclusion

Snowflake offers a genuinely capable platform for ML workloads, from the turnkey simplicity of Cortex ML Functions to the full flexibility of custom Snowpark ML pipelines. The choice between approaches depends on your use case complexity, your team’s ML expertise, and how much control you need over the training process.

The common thread is that all three approaches keep your data within Snowflake’s governance boundary — no extraction, no external ML platform to manage, and no data movement cost.

Key Takeaways:

Cortex ML Functions provide turnkey forecasting, anomaly detection, and classification without ML expertise
Snowpark ML offers a scikit-learn compatible API that executes inside Snowflake’s compute
In-database model training using stored procedures and SQL window functions suits teams that need full algorithm control
The Model Registry provides versioned model storage and SQL-based deployment for any Snowpark ML model
Use dedicated warehouses for training to avoid impacting analytical workloads

The Case for In-Database ML

Three Approaches to ML in Snowflake

Approach 1: Cortex ML Functions

Approach 2: Snowpark ML

Approach 3: In-Database Model Training with Stored Procedures

Storing and Versioning Models

Choosing the Right Approach

Best Practices

Common Pitfalls

Pitfall: Training on the Full Dataset Without Sampling

Pitfall: Ignoring Feature Drift After Deployment

Conclusion

Further Reading

Disclaimer

Implementing Machine Learning Workloads in Snowflake

The Case for In-Database ML#

Three Approaches to ML in Snowflake#

Approach 1: Cortex ML Functions#

Approach 2: Snowpark ML#

Approach 3: In-Database Model Training with Stored Procedures#

Storing and Versioning Models#

Choosing the Right Approach#

Best Practices#

Common Pitfalls#

Pitfall: Training on the Full Dataset Without Sampling#

Pitfall: Ignoring Feature Drift After Deployment#

Conclusion#

Further Reading#

Disclaimer

The Case for In-Database ML

Three Approaches to ML in Snowflake

Approach 1: Cortex ML Functions

Approach 2: Snowpark ML

Approach 3: In-Database Model Training with Stored Procedures

Storing and Versioning Models

Choosing the Right Approach

Best Practices

Common Pitfalls

Pitfall: Training on the Full Dataset Without Sampling

Pitfall: Ignoring Feature Drift After Deployment

Conclusion

Further Reading