❄️
Data Flakes

Back

Machine learning workloads and data platforms have historically lived in separate worlds. You extract data from your warehouse, move it to a Python environment or cloud ML service, train a model, and then find a way to get predictions back into the warehouse for reporting. Each handoff adds latency, infrastructure to manage, and security surface area.

Snowflake’s platform has evolved to change this. In this article, we will discuss the multiple approaches Snowflake now offers for ML workloads, from Cortex ML functions for turnkey predictions to Snowpark ML for custom model development, and in-database model training for teams that want to keep their ML pipeline entirely within Snowflake.

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

The Case for In-Database ML#

The traditional ML workflow involves a fundamental tension: your data lives in a warehouse where it is secure, governed, and well-integrated with your business context, but your ML tools want the data in a Python environment where they can operate freely.

The case for keeping ML inside Snowflake is straightforward:

  • Data security: The data never leaves the governed environment
  • No data movement: Training on terabytes of data without extract costs or latency
  • Operational simplicity: One fewer platform to manage, monitor, and secure
  • Consistent governance: RBAC, masking, and auditing apply to ML pipelines as they do to SQL queries

Snowflake offers three distinct approaches to ML, suited to different use cases and team capabilities.

Three Approaches to ML in Snowflake#

graph TD A{What is your use case?} -->|Forecast, anomaly, classification| B[Cortex ML Functions<br/>No ML expertise required] A -->|Custom model with Python| C[Snowpark ML<br/>scikit-learn compatible API] A -->|Full SQL-based pipeline| D[In-Database Model Training<br/>SQL and stored procedures] B --> E[Fastest time to value] C --> F[Most flexibility] D --> G[Maximum data control]

Approach 1: Cortex ML Functions#

For common ML tasks — time-series forecasting, anomaly detection, and classification — Cortex ML provides turnkey functions that require no ML expertise:

-- Create a forecast model on sales data
CREATE SNOWFLAKE.ML.FORECAST sales_forecast (
    INPUT_DATA    => TABLE(v_weekly_sales),
    SERIES_COLNAME   => 'region',
    TIMESTAMP_COLNAME => 'week_start',
    TARGET_COLNAME   => 'revenue'
);

-- Generate predictions for the next 8 weeks
CALL sales_forecast!FORECAST(
    FORECASTING_PERIODS => 8
);
sql

Explanation: The FORECAST function handles feature engineering, model selection, hyperparameter tuning, and prediction generation automatically. You provide clean time-series data and specify the target column; Cortex ML handles the rest. Results are returned as a table with predicted values and confidence intervals.

Cortex ML functions are the right choice when your use case matches a supported task type and you want a production result quickly. The trade-off is that you cannot inspect or customise the underlying model.

Approach 2: Snowpark ML#

Snowpark ML provides a scikit-learn compatible API for custom model development that executes within Snowflake’s compute:

Explanation: Despite the familiar scikit-learn interface, the computation executes inside Snowflake. The training data is never moved to a local Python process. Snowpark ML translates the pipeline operations into optimised Snowflake queries and Snowpark jobs, running them on your configured warehouse.

Snowpark ML also integrates with the Snowflake Model Registry for versioned model storage and deployment:

from snowflake.ml.registry import Registry

# Register the trained pipeline
registry = Registry(session=session)
mv = registry.log_model(
    pipeline,
    model_name='customer_churn_model',
    version_name='v1_0',
    sample_input_data=training_df.limit(100),
    comment='Random Forest churn model trained on Q1 2026 data'
)
python

Explanation: The Model Registry stores model artefacts, metadata, and sample inputs within Snowflake. Registered models can be deployed as SQL functions, making predictions available to any team with appropriate access — no Python environment required at inference time.

Approach 3: In-Database Model Training with Stored Procedures#

For teams that want complete control over the training algorithm and prefer to express everything in SQL and JavaScript (or Python) stored procedures, Snowflake’s compute engine is capable of supporting custom ML implementations directly.

The key insight is to offload heavy calculations to Snowflake’s query engine and use the stored procedure as an orchestration layer:

-- Example: calculate node split candidates using Snowflake's analytical functions
SELECT
    feature_column,
    candidate_split_value,
    STDDEV(target_col) OVER (PARTITION BY CASE WHEN feature_column <= candidate_split_value
                                              THEN 'left' ELSE 'right' END) AS child_stddev,
    COUNT(*) OVER (PARTITION BY CASE WHEN feature_column <= candidate_split_value
                                     THEN 'left' ELSE 'right' END) AS child_count
FROM training_data
CROSS JOIN split_candidates
ORDER BY child_stddev ASC;
sql

Explanation: Rather than implementing split evaluation in procedural code row-by-row, this pattern pushes the calculation into Snowflake’s distributed query engine using window functions. Snowflake’s parallelism evaluates all candidate splits simultaneously — far more efficient than iterating in application code.

Models trained this way are stored as JSON in a Snowflake table, converted to scoring logic (such as a CASE statement tree), and deployed as a SQL function for inference.

Storing and Versioning Models#

Regardless of the training approach, trained models should be stored and versioned:

Explanation: Storing model artefacts as VARIANT allows you to capture any model structure without schema changes. The parameters column records hyperparameters for reproducibility; model_object stores the serialised model for inference.

Choosing the Right Approach#

ConsiderationCortex ML FunctionsSnowpark MLIn-Database Custom
ML expertise requiredNoneModerateHigh
Time to first resultHoursDaysWeeks
CustomisationNoneHighFull
Model transparencyLowHighFull
Supported algorithmsForecast, anomaly, classifyscikit-learn compatibleAny
Inference deploymentAutomaticModel RegistrySQL function

Best Practices#

From experience, I would recommend considering the below:

  1. Start with Cortex ML Functions for common tasks: If your use case fits a supported function type, use it. The time saved on implementation and maintenance is substantial compared to building a custom model.

  2. Use the Model Registry for all Snowpark ML models: Even for prototypes. Versioning and sample input logging from the beginning prevents the “which model is in production?” problem later.

  3. Separate training and inference warehouses: Model training is compute-intensive and can disrupt concurrent analytical workloads. Use a dedicated warehouse for training runs.

  4. Test predictions against a holdout set before deployment: Apply the trained model to a time-based holdout set, not just a random sample. Time-series leakage is a common error in in-database ML pipelines.

Common Pitfalls#

Pitfall: Training on the Full Dataset Without Sampling#

Problem: Training a model on a 10TB table using a single warehouse leads to very long training times and high credit consumption.

Solution: Use Snowflake’s SAMPLE function to create a representative training set. For most ML algorithms, training on a well-sampled 10% of data produces similar accuracy to training on 100% at a fraction of the cost.

-- Create a 10% stratified sample for training
CREATE OR REPLACE TABLE training_sample AS
SELECT * FROM customer_features
SAMPLE (10);
sql

Pitfall: Ignoring Feature Drift After Deployment#

Problem: A model trained on Q1 data is still in production in Q4 with no monitoring. Its accuracy has degraded silently as the underlying data distribution has changed.

Solution: Implement monitoring queries that compare live prediction distributions against training baseline distributions. Set up Snowflake Alerts to notify when prediction distributions shift beyond a defined threshold.

Conclusion#

Snowflake offers a genuinely capable platform for ML workloads, from the turnkey simplicity of Cortex ML Functions to the full flexibility of custom Snowpark ML pipelines. The choice between approaches depends on your use case complexity, your team’s ML expertise, and how much control you need over the training process.

The common thread is that all three approaches keep your data within Snowflake’s governance boundary — no extraction, no external ML platform to manage, and no data movement cost.

Key Takeaways:

  • Cortex ML Functions provide turnkey forecasting, anomaly detection, and classification without ML expertise
  • Snowpark ML offers a scikit-learn compatible API that executes inside Snowflake’s compute
  • In-database model training using stored procedures and SQL window functions suits teams that need full algorithm control
  • The Model Registry provides versioned model storage and SQL-based deployment for any Snowpark ML model
  • Use dedicated warehouses for training to avoid impacting analytical workloads

Features change from time to time with new features being added regularly, it is recommended that you review the documentation for the latest on what specific features are included with any of the Editions.

Further Reading#

Disclaimer

The information provided on this website is for general informational purposes only. While we strive to keep the information up to date and correct, there may be instances where information is outdated or links are no longer valid. We make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability, or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.