Hosting LLMs in SPCS
A comprehensive guide to deploying a custom Hugging Face model on GPUs within Snowpark Container Services.
Last week we talked about SPCS use cases. Today, we implement the most exciting one: Hosting your own Large Language Model.
Why would you do this when Cortex exists?
- Custom Weights: You trained a Llama 3 model on your specific domain data using 10,000 internal documents.
- Versioning Control: You need to guarantee that the model behavior never changes, not even when the vendor updates the base model.
- Specialized Models: You need a niche model (e.g., Biology-BERT) not available in Cortex.
1. The Container#
We need a Python script that loads the model and serves it. Libraries like vLLM or HuggingFace TGI (Text Generation
Inference) are excellent for this.
Dockerfile:
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Using NVIDIA base image for CUDA support
RUN pip install vllm
COPY model_loader.py .
CMD ["python", "-m", "vllm.entrypoints.api_server", "--model", "mistralai/Mistral-7B-v0.1"]dockerfile2. The Compute Pool#
LLMs need GPUs. Standard warehouses won’t cut it.
CREATE COMPUTE POOL gpu_pool
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = GPU_NV_S; -- Small GPU instance typesqlWarning: Compute pools bill you as long as they are running, even if idle. Be diligent about suspending them!
3. The Service Specification#
The YAML spec defines how Snowflake runs the container.
spec:
containers:
- name: llm-inference
image: /db/schema/repo/my-llm:v1
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
endpoints:
- name: api
port: 8000
public: false -- Only accessible internallyyaml4. The Bridge#
To call this from SQL, we create a Service Function.
CREATE FUNCTION generate_text(prompt text)
RETURNS text
SERVICE = my_llm_service
ENDPOINT = api
AS '/generate';sqlConclusion#
Now you have a private, custom LLM running in your VPC. You can call SELECT generate_text('Hello') from any worksheet.
This is ultimate power and flexibility, balanced with the responsibility of managing your own infrastructure (checking
logs, managing memory, updating drivers).