Hosting LLMs in SPCS
A comprehensive guide to deploying a custom Hugging Face model on GPUs within Snowpark Container Services.
Last week we talked about SPCS use cases. Today, we implement the most exciting one: Hosting your own Large Language Model.
Why would you do this when Cortex exists?
- Custom Weights: You trained a Llama 3 model on your specific domain data using 10,000 internal documents.
- Versioning Control: You need to guarantee that the model behavior never changes, not even when the vendor updates the base model.
- Specialized Models: You need a niche model (e.g., Biology-BERT) not available in Cortex.
1. The Container
We need a Python script that loads the model and serves it. Libraries like vLLM or HuggingFace TGI (Text Generation
Inference) are excellent for this.
Dockerfile:
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Using NVIDIA base image for CUDA support
RUN pip install vllm
COPY model_loader.py .
CMD ["python", "-m", "vllm.entrypoints.api_server", "--model", "mistralai/Mistral-7B-v0.1"]dockerfile2. The Compute Pool
LLMs need GPUs. Standard warehouses won’t cut it.
CREATE COMPUTE POOL gpu_pool
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = GPU_NV_S; -- Small GPU instance typesqlWarning: Compute pools bill you as long as they are running, even if idle. Be diligent about suspending them!
3. The Service Specification
The YAML spec defines how Snowflake runs the container.
spec:
containers:
- name: llm-inference
image: /db/schema/repo/my-llm:v1
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
endpoints:
- name: api
port: 8000
public: false -- Only accessible internallyyaml4. The Bridge
To call this from SQL, we create a Service Function.
CREATE FUNCTION generate_text(prompt text)
RETURNS text
SERVICE = my_llm_service
ENDPOINT = api
AS '/generate';sqlConclusion
Now you have a private, custom LLM running in your VPC. You can call SELECT generate_text('Hello') from any worksheet.
This is ultimate power and flexibility, balanced with the responsibility of managing your own infrastructure (checking
logs, managing memory, updating drivers).