Hosting LLMs in SPCS

Last week we talked about SPCS use cases. Today, we implement the most exciting one: Hosting your own Large Language Model.

Why would you do this when Cortex exists?

Custom Weights: You trained a Llama 3 model on your specific domain data using 10,000 internal documents.
Versioning Control: You need to guarantee that the model behavior never changes, not even when the vendor updates the base model.
Specialized Models: You need a niche model (e.g., Biology-BERT) not available in Cortex.

1. The Container#

We need a Python script that loads the model and serves it. Libraries like vLLM or HuggingFace TGI (Text Generation Inference) are excellent for this.

Dockerfile:

FROM nvcr.io/nvidia/pytorch:23.10-py3
# Using NVIDIA base image for CUDA support

RUN pip install vllm

COPY model_loader.py .
CMD ["python", "-m", "vllm.entrypoints.api_server", "--model", "mistralai/Mistral-7B-v0.1"]

dockerfile

2. The Compute Pool#

LLMs need GPUs. Standard warehouses won’t cut it.

CREATE COMPUTE POOL gpu_pool
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = GPU_NV_S; -- Small GPU instance type

sql

Warning: Compute pools bill you as long as they are running, even if idle. Be diligent about suspending them!

3. The Service Specification#

The YAML spec defines how Snowflake runs the container.

spec:
  containers:
  - name: llm-inference
    image: /db/schema/repo/my-llm:v1
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
  endpoints:
  - name: api
    port: 8000
    public: false -- Only accessible internally

yaml

4. The Bridge#

To call this from SQL, we create a Service Function.

CREATE FUNCTION generate_text(prompt text)
  RETURNS text
  SERVICE = my_llm_service
  ENDPOINT = api
  AS '/generate';

sql

Conclusion#

Now you have a private, custom LLM running in your VPC. You can call SELECT generate_text('Hello') from any worksheet. This is ultimate power and flexibility, balanced with the responsibility of managing your own infrastructure (checking logs, managing memory, updating drivers).

1. The Container#

2. The Compute Pool#

3. The Service Specification#

4. The Bridge#

Conclusion#

Disclaimer