vLLM

Configure vLLM, the high-performance LLM serving engine, to serve self-hosted models through kgateway.

Overview

vLLM is a fast and memory-efficient inference engine for large language models. It’s designed for high-throughput serving and is commonly deployed in Kubernetes clusters for production workloads.

This guide shows two deployment patterns:

External vLLM: Connect to a vLLM instance running outside your Kubernetes cluster
In-cluster vLLM: Deploy vLLM within your Kubernetes cluster

Before you begin

Install and set up an agentgateway proxy.

Option 1: Connect to external vLLM instance

Use this option if vLLM is already deployed on dedicated GPU infrastructure outside your Kubernetes cluster.

Set up external vLLM

Install and run vLLM on a machine with GPU access. Follow the vLLM installation guide.

Start the vLLM OpenAI-compatible server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto

Verify vLLM is accessible:

curl http://<vllm-server-ip>:8000/v1/models

Configure Kubernetes to connect to external vLLM

Get the IP address of your vLLM server.

Create a headless Service and Endpoints pointing to the external vLLM instance:

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: vllm-external
  namespace: agentgateway-system
spec:
  type: ClusterIP
  clusterIP: None  # Headless service
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: vllm-external
  namespace: agentgateway-system
subsets:
- addresses:
  - ip: 10.0.1.50  # Replace with your vLLM server IP
  ports:
  - port: 8000
    protocol: TCP
EOF

Create an AgentgatewayBackend resource:

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: meta-llama/Llama-3.1-8B-Instruct
      host: vllm-external.agentgateway-system.svc.cluster.local
      port: 8000
EOF

Review the following table to understand this configuration.

Setting	Description
`ai.provider.openai`	Use OpenAI-compatible provider for vLLM.
`openai.model`	The model served by vLLM (must match the model vLLM is serving).
`openai.host`	Kubernetes Service DNS name for the external vLLM instance.
`openai.port`	vLLM API port (default: `8000`).

Create an HTTPRoute:

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /vllm
    backendRefs:
    - name: vllm
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Test the setup:

curl "$INGRESS_GW_ADDRESS/vllm" -H content-type:application/json  -d '{
   "model": "meta-llama/Llama-3.1-8B-Instruct",
   "messages": [
     {
       "role": "user",
       "content": "Explain the benefits of vLLM."
     }
   ]
 }' | jq

curl "localhost:8080/vllm" -H content-type:application/json  -d '{
   "model": "meta-llama/Llama-3.1-8B-Instruct",
   "messages": [
     {
       "role": "user",
       "content": "Explain the benefits of vLLM."
     }
   ]
 }' | jq

Option 2: Deploy vLLM in Kubernetes cluster

Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.

Before you begin

Kubernetes cluster with GPU nodes (NVIDIA GPUs with CUDA support).
NVIDIA GPU Operator or device plugin installed.
Sufficient GPU memory for your chosen model.

Deploy vLLM in the cluster

Create a vLLM Deployment with GPU resources:

kubectl apply -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8000"
          - "--dtype"
          - "auto"
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            nvidia.com/gpu: 1  # Request 1 GPU
          limits:
            nvidia.com/gpu: 1  # Limit to 1 GPU
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token  # Create this secret if accessing gated models
              key: token
              optional: true
EOF

ℹ️

Model access: For gated models (like Llama), create a Hugging Face token secret:

kubectl create secret generic hf-token \
  -n agentgateway-system \
  --from-literal=token=<your-hf-token>

Create a Service for the vLLM deployment:

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
EOF

Wait for vLLM to be ready:
```
kubectl wait --for=condition=ready pod \
  -l app=vllm \
  -n agentgateway-system \
  --timeout=300s
```
Initial startup: vLLM needs to download the model weights on first launch, which can take several minutes depending on model size and network speed. Monitor the logs:
kubectl logs -f deployment/vllm -n agentgateway-system

Create an AgentgatewayBackend resource:

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: meta-llama/Llama-3.1-8B-Instruct
      host: vllm.agentgateway-system.svc.cluster.local
      port: 8000
EOF

Create an HTTPRoute (same as Option 1 step 4 above).

vLLM configuration options

Quantization

Reduce memory usage with quantization:

args:
  - "--model"
  - "meta-llama/Llama-3.1-8B-Instruct"
  - "--quantization"
  - "awq"  # or "gptq", "squeezellm"

Tensor parallelism

Distribute model across multiple GPUs:

args:
  - "--model"
  - "meta-llama/Llama-3.1-70B-Instruct"
  - "--tensor-parallel-size"
  - "4"  # Use 4 GPUs
resources:
  requests:
    nvidia.com/gpu: 4
  limits:
    nvidia.com/gpu: 4

Engine arguments

Tune performance parameters:

args:
  - "--model"
  - "meta-llama/Llama-3.1-8B-Instruct"
  - "--max-model-len"
  - "4096"  # Maximum sequence length
  - "--gpu-memory-utilization"
  - "0.9"  # Use 90% of GPU memory
  - "--max-num-seqs"
  - "256"  # Maximum number of sequences to process in parallel

Scaling and high availability

Horizontal scaling

Scale vLLM for higher throughput:

kubectl scale deployment vllm \
  -n agentgateway-system \
  --replicas=3

Kubernetes will load balance requests across vLLM replicas through the Service.

Resource limits

Set appropriate resource requests and limits:

resources:
  requests:
    memory: "16Gi"
    nvidia.com/gpu: 1
  limits:
    memory: "32Gi"
    nvidia.com/gpu: 1

Node affinity

Pin vLLM to GPU nodes:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values:
                - "true"

Monitoring

vLLM exposes Prometheus metrics at /metrics:

apiVersion: v1
kind: Service
metadata:
  name: vllm-metrics
  namespace: agentgateway-system
  labels:
    app: vllm
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    name: metrics

Key metrics to monitor:

vllm_request_duration_seconds - Request latency.
vllm_num_requests_running - Active requests.
vllm_gpu_cache_usage_perc - GPU memory utilization.

Troubleshooting

Pod stuck in Pending state

What’s happening:

vLLM pod doesn’t start, shows Pending status.

Why it’s happening:

No GPU nodes available or insufficient GPU memory.

How to fix it:

Check GPU availability:

kubectl describe nodes | grep -A 5 "nvidia.com/gpu"

Check pod events:

kubectl describe pod -l app=vllm -n agentgateway-system

Out of memory errors

What’s happening:

vLLM crashes with CUDA out-of-memory errors.

Why it’s happening:

The model requires more GPU memory than is available.

How to fix it:

Use a smaller model or quantized variant.
Reduce --max-model-len.
Lower --gpu-memory-utilization (try 0.8 or 0.7).
Enable tensor parallelism across more GPUs.

Slow inference

What’s happening:

High latency on requests.

Why it’s happening:

Model too large for available GPU memory (swapping to CPU), insufficient --max-num-seqs for concurrent requests, or CPU bottleneck in preprocessing.

How to fix it:

Increase GPU memory or use smaller model.
Tune --max-num-seqs and --max-model-len.
Use faster CPUs or increase CPU requests.

Connection refused from agentgateway

What’s happening:

agentgateway cannot reach vLLM service.

Why it’s happening:

The vLLM Service may not exist, have no endpoints, or network policies are blocking traffic.

How to fix it:

Verify vLLM Service exists and has endpoints:

kubectl get svc vllm -n agentgateway-system
kubectl get endpoints vllm -n agentgateway-system

Test connectivity from another pod:

kubectl run -it --rm debug --image=curlimages/curl --restart=Never \
  -- curl http://vllm.agentgateway-system.svc.cluster.local:8000/v1/models

Next steps

Want to use other endpoints than chat completions, such as embeddings or models? Check out the multiple endpoints guide.
Explore other guides for LLM consumption, such as function calling, model failover, and prompt guards.

Ollama Multiple endpoints

vLLM

Overview

Before you begin

Option 1: Connect to external vLLM instance

Set up external vLLM

Configure Kubernetes to connect to external vLLM

Option 2: Deploy vLLM in Kubernetes cluster

Before you begin

Deploy vLLM in the cluster

vLLM configuration options

Quantization

Tensor parallelism

Engine arguments

Scaling and high availability

Horizontal scaling

Resource limits

Node affinity

Monitoring

Troubleshooting

Pod stuck in Pending state

Out of memory errors

Slow inference

Connection refused from agentgateway

Next steps

What could be improved?