vLLM

Configure vLLM, the high-performance LLM serving engine, to serve self-hosted models through kgateway.

Overview

vLLM is a fast and memory-efficient inference engine for large language models. It’s designed for high-throughput serving and is commonly deployed in Kubernetes clusters for production workloads.

This guide shows two deployment patterns:

  • External vLLM: Connect to a vLLM instance running outside your Kubernetes cluster
  • In-cluster vLLM: Deploy vLLM within your Kubernetes cluster

Before you begin

Install and set up an agentgateway proxy.

Option 1: Connect to external vLLM instance

Use this option if vLLM is already deployed on dedicated GPU infrastructure outside your Kubernetes cluster.

Set up external vLLM

  1. Install and run vLLM on a machine with GPU access. Follow the vLLM installation guide.

  2. Start the vLLM OpenAI-compatible server:

    vllm serve meta-llama/Llama-3.1-8B-Instruct \
      --host 0.0.0.0 \
      --port 8000 \
      --dtype auto
  3. Verify vLLM is accessible:

    curl http://<vllm-server-ip>:8000/v1/models

Configure Kubernetes to connect to external vLLM

  1. Get the IP address of your vLLM server.

  2. Create a headless Service and Endpoints pointing to the external vLLM instance:

    kubectl apply -f- <<EOF
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-external
      namespace: agentgateway-system
    spec:
      type: ClusterIP
      clusterIP: None  # Headless service
      ports:
      - port: 8000
        targetPort: 8000
        protocol: TCP
    ---
    apiVersion: v1
    kind: Endpoints
    metadata:
      name: vllm-external
      namespace: agentgateway-system
    subsets:
    - addresses:
      - ip: 10.0.1.50  # Replace with your vLLM server IP
      ports:
      - port: 8000
        protocol: TCP
    EOF
  3. Create an AgentgatewayBackend resource:

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      ai:
        provider:
          openai:
            model: meta-llama/Llama-3.1-8B-Instruct
          host: vllm-external.agentgateway-system.svc.cluster.local
          port: 8000
    EOF

    Review the following table to understand this configuration.

    Setting Description
    ai.provider.openai Use OpenAI-compatible provider for vLLM.
    openai.model The model served by vLLM (must match the model vLLM is serving).
    openai.host Kubernetes Service DNS name for the external vLLM instance.
    openai.port vLLM API port (default: 8000).
  4. Create an HTTPRoute:

    kubectl apply -f- <<EOF
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      parentRefs:
        - name: agentgateway-proxy
          namespace: agentgateway-system
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: /vllm
        backendRefs:
        - name: vllm
          namespace: agentgateway-system
          group: agentgateway.dev
          kind: AgentgatewayBackend
    EOF
  5. Test the setup:

    curl "$INGRESS_GW_ADDRESS/vllm" -H content-type:application/json  -d '{
       "model": "meta-llama/Llama-3.1-8B-Instruct",
       "messages": [
         {
           "role": "user",
           "content": "Explain the benefits of vLLM."
         }
       ]
     }' | jq
    curl "localhost:8080/vllm" -H content-type:application/json  -d '{
       "model": "meta-llama/Llama-3.1-8B-Instruct",
       "messages": [
         {
           "role": "user",
           "content": "Explain the benefits of vLLM."
         }
       ]
     }' | jq

Option 2: Deploy vLLM in Kubernetes cluster

Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.

Before you begin

  • Kubernetes cluster with GPU nodes (NVIDIA GPUs with CUDA support).
  • NVIDIA GPU Operator or device plugin installed.
  • Sufficient GPU memory for your chosen model.

Deploy vLLM in the cluster

  1. Create a vLLM Deployment with GPU resources:

    kubectl apply -f- <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm
      template:
        metadata:
          labels:
            app: vllm
        spec:
          containers:
          - name: vllm
            image: vllm/vllm-openai:latest
            args:
              - "--model"
              - "meta-llama/Llama-3.1-8B-Instruct"
              - "--host"
              - "0.0.0.0"
              - "--port"
              - "8000"
              - "--dtype"
              - "auto"
            ports:
            - containerPort: 8000
              name: http
            resources:
              requests:
                nvidia.com/gpu: 1  # Request 1 GPU
              limits:
                nvidia.com/gpu: 1  # Limit to 1 GPU
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token  # Create this secret if accessing gated models
                  key: token
                  optional: true
    EOF
    ℹ️

    Model access: For gated models (like Llama), create a Hugging Face token secret:

    kubectl create secret generic hf-token \
      -n agentgateway-system \
      --from-literal=token=<your-hf-token>
  2. Create a Service for the vLLM deployment:

    kubectl apply -f- <<EOF
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      selector:
        app: vllm
      ports:
      - port: 8000
        targetPort: 8000
        protocol: TCP
    EOF
  3. Wait for vLLM to be ready:

    kubectl wait --for=condition=ready pod \
      -l app=vllm \
      -n agentgateway-system \
      --timeout=300s

    Initial startup: vLLM needs to download the model weights on first launch, which can take several minutes depending on model size and network speed. Monitor the logs:

    kubectl logs -f deployment/vllm -n agentgateway-system
  4. Create an AgentgatewayBackend resource:

    kubectl apply -f- <<EOF
    apiVersion: agentgateway.dev/v1alpha1
    kind: AgentgatewayBackend
    metadata:
      name: vllm
      namespace: agentgateway-system
    spec:
      ai:
        provider:
          openai:
            model: meta-llama/Llama-3.1-8B-Instruct
          host: vllm.agentgateway-system.svc.cluster.local
          port: 8000
    EOF
  5. Create an HTTPRoute (same as Option 1 step 4 above).

vLLM configuration options

Quantization

Reduce memory usage with quantization:

args:
  - "--model"
  - "meta-llama/Llama-3.1-8B-Instruct"
  - "--quantization"
  - "awq"  # or "gptq", "squeezellm"

Tensor parallelism

Distribute model across multiple GPUs:

args:
  - "--model"
  - "meta-llama/Llama-3.1-70B-Instruct"
  - "--tensor-parallel-size"
  - "4"  # Use 4 GPUs
resources:
  requests:
    nvidia.com/gpu: 4
  limits:
    nvidia.com/gpu: 4

Engine arguments

Tune performance parameters:

args:
  - "--model"
  - "meta-llama/Llama-3.1-8B-Instruct"
  - "--max-model-len"
  - "4096"  # Maximum sequence length
  - "--gpu-memory-utilization"
  - "0.9"  # Use 90% of GPU memory
  - "--max-num-seqs"
  - "256"  # Maximum number of sequences to process in parallel

Scaling and high availability

Horizontal scaling

Scale vLLM for higher throughput:

kubectl scale deployment vllm \
  -n agentgateway-system \
  --replicas=3

Kubernetes will load balance requests across vLLM replicas through the Service.

Resource limits

Set appropriate resource requests and limits:

resources:
  requests:
    memory: "16Gi"
    nvidia.com/gpu: 1
  limits:
    memory: "32Gi"
    nvidia.com/gpu: 1

Node affinity

Pin vLLM to GPU nodes:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values:
                - "true"

Monitoring

vLLM exposes Prometheus metrics at /metrics:

apiVersion: v1
kind: Service
metadata:
  name: vllm-metrics
  namespace: agentgateway-system
  labels:
    app: vllm
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
    name: metrics

Key metrics to monitor:

  • vllm_request_duration_seconds - Request latency.
  • vllm_num_requests_running - Active requests.
  • vllm_gpu_cache_usage_perc - GPU memory utilization.

Troubleshooting

Pod stuck in Pending state

What’s happening:

vLLM pod doesn’t start, shows Pending status.

Why it’s happening:

No GPU nodes available or insufficient GPU memory.

How to fix it:

  1. Check GPU availability:

    kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
  2. Check pod events:

    kubectl describe pod -l app=vllm -n agentgateway-system

Out of memory errors

What’s happening:

vLLM crashes with CUDA out-of-memory errors.

Why it’s happening:

The model requires more GPU memory than is available.

How to fix it:

  1. Use a smaller model or quantized variant.
  2. Reduce --max-model-len.
  3. Lower --gpu-memory-utilization (try 0.8 or 0.7).
  4. Enable tensor parallelism across more GPUs.

Slow inference

What’s happening:

High latency on requests.

Why it’s happening:

Model too large for available GPU memory (swapping to CPU), insufficient --max-num-seqs for concurrent requests, or CPU bottleneck in preprocessing.

How to fix it:

  1. Increase GPU memory or use smaller model.
  2. Tune --max-num-seqs and --max-model-len.
  3. Use faster CPUs or increase CPU requests.

Connection refused from agentgateway

What’s happening:

agentgateway cannot reach vLLM service.

Why it’s happening:

The vLLM Service may not exist, have no endpoints, or network policies are blocking traffic.

How to fix it:

  1. Verify vLLM Service exists and has endpoints:

    kubectl get svc vllm -n agentgateway-system
    kubectl get endpoints vllm -n agentgateway-system
  2. Test connectivity from another pod:

    kubectl run -it --rm debug --image=curlimages/curl --restart=Never \
      -- curl http://vllm.agentgateway-system.svc.cluster.local:8000/v1/models

Next steps

Agentgateway assistant

Ask me anything about agentgateway configuration, features, or usage.

Note: AI-generated content might contain errors; please verify and test all returned information.

↑↓ navigate select esc dismiss

What could be improved?