vLLM
Configure vLLM, the high-performance LLM serving engine, to serve self-hosted models through kgateway.
Overview
vLLM is a fast and memory-efficient inference engine for large language models. It’s designed for high-throughput serving and is commonly deployed in Kubernetes clusters for production workloads.
This guide shows two deployment patterns:
- External vLLM: Connect to a vLLM instance running outside your Kubernetes cluster
- In-cluster vLLM: Deploy vLLM within your Kubernetes cluster
Before you begin
Install and set up an agentgateway proxy.Option 1: Connect to external vLLM instance
Use this option if vLLM is already deployed on dedicated GPU infrastructure outside your Kubernetes cluster.
Set up external vLLM
-
Install and run vLLM on a machine with GPU access. Follow the vLLM installation guide.
-
Start the vLLM OpenAI-compatible server:
vllm serve meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --dtype auto -
Verify vLLM is accessible:
curl http://<vllm-server-ip>:8000/v1/models
Configure Kubernetes to connect to external vLLM
-
Get the IP address of your vLLM server.
-
Create a headless Service and Endpoints pointing to the external vLLM instance:
kubectl apply -f- <<EOF apiVersion: v1 kind: Service metadata: name: vllm-external namespace: agentgateway-system spec: type: ClusterIP clusterIP: None # Headless service ports: - port: 8000 targetPort: 8000 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: vllm-external namespace: agentgateway-system subsets: - addresses: - ip: 10.0.1.50 # Replace with your vLLM server IP ports: - port: 8000 protocol: TCP EOF -
Create an AgentgatewayBackend resource:
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: vllm namespace: agentgateway-system spec: ai: provider: openai: model: meta-llama/Llama-3.1-8B-Instruct host: vllm-external.agentgateway-system.svc.cluster.local port: 8000 EOFReview the following table to understand this configuration.
Setting Description ai.provider.openaiUse OpenAI-compatible provider for vLLM. openai.modelThe model served by vLLM (must match the model vLLM is serving). openai.hostKubernetes Service DNS name for the external vLLM instance. openai.portvLLM API port (default: 8000). -
Create an HTTPRoute:
kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: vllm namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - matches: - path: type: PathPrefix value: /vllm backendRefs: - name: vllm namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend EOF -
Test the setup:
curl "$INGRESS_GW_ADDRESS/vllm" -H content-type:application/json -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Explain the benefits of vLLM." } ] }' | jqcurl "localhost:8080/vllm" -H content-type:application/json -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Explain the benefits of vLLM." } ] }' | jq
Option 2: Deploy vLLM in Kubernetes cluster
Use this option to deploy vLLM directly in your Kubernetes cluster alongside agentgateway.
Before you begin
- Kubernetes cluster with GPU nodes (NVIDIA GPUs with CUDA support).
- NVIDIA GPU Operator or device plugin installed.
- Sufficient GPU memory for your chosen model.
Deploy vLLM in the cluster
-
Create a vLLM Deployment with GPU resources:
kubectl apply -f- <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: vllm namespace: agentgateway-system spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: containers: - name: vllm image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--host" - "0.0.0.0" - "--port" - "8000" - "--dtype" - "auto" ports: - containerPort: 8000 name: http resources: requests: nvidia.com/gpu: 1 # Request 1 GPU limits: nvidia.com/gpu: 1 # Limit to 1 GPU env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token # Create this secret if accessing gated models key: token optional: true EOFℹ️Model access: For gated models (like Llama), create a Hugging Face token secret:
kubectl create secret generic hf-token \ -n agentgateway-system \ --from-literal=token=<your-hf-token> -
Create a Service for the vLLM deployment:
kubectl apply -f- <<EOF apiVersion: v1 kind: Service metadata: name: vllm namespace: agentgateway-system spec: selector: app: vllm ports: - port: 8000 targetPort: 8000 protocol: TCP EOF -
Wait for vLLM to be ready:
kubectl wait --for=condition=ready pod \ -l app=vllm \ -n agentgateway-system \ --timeout=300sInitial startup: vLLM needs to download the model weights on first launch, which can take several minutes depending on model size and network speed. Monitor the logs:
kubectl logs -f deployment/vllm -n agentgateway-system -
Create an AgentgatewayBackend resource:
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: vllm namespace: agentgateway-system spec: ai: provider: openai: model: meta-llama/Llama-3.1-8B-Instruct host: vllm.agentgateway-system.svc.cluster.local port: 8000 EOF -
Create an HTTPRoute (same as Option 1 step 4 above).
vLLM configuration options
Quantization
Reduce memory usage with quantization:
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--quantization"
- "awq" # or "gptq", "squeezellm"Tensor parallelism
Distribute model across multiple GPUs:
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size"
- "4" # Use 4 GPUs
resources:
requests:
nvidia.com/gpu: 4
limits:
nvidia.com/gpu: 4Engine arguments
Tune performance parameters:
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--max-model-len"
- "4096" # Maximum sequence length
- "--gpu-memory-utilization"
- "0.9" # Use 90% of GPU memory
- "--max-num-seqs"
- "256" # Maximum number of sequences to process in parallelScaling and high availability
Horizontal scaling
Scale vLLM for higher throughput:
kubectl scale deployment vllm \
-n agentgateway-system \
--replicas=3Kubernetes will load balance requests across vLLM replicas through the Service.
Resource limits
Set appropriate resource requests and limits:
resources:
requests:
memory: "16Gi"
nvidia.com/gpu: 1
limits:
memory: "32Gi"
nvidia.com/gpu: 1Node affinity
Pin vLLM to GPU nodes:
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"Monitoring
vLLM exposes Prometheus metrics at /metrics:
apiVersion: v1
kind: Service
metadata:
name: vllm-metrics
namespace: agentgateway-system
labels:
app: vllm
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
name: metricsKey metrics to monitor:
vllm_request_duration_seconds- Request latency.vllm_num_requests_running- Active requests.vllm_gpu_cache_usage_perc- GPU memory utilization.
Troubleshooting
Pod stuck in Pending state
What’s happening:
vLLM pod doesn’t start, shows Pending status.
Why it’s happening:
No GPU nodes available or insufficient GPU memory.
How to fix it:
-
Check GPU availability:
kubectl describe nodes | grep -A 5 "nvidia.com/gpu" -
Check pod events:
kubectl describe pod -l app=vllm -n agentgateway-system
Out of memory errors
What’s happening:
vLLM crashes with CUDA out-of-memory errors.
Why it’s happening:
The model requires more GPU memory than is available.
How to fix it:
- Use a smaller model or quantized variant.
- Reduce
--max-model-len. - Lower
--gpu-memory-utilization(try0.8or0.7). - Enable tensor parallelism across more GPUs.
Slow inference
What’s happening:
High latency on requests.
Why it’s happening:
Model too large for available GPU memory (swapping to CPU), insufficient --max-num-seqs for concurrent requests, or CPU bottleneck in preprocessing.
How to fix it:
- Increase GPU memory or use smaller model.
- Tune
--max-num-seqsand--max-model-len. - Use faster CPUs or increase CPU requests.
Connection refused from agentgateway
What’s happening:
agentgateway cannot reach vLLM service.
Why it’s happening:
The vLLM Service may not exist, have no endpoints, or network policies are blocking traffic.
How to fix it:
-
Verify vLLM Service exists and has endpoints:
kubectl get svc vllm -n agentgateway-system kubectl get endpoints vllm -n agentgateway-system -
Test connectivity from another pod:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never \ -- curl http://vllm.agentgateway-system.svc.cluster.local:8000/v1/models
Next steps
- Want to use other endpoints than chat completions, such as embeddings or models? Check out the multiple endpoints guide.
- Explore other guides for LLM consumption, such as function calling, model failover, and prompt guards.