Ollama

Configure Ollama to serve local models through kgateway.

Overview

Ollama allows you to run open-source LLMs locally on your development machine or a dedicated server. You can configure agentgateway running in Kubernetes to route requests to an external Ollama instance.

This guide shows how to connect agentgateway to an Ollama instance running outside your Kubernetes cluster, such as on a developer’s laptop or a separate server.

Before you begin

Install and set up an agentgateway proxy.

Additional requirements:

Ollama installed and running on an accessible machine.
Network connectivity between your Kubernetes cluster and the Ollama instance.
Ollama configured to accept external connections.

Set up Ollama

Install Ollama on your local machine or server by following the Ollama installation guide.
Pull a model to use with agentgateway.
```
ollama pull llama3.2
```
Verify Ollama is running and accessible.
```
curl http://localhost:11434/v1/models
```
Configure Ollama to accept connections from your Kubernetes cluster. By default, Ollama only listens on localhost. Set the OLLAMA_HOST environment variable to allow external connections.
```
# On macOS/Linux, add to ~/.zshrc or ~/.bashrc
export OLLAMA_HOST=0.0.0.0:11434

# Restart Ollama
```
⚠️

Security consideration: Binding Ollama to 0.0.0.0 makes it accessible from any network interface. In production, use firewall rules or network policies to restrict access to your Kubernetes cluster nodes only.

Configure Kubernetes to connect to external Ollama

Since Ollama runs outside the Kubernetes cluster, you need to create a headless Service with manual Endpoints pointing to your Ollama instance.

Get the IP address of the machine running Ollama. This should be an IP address reachable from your Kubernetes cluster nodes.

# On the Ollama machine, get its IP
# macOS:
ipconfig getifaddr en0

# Linux:
hostname -I | awk '{print $1}'

# Example output: 192.168.1.100

Create a Service and Endpoints that point to your external Ollama instance.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: ollama-external
  namespace: agentgateway-system
spec:
  type: ClusterIP
  clusterIP: None  # Headless service
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: ollama-external
  namespace: agentgateway-system
subsets:
- addresses:
  - ip: 192.168.1.100  # Replace with your Ollama machine's IP
  ports:
  - port: 11434
    protocol: TCP
EOF

ℹ️

Replace 192.168.1.100 with the actual IP address of your Ollama machine. This IP must be routable from your Kubernetes cluster nodes.

Create an AgentgatewayBackend resource that configures Ollama as an OpenAI-compatible provider.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: ollama
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: llama3.2
      host: ollama-external.agentgateway-system.svc.cluster.local
      port: 11434
EOF

Review the following table to understand this configuration. For more information, see the API reference.

Setting	Description
`ai.provider.openai`	Use the OpenAI-compatible provider type for Ollama.
`openai.model`	The model pulled in Ollama (e.g., `llama3.2`, `mistral`, `codellama`).
`openai.host`	The Kubernetes Service DNS name for the external Ollama instance.
`openai.port`	The port Ollama listens on (default: `11434`).

No policies.auth is required since Ollama does not require authentication by default. No policies.tls is needed since Ollama uses HTTP (not HTTPS).

Create an HTTPRoute resource that routes incoming traffic to the Ollama AgentgatewayBackend.

kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ollama
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /ollama
    backendRefs:
    - name: ollama
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Send a request to verify the setup.

curl "$INGRESS_GW_ADDRESS/ollama" -H content-type:application/json  -d '{
   "model": "llama3.2",
   "messages": [
     {
       "role": "user",
       "content": "Explain the benefits of running models locally."
     }
   ]
 }' | jq

curl "localhost:8080/ollama" -H content-type:application/json  -d '{
   "model": "llama3.2",
   "messages": [
     {
       "role": "user",
       "content": "Explain the benefits of running models locally."
     }
   ]
 }' | jq

Example output:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1727967462,
  "model": "llama3.2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Running models locally provides several key benefits: complete data privacy since information never leaves your infrastructure, no API costs or rate limits, consistent low latency without network dependencies, and the ability to work offline. This makes it ideal for sensitive data, development environments, and applications requiring guaranteed response times."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 58,
    "total_tokens": 73
  }
}

Connecting from a different network

If your Ollama instance and Kubernetes cluster are on different networks (e.g., Ollama on your laptop and cluster in the cloud), you need to expose Ollama through a tunnel or VPN.

Option 1: Tailscale (recommended)

Tailscale creates a secure mesh network between your laptop and Kubernetes cluster.

Install Tailscale on both your Ollama machine and Kubernetes cluster nodes.
Use the Tailscale IP address in the Kubernetes Endpoints resource.
Configure your AgentgatewayBackend to point to the Tailscale service name.

Option 2: ngrok or similar tunneling service

Use ngrok to expose your local Ollama instance:

ngrok http 11434

Then configure the AgentgatewayBackend to use the ngrok URL:

spec:
  ai:
    provider:
      openai:
        model: llama3.2
      host: abc123.ngrok.io
      port: 443
  policies:
    tls:
      sni: abc123.ngrok.io

⚠️

Security: Free ngrok tunnels are publicly accessible. Use authentication or ngrok’s paid tier with reserved domains and access controls for production use.

Model management

Switching models

To use a different model, pull it with Ollama and update the AgentgatewayBackend:

# Pull a new model
ollama pull mistral

# Update the backend
kubectl patch AgentgatewayBackend ollama \
  -n agentgateway-system \
  --type merge \
  -p '{"spec":{"ai":{"provider":{"openai":{"model":"mistral"}}}}}'

Multiple models

To serve multiple Ollama models simultaneously, create separate AgentgatewayBackend resources and HTTPRoutes:

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: ollama-llama
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: llama3.2
      host: ollama-external.agentgateway-system.svc.cluster.local
      port: 11434
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: ollama-mistral
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai:
        model: mistral
      host: ollama-external.agentgateway-system.svc.cluster.local
      port: 11434
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ollama-llama
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /ollama/llama
    backendRefs:
    - name: ollama-llama
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ollama-mistral
  namespace: agentgateway-system
spec:
  parentRefs:
    - name: agentgateway-proxy
      namespace: agentgateway-system
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /ollama/mistral
    backendRefs:
    - name: ollama-mistral
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

Troubleshooting

Connection refused errors

What’s happening:

Requests fail with connection refused errors.

Why it’s happening:

The Kubernetes cluster cannot reach the Ollama instance, possibly due to network configuration, firewall rules, or incorrect Endpoints.

How to fix it:

Verify Ollama is running and bound to the correct interface:
```
curl http://<ollama-ip>:11434/v1/models
```

Check the Kubernetes Endpoints are correctly configured:

kubectl get endpoints ollama-external -n agentgateway-system

Verify network connectivity from a pod in your cluster:

kubectl run -it --rm debug --image=curlimages/curl --restart=Never \
  -- curl http://<ollama-ip>:11434/v1/models

Check firewall rules on the Ollama machine allow traffic on port 11434 from Kubernetes cluster IPs.

Model not found

What’s happening:

Error message indicating the model is not available.

Why it’s happening:

The requested model has not been pulled in Ollama.

How to fix it:

Verify the model is pulled in Ollama:
```
ollama list
```
If not listed, pull it:
```
ollama pull llama3.2
```

Slow response times

What’s happening:

Requests take longer than expected to complete.

Why it’s happening:

Network latency, insufficient resources on the Ollama machine, or the model size exceeds available memory.

How to fix it:

Use a model variant with smaller memory requirements (e.g., llama3.2:7b instead of llama3.2:70b).
Increase resources on the Ollama machine.
Consider running Ollama on a machine closer to the cluster (same datacenter/VPC).

Next steps

Want to use other endpoints than chat completions, such as embeddings or models? Check out the multiple endpoints guide.
Explore other guides for LLM consumption, such as function calling, model failover, and prompt guards.

Vertex AI vLLM

Ollama

Overview

Before you begin

Set up Ollama

Configure Kubernetes to connect to external Ollama

Connecting from a different network

Option 1: Tailscale (recommended)

Option 2: ngrok or similar tunneling service

Model management

Switching models

Multiple models

Troubleshooting

Connection refused errors

Model not found

Slow response times

Next steps

What could be improved?