Ollama
Configure Ollama to serve local models through kgateway.
Overview
Ollama allows you to run open-source LLMs locally on your development machine or a dedicated server. You can configure agentgateway running in Kubernetes to route requests to an external Ollama instance.
This guide shows how to connect agentgateway to an Ollama instance running outside your Kubernetes cluster, such as on a developer’s laptop or a separate server.
Before you begin
Install and set up an agentgateway proxy.Additional requirements:
- Ollama installed and running on an accessible machine.
- Network connectivity between your Kubernetes cluster and the Ollama instance.
- Ollama configured to accept external connections.
Set up Ollama
-
Install Ollama on your local machine or server by following the Ollama installation guide.
-
Pull a model to use with agentgateway.
ollama pull llama3.2 -
Verify Ollama is running and accessible.
curl http://localhost:11434/v1/models -
Configure Ollama to accept connections from your Kubernetes cluster. By default, Ollama only listens on
localhost. Set theOLLAMA_HOSTenvironment variable to allow external connections.# On macOS/Linux, add to ~/.zshrc or ~/.bashrc export OLLAMA_HOST=0.0.0.0:11434 # Restart Ollama⚠️Security consideration: Binding Ollama to0.0.0.0makes it accessible from any network interface. In production, use firewall rules or network policies to restrict access to your Kubernetes cluster nodes only.
Configure Kubernetes to connect to external Ollama
Since Ollama runs outside the Kubernetes cluster, you need to create a headless Service with manual Endpoints pointing to your Ollama instance.
-
Get the IP address of the machine running Ollama. This should be an IP address reachable from your Kubernetes cluster nodes.
# On the Ollama machine, get its IP # macOS: ipconfig getifaddr en0 # Linux: hostname -I | awk '{print $1}' # Example output: 192.168.1.100 -
Create a Service and Endpoints that point to your external Ollama instance.
kubectl apply -f- <<EOF apiVersion: v1 kind: Service metadata: name: ollama-external namespace: agentgateway-system spec: type: ClusterIP clusterIP: None # Headless service ports: - port: 11434 targetPort: 11434 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: ollama-external namespace: agentgateway-system subsets: - addresses: - ip: 192.168.1.100 # Replace with your Ollama machine's IP ports: - port: 11434 protocol: TCP EOFℹ️Replace192.168.1.100with the actual IP address of your Ollama machine. This IP must be routable from your Kubernetes cluster nodes. -
Create an AgentgatewayBackend resource that configures Ollama as an OpenAI-compatible provider.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: ollama namespace: agentgateway-system spec: ai: provider: openai: model: llama3.2 host: ollama-external.agentgateway-system.svc.cluster.local port: 11434 EOFReview the following table to understand this configuration. For more information, see the API reference.
Setting Description ai.provider.openaiUse the OpenAI-compatible provider type for Ollama. openai.modelThe model pulled in Ollama (e.g., llama3.2,mistral,codellama).openai.hostThe Kubernetes Service DNS name for the external Ollama instance. openai.portThe port Ollama listens on (default: 11434).Nopolicies.authis required since Ollama does not require authentication by default. Nopolicies.tlsis needed since Ollama uses HTTP (not HTTPS). -
Create an HTTPRoute resource that routes incoming traffic to the Ollama AgentgatewayBackend.
kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: ollama namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - matches: - path: type: PathPrefix value: /ollama backendRefs: - name: ollama namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend EOF -
Send a request to verify the setup.
curl "$INGRESS_GW_ADDRESS/ollama" -H content-type:application/json -d '{ "model": "llama3.2", "messages": [ { "role": "user", "content": "Explain the benefits of running models locally." } ] }' | jqcurl "localhost:8080/ollama" -H content-type:application/json -d '{ "model": "llama3.2", "messages": [ { "role": "user", "content": "Explain the benefits of running models locally." } ] }' | jqExample output:
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1727967462, "model": "llama3.2", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Running models locally provides several key benefits: complete data privacy since information never leaves your infrastructure, no API costs or rate limits, consistent low latency without network dependencies, and the ability to work offline. This makes it ideal for sensitive data, development environments, and applications requiring guaranteed response times." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 15, "completion_tokens": 58, "total_tokens": 73 } }
Connecting from a different network
If your Ollama instance and Kubernetes cluster are on different networks (e.g., Ollama on your laptop and cluster in the cloud), you need to expose Ollama through a tunnel or VPN.
Option 1: Tailscale (recommended)
Tailscale creates a secure mesh network between your laptop and Kubernetes cluster.
- Install Tailscale on both your Ollama machine and Kubernetes cluster nodes.
- Use the Tailscale IP address in the Kubernetes Endpoints resource.
- Configure your AgentgatewayBackend to point to the Tailscale service name.
Option 2: ngrok or similar tunneling service
Use ngrok to expose your local Ollama instance:
ngrok http 11434Then configure the AgentgatewayBackend to use the ngrok URL:
spec:
ai:
provider:
openai:
model: llama3.2
host: abc123.ngrok.io
port: 443
policies:
tls:
sni: abc123.ngrok.ioModel management
Switching models
To use a different model, pull it with Ollama and update the AgentgatewayBackend:
# Pull a new model
ollama pull mistral
# Update the backend
kubectl patch AgentgatewayBackend ollama \
-n agentgateway-system \
--type merge \
-p '{"spec":{"ai":{"provider":{"openai":{"model":"mistral"}}}}}'Multiple models
To serve multiple Ollama models simultaneously, create separate AgentgatewayBackend resources and HTTPRoutes:
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: ollama-llama
namespace: agentgateway-system
spec:
ai:
provider:
openai:
model: llama3.2
host: ollama-external.agentgateway-system.svc.cluster.local
port: 11434
---
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: ollama-mistral
namespace: agentgateway-system
spec:
ai:
provider:
openai:
model: mistral
host: ollama-external.agentgateway-system.svc.cluster.local
port: 11434
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: ollama-llama
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- matches:
- path:
type: PathPrefix
value: /ollama/llama
backendRefs:
- name: ollama-llama
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: ollama-mistral
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- matches:
- path:
type: PathPrefix
value: /ollama/mistral
backendRefs:
- name: ollama-mistral
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
EOFTroubleshooting
Connection refused errors
What’s happening:
Requests fail with connection refused errors.
Why it’s happening:
The Kubernetes cluster cannot reach the Ollama instance, possibly due to network configuration, firewall rules, or incorrect Endpoints.
How to fix it:
-
Verify Ollama is running and bound to the correct interface:
curl http://<ollama-ip>:11434/v1/models -
Check the Kubernetes Endpoints are correctly configured:
kubectl get endpoints ollama-external -n agentgateway-system -
Verify network connectivity from a pod in your cluster:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never \ -- curl http://<ollama-ip>:11434/v1/models -
Check firewall rules on the Ollama machine allow traffic on port 11434 from Kubernetes cluster IPs.
Model not found
What’s happening:
Error message indicating the model is not available.
Why it’s happening:
The requested model has not been pulled in Ollama.
How to fix it:
-
Verify the model is pulled in Ollama:
ollama list -
If not listed, pull it:
ollama pull llama3.2
Slow response times
What’s happening:
Requests take longer than expected to complete.
Why it’s happening:
Network latency, insufficient resources on the Ollama machine, or the model size exceeds available memory.
How to fix it:
- Use a model variant with smaller memory requirements (e.g.,
llama3.2:7binstead ofllama3.2:70b). - Increase resources on the Ollama machine.
- Consider running Ollama on a machine closer to the cluster (same datacenter/VPC).
Next steps
- Want to use other endpoints than chat completions, such as embeddings or models? Check out the multiple endpoints guide.
- Explore other guides for LLM consumption, such as function calling, model failover, and prompt guards.