Ollama
title: Ollama weight: 15 description: Configure Agentgateway to route LLM traffic to Ollama for local model inference
Ollama enables you to run large language models locally on your machine. Agentgateway can route requests to your local Ollama instance, providing a unified interface for both local and cloud-based LLMs.
Use cases
- Local development: Test and develop without cloud API costs
- Privacy: Keep sensitive data on your machine
- Offline usage: Run models without internet connectivity
- Cost optimization: Avoid per-token charges during development
- Custom models: Use fine-tuned or specialized local models
Before you begin
- Install Ollama: Download and install from ollama.ai
- Pull a model: Download at least one model
ollama pull llama3.2 - Verify Ollama is running: Check that Ollama is serving on port 11434
curl http://localhost:11434/api/version
Basic configuration
Configure Agentgateway to route to your local Ollama instance:
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
binds:
- port: 3000
listeners:
- routes:
- policies:
urlRewrite:
authority:
full: localhost:11434
backends:
- ai:
name: ollama
hostOverride: localhost:11434
provider:
openAI:
model: llama3.2 # Default modelbackendTLS policy is not required.Test the configuration
curl 'http://localhost:3000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello! Tell me about Ollama in one sentence."
}
]
}'Model configuration
List available models
See which models you have pulled locally:
ollama listExample output:
NAME ID SIZE MODIFIED
llama3.2:latest a80c4f17acd5 2.0 GB 2 weeks ago
mistral:latest f974a74358d6 4.1 GB 3 weeks ago
codellama:latest 8fdf8f752f6e 3.8 GB 1 month agoPull additional models
Download models from the Ollama library:
# Pull a specific model
ollama pull mistral
# Pull a specific tag/size
ollama pull llama3.2:70bSpecify model in requests
You can override the default model in each request:
curl 'http://localhost:3000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "mistral",
"messages": [{"role": "user", "content": "Hello"}]
}'model field in the backend configuration, clients must specify the model in each request.Advanced configuration
Multiple Ollama instances
Route to different Ollama instances (e.g., different machines or ports):
binds:
- port: 3000
listeners:
- routes:
# Route to local Ollama
- policies:
urlRewrite:
authority:
full: localhost:11434
backends:
- ai:
name: ollama-local
hostOverride: localhost:11434
provider:
openAI:
model: llama3.2
# Route to remote Ollama instance
- policies:
urlRewrite:
authority:
full: 192.168.1.100:11434
backends:
- ai:
name: ollama-remote
hostOverride: 192.168.1.100:11434
provider:
openAI:
model: mistralCustom port
If you’re running Ollama on a non-default port:
# Start Ollama on custom port
OLLAMA_HOST=0.0.0.0:8080 ollama serveUpdate your configuration:
binds:
- port: 3000
listeners:
- routes:
- policies:
urlRewrite:
authority:
full: localhost:8080
backends:
- ai:
name: ollama
hostOverride: localhost:8080
provider:
openAI:
model: llama3.2Model parameters
Control generation parameters:
curl 'http://localhost:3000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a poem"}],
"temperature": 0.7,
"max_tokens": 500,
"top_p": 0.9
}'Embeddings with Ollama
Ollama supports embedding models for semantic search and RAG applications.
Pull an embedding model
ollama pull nomic-embed-textConfigure for embeddings
binds:
- port: 3000
listeners:
- routes:
- policies:
urlRewrite:
authority:
full: localhost:11434
backends:
- ai:
name: ollama-embeddings
hostOverride: localhost:11434
provider:
openAI:
model: nomic-embed-textGenerate embeddings
curl 'http://localhost:3000/v1/embeddings' \
--header 'Content-Type: application/json' \
--data '{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog"
}'Popular embedding models
| Model | Size | Use Case |
|---|---|---|
nomic-embed-text |
274 MB | General purpose, high quality |
mxbai-embed-large |
669 MB | Long context, high accuracy |
all-minilm |
45 MB | Fast, lightweight |
Production considerations
Performance tuning
GPU acceleration: Ollama automatically uses GPU if available. Check with:
ollama ps # Shows GPU utilizationModel loading: First request may be slow as the model loads into memory. Keep frequently-used models loaded:
ollama run llama3.2 # Keeps model in memoryConcurrent requests: Ollama handles multiple concurrent requests. Monitor with:
ollama ps # Shows active model and memory usageResource requirements
| Model Size | RAM Required | GPU VRAM (if using GPU) |
|---|---|---|
| 3B params | 8 GB | 4 GB |
| 7B params | 16 GB | 8 GB |
| 13B params | 32 GB | 16 GB |
| 70B params | 64+ GB | 40+ GB |
When NOT to use Ollama
Ollama is great for development but may not be ideal for:
- Production at scale: Cloud providers offer better reliability and scaling
- Low-resource environments: Large models require significant RAM/GPU
- Latest models: Cloud providers often have newer/larger models
- High availability: Ollama runs on a single machine without built-in redundancy
Troubleshooting
Connection refused
Problem: curl: (7) Failed to connect to localhost port 11434: Connection refused
Solutions:
- Check Ollama is running:
ps aux | grep ollama - Start Ollama if not running:
ollama serve - Check port binding:
lsof -i :11434
Model not found
Problem: {"error":"model 'llama3.2' not found"}
Solutions:
- List pulled models:
ollama list - Pull the model:
ollama pull llama3.2 - Use exact model name from
ollama listoutput
Slow performance
Problem: Responses are very slow
Solutions:
- Use smaller models: Try
llama3.2:3binstead ofllama3.2:70b - Check GPU usage:
ollama ps # Should show GPU if available - Reduce context window:
# Use max_tokens to limit response length curl ... --data '{"model": "llama3.2", "max_tokens": 100, ...}' - Unload unused models:
ollama stop <model-name>
Out of memory
Problem: Ollama crashes or fails to load model
Solutions:
- Use a smaller model variant:
ollama pull llama3.2:3b # Instead of 70b - Check available memory:
# macOS vm_stat # Linux free -h - Close other applications to free memory
- Use model quantization (Ollama automatically uses Q4 quantization)
Next steps
- Configure multiple LLM providers for fallback
- Set up API keys to control access
- Enable observability to track token usage