Ollama

title: Ollama weight: 15 description: Configure Agentgateway to route LLM traffic to Ollama for local model inference

Ollama enables you to run large language models locally on your machine. Agentgateway can route requests to your local Ollama instance, providing a unified interface for both local and cloud-based LLMs.

Use cases

Local development: Test and develop without cloud API costs
Privacy: Keep sensitive data on your machine
Offline usage: Run models without internet connectivity
Cost optimization: Avoid per-token charges during development
Custom models: Use fine-tuned or specialized local models

Before you begin

Install Ollama: Download and install from ollama.ai
Pull a model: Download at least one model
```
ollama pull llama3.2
```
Verify Ollama is running: Check that Ollama is serving on port 11434
```
curl http://localhost:11434/api/version
```

Basic configuration

Configure Agentgateway to route to your local Ollama instance:

# yaml-language-server: $schema=https://agentgateway.dev/schema/config
binds:
- port: 3000
  listeners:
  - routes:
    - policies:
        urlRewrite:
          authority:
            full: localhost:11434
      backends:
      - ai:
          name: ollama
          hostOverride: localhost:11434
          provider:
            openAI:
              model: llama3.2  # Default model

ℹ️

No TLS configuration needed: Ollama runs on localhost over HTTP by default. The backendTLS policy is not required.

Test the configuration

curl 'http://localhost:3000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Hello! Tell me about Ollama in one sentence."
      }
    ]
  }'

Model configuration

List available models

See which models you have pulled locally:

ollama list

Example output:

NAME                    ID              SIZE      MODIFIED
llama3.2:latest         a80c4f17acd5    2.0 GB    2 weeks ago
mistral:latest          f974a74358d6    4.1 GB    3 weeks ago
codellama:latest        8fdf8f752f6e    3.8 GB    1 month ago

Pull additional models

Download models from the Ollama library:

# Pull a specific model
ollama pull mistral

# Pull a specific tag/size
ollama pull llama3.2:70b

Specify model in requests

You can override the default model in each request:

curl 'http://localhost:3000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "mistral",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

If you omit the model field in the backend configuration, clients must specify the model in each request.

Advanced configuration

Multiple Ollama instances

Route to different Ollama instances (e.g., different machines or ports):

binds:
- port: 3000
  listeners:
  - routes:
    # Route to local Ollama
    - policies:
        urlRewrite:
          authority:
            full: localhost:11434
      backends:
      - ai:
          name: ollama-local
          hostOverride: localhost:11434
          provider:
            openAI:
              model: llama3.2

    # Route to remote Ollama instance
    - policies:
        urlRewrite:
          authority:
            full: 192.168.1.100:11434
      backends:
      - ai:
          name: ollama-remote
          hostOverride: 192.168.1.100:11434
          provider:
            openAI:
              model: mistral

Custom port

If you’re running Ollama on a non-default port:

# Start Ollama on custom port
OLLAMA_HOST=0.0.0.0:8080 ollama serve

Update your configuration:

binds:
- port: 3000
  listeners:
  - routes:
    - policies:
        urlRewrite:
          authority:
            full: localhost:8080
      backends:
      - ai:
          name: ollama
          hostOverride: localhost:8080
          provider:
            openAI:
              model: llama3.2

Model parameters

Control generation parameters:

curl 'http://localhost:3000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "temperature": 0.7,
    "max_tokens": 500,
    "top_p": 0.9
  }'

Embeddings with Ollama

Ollama supports embedding models for semantic search and RAG applications.

Pull an embedding model

ollama pull nomic-embed-text

Configure for embeddings

binds:
- port: 3000
  listeners:
  - routes:
    - policies:
        urlRewrite:
          authority:
            full: localhost:11434
      backends:
      - ai:
          name: ollama-embeddings
          hostOverride: localhost:11434
          provider:
            openAI:
              model: nomic-embed-text

Generate embeddings

curl 'http://localhost:3000/v1/embeddings' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "nomic-embed-text",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Popular embedding models

Model	Size	Use Case
`nomic-embed-text`	274 MB	General purpose, high quality
`mxbai-embed-large`	669 MB	Long context, high accuracy
`all-minilm`	45 MB	Fast, lightweight

Production considerations

Performance tuning

GPU acceleration: Ollama automatically uses GPU if available. Check with:

ollama ps  # Shows GPU utilization

Model loading: First request may be slow as the model loads into memory. Keep frequently-used models loaded:

ollama run llama3.2  # Keeps model in memory

Concurrent requests: Ollama handles multiple concurrent requests. Monitor with:

ollama ps  # Shows active model and memory usage

Resource requirements

Model Size	RAM Required	GPU VRAM (if using GPU)
3B params	8 GB	4 GB
7B params	16 GB	8 GB
13B params	32 GB	16 GB
70B params	64+ GB	40+ GB

⚠️

Memory usage: Ollama loads the entire model into RAM/VRAM. Ensure you have sufficient memory before pulling large models.

When NOT to use Ollama

Ollama is great for development but may not be ideal for:

Production at scale: Cloud providers offer better reliability and scaling
Low-resource environments: Large models require significant RAM/GPU
Latest models: Cloud providers often have newer/larger models
High availability: Ollama runs on a single machine without built-in redundancy

Troubleshooting

Connection refused

Problem: curl: (7) Failed to connect to localhost port 11434: Connection refused

Solutions:

Check Ollama is running:
```
ps aux | grep ollama
```
Start Ollama if not running:
```
ollama serve
```
Check port binding:
```
lsof -i :11434
```

Model not found

Problem: {"error":"model 'llama3.2' not found"}

Solutions:

List pulled models:
```
ollama list
```
Pull the model:
```
ollama pull llama3.2
```
Use exact model name from ollama list output

Slow performance

Problem: Responses are very slow

Solutions:

Use smaller models: Try llama3.2:3b instead of llama3.2:70b

Check GPU usage:

ollama ps  # Should show GPU if available

Reduce context window:

# Use max_tokens to limit response length
curl ... --data '{"model": "llama3.2", "max_tokens": 100, ...}'

Unload unused models:
```
ollama stop <model-name>
```

Out of memory

Problem: Ollama crashes or fails to load model

Solutions:

Use a smaller model variant:

ollama pull llama3.2:3b  # Instead of 70b

Check available memory:
```
# macOS
vm_stat

# Linux
free -h
```
Close other applications to free memory
Use model quantization (Ollama automatically uses Q4 quantization)

Next steps

Configure multiple LLM providers for fallback
Set up API keys to control access
Enable observability to track token usage

Related resources

OpenAI-compatible providers Vertex AI

Ollama

title: Ollama weight: 15 description: Configure Agentgateway to route LLM traffic to Ollama for local model inference

Use cases

Before you begin

Basic configuration

Test the configuration

Model configuration

List available models

Pull additional models

Specify model in requests

Advanced configuration

Multiple Ollama instances

Custom port

Model parameters

Embeddings with Ollama

Pull an embedding model

Configure for embeddings

Generate embeddings

Popular embedding models

Production considerations

Performance tuning

Resource requirements

When NOT to use Ollama

Troubleshooting

Connection refused

Model not found

Slow performance

Out of memory

Next steps

Related resources

What could be improved?