Agent Zero Mastery Series
    Part 3 of 6

    LLM Provider Configuration

    Configure cloud LLMs (OpenAI, Claude, Gemini, Groq) and local models with Ollama. Optimize for cost, privacy, or capability with hybrid setups.

    15 minutes
    Intermediate

    Agent Zero is model-agnostic—it works with virtually any LLM provider, from cloud APIs to locally-hosted models. This flexibility lets you optimize for cost, privacy, speed, or capability depending on your needs.

    Understanding Agent Zero's Model Architecture

    Agent Zero uses three types of models for different purposes:

    Chat Model

    The primary reasoning engine. Handles complex tasks, code generation, and multi-step problem solving. This should be your most capable model.

    Utility Model

    Handles lightweight tasks like memory summarization, context compression, and quick lookups. Can be smaller and cheaper than the chat model.

    Embedding Model

    Converts text into vector representations for memory search and knowledge retrieval. Runs frequently but uses minimal resources.

    This separation lets you allocate expensive, capable models where they matter most while using efficient models for routine operations.

    OpenAI Configuration

    OpenAI offers the most straightforward setup and remains a solid default choice. Edit your .env file:

    Terminal
    cd ~/agent-zero
    nano .env

    Configure OpenAI:

    .env
    # OpenAI Configuration
    API_KEY_OPENAI=sk-your-api-key-here
    
    # Model Selection
    CHAT_MODEL_OPENAI=gpt-4o
    UTILITY_MODEL_OPENAI=gpt-4o-mini
    EMBEDDING_MODEL_OPENAI=text-embedding-3-small

    Model Recommendations

    Use CaseModelNotes
    Best reasoninggpt-4oStrongest for complex code and multi-step tasks
    Balancedgpt-4o-miniGood capability at lower cost
    Budgetgpt-3.5-turboFaster, cheaper, less capable

    Anthropic Claude

    Claude excels at nuanced reasoning and longer context windows. Configure it alongside or instead of OpenAI:

    .env
    # Anthropic Configuration
    API_KEY_ANTHROPIC=sk-ant-your-api-key-here
    
    # Model Selection
    CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
    UTILITY_MODEL_ANTHROPIC=claude-haiku-3-5-20241022
    EMBEDDING_MODEL_ANTHROPIC=voyage-2  # Note: Anthropic uses Voyage for embeddings

    Get your API key from console.anthropic.com.

    Model Recommendations

    Use CaseModelNotes
    Best reasoningclaude-sonnet-4-20250514Excellent for code and analysis
    Extended contextclaude-sonnet-4-20250514200K token context window
    Budgetclaude-haiku-3-5-20241022Fast and cost-effective

    Groq (Fast Inference)

    Groq provides extremely fast inference using custom hardware. Response times are often 10x faster than other providers, making it excellent for interactive use:

    .env
    # Groq Configuration
    API_KEY_GROQ=gsk_your-api-key-here
    
    # Model Selection
    CHAT_MODEL_GROQ=llama-3.3-70b-versatile
    UTILITY_MODEL_GROQ=llama-3.1-8b-instant
    EMBEDDING_MODEL_GROQ=nomic-embed-text  # Or use another provider

    Get your API key from console.groq.com. Groq's free tier is generous for testing, and paid usage remains very affordable.

    Model Recommendations

    Use CaseModelNotes
    Best reasoningllama-3.3-70b-versatileStrong open model, very fast
    Balancedmixtral-8x7b-32768Good MoE model with 32K context
    Speedllama-3.1-8b-instantBlazing fast for simple tasks

    Google Gemini

    Gemini offers competitive models with generous free tiers:

    .env
    # Google Configuration
    API_KEY_GOOGLE=your-api-key-here
    
    # Model Selection
    CHAT_MODEL_GOOGLE=gemini-1.5-pro
    UTILITY_MODEL_GOOGLE=gemini-1.5-flash
    EMBEDDING_MODEL_GOOGLE=text-embedding-004

    Get your API key from aistudio.google.com.

    Model Recommendations

    Use CaseModelNotes
    Best reasoninggemini-1.5-pro1M token context, strong multimodal
    Balancedgemini-1.5-flashFast with good capability
    Budgetgemini-1.5-flash-8bSmallest and cheapest

    Setting Default Models

    After configuring providers, set which models Agent Zero uses by default:

    .env
    # Default Model Configuration
    CHAT_MODEL_DEFAULT=gpt-4o
    UTILITY_MODEL_DEFAULT=gpt-4o-mini
    EMBEDDING_MODEL_DEFAULT=text-embedding-3-small

    You can mix providers. For example, use OpenAI for chat, Groq for utility tasks, and a local embedding model:

    .env
    CHAT_MODEL_DEFAULT=gpt-4o
    UTILITY_MODEL_DEFAULT=llama-3.1-8b-instant
    EMBEDDING_MODEL_DEFAULT=nomic-embed-text

    Local LLMs with Ollama

    Running models locally eliminates API costs and keeps all data on your server. Ollama makes local LLM deployment straightforward.

    Hardware Considerations

    Local inference requires significant RAM. The model must fit entirely in memory:

    Model ParametersMinimum RAMRecommended RAMExample Models
    7-8B8 GB12 GBLlama 3.1 8B, Mistral 7B, Qwen2.5 7B
    13-14B16 GB20 GBQwen2.5 14B
    30-34B32 GB40 GBQwen2.5 32B, CodeLlama 34B
    70B64 GB80 GBLlama 3.1 70B

    CPU inference is slower than GPU but entirely usable for async workflows. Expect 5-15 tokens/second on a modern CPU with 8B models, compared to 50+ tokens/second with a GPU. For most RamNode deployments, 8B parameter models hit the sweet spot.

    Install Ollama

    Terminal
    curl -fsSL https://ollama.com/install.sh | sh

    Verify the installation:

    Terminal
    ollama --version

    Ollama runs as a systemd service automatically:

    Terminal
    sudo systemctl status ollama

    Pull Models

    Download models you want to use. Start with a capable general-purpose model:

    Terminal
    # Recommended starting model - excellent balance of capability and size
    ollama pull qwen2.5:7b
    
    # Alternative options
    ollama pull llama3.1:8b
    ollama pull mistral:7b
    ollama pull codellama:7b  # Specialized for code

    For the utility model, a smaller variant works well:

    Terminal
    ollama pull qwen2.5:3b
    # or
    ollama pull llama3.2:3b

    Pull an embedding model:

    Terminal
    ollama pull nomic-embed-text
    # or
    ollama pull mxbai-embed-large

    List installed models:

    Terminal
    ollama list

    Test Local Inference

    Verify models work before configuring Agent Zero:

    Terminal
    ollama run qwen2.5:7b "Write a Python function that calculates factorial"

    You should see the model generate a response. Exit with /bye or Ctrl+D. Check inference speed:

    Terminal
    time ollama run qwen2.5:7b "What is 2+2?" --verbose

    The --verbose flag shows tokens per second.

    Configure Agent Zero for Ollama

    Edit your .env file to use local models:

    Terminal
    nano ~/agent-zero/.env

    Add Ollama configuration:

    .env
    # Ollama Configuration
    API_URL_OLLAMA=http://localhost:11434
    
    # Model Selection (use exact names from 'ollama list')
    CHAT_MODEL_OLLAMA=qwen2.5:7b
    UTILITY_MODEL_OLLAMA=qwen2.5:3b
    EMBEDDING_MODEL_OLLAMA=nomic-embed-text

    Set Ollama models as defaults:

    .env
    # Default to local models
    CHAT_MODEL_DEFAULT=qwen2.5:7b
    UTILITY_MODEL_DEFAULT=qwen2.5:3b
    EMBEDDING_MODEL_DEFAULT=nomic-embed-text

    Ollama Performance Tuning

    Increase Context Length

    By default, Ollama uses 2048 token context. For Agent Zero's complex workflows, increase this:

    Terminal
    # Create a custom model with larger context
    ollama create qwen2.5-32k -f - <<EOF
    FROM qwen2.5:7b
    PARAMETER num_ctx 32768
    EOF

    Update your .env to use the custom model:

    .env
    CHAT_MODEL_OLLAMA=qwen2.5-32k

    Configure Memory Usage

    Ollama automatically manages GPU/CPU memory, but you can tune behavior:

    Terminal
    # Edit Ollama service configuration
    sudo systemctl edit ollama

    Add environment variables:

    systemd override
    [Service]
    Environment="OLLAMA_NUM_PARALLEL=2"
    Environment="OLLAMA_MAX_LOADED_MODELS=2"

    Restart Ollama:

    Terminal
    sudo systemctl restart ollama

    Keep Models Loaded

    By default, Ollama unloads models after 5 minutes of inactivity. For responsive agents, keep models loaded:

    .env
    # Set longer keep-alive (in .env or environment)
    OLLAMA_KEEP_ALIVE=24h

    Or load models persistently:

    Terminal
    curl http://localhost:11434/api/generate -d '{"model": "qwen2.5:7b", "keep_alive": -1}'

    Recommended Local Model Combinations

    Memory-Constrained (8GB RAM)

    CHAT_MODEL_DEFAULT=qwen2.5:7b
    UTILITY_MODEL_DEFAULT=qwen2.5:3b
    EMBEDDING_MODEL_DEFAULT=nomic-embed-text

    Balanced (16GB RAM)

    CHAT_MODEL_DEFAULT=qwen2.5:14b
    UTILITY_MODEL_DEFAULT=qwen2.5:7b
    EMBEDDING_MODEL_DEFAULT=mxbai-embed-large

    Code-Focused (16GB RAM)

    CHAT_MODEL_DEFAULT=deepseek-coder:6.7b
    UTILITY_MODEL_DEFAULT=qwen2.5:3b
    EMBEDDING_MODEL_DEFAULT=nomic-embed-text

    Hybrid Configurations

    The most practical setup often combines cloud and local models—using local inference for routine tasks and cloud APIs for complex reasoning.

    Strategy 1: Local Utility, Cloud Chat

    Use local models for frequent, simple operations while reserving cloud APIs for heavy lifting:

    .env
    # Cloud for complex reasoning
    CHAT_MODEL_DEFAULT=gpt-4o
    
    # Local for utility tasks (no API cost)
    UTILITY_MODEL_DEFAULT=qwen2.5:3b
    
    # Local embeddings (runs constantly, saves significant cost)
    EMBEDDING_MODEL_DEFAULT=nomic-embed-text

    This dramatically reduces API costs since embedding and utility calls happen far more frequently than chat completions.

    Strategy 2: Fast Local, Powerful Cloud Fallback

    Use fast local models for initial attempts, escalating to cloud for difficult tasks:

    .env
    # Start with local
    CHAT_MODEL_DEFAULT=qwen2.5:7b
    
    # Configure cloud as available alternative
    CHAT_MODEL_OPENAI=gpt-4o

    You can then instruct Agent Zero in custom prompts to escalate to more powerful models when local inference struggles.

    Strategy 3: Provider Redundancy

    Configure multiple providers for reliability:

    .env
    # Primary
    CHAT_MODEL_DEFAULT=gpt-4o
    API_KEY_OPENAI=sk-...
    
    # Backup
    CHAT_MODEL_ANTHROPIC=claude-sonnet-4-20250514
    API_KEY_ANTHROPIC=sk-ant-...
    
    # Local fallback
    CHAT_MODEL_OLLAMA=qwen2.5:7b

    If one provider has an outage or rate limits you, alternatives are ready.

    API Key Security

    API keys grant access to paid services. Protect them:

    Restrict File Permissions

    Terminal
    chmod 600 ~/agent-zero/.env

    This ensures only your user can read the file.

    Use Environment Variables

    For production, consider loading keys from environment variables rather than files:

    ~/.bashrc
    # In ~/.bashrc or service file
    export API_KEY_OPENAI="sk-..."

    Then reference in .env:

    .env
    API_KEY_OPENAI=${API_KEY_OPENAI}

    Set Usage Limits

    Most providers let you set spending caps:

    • OpenAI: Settings → Limits → Set monthly budget
    • Anthropic: Settings → Limits → Usage limits
    • Google: Cloud Console → Budgets & Alerts

    Set alerts at 50% and 80% of your budget to catch runaway usage.

    Rotate Keys Periodically

    Generate new API keys monthly and revoke old ones. This limits exposure if a key is compromised.

    Testing Your Configuration

    After configuring providers, restart Agent Zero and verify each model works:

    Terminal
    # Restart if running as service
    sudo systemctl restart agent-zero
    
    # Or restart manually
    cd ~/agent-zero
    source venv/bin/activate
    python run_ui.py

    Test Prompts

    Test chat model:

    "Write a Python script that fetches weather data from an API, parses the JSON response, and formats it nicely for terminal output."

    Test utility model (happens automatically with memory operations):

    "Remember that my favorite programming language is Python."

    Test embeddings (happens automatically with knowledge queries):

    "Search your knowledge for information about Python."

    Check logs for any model loading errors:

    Terminal
    journalctl -u agent-zero -f
    # And in another terminal:
    journalctl -u ollama -f

    Provider Comparison Summary

    ProviderSpeedCostPrivacyBest For
    OpenAIFastMediumCloudGeneral use, broad capability
    AnthropicFastMediumCloudComplex reasoning, long context
    GroqVery FastLowCloudInteractive use, speed-critical
    GoogleFastLow/FreeCloudBudget-conscious, multimodal
    OllamaSlowerFreeLocalPrivacy, no ongoing costs

    What's Next

    Your Agent Zero instance can now leverage multiple LLM providers, from powerful cloud APIs to fully private local models. In Part 4: Memory Systems & Knowledge Management, we'll explore:

    • How Agent Zero's memory architecture works
    • Configuring persistent storage for agent learning
    • Building custom knowledge bases from your documents
    • Setting up SearXNG for private web search
    • Optimizing memory for long-running agents

    The memory system is what transforms Agent Zero from a stateless chatbot into a genuinely useful assistant that improves over time.