vLLM / Local models¶

Autobot supports vLLM and any OpenAI-compatible local inference server as an LLM provider. This lets you run models on your own hardware with full privacy and no API costs.

Setup¶

1. Start a local server¶

Start a vLLM server (or any OpenAI-compatible endpoint):

# vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

# Ollama (OpenAI-compatible mode)
ollama serve  # Runs on port 11434

# llama.cpp server
llama-server -m model.gguf --port 8080

2. Configure the provider¶

In config.yml:

agents:
  defaults:
    model: "vllm/meta-llama/Llama-3.3-70B-Instruct"

providers:
  vllm:
    api_base: "http://localhost:8000"
    api_key: "token"

The api_key field is required by the config schema but most local servers ignore it. Use any non-empty value (e.g. "token" or "none").

3. Verify¶

autobot doctor
# Should show: ✓ LLM provider configured (vllm)

Model naming¶

For local providers, the model name after the vllm/ prefix should match what the server expects:

# vLLM — uses the model name from the serve command
model: "vllm/meta-llama/Llama-3.3-70B-Instruct"

# Ollama — uses the model tag
model: "vllm/llama3.3:70b"

# llama.cpp — usually any string works (model is already loaded)
model: "vllm/local"

Endpoint configuration¶

The api_base should point to your server's base URL. Autobot appends /chat/completions automatically:

# vLLM (default port 8000)
api_base: "http://localhost:8000"
# -> POST http://localhost:8000/chat/completions

# Ollama (default port 11434)
api_base: "http://localhost:11434/v1"
# -> POST http://localhost:11434/v1/chat/completions

# Custom server with full path
api_base: "http://localhost:8080/v1/chat/completions"
# -> POST http://localhost:8080/v1/chat/completions (used as-is)

If the api_base already ends with /chat/completions, it is used as-is.

Configuration reference¶

Field	Required	Default	Description
`api_key`	Yes*	—	API key or token (most local servers ignore it, use any non-empty value)
`api_base`	Yes	—	Server URL (e.g. `http://localhost:8000`)
`extra_headers`	No	—	Additional HTTP headers for every request

*Required by config schema, but the value typically does not matter for local servers.

How it works¶

vLLM and other local servers implement the OpenAI-compatible Chat Completions API:

Authorization: Bearer header sent (most local servers ignore it)
Standard message format with role and content fields
Function calling works if the served model supports it

Autobot detects local providers by the vllm keyword or by explicit provider_name in config. The OpenAI-compatible request format is used for all local servers.

Voice transcription¶

Local servers do not provide a transcription API. If you need voice message support, configure an additional Groq or OpenAI provider for Whisper-based transcription.

Known limitations¶

No streaming — Responses are returned in full after the model finishes generating.
Tool support varies — Function calling depends on the model and server. Not all local models support tools.
Tool choice is always auto — There is no configuration to force a specific tool or disable tool use per-request.
No automatic model detection — You must specify the exact model name the server expects.

Troubleshooting¶

Enable debug logging to see request/response details:

LOG_LEVEL=DEBUG autobot agent -m "Hello"

Look for:

POST http://localhost:8000/chat/completions model=... — confirms provider is active
Response 200 (N bytes) — confirms server response
LLM request failed: ... — connection or request errors

Common issues¶

"No LLM provider configured" — Check that api_key is set (any non-empty value) and api_base points to your running server.

"LLM request failed: Connection refused" — Server is not running or the port is wrong. Verify the server is up with curl http://localhost:8000/v1/models.

"API error: model not found" — The model name in config doesn't match what the server is serving. Check available models with curl http://localhost:8000/v1/models.

Slow responses — Local inference speed depends on your hardware (GPU/CPU). Smaller or quantized models run faster.