Media support¶
Autobot handles media in three directions:
- Vision (inbound) — photos sent by users are downloaded, base64-encoded, and forwarded to the LLM as multimodal content blocks
- Image generation (outbound) — the LLM can create images via the
generate_imagetool and send them back to users - Voice transcription (inbound) — voice messages are transcribed to text via the Whisper API before reaching the LLM
Vision¶
How it works¶
- Channel receives a photo and downloads the file bytes via the platform API
- MediaAttachment stores the base64-encoded data in a transient
datafield (excluded from JSON serialization to avoid bloating session files) - Context builder detects attachments with
dataand builds an array of content blocks (text + image) in OpenAI'simage_urlformat - Provider sends the content blocks directly for OpenAI-compatible APIs, or converts them to Anthropic's
image/source/base64format for the native Anthropic path
Supported channels¶
| Channel | Status | Notes |
|---|---|---|
| Telegram | Supported | Auto-downloads photos via Bot API |
| Slack | Planned | Needs url_private download with auth |
| Planned | Needs bridge-side changes to forward images | |
| Zulip | Not supported | Media handling not yet implemented |
Supported providers¶
All providers work with vision — the internal format uses OpenAI-compatible image_url content blocks:
- OpenAI-compatible (OpenAI, DeepSeek, Groq, Gemini, OpenRouter, vLLM, etc.) — content blocks are serialized directly, no conversion needed
- Anthropic native —
image_urlblocks are automatically converted to Anthropic'simage/source/base64format
Note: The LLM model itself must support vision. Non-vision models will ignore or fail to interpret image content.
Configuration¶
No additional configuration is needed. Vision works automatically when:
- The channel is enabled and configured
- The LLM model supports multimodal/vision input
Optional proxy support for Telegram file downloads:
channels:
telegram:
enabled: true
token: "BOT_TOKEN"
proxy: "http://proxy.example.com:8080" # Optional
Limits¶
- Max image size: 20 MB (configurable via
MAX_IMAGE_SIZEconstant) - Telegram Bot API limit: 20 MB for file downloads
- Images are not persisted in session history — only the current turn's images are sent to the LLM to avoid token cost bloat
Architecture details¶
MediaAttachment.data¶
The data field on MediaAttachment uses @[JSON::Field(ignore: true)] to keep base64 image data out of JSON serialization. This means:
- Session files (JSONL) stay small — no multi-MB base64 strings
- Past images are not re-sent on subsequent turns
- The field is only populated for the current inbound message
Content block format¶
The context builder produces OpenAI-format content blocks:
[
{"type": "text", "text": "Analyze this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
For Anthropic native, this is converted to:
[
{"type": "text", "text": "Analyze this image"},
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "..."}}
]
Image generation¶
The generate_image tool allows the LLM to create images from text prompts and send them directly to users.
How it works¶
- The user asks the LLM to create an image
- The LLM calls
generate_image(prompt)with a description - The tool calls the provider's image generation API
- Base64 image data is wrapped in an
OutboundMessagewithMediaAttachment - The channel sends the photo to the user (e.g. Telegram
sendPhoto)
Supported providers¶
| Provider | Default model | API |
|---|---|---|
| OpenAI | gpt-image-1 |
/v1/images/generations |
| Gemini | gemini-2.5-flash-image |
/v1beta/models/{model}:generateContent |
Note: Anthropic does not support image generation. When no explicit override is set, autobot automatically picks the first available image-capable provider (tries OpenAI, then Gemini). Use
tools.image.providerto force a specific one.
Configuration¶
Image generation is auto-enabled when an OpenAI or Gemini provider is configured. No extra settings needed.
To override the provider or model:
tools:
image:
enabled: true
provider: openai # optional, auto-detected from configured providers
model: gpt-image-1 # optional, auto-detected from provider
size: 1024x1024 # optional, default: 1024x1024
Supported channels (outbound)¶
| Channel | Status | Notes |
|---|---|---|
| Telegram | Supported | Sends photos via sendPhoto multipart API |
| Slack | Text fallback | Logs warning, sends caption as text |
| Zulip | Text fallback | Logs warning, sends caption as text |
Verification¶
Run autobot doctor to check image generation status:
Or if no provider is configured:
Voice transcription¶
Voice and audio messages received via Telegram are automatically transcribed to text using the Whisper API before being sent to the LLM.
How it works¶
- Channel receives a voice/audio message and downloads the file bytes
- Transcriber sends the audio to the Whisper API (OpenAI or Groq) and receives text
- The transcribed text replaces the
[voice message]placeholder as[voice transcription]: {text} - The LLM receives the transcription as regular text content
Configuration¶
No extra configuration needed. Voice transcription is auto-enabled when a Whisper-capable provider (Groq or OpenAI) is configured:
Or:
providers:
openai:
api_key: "${OPENAI_API_KEY}" # Voice transcription auto-enabled via OpenAI Whisper
Groq is preferred when both are configured (faster, has free tier). If neither is configured, voice messages fall back to [voice message] text with no errors.
Supported providers¶
| Provider | Model | API endpoint |
|---|---|---|
| Groq | whisper-large-v3-turbo |
api.groq.com/openai/v1/audio/transcriptions |
| OpenAI | whisper-1 |
api.openai.com/v1/audio/transcriptions |
Verification¶
Run autobot doctor to check voice transcription status:
Or if no provider is configured: