Multimodal capabilities
Image, video, audio, 3D, UI grounding, OCR, and time-series forecasting — all reachable through the same Bridge.
Syntax isn't limited to text. The catalog includes models for a wide range of modalities, and the Bridge exposes them as tools the main agent can invoke alongside text generation.
Supported modalities
| Modality | Examples of what's supported |
|---|---|
| Text generation | Chat, code, reasoning, structured outputs. |
| Embedding | Sentence and code embeddings for semantic search. |
| Reranking | Listwise reranking for retrieval pipelines. |
| Image understanding | Vision-language models that look at images and answer questions. |
| OCR | Optical character recognition. |
| Image processing | Style transfer, restoration, adjustment. |
| Image generation | Text-to-image, image-to-image diffusion. |
| Video processing | Temporal segmentation, video Q&A. |
| Video generation | Text-to-video, image-to-video. |
| Segmentation | Image and video segmentation. |
| TTS (text-to-speech) | High-quality speech synthesis. |
| Audio generation | Music and effect generation, V2A Foley. |
| Audio transcription | Speech-to-text. |
| Speech-to-speech | Voice transformation, style transfer. |
| Mesh recovery | 3D mesh from images or video. |
| UI grounding | Locate UI elements in screenshots. |
| Time-series forecasting | Foundation-model forecasting (Chronos, TimesFM, MOMENT, Granite-TTM, etc.). |
How multimodal capabilities surface to your harness
Each multimodal capability is exposed as a tool the main agent can
invoke. When a multimodal model is deployed, the Bridge registers its
capability — generate_image, transcribe_audio,
segment_image, text_to_speech, etc. — so the main agent can pick
the right tool when the user's request needs it.
The capability set is dynamic: it's recomputed every time you
deploy or undeploy. If you have an image generator deployed today and
remove it tomorrow, the agent stops seeing generate_image as an
available tool.
Engine selection for multimodal
Each modality is served by the engine class best suited to it:
- LLMs and vision-language models run on GPU-serving engines.
- Image and video generation run on diffusion-friendly engines.
- Specialized non-LLM models (OCR, segmentation, TTS, audio generation, mesh recovery, UI grounding, time-series forecasting) run on a serving framework optimized for those workloads.
- On Apple Silicon, the Apple-native stack handles eligible models.
The autotuner picks all of this for you — see Multi-engine inference.
Where to go next
- Models → Modalities — modality-by-modality capability summary.
- Models → Purposes — the full list of Model Purpose categories.
- Concepts → Party Builder — adding multimodal specialists to a party.