Multimodal capabilities

Image, video, audio, 3D, UI grounding, OCR, and time-series forecasting — all reachable through the same Bridge.

Syntax isn't limited to text. The catalog includes models for a wide range of modalities, and the Bridge exposes them as tools the main agent can invoke alongside text generation.

Supported modalities

Modality	Examples of what's supported
Text generation	Chat, code, reasoning, structured outputs.
Embedding	Sentence and code embeddings for semantic search.
Reranking	Listwise reranking for retrieval pipelines.
Image understanding	Vision-language models that look at images and answer questions.
OCR	Optical character recognition.
Image processing	Style transfer, restoration, adjustment.
Image generation	Text-to-image, image-to-image diffusion.
Video processing	Temporal segmentation, video Q&A.
Video generation	Text-to-video, image-to-video.
Segmentation	Image and video segmentation.
TTS (text-to-speech)	High-quality speech synthesis.
Audio generation	Music and effect generation, V2A Foley.
Audio transcription	Speech-to-text.
Speech-to-speech	Voice transformation, style transfer.
Mesh recovery	3D mesh from images or video.
UI grounding	Locate UI elements in screenshots.
Time-series forecasting	Foundation-model forecasting (Chronos, TimesFM, MOMENT, Granite-TTM, etc.).

How multimodal capabilities surface to your harness

Each multimodal capability is exposed as a tool the main agent can invoke. When a multimodal model is deployed, the Bridge registers its capability — generate_image, transcribe_audio, segment_image, text_to_speech, etc. — so the main agent can pick the right tool when the user's request needs it.

The capability set is dynamic: it's recomputed every time you deploy or undeploy. If you have an image generator deployed today and remove it tomorrow, the agent stops seeing generate_image as an available tool.

Engine selection for multimodal

Each modality is served by the engine class best suited to it:

LLMs and vision-language models run on GPU-serving engines.
Image and video generation run on diffusion-friendly engines.
Specialized non-LLM models (OCR, segmentation, TTS, audio generation, mesh recovery, UI grounding, time-series forecasting) run on a serving framework optimized for those workloads.
On Apple Silicon, the Apple-native stack handles eligible models.

The autotuner picks all of this for you — see Multi-engine inference.

Where to go next

Models → Modalities — modality-by-modality capability summary.
Models → Purposes — the full list of Model Purpose categories.
Concepts → Party Builder — adding multimodal specialists to a party.

Multimodal capabilities

Supported modalities

How multimodal capabilities surface to your harness

Engine selection for multimodal

Where to go next

On this page