Syntax

Multimodal capabilities

Image, video, audio, 3D, UI grounding, OCR, and time-series forecasting — all reachable through the same Bridge.

Syntax isn't limited to text. The catalog includes models for a wide range of modalities, and the Bridge exposes them as tools the main agent can invoke alongside text generation.

Supported modalities

ModalityExamples of what's supported
Text generationChat, code, reasoning, structured outputs.
EmbeddingSentence and code embeddings for semantic search.
RerankingListwise reranking for retrieval pipelines.
Image understandingVision-language models that look at images and answer questions.
OCROptical character recognition.
Image processingStyle transfer, restoration, adjustment.
Image generationText-to-image, image-to-image diffusion.
Video processingTemporal segmentation, video Q&A.
Video generationText-to-video, image-to-video.
SegmentationImage and video segmentation.
TTS (text-to-speech)High-quality speech synthesis.
Audio generationMusic and effect generation, V2A Foley.
Audio transcriptionSpeech-to-text.
Speech-to-speechVoice transformation, style transfer.
Mesh recovery3D mesh from images or video.
UI groundingLocate UI elements in screenshots.
Time-series forecastingFoundation-model forecasting (Chronos, TimesFM, MOMENT, Granite-TTM, etc.).

How multimodal capabilities surface to your harness

Each multimodal capability is exposed as a tool the main agent can invoke. When a multimodal model is deployed, the Bridge registers its capability — generate_image, transcribe_audio, segment_image, text_to_speech, etc. — so the main agent can pick the right tool when the user's request needs it.

The capability set is dynamic: it's recomputed every time you deploy or undeploy. If you have an image generator deployed today and remove it tomorrow, the agent stops seeing generate_image as an available tool.

Engine selection for multimodal

Each modality is served by the engine class best suited to it:

  • LLMs and vision-language models run on GPU-serving engines.
  • Image and video generation run on diffusion-friendly engines.
  • Specialized non-LLM models (OCR, segmentation, TTS, audio generation, mesh recovery, UI grounding, time-series forecasting) run on a serving framework optimized for those workloads.
  • On Apple Silicon, the Apple-native stack handles eligible models.

The autotuner picks all of this for you — see Multi-engine inference.

Where to go next