Modalities
Text, image, video, and audio — what each modality means in Syntax and how multimodal models surface to your harness.
Where purpose is what a model is for, modality is what kind of data the model accepts and emits. Models can be unimodal (text only) or multimodal (text + image, or text + audio, etc.).
The four common modalities
| Modality | Meaning |
|---|---|
| Text | Tokens — chat, code, structured outputs. |
| Image | Still images, in or out. |
| Video | Sequences of frames, in or out. |
| Audio | Audio waveforms — speech and non-speech. |
A vision-language model is "text + image" in. A diffusion image generator is "text" in, "image" out. A speech-to-speech model is "audio" in, "audio" out. A multimodal LLM might accept all four.
How multimodal LLMs work through Syntax
When you deploy a multimodal LLM (text + image, text + audio, etc.):
- The model is registered with its declared modalities.
- The Bridge accepts content blocks (image URLs, base64-encoded images, audio chunks) in the appropriate API surface and routes them to the model.
- Streaming, tool calls, and reasoning continue to work alongside multimodal input.
If your harness sends a multimodal request to a unimodal model, the Bridge returns a clear error rather than silently dropping the non-text content.
How non-LLM multimodal models work through Syntax
Models with non-LLM modalities — image generators, OCR, segmenters, TTS, audio generators, mesh recovery, UI grounding, time-series forecasting — surface as tools on the main agent rather than chat-completion targets.
Concretely: when an image generator is deployed, the main agent sees
a generate_image tool. When the user's request needs an image, the
agent calls the tool, the tool runs the model, and the result is
folded back into the conversation. The same pattern applies to every
non-LLM modality.
Capability scoring in the Party Builder
The Party Builder uses modality coverage as part of its capability scoring. When you compose a party, you can see at a glance:
- Which input modalities your party can handle.
- Which output modalities your party can produce.
- Where there are gaps — for example, "no image generation in this party" or "no audio transcription".
Picking a specialist that closes a gap is a single click.
Where to go next
- Models → Purposes — the purpose taxonomy.
- Inference → Multimodal capabilities — what each modality looks like at runtime.
- Concepts → Party Builder