Syntax

Modalities

Text, image, video, and audio — what each modality means in Syntax and how multimodal models surface to your harness.

Where purpose is what a model is for, modality is what kind of data the model accepts and emits. Models can be unimodal (text only) or multimodal (text + image, or text + audio, etc.).

The four common modalities

ModalityMeaning
TextTokens — chat, code, structured outputs.
ImageStill images, in or out.
VideoSequences of frames, in or out.
AudioAudio waveforms — speech and non-speech.

A vision-language model is "text + image" in. A diffusion image generator is "text" in, "image" out. A speech-to-speech model is "audio" in, "audio" out. A multimodal LLM might accept all four.

How multimodal LLMs work through Syntax

When you deploy a multimodal LLM (text + image, text + audio, etc.):

  • The model is registered with its declared modalities.
  • The Bridge accepts content blocks (image URLs, base64-encoded images, audio chunks) in the appropriate API surface and routes them to the model.
  • Streaming, tool calls, and reasoning continue to work alongside multimodal input.

If your harness sends a multimodal request to a unimodal model, the Bridge returns a clear error rather than silently dropping the non-text content.

How non-LLM multimodal models work through Syntax

Models with non-LLM modalities — image generators, OCR, segmenters, TTS, audio generators, mesh recovery, UI grounding, time-series forecasting — surface as tools on the main agent rather than chat-completion targets.

Concretely: when an image generator is deployed, the main agent sees a generate_image tool. When the user's request needs an image, the agent calls the tool, the tool runs the model, and the result is folded back into the conversation. The same pattern applies to every non-LLM modality.

Capability scoring in the Party Builder

The Party Builder uses modality coverage as part of its capability scoring. When you compose a party, you can see at a glance:

  • Which input modalities your party can handle.
  • Which output modalities your party can produce.
  • Where there are gaps — for example, "no image generation in this party" or "no audio transcription".

Picking a specialist that closes a gap is a single click.

Where to go next