Inference overview
How Syntax serves models — local, remote self-hosted, managed remote on dUX, and hosted providers.
Every model Syntax exposes ends up running somewhere. The four inference targets are:
| Target | Where it runs | Best for |
|---|---|---|
| Local | Your machine — GPU, Apple Silicon, or CPU. | Solo workflows, privacy-first, no network dependency. |
| Remote self-hosted | A box you've provisioned (your server, your GPU, your SSH). | Power users with their own hardware. |
| Managed remote (dUX) | dUX-managed cloud GPU. | Teams that want managed infrastructure. |
| Hosted provider | OpenAI, Anthropic, Google, etc. | Frontier models, predictable cost, no infra. |
All four are reachable through the same Bridge endpoint. Your harness doesn't know — or care — which one is serving any given request.
How Syntax decides what to run
Two layers make the decision:
-
Routing. When a request arrives at the Bridge, the active model policy picks which deployment serves it. If a model is deployed in multiple places (e.g., locally and on managed remote), routing picks based on your preferences.
-
Engine selection. For local and remote-self-hosted serving, Syntax's autotuner picks the most efficient serving engine for the chosen model and your hardware — see Differentiators → Multi-engine inference.
You can override either layer. Aliases let you pin a name to a specific deployment; per-deployment configuration lets you override engine choices when you need to.
Multi-model deployments
When you deploy a multi-model party — a Main Agent, a Default Sub-Agent, and up to six Specialists — the inference plane plans holistically:
- All models in the party share the same target (local, self-managed remote, or managed remote — but not mixed).
- The autotuner places each model on the available hardware in role order so the Main Agent gets the best resources.
- VRAM pressure is relieved by tier when needed: specialists first, the sub-agent second, the Main Agent only as a last resort. Eligible smaller models can fall back to CPU automatically.
Targets in depth
- Local inference — GPU / Apple Silicon / CPU on your own machine.
- Remote self-hosted — your own SSH-reachable hardware.
- Managed remote — dUX-backed cloud GPU.
- Hardware support — what runs on what.
- Multimodal capabilities — image, video, audio, 3D, time-series forecasting.