Syntax

Multi-engine inference

Hardware-aware engine selection across a large compatibility matrix — Syntax owns the optimization work so you don't.

Choosing how to run a model is a real engineering problem. The "right" serving stack for a given workload depends on the model architecture, the hardware family and SKU, the attention backend, the quantization format, the tool-call and reasoning parsers each engine ships, how each engine handles KV cache offload, and the way the model needs to be sharded across one or more hosts. Syntax owns this entire decision so the surface you build against stays a single, stable endpoint.

The matrix Syntax is solving for you

When you deploy a model, the autotuner is searching across — at minimum — the cross product of:

  • Model architecture and modality. Dense and Mixture-of-Experts LLMs, vision-language models, diffusion image and video generators, audio models, embedding models, rerankers, OCR, segmentation, time-series forecasters, UI-grounding models, 3D mesh-recovery models. Each has different serving constraints.
  • Hardware. Dozens of GPU SKUs across NVIDIA, AMD ROCm, and Apple Silicon; CPU-only fallback; single-host versus multi-host topologies; and the corresponding cloud instance types when running on managed remote.
  • Serving engines. Multiple engines per model family — vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX, diffusion-native servers, and others — each with its own performance profile and its own feature support per model.
  • Engine-internal configuration. Attention backends (FlashAttention, PagedAttention, architecture-specific custom kernels), KV cache layout and hierarchical offload to host RAM, speculative decoding, prefix caching, quantization (W4A16, W8A8, FP8, GPTQ, AWQ, GGUF), tensor and pipeline parallelism, batch-scheduling policies.

That's not a configuration; it's a search space. Picking the wrong cell costs you tokens-per-second, time-to-first-token, output correctness, or money — sometimes all four.

"Supported" isn't the same as "best supported"

A given model is frequently supported by more than one engine, but the quality of that support is rarely identical. Some of the distinctions the autotuner makes:

  • A model runs on two engines, but only one ships the official tool-call parser. Tool calls degrade on the other. Syntax routes to the engine with first-class parser support.
  • A model exposes a reasoning channel on both engines, but only one surfaces it cleanly through the OpenAI- and Anthropic-compatible Bridge. Syntax picks the engine that preserves the reasoning round-trip.
  • A long-context workload fits in VRAM on one engine but requires hierarchical KV cache offload to host RAM on the other. If the deployment is latency-sensitive, the in-VRAM engine wins; if it's throughput- and context-heavy, the offload-capable engine wins.
  • A quantized variant of a model is fast and produces faithful outputs on one engine but is numerically unstable on another at the same precision. Syntax avoids the unstable combination.

This is the kind of nuance that's otherwise buried in engine release notes, GitHub issues, and benchmarks you'd have to run yourself.

What you actually decide

The user-facing input is two values, not the matrix above:

  • A deployment tier. Either Performance — low time-to-first-token and high tokens-per-user-per-second, willing to pay for the right hardware and serving topology — or Cost-optimized — aggressively minimize spend while meeting your acceptable floors for TTFT and per-user throughput.
  • A target. Local, self-managed remote, managed remote on dUX, or a hosted-provider passthrough.

Everything underneath — engine selection, attention backend, quantization, parallelism, KV offload, and instance-type selection on managed remote — is the autotuner's job.

Party-level planning

A multi-model party (Main Agent, Default Sub-Agent, up to six Specialists) is a packing and isolation problem on top of single-model optimization. The autotuner plans across the whole party:

  • What packs together. Models with complementary memory profiles and compatible engines that can share a host without contention get co-tenanted to reduce cost.
  • What stays separate. Models that would harm each other's latency under load — for example, a latency-sensitive Main Agent next to a throughput-heavy diffusion specialist — get split across instances.
  • Role-aware degradation. Under VRAM pressure, specialists yield first, the sub-agent second, the Main Agent only as a last resort. Eligible smaller models can fall back to a CPU engine automatically.
  • Tier propagation. Performance versus Cost-optimized applies to the party as a whole and shapes both the packing decisions and the instance-type recommendations on managed remote.

Scales from zero to whatever sustained traffic demands

Every plan the autotuner produces is autoscalable end-to-end. Under no traffic, a deployment can sit at zero replicas; under sustained load it scales out across replicas of the same plan, fronted by the Bridge so the harness sees a single endpoint either way; when load falls off, replicas wind down. You don't pick a horizontal-pod- autoscaler policy, you don't model cold-start curves, and you don't maintain a separate scaling configuration per model — the plan already encodes how to scale itself.

What stays the same

From the harness's point of view, none of this is visible. You get the same OpenAI- or Anthropic-compatible API surface. The model appears in the harness's model list. Streaming, tool calls, and reasoning content flow through unchanged. Swapping engines, scaling out, or re-packing a party doesn't require any change in the harness.

Where to start