Syntax

Local inference

Running models on your own machine — GPU, Apple Silicon, or CPU.

Local inference runs models on the machine Syntax is installed on. It works on everything from a CPU-only laptop to a multi-GPU workstation.

What's supported

HardwareEngine classNotes
NVIDIA GPU (Linux)GPU-serving engine tuned for the architecture.Best supported — most open-weight LLMs and multimodal models work.
NVIDIA GPU (Windows)GPU-serving engine.Same coverage as Linux, modern driver required.
Apple SiliconNative Apple Metal stack.Excellent for M-series Macs; no container or driver overhead.
AMD ROCmGPU-serving engine for compatible cards.Supported for current cards; check the catalog for per-model status.
CPU onlyLightweight CPU serving engine.Smaller models only. Eligible larger models can also fall back here when GPU VRAM is exhausted by co-tenants.

Picking what to run locally

The desktop app's Catalog page shows recommended models for your detected hardware tier. Cards expose:

  • Download Locally — pull weights to your machine.
  • A clear indicator if the model won't fit on your hardware so you can pick a smaller variant.

Once a model is downloaded, it's available for deployment from the Deployments page.

Deploying a single model locally

  1. Open Deployments → New Deployment.
  2. Pick a category (Chat, General, Coding, Media, Vision, Custom) or pick Custom to compose your own.
  3. Choose Local as the target.
  4. Pick a deployment Mode (Latency or Throughput).
  5. Submit.

Syntax's autotuner picks the right engine and parameters for your hardware automatically. The deployment shows up on the Active Deployments page once it's serving.

Deploying a party locally

Multi-model parties deploy through the same flow. The Party Builder generates a plan that fits the whole party on your local hardware, relieving VRAM pressure by role tier when needed (see Inference → Overview).

When local isn't enough

  • VRAM-bound by a model larger than your GPU can hold → consider a smaller variant, a quantized version, or routing to a hosted provider for that model.
  • Throughput-bound by sustained heavy load → consider remote self-hosted or managed remote.
  • Cold-start sensitive when you need a model rarely → routing to a hosted provider is often the right answer.

Where to go next