Syntax Docs

Local inference runs models on the machine Syntax is installed on. It works on everything from a CPU-only laptop to a multi-GPU workstation.

What's supported

Hardware	Engine class	Notes
NVIDIA GPU (Linux)	GPU-serving engine tuned for the architecture.	Best supported — most open-weight LLMs and multimodal models work.
NVIDIA GPU (Windows)	GPU-serving engine.	Same coverage as Linux, modern driver required.
Apple Silicon	Native Apple Metal stack.	Excellent for M-series Macs; no container or driver overhead.
AMD ROCm	GPU-serving engine for compatible cards.	Supported for current cards; check the catalog for per-model status.
CPU only	Lightweight CPU serving engine.	Smaller models only. Eligible larger models can also fall back here when GPU VRAM is exhausted by co-tenants.

Picking what to run locally

The desktop app's Catalog page shows recommended models for your detected hardware tier. Cards expose:

Download Locally — pull weights to your machine.
A clear indicator if the model won't fit on your hardware so you can pick a smaller variant.

Once a model is downloaded, it's available for deployment from the Deployments page.

Deploying a single model locally

Open Deployments → New Deployment.
Pick a category (Chat, General, Coding, Media, Vision, Custom) or pick Custom to compose your own.
Choose Local as the target.
Pick a deployment Mode (Latency or Throughput).
Submit.

Syntax's autotuner picks the right engine and parameters for your hardware automatically. The deployment shows up on the Active Deployments page once it's serving.

Multi-model parties deploy through the same flow. The Party Builder generates a plan that fits the whole party on your local hardware, relieving VRAM pressure by role tier when needed (see Inference → Overview).

When local isn't enough

VRAM-bound by a model larger than your GPU can hold → consider a smaller variant, a quantized version, or routing to a hosted provider for that model.
Throughput-bound by sustained heavy load → consider remote self-hosted or managed remote.
Cold-start sensitive when you need a model rarely → routing to a hosted provider is often the right answer.

Where to go next

Hardware support — full hardware matrix.
Multi-engine inference — why Syntax picks the engine it does.
Concepts → Party Builder — deploying multiple models locally as one party.

Local inference

What's supported

Picking what to run locally

Deploying a single model locally

Deploying a party locally

When local isn't enough

Where to go next

On this page