Local inference
Running models on your own machine — GPU, Apple Silicon, or CPU.
Local inference runs models on the machine Syntax is installed on. It works on everything from a CPU-only laptop to a multi-GPU workstation.
What's supported
| Hardware | Engine class | Notes |
|---|---|---|
| NVIDIA GPU (Linux) | GPU-serving engine tuned for the architecture. | Best supported — most open-weight LLMs and multimodal models work. |
| NVIDIA GPU (Windows) | GPU-serving engine. | Same coverage as Linux, modern driver required. |
| Apple Silicon | Native Apple Metal stack. | Excellent for M-series Macs; no container or driver overhead. |
| AMD ROCm | GPU-serving engine for compatible cards. | Supported for current cards; check the catalog for per-model status. |
| CPU only | Lightweight CPU serving engine. | Smaller models only. Eligible larger models can also fall back here when GPU VRAM is exhausted by co-tenants. |
Picking what to run locally
The desktop app's Catalog page shows recommended models for your detected hardware tier. Cards expose:
- Download Locally — pull weights to your machine.
- A clear indicator if the model won't fit on your hardware so you can pick a smaller variant.
Once a model is downloaded, it's available for deployment from the Deployments page.
Deploying a single model locally
- Open Deployments → New Deployment.
- Pick a category (Chat, General, Coding, Media, Vision, Custom) or pick Custom to compose your own.
- Choose Local as the target.
- Pick a deployment Mode (Latency or Throughput).
- Submit.
Syntax's autotuner picks the right engine and parameters for your hardware automatically. The deployment shows up on the Active Deployments page once it's serving.
Deploying a party locally
Multi-model parties deploy through the same flow. The Party Builder generates a plan that fits the whole party on your local hardware, relieving VRAM pressure by role tier when needed (see Inference → Overview).
When local isn't enough
- VRAM-bound by a model larger than your GPU can hold → consider a smaller variant, a quantized version, or routing to a hosted provider for that model.
- Throughput-bound by sustained heavy load → consider remote self-hosted or managed remote.
- Cold-start sensitive when you need a model rarely → routing to a hosted provider is often the right answer.
Where to go next
- Hardware support — full hardware matrix.
- Multi-engine inference — why Syntax picks the engine it does.
- Concepts → Party Builder — deploying multiple models locally as one party.