Serving
Run mach-serve on the production fast path or generic OpenAI backend, confirm startup signals, and use HTTP endpoints.
mach-serve is the primary entry point for OpenAI-compatible HTTP serving on Apple Silicon.
Production fast path
mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080Pick exactly one residency mode:
| Flag | Residency |
|---|---|
--streaming | Transient prefill + resident decode (experts streamed from SSD sidecar) |
--no-streaming | Full resident stacked (compatible checkpoint that fits in RAM) |
Production stack (implicit):
- DFlash v7 +
opencode-sampled-v1adaptive blocks - Native expert sidecar I/O + direct pread
- Gather dispatch + vLLM-parity tool handling
- Prefix cache (L2 on disk)
Compatibility aliases (not primary docs): --backend production, --backend dflash, scripts/serve_production.py.
Generic OpenAI backend
Architecture-agnostic serving without DFlash production optimizations:
mach-serve mlx-community/Qwen3.6-35B-A3B-MLX-4bit --backend openai --port 8000Use for quick mlx-lm compatibility checks, not production benches on engine-format sidecar artifacts.
Startup signals
Confirm the fast path in stdout logs:
startup mode profile=production production_streaming_mode=streaming ... transient_prefill=1 ... expert_residency=streaming ...
production optimizations direct_pread=1 ... native_prefill_fused=1 ... async_prefetch=0 ... resident_metal=0
startup direct-pread expected=1 sidecar_present=1 sidecar_valid=1 ... native_extension=ready ... fallback_policy=fail-fast
startup decode_path=specdec-v7 adaptive_block_policy=opencode-sampled-v1| Signal | Expected (production streaming) |
|---|---|
production_streaming_mode | streaming |
transient_prefill | 1 |
expert_residency | streaming |
direct_pread | 1 |
native_extension | ready |
decode_path | specdec-v7 |
adaptive_block_policy | opencode-sampled-v1 |
Red flags: sidecar_valid=0, native_extension=missing, fallback_policy=diagnostic-python. See Troubleshooting.
OpenAI client configuration
Base URL: http://127.0.0.1:PORT/v1
API key: any non-empty string (server does not validate keys locally).
Example with curl smoke script:
./examples/curl_smoke.sh http://127.0.0.1:8080OpenCode example:
export XDG_CONFIG_HOME=$(pwd)/.opencode-config
opencode run "Write a binary search in Python"HTTP endpoints
| Method | Path | Purpose |
|---|---|---|
GET | /v1/models | Model list (readiness probe for Maniac Desktop) |
GET | /v1/capabilities | Loaded model's architecture + active serving path (both backends) |
POST | /v1/chat/completions | Chat completions — streaming SSE, tools, JSON schema |
GET | /v1/stats | Decode stats, adaptive block policy, DFlash counters |
GET | /v1/cache/stats | Prefix cache + streaming_summary expert bank rollup |
GET | /v1/cache/warm_prefix | Warm-prefix status |
POST | /v1/cache/warm_prefix | Prefill and cache stable system/tool prefix |
Capabilities
GET /v1/capabilities reports, for the currently loaded model, which architecture it is and what serving path is active. It is served by both the production fast path (--streaming / --no-streaming) and the generic --backend openai path.
curl -s http://127.0.0.1:8080/v1/capabilities | jq .{
"model": "your-engine-checkpoint",
"backend": "production",
"supported": true,
"arch": {
"model_type": "qwen3_5_moe",
"swiglu_variant": "plain_swiglu",
"stacked_supported": true,
"has_expert_bias": false,
"text_config_wrapped": true,
"model_module": "mach.models.qwen3_5_moe",
"default_bits": 4,
"default_group_size": 64,
"kv_cache_kind": "hybrid_gdn",
"specdec_kind": "specdec_v7",
"continuous_batching": "ok"
},
"serving": {
"expert_residency": "streaming",
"serving_mode": "streaming",
"decode_path": "specdec-v7",
"effective_decode_path": "specdec_v7",
"effective_decode_path_reason": "specdec_v7 draft is usable",
"continuous_batching": false,
"turboquant_kv": false,
"decode_policy": { "mode": "greedy-specdec", "engine_mode": "native_v7" },
"speculation": {
"supported_presets": ["none", "summarization", "classification"],
"additive": true,
"exactness_preserving": true
},
"rho_gate": null
},
"grammar": { "xgrammar": true, "lm_format_enforcer": false }
}| Field | Meaning |
|---|---|
model | Loaded model id |
backend | production or openai |
supported | Whether the loaded model_type is in the registry (arch is null when not) |
arch | Static architecture facts — the same arch_summary that mach-archs --json prints |
serving | Live config of this process — expert residency, serving mode, decode path, the arch-gated effective_decode_path (+ reason), continuous batching, the decode policy, and the speculation capability block |
grammar | Which constrained-decoding backends are importable (xgrammar, lm_format_enforcer) |
decode_path vs effective_decode_path
decode_path is the configured path label; effective_decode_path is what the arch gate (_resolve_effective_decode_path) actually resolved at load, one of specdec_v7, target_only, v4_mtp, or v4_eagle3, with a one-line effective_decode_path_reason. The gate reuses the same specdec predicate as mach check, so a stray --draft-dir on a non-Qwen architecture (or a missing/unusable draft) cleanly falls back to target_only instead of mis-constructing the DFlash stack. See Target-only mode.
speculation
The speculation block advertises the draft-free request presets the loaded engine accepts: supported_presets (["none", "summarization", "classification"] on the DFlash and target-only paths; ["none"] on engines without a speculation seam, e.g. DeepSeek-V4 MTP/EAGLE), plus additive: true and exactness_preserving: true.
The generic openai backend returns backend: "openai" with a backend-specific serving block (serving_mode: "openai-mlx-stream-generate", a decode_path of spec-dec or target-only, and a disk_kv flag).
This endpoint reports only what the running server is actually doing. The full path × architecture support matrix (which paths each arch could run) is intentionally out of scope here — that is the job of the static, pre-load mach check preflight, which predicts a checkpoint's runnable paths, required artifacts, and memory fit without loading weights.
Chat completions
Supports:
stream: true— SSE token deltas- Tool calling — OpenAI/vLLM-compatible parsing
- JSON schema response format
- Optional
--continuous-batching— see Continuous batching
Headers of interest when batching is enabled: X-Decode-Mode, X-Active-Loops.
Runtime observability
curl -s http://127.0.0.1:8080/v1/stats | jq .
curl -s http://127.0.0.1:8080/v1/cache/stats | jq .Confirm adaptive_block_policy.name=opencode-sampled-v1, alpha_window=4, alpha_threshold=0.40, low_block=4, default_block=16.
Memory tuning examples
Default streaming (sidecar experts streamed, bounded decode bank):
mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080Larger resident decode bank when you have spare RAM:
mach-serve /path/to/<your-engine-checkpoint> --streaming --expert-cache-gb 8 --port 8080Resident stacked (compatible checkpoint that fits in RAM):
mach-serve /path/to/stacked-mlx-checkpoint --no-streaming --port 8080To size these knobs for a specific machine, run mach check <checkpoint> --expert-cache-gb N, which reports a fits / tight / wont_fit verdict without loading weights.
Related pages
- Installation — native extensions
- Expert residency —
--streamingvs--no-streaming - Speculative decoding — DFlash policies
- Caching — prefix warm and TurboQuant
- Maniac integration — desktop spawn and readiness probe