Serving

Run mach-serve on the production fast path or generic OpenAI backend, confirm startup signals, and use HTTP endpoints.

mach-serve is the primary entry point for OpenAI-compatible HTTP serving on Apple Silicon.

Production fast path

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Pick exactly one residency mode:

Flag	Residency
`--streaming`	Transient prefill + resident decode (experts streamed from SSD sidecar)
`--no-streaming`	Full resident stacked (compatible checkpoint that fits in RAM)

Production stack (implicit):

DFlash v7 + opencode-sampled-v1 adaptive blocks
Native expert sidecar I/O + direct pread
Gather dispatch + vLLM-parity tool handling
Prefix cache (L2 on disk)

Compatibility aliases (not primary docs): --backend production, --backend dflash, scripts/serve_production.py.

Generic OpenAI backend

Architecture-agnostic serving without DFlash production optimizations:

mach-serve mlx-community/Qwen3.6-35B-A3B-MLX-4bit --backend openai --port 8000

Use for quick mlx-lm compatibility checks, not production benches on engine-format sidecar artifacts.

Startup signals

Confirm the fast path in stdout logs:

startup mode profile=production production_streaming_mode=streaming ... transient_prefill=1 ... expert_residency=streaming ...
production optimizations direct_pread=1 ... native_prefill_fused=1 ... async_prefetch=0 ... resident_metal=0
startup direct-pread expected=1 sidecar_present=1 sidecar_valid=1 ... native_extension=ready ... fallback_policy=fail-fast
startup decode_path=specdec-v7 adaptive_block_policy=opencode-sampled-v1

Signal	Expected (production streaming)
`production_streaming_mode`	`streaming`
`transient_prefill`	`1`
`expert_residency`	`streaming`
`direct_pread`	`1`
`native_extension`	`ready`
`decode_path`	`specdec-v7`
`adaptive_block_policy`	`opencode-sampled-v1`

Red flags: sidecar_valid=0, native_extension=missing, fallback_policy=diagnostic-python. See Troubleshooting.

OpenAI client configuration

Base URL: http://127.0.0.1:PORT/v1

API key: any non-empty string (server does not validate keys locally).

Example with curl smoke script:

./examples/curl_smoke.sh http://127.0.0.1:8080

OpenCode example:

export XDG_CONFIG_HOME=$(pwd)/.opencode-config
opencode run "Write a binary search in Python"

HTTP endpoints

Method	Path	Purpose
`GET`	`/v1/models`	Model list (readiness probe for Maniac Desktop)
`GET`	`/v1/capabilities`	Loaded model's architecture + active serving path (both backends)
`POST`	`/v1/chat/completions`	Chat completions — streaming SSE, tools, JSON schema
`GET`	`/v1/stats`	Decode stats, adaptive block policy, DFlash counters
`GET`	`/v1/cache/stats`	Prefix cache + `streaming_summary` expert bank rollup
`GET`	`/v1/cache/warm_prefix`	Warm-prefix status
`POST`	`/v1/cache/warm_prefix`	Prefill and cache stable system/tool prefix

GET /v1/capabilities reports, for the currently loaded model, which architecture it is and what serving path is active. It is served by both the production fast path (--streaming / --no-streaming) and the generic --backend openai path.

curl -s http://127.0.0.1:8080/v1/capabilities | jq .

{
  "model": "your-engine-checkpoint",
  "backend": "production",
  "supported": true,
  "arch": {
    "model_type": "qwen3_5_moe",
    "swiglu_variant": "plain_swiglu",
    "stacked_supported": true,
    "has_expert_bias": false,
    "text_config_wrapped": true,
    "model_module": "mach.models.qwen3_5_moe",
    "default_bits": 4,
    "default_group_size": 64,
    "kv_cache_kind": "hybrid_gdn",
    "specdec_kind": "specdec_v7",
    "continuous_batching": "ok"
  },
  "serving": {
    "expert_residency": "streaming",
    "serving_mode": "streaming",
    "decode_path": "specdec-v7",
    "effective_decode_path": "specdec_v7",
    "effective_decode_path_reason": "specdec_v7 draft is usable",
    "continuous_batching": false,
    "turboquant_kv": false,
    "decode_policy": { "mode": "greedy-specdec", "engine_mode": "native_v7" },
    "speculation": {
      "supported_presets": ["none", "summarization", "classification"],
      "additive": true,
      "exactness_preserving": true
    },
    "rho_gate": null
  },
  "grammar": { "xgrammar": true, "lm_format_enforcer": false }
}

Field	Meaning
`model`	Loaded model id
`backend`	`production` or `openai`
`supported`	Whether the loaded `model_type` is in the registry (`arch` is `null` when not)
`arch`	Static architecture facts — the same `arch_summary` that `mach-archs --json` prints
`serving`	Live config of this process — expert residency, serving mode, decode path, the arch-gated `effective_decode_path` (+ reason), continuous batching, the decode policy, and the `speculation` capability block
`grammar`	Which constrained-decoding backends are importable (`xgrammar`, `lm_format_enforcer`)

`decode_path` vs `effective_decode_path`

decode_path is the configured path label; effective_decode_path is what the arch gate (_resolve_effective_decode_path) actually resolved at load, one of specdec_v7, target_only, v4_mtp, or v4_eagle3, with a one-line effective_decode_path_reason. The gate reuses the same specdec predicate as mach check, so a stray --draft-dir on a non-Qwen architecture (or a missing/unusable draft) cleanly falls back to target_only instead of mis-constructing the DFlash stack. See Target-only mode.

`speculation`

The speculation block advertises the draft-free request presets the loaded engine accepts: supported_presets (["none", "summarization", "classification"] on the DFlash and target-only paths; ["none"] on engines without a speculation seam, e.g. DeepSeek-V4 MTP/EAGLE), plus additive: true and exactness_preserving: true.

The generic openai backend returns backend: "openai" with a backend-specific serving block (serving_mode: "openai-mlx-stream-generate", a decode_path of spec-dec or target-only, and a disk_kv flag).

This endpoint reports only what the running server is actually doing. The full path × architecture support matrix (which paths each arch could run) is intentionally out of scope here — that is the job of the static, pre-load mach check preflight, which predicts a checkpoint's runnable paths, required artifacts, and memory fit without loading weights.

Chat completions

Supports:

stream: true — SSE token deltas
Tool calling — OpenAI/vLLM-compatible parsing
JSON schema response format
Optional --continuous-batching — see Continuous batching

Headers of interest when batching is enabled: X-Decode-Mode, X-Active-Loops.

Runtime observability

curl -s http://127.0.0.1:8080/v1/stats | jq .
curl -s http://127.0.0.1:8080/v1/cache/stats | jq .

Confirm adaptive_block_policy.name=opencode-sampled-v1, alpha_window=4, alpha_threshold=0.40, low_block=4, default_block=16.

Memory tuning examples

Default streaming (sidecar experts streamed, bounded decode bank):

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Larger resident decode bank when you have spare RAM:

mach-serve /path/to/<your-engine-checkpoint> --streaming --expert-cache-gb 8 --port 8080

Resident stacked (compatible checkpoint that fits in RAM):

mach-serve /path/to/stacked-mlx-checkpoint --no-streaming --port 8080

To size these knobs for a specific machine, run mach check <checkpoint> --expert-cache-gb N, which reports a fits / tight / wont_fit verdict without loading weights.

Installation — native extensions
Expert residency — --streaming vs --no-streaming
Speculative decoding — DFlash policies
Caching — prefix warm and TurboQuant
Maniac integration — desktop spawn and readiness probe