Library API
Programmatic load(), load_moe(), REGISTRY detection, and BlockSpecDecEngine for embedding speculative decoding.
Use the Python package directly when building benchmarks, custom servers, or embedding decode inside another process. HTTP serving remains the common path via Serving.
Public exports
mach.__init__ lazy-exports the stable surface. Typical imports:
from mach import load, load_moe, detect_arch, REGISTRYInstall with [dflash] when using BlockSpecDecEngine. See Installation.
load()
Generic checkpoint loader (mlx-lm compatible path):
from mach import load
model, tokenizer = load("mlx-community/Qwen3.6-35B-A3B-MLX-4bit")Use for non-MoE or quick mlx-lm workflows. MoE production features require load_moe.
load_moe()
Primary MoE entry point — runs architecture detection, expert block swap, and store/cache wiring:
from mach import load_moe
model, tokenizer, store, cache = load_moe("/path/to/checkpoint")Optional explicit architecture:
model, tokenizer, store, cache = load_moe(
"/path/to/checkpoint",
arch="qwen3_5_moe",
)Returns:
| Object | Role |
|---|---|
model | MLX model with GatherSwitchGLU / DiskSwitchGLU swapped in |
tokenizer | Hugging Face-compatible tokenizer |
store | Expert store (BankCache, stacked resident, etc.) |
cache | KV / prefix cache adapter |
Residency mode is selected by server flags or env (--streaming, expert_residency=...). See Expert residency and Architecture.
load_moe prints a guard when streaming is pointed at a stacked layout — it will use lazy per-expert slices, not resident stacked banks.
load_v4_flash()
DeepSeek-V4-specific loader with hybrid routing and MTP/EAGLE draft hooks:
from mach import load_v4_flash
model, tokenizer, store, cache = load_v4_flash("/path/to/deepseek_v4_engine")See Speculative decoding for rho-gate env knobs.
REGISTRY and detect_arch()
from mach import detect_arch, REGISTRY
arch = detect_arch("/path/to/checkpoint/config.json")
adapter = REGISTRY[arch]REGISTRY maps architecture keys to MoeArchAdapter implementations. detect_arch() reads config.json and returns the matching key (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …).
For a JSON-serializable summary of the supported set, use arch_registry.registry_summary() (every architecture) or arch_summary(adapter) (one architecture). These back the mach-archs CLI and the GET /v1/capabilities endpoint — one source of truth — so tools never hard-code the architecture list:
from mach.io.arch_registry import registry_summary
for arch in registry_summary():
print(arch["model_type"], arch["kv_cache_kind"], arch["specdec_kind"])Supported families are listed in Overview.
BlockSpecDecEngine
Embed DFlash v7 speculative decoding without mach-serve:
from mach import (
BlockSpecDecEngine,
BlockSpecDecEngineConfig,
BlockSpecDecEngineMode,
)
config = BlockSpecDecEngineConfig(
mode=BlockSpecDecEngineMode.NATIVE_V7,
# adaptive block policy knobs, draft paths, etc.
)
engine = BlockSpecDecEngine.from_checkpoint(
target_checkpoint="/path/to/target",
draft_checkpoint="/path/to/dflash_draft",
config=config,
)Requires pip install -e ".[dflash]" (dflash-mlx).
BlockSpecDecEngineMode.NATIVE_V7 matches production decode_path=specdec-v7. Configure adaptive policies to mirror opencode-sampled-v1 or ablation modes — see Speculative decoding.
Lower-level modules
Advanced integration (not re-exported as stable API):
| Module area | Contents |
|---|---|
io/load.py | Load pipeline, _swap_switch_mlp |
io/gather_switch_glu.py | Gather dispatch |
io/expert_bank.py | Bank slots, eviction |
engine/disk_kv_cache.py | L2 prefix cache |
server/production_app.py | FastAPI app used by mach-serve |
Prefer load_moe + BlockSpecDecEngine + mach-serve unless you are extending the engine itself.
Related pages
- Architecture — load pipeline internals
- CLI — artifact paths for
from_checkpoint - Maniac integration — desktop does not embed Python; it spawns
mach-serve