Library API

Programmatic load(), load_moe(), REGISTRY detection, and BlockSpecDecEngine for embedding speculative decoding.

Use the Python package directly when building benchmarks, custom servers, or embedding decode inside another process. HTTP serving remains the common path via Serving.

Public exports

mach.__init__ lazy-exports the stable surface. Typical imports:

from mach import load, load_moe, detect_arch, REGISTRY

Install with [dflash] when using BlockSpecDecEngine. See Installation.

load()

Generic checkpoint loader (mlx-lm compatible path):

from mach import load

model, tokenizer = load("mlx-community/Qwen3.6-35B-A3B-MLX-4bit")

Use for non-MoE or quick mlx-lm workflows. MoE production features require load_moe.

load_moe()

Primary MoE entry point — runs architecture detection, expert block swap, and store/cache wiring:

from mach import load_moe

model, tokenizer, store, cache = load_moe("/path/to/checkpoint")

Optional explicit architecture:

model, tokenizer, store, cache = load_moe(
    "/path/to/checkpoint",
    arch="qwen3_5_moe",
)

Returns:

Object	Role
`model`	MLX model with `GatherSwitchGLU` / `DiskSwitchGLU` swapped in
`tokenizer`	Hugging Face-compatible tokenizer
`store`	Expert store (`BankCache`, stacked resident, etc.)
`cache`	KV / prefix cache adapter

Residency mode is selected by server flags or env (--streaming, expert_residency=...). See Expert residency and Architecture.

load_moe prints a guard when streaming is pointed at a stacked layout — it will use lazy per-expert slices, not resident stacked banks.

load_v4_flash()

DeepSeek-V4-specific loader with hybrid routing and MTP/EAGLE draft hooks:

from mach import load_v4_flash

model, tokenizer, store, cache = load_v4_flash("/path/to/deepseek_v4_engine")

See Speculative decoding for rho-gate env knobs.

REGISTRY and detect_arch()

from mach import detect_arch, REGISTRY

arch = detect_arch("/path/to/checkpoint/config.json")
adapter = REGISTRY[arch]

REGISTRY maps architecture keys to MoeArchAdapter implementations. detect_arch() reads config.json and returns the matching key (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …).

For a JSON-serializable summary of the supported set, use arch_registry.registry_summary() (every architecture) or arch_summary(adapter) (one architecture). These back the mach-archs CLI and the GET /v1/capabilities endpoint — one source of truth — so tools never hard-code the architecture list:

from mach.io.arch_registry import registry_summary

for arch in registry_summary():
    print(arch["model_type"], arch["kv_cache_kind"], arch["specdec_kind"])

Supported families are listed in Overview.

BlockSpecDecEngine

Embed DFlash v7 speculative decoding without mach-serve:

from mach import (
    BlockSpecDecEngine,
    BlockSpecDecEngineConfig,
    BlockSpecDecEngineMode,
)

config = BlockSpecDecEngineConfig(
    mode=BlockSpecDecEngineMode.NATIVE_V7,
    # adaptive block policy knobs, draft paths, etc.
)

engine = BlockSpecDecEngine.from_checkpoint(
    target_checkpoint="/path/to/target",
    draft_checkpoint="/path/to/dflash_draft",
    config=config,
)

Requires pip install -e ".[dflash]" (dflash-mlx).

BlockSpecDecEngineMode.NATIVE_V7 matches production decode_path=specdec-v7. Configure adaptive policies to mirror opencode-sampled-v1 or ablation modes — see Speculative decoding.

Lower-level modules

Advanced integration (not re-exported as stable API):

Module area	Contents
`io/load.py`	Load pipeline, `_swap_switch_mlp`
`io/gather_switch_glu.py`	Gather dispatch
`io/expert_bank.py`	Bank slots, eviction
`engine/disk_kv_cache.py`	L2 prefix cache
`server/production_app.py`	FastAPI app used by `mach-serve`

Prefer load_moe + BlockSpecDecEngine + mach-serve unless you are extending the engine itself.

Architecture — load pipeline internals
CLI — artifact paths for from_checkpoint
Maniac integration — desktop does not embed Python; it spawns mach-serve

Library API

On this page