Maniac Docs

Library API

Programmatic load(), load_moe(), REGISTRY detection, and BlockSpecDecEngine for embedding speculative decoding.

Use the Python package directly when building benchmarks, custom servers, or embedding decode inside another process. HTTP serving remains the common path via Serving.

Public exports

mach.__init__ lazy-exports the stable surface. Typical imports:

from mach import load, load_moe, detect_arch, REGISTRY

Install with [dflash] when using BlockSpecDecEngine. See Installation.

load()

Generic checkpoint loader (mlx-lm compatible path):

from mach import load

model, tokenizer = load("mlx-community/Qwen3.6-35B-A3B-MLX-4bit")

Use for non-MoE or quick mlx-lm workflows. MoE production features require load_moe.

load_moe()

Primary MoE entry point — runs architecture detection, expert block swap, and store/cache wiring:

from mach import load_moe

model, tokenizer, store, cache = load_moe("/path/to/checkpoint")

Optional explicit architecture:

model, tokenizer, store, cache = load_moe(
    "/path/to/checkpoint",
    arch="qwen3_5_moe",
)

Returns:

ObjectRole
modelMLX model with GatherSwitchGLU / DiskSwitchGLU swapped in
tokenizerHugging Face-compatible tokenizer
storeExpert store (BankCache, stacked resident, etc.)
cacheKV / prefix cache adapter

Residency mode is selected by server flags or env (--streaming, expert_residency=...). See Expert residency and Architecture.

load_moe prints a guard when streaming is pointed at a stacked layout — it will use lazy per-expert slices, not resident stacked banks.

load_v4_flash()

DeepSeek-V4-specific loader with hybrid routing and MTP/EAGLE draft hooks:

from mach import load_v4_flash

model, tokenizer, store, cache = load_v4_flash("/path/to/deepseek_v4_engine")

See Speculative decoding for rho-gate env knobs.

REGISTRY and detect_arch()

from mach import detect_arch, REGISTRY

arch = detect_arch("/path/to/checkpoint/config.json")
adapter = REGISTRY[arch]

REGISTRY maps architecture keys to MoeArchAdapter implementations. detect_arch() reads config.json and returns the matching key (qwen3_5_moe, qwen3_moe, gpt_oss, gemma4, deepseek_v4, …).

For a JSON-serializable summary of the supported set, use arch_registry.registry_summary() (every architecture) or arch_summary(adapter) (one architecture). These back the mach-archs CLI and the GET /v1/capabilities endpoint — one source of truth — so tools never hard-code the architecture list:

from mach.io.arch_registry import registry_summary

for arch in registry_summary():
    print(arch["model_type"], arch["kv_cache_kind"], arch["specdec_kind"])

Supported families are listed in Overview.

BlockSpecDecEngine

Embed DFlash v7 speculative decoding without mach-serve:

from mach import (
    BlockSpecDecEngine,
    BlockSpecDecEngineConfig,
    BlockSpecDecEngineMode,
)

config = BlockSpecDecEngineConfig(
    mode=BlockSpecDecEngineMode.NATIVE_V7,
    # adaptive block policy knobs, draft paths, etc.
)

engine = BlockSpecDecEngine.from_checkpoint(
    target_checkpoint="/path/to/target",
    draft_checkpoint="/path/to/dflash_draft",
    config=config,
)

Requires pip install -e ".[dflash]" (dflash-mlx).

BlockSpecDecEngineMode.NATIVE_V7 matches production decode_path=specdec-v7. Configure adaptive policies to mirror opencode-sampled-v1 or ablation modes — see Speculative decoding.

Lower-level modules

Advanced integration (not re-exported as stable API):

Module areaContents
io/load.pyLoad pipeline, _swap_switch_mlp
io/gather_switch_glu.pyGather dispatch
io/expert_bank.pyBank slots, eviction
engine/disk_kv_cache.pyL2 prefix cache
server/production_app.pyFastAPI app used by mach-serve

Prefer load_moe + BlockSpecDecEngine + mach-serve unless you are extending the engine itself.

On this page