Installation
Install mach with production extras, build native extensions, and run the test suite.
Install mach in editable mode with the extras that match your use case. Production serving on Apple Silicon expects dev, dflash, and native together.
pip install extras
From the engine root (vendored at vendor/local_moe_engine in Maniac Desktop, or a standalone checkout):
pip install -e ".[dev,dflash,native]"| Extra | Purpose |
|---|---|
dev | Test dependencies (pytest, etc.) |
dflash | Pulls dflash-mlx for DFlash v7 speculative decoding |
native | Native expert I/O build dependencies and extension packaging |
conversion | Dependencies for the mach convert 2-bit IQ2 pipeline (see Conversion) |
Without dflash, mach-serve falls back to target-only autoregressive decode when no usable draft checkpoint is available. Without native, production preflight fails fast unless you explicitly enable the diagnostic Python fallback (not recommended for serving).
Console scripts installed by the package:
| Command | Role |
|---|---|
mach-serve | OpenAI-compatible HTTP server (primary entry point) |
mach-generate | One-shot CLI generation |
mach convert | Master → 2-bit IQ2 GGUF checkpoint pipeline |
mach-prune | Slice/convert checkpoints to per-expert engine layout |
mach-reap | REAP saliency pruning |
Native extensions
Production streaming depends on two native components.
Direct-pread extension (lme_mlx_pread_ext)
Built by scripts/build_mlx_pread_ext.py. Preads packed sidecar bytes straight into persistent GPU bank slots, avoiding staged source tensors on the hot path.
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"If import fails after install, reinstall extras and rebuild. Startup should log native_extension=ready and direct_pread=1. Missing extension triggers fail-fast preflight in production unless LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 is set (diagnostic only).
Sidecar loader (liblme_expert_io)
C implementation in csrc/expert_io.c. Reads packed expert sidecar records (layer_XX.bin) instead of safetensors tensor slices. Enabled in production when sidecars are present (LME_NATIVE_EXPERT_IO=1 / --native-expert-io).
Together, these extensions are why production mach-serve can sustain low-RAM serving with engine-format sidecar artifacts. See Expert residency for how pread feeds the expert bank.
IQ2 codec (libiqk)
Building 2-bit checkpoints with mach convert needs the native libiqk codec (IQ2_K / IQ2_KS / IQ2_KL). Build it once:
python scripts/build_libiqk.pyA missing libiqk makes mach convert exit 2. It is only required for conversion, not for serving.
Verify the install
python -c "import mach; print(mach.__version__)"
mach-serve --helpFor DFlash flags:
mach-serve --help | rg "serving-adaptive-block-policy|streaming"Tests
pip install -e ".[dev]"
pytest tests/Slow integration tests (model loads, serving smoke):
pytest tests/ -m slowFocused probes without full model loads:
python -m pytest tests/test_serve_production_tool_parsing.py -q
python -m py_compile src/mach/server/production_app.py src/mach/cli/serve.pyManiac Desktop managed venv
Maniac Desktop installs vendor/local_moe_engine[dev,dflash,native] into its managed Python venv automatically when Local MoE is enabled, builds lme_mlx_pread_ext if missing, and spawns mach-serve from that venv. See Maniac integration.
Next steps
- Serving — start
mach-serveon the production fast path - CLI — obtain or build an engine-format checkpoint and sidecar
- Troubleshooting — if
direct_pread.active=0or prefill is very slow