Installation

Install mach with production extras, build native extensions, and run the test suite.

Install mach in editable mode with the extras that match your use case. Production serving on Apple Silicon expects dev, dflash, and native together.

pip install extras

From the engine root (vendored at vendor/local_moe_engine in Maniac Desktop, or a standalone checkout):

pip install -e ".[dev,dflash,native]"

Extra	Purpose
`dev`	Test dependencies (`pytest`, etc.)
`dflash`	Pulls `dflash-mlx` for DFlash v7 speculative decoding
`native`	Native expert I/O build dependencies and extension packaging
`conversion`	Dependencies for the `mach convert` 2-bit IQ2 pipeline (see Conversion)

Without dflash, mach-serve falls back to target-only autoregressive decode when no usable draft checkpoint is available. Without native, production preflight fails fast unless you explicitly enable the diagnostic Python fallback (not recommended for serving).

Console scripts installed by the package:

Command	Role
`mach-serve`	OpenAI-compatible HTTP server (primary entry point)
`mach-generate`	One-shot CLI generation
`mach convert`	Master → 2-bit IQ2 GGUF checkpoint pipeline
`mach-prune`	Slice/convert checkpoints to per-expert engine layout
`mach-reap`	REAP saliency pruning

See CLI and Serving.

Native extensions

Production streaming depends on two native components.

Direct-pread extension (`lme_mlx_pread_ext`)

Built by scripts/build_mlx_pread_ext.py. Preads packed sidecar bytes straight into persistent GPU bank slots, avoiding staged source tensors on the hot path.

python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"

If import fails after install, reinstall extras and rebuild. Startup should log native_extension=ready and direct_pread=1. Missing extension triggers fail-fast preflight in production unless LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 is set (diagnostic only).

Sidecar loader (`liblme_expert_io`)

C implementation in csrc/expert_io.c. Reads packed expert sidecar records (layer_XX.bin) instead of safetensors tensor slices. Enabled in production when sidecars are present (LME_NATIVE_EXPERT_IO=1 / --native-expert-io).

Together, these extensions are why production mach-serve can sustain low-RAM serving with engine-format sidecar artifacts. See Expert residency for how pread feeds the expert bank.

IQ2 codec (`libiqk`)

Building 2-bit checkpoints with mach convert needs the native libiqk codec (IQ2_K / IQ2_KS / IQ2_KL). Build it once:

python scripts/build_libiqk.py

A missing libiqk makes mach convert exit 2. It is only required for conversion, not for serving.

Verify the install

python -c "import mach; print(mach.__version__)"
mach-serve --help

For DFlash flags:

mach-serve --help | rg "serving-adaptive-block-policy|streaming"

Tests

pip install -e ".[dev]"
pytest tests/

Slow integration tests (model loads, serving smoke):

pytest tests/ -m slow

Focused probes without full model loads:

python -m pytest tests/test_serve_production_tool_parsing.py -q
python -m py_compile src/mach/server/production_app.py src/mach/cli/serve.py

Maniac Desktop managed venv

Maniac Desktop installs vendor/local_moe_engine[dev,dflash,native] into its managed Python venv automatically when Local MoE is enabled, builds lme_mlx_pread_ext if missing, and spawns mach-serve from that venv. See Maniac integration.

Next steps

Serving — start mach-serve on the production fast path
CLI — obtain or build an engine-format checkpoint and sidecar
Troubleshooting — if direct_pread.active=0 or prefill is very slow

Installation

On this page