CLI
mach-archs, mach-models, mach check, mach-generate, mach convert, mach-prune, mach-reap, and the artifact workflow for engine-format checkpoints and expert sidecars.
mach ships console scripts for discovering supported architectures and models, preflighting a checkpoint before loading it, one-shot generation, 2-bit conversion (mach convert — documented in Conversion), checkpoint layout conversion, pruning, and the production serving entry point (mach-serve — documented in Serving).
Start with the read-only discovery commands — mach-archs (what the engine can serve), mach-models (what is installed or downloadable), and mach check (whether a specific checkpoint will run, on which paths, before paying a load) — then move on to generation and artifact building. All three are import-light (no MLX) so they run on CPU-only / Modal hosts.
mach-archs
List the MoE architectures the engine can serve. The output is generated directly from the in-code architecture registry (io/arch_registry.REGISTRY), so it is the authoritative answer to "what can this engine load?" — a checkpoint whose config.json model_type is not listed here cannot be loaded.
mach-archs # human-readable table
mach-archs --json # machine-readable JSON (one source of truth for tools)The engine currently supports five architectures: deepseek_v4, qwen3_5_moe (Qwen3.6-35B-A3B), qwen3_moe, gpt_oss, and gemma4. The table reports static facts per architecture:
| Column | Meaning |
|---|---|
model_type | config.json model_type key |
swiglu | Activation variant (plain_swiglu, silu_limited, plain_geglu) |
stacked | Whether the upstream axis-0 stacked layout loads directly (no ⇒ must be sliced first; deepseek_v4 only) |
expert_bias | Per-expert Linear bias present (only gpt_oss) |
wrapped | Architecture nested under text_config (Qwen3.6, Gemma 4) |
bits / group | Default routed-expert quantization fallbacks |
model_module | Module imported for the Model / ModelArgs classes |
--json emits the full machine-readable summary — the same arch_summary consumed by GET /v1/capabilities — which additionally includes the arch's capability metadata: kv_cache_kind (standard / hybrid_gdn / compressed / rotating), specdec_kind (specdec_v7 / mtp_eagle3 / null), and the derived continuous_batching capability (computed from kv_cache_kind).
This table is the same set rendered in Overview → Supported architectures; mach-archs --json is the authoritative source. To see which serving paths each architecture (or a concrete checkpoint) can actually run, use mach check --all-archs.
mach-models
List models known to the engine: locally-cached checkpoints plus (when configured) the remote downloadable catalog, merged into one view. Each row reports its architecture, whether the engine supports it (its model_type is in the registry), and whether it is installed locally.
mach-models # merged local cache + remote catalog (if configured)
mach-models --local-only # only scan local checkpoints
mach-models --json # machine-readable
mach-models --dir PATH # extra directory to scan (repeatable)
mach-models --cache-dir PATH # override the engine cache root (default: ~/.cache/mach)Columns: name, source (local, remote, or local+remote), model_type, supported, installed, size.
Remote discovery reads the existing GET /v1/local-moe/catalog route. Enable it with:
| Env var | Purpose |
|---|---|
LME_CATALOG_URL | The composio-api base URL or a full /v1/local-moe/catalog URL |
LME_CATALOG_TOKEN | A Supabase JWT used to authenticate the catalog request |
LME_MODELS_DIR | Extra scan directories (os.pathsep-separated), added alongside --dir |
When LME_CATALOG_URL / LME_CATALOG_TOKEN are unset (or --local-only is passed), the listing is local-only and prints a hint pointing at those env vars.
mach check
A static, pre-load model-support preflight. It answers — before paying a load — whether a checkpoint's architecture is supported, which serving paths it can run (the path × architecture support matrix), which artifacts it needs (expert sidecar, draft dir), and whether it will fit in memory — all from config.json + sidecar metadata alone, never loading weights.
mach check is the static counterpart to GET /v1/capabilities: the endpoint reports what the currently loaded process is doing; mach check predicts what could run. It stays drift-free by reusing the same arch registry, layout / routed-quant inspection, and serving gates the loader and server use (it never re-derives them).
mach check ./checkpoints/qwen3-coder # local checkpoint dir
mach check mlx-community/Qwen3-Coder-30B-A3B-MLX-4bit # HF repo id (config-only fetch)
mach check MODEL --json # machine-readable report
mach check --all-archs # full path × arch matrix (no checkpoint)
mach check MODEL --draft-dir ./draft --context-length 16384 --expert-cache-gb 4
mach check MODEL --local-only # never touch the networkConfig-only resolution (never downloads weights)
A local directory containing a config.json is inspected in place. A Hugging Face repo id fetches only config.json and a best-effort expert_sidecar/layout.json (metadata only — weights are never downloaded), cached under the engine cache root. --local-only disables network resolution entirely and requires a local checkpoint directory.
Flags
| Flag | Purpose |
|---|---|
--json | Emit the full report as JSON instead of the human table. |
--all-archs | Print the path × architecture matrix for every registered arch (no checkpoint required; omit the model argument). |
--draft-dir DIR | Speculative-decode draft directory to probe for the specdec path. |
--context-length N | Context length for the KV-byte memory estimate (default: min(config max_position_embeddings, 8192)). |
--expert-cache-gb GB | Streaming expert-cache budget used for the memory verdict. |
--local-only | Never touch the network; require a local checkpoint directory. |
--revision REV | Hugging Face revision (ignored for local paths). |
--cache-dir PATH | Override the engine cache root used for config-only fetches. |
Exit status is 0 when the checkpoint is supported, 1 when it is not (and non-zero usage errors otherwise).
Report
For a concrete checkpoint the human report prints the resolved model_type, layout, quant mode, the artifact probe (sidecar, draft_dir), an optional memory block, the per-path status table, an overall verdict, and a suggested mach-serve command. An illustrative run:
model: ./checkpoints/qwen3-coder
model_type: qwen3_moe
supported: yes
layout: stacked
quant: mxfp4 (4-bit)
artifacts: sidecar=no, draft_dir=none
memory: verdict=fits, experts=14.2GB, kv@8192=1.0GB, working_set=18.0GB
path status reason
------------------- -------------- ---------------------------------------------------
openai_http ok registered architecture
expert_streaming ok expert-sidecar streaming
resident_stacked ok stacked-layout checkpoint usable for resident serving
bf16_streaming ok stacked layout supports bf16 streaming
gguf_experts_2bit n/a no gguf expert sidecar / format
specdec needs_artifact specdec_v7 draft requires a usable --draft-dir
continuous_batching ok standard KV cache is batch-compatible
turboquant_kv ok turboquant KV supported (mutually exclusive ...)
prefix_kv_reuse ok standard cache supports prefix trim reuse
native_tool_grammar ok native Qwen tool/grammar path (xgrammar)
verdict: supported
serve with: mach-serve --checkpoint ./checkpoints/qwen3-coder --streaming --target-onlyThe memory block degrades gracefully: on CPU-only / Modal hosts (no MLX / Metal), the byte math is skipped and the report notes that the estimate is unavailable — every other section still works. Its verdict is one of fits / tight / wont_fit / unknown.
Path × architecture matrix
mach check --all-archs prints the executable path × architecture matrix for every registered arch — a quick way to keep the serving paths straight without a checkpoint:
path deepseek_v4 gemma4 gpt_oss qwen3_5_moe qwen3_moe
------------------- -------------- ------- ------- -------------- --------------
openai_http ok ok ok ok ok
expert_streaming ok ok ok ok ok
resident_stacked blocked unknown unknown unknown unknown
bf16_streaming unknown unknown unknown unknown unknown
gguf_experts_2bit unknown unknown unknown unknown unknown
specdec needs_artifact n/a n/a needs_artifact needs_artifact
continuous_batching blocked unknown unknown ok ok
turboquant_kv ok ok ok ok ok
prefix_kv_reuse unknown unknown unknown ok ok
native_tool_grammar n/a n/a n/a ok okEach cell is one of ok / blocked / needs_artifact / unknown / n/a. Checkpoint-dependent rows (resident_stacked, bf16_streaming, gguf_experts_2bit) read unknown in the arch-only matrix and resolve to a concrete status when a checkpoint is inspected. The path names reuse the /v1/capabilities serving vocabulary where they overlap (continuous_batching, turboquant_kv) so the static prediction and the runtime report stay directly comparable.
mach-generate
One-shot text generation without standing up HTTP:
mach-generate mlx-community/Qwen3.6-35B-A3B-MLX-4bit \
--prompt "def reverse(s):" \
--max-tokens 64 \
--temperature 0.0Useful for smoke-testing a checkpoint or mlx-lm compatibility before mach-serve.
mach convert
One command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint — the replacement for the seven-step manual recipe in recipes/2bit-moe/scripts/*:
mach convert mlx-community/Qwen3.6-35B-A3B-bf16 --out ./qwen36-a3b-iq2 --gates full
mach-serve ./qwen36-a3b-iq2 --streaming --port 8080It threads resolve master → slice → calibrate → quantize → pack → gates, writing each phase under --out plus a conversion_report.json manifest. The full flag reference, the per-arch ConversionConfig seam, validated-vs-untuned architectures, validation gates, exit codes, and the native libiqk dependency are documented in Conversion.
mach-prune
Convert a source checkpoint into the per-expert engine layout (sliced experts) used by streaming residency:
mach-prune \
--source /path/to/source/checkpoint \
--output ./qwen3_6_a3b_k128 \
--plan path/to/keep_list.jsonl \
--candidate-id qwen3_6_a3b_uniform_avgk128K<n> in an artifact name is the number of routed experts retained per layer — not a model name or a quality tier. k256 keeps 256 experts; k192 / k128 are pruned subsets. For a model whose architecture has 256 routed experts (e.g. Qwen3.6-35B-A3B, DeepSeek-V4-Flash), k256 is the full, unpruned set; a model with a different expert count uses a correspondingly different k<n>. Prune plans select which experts to retain when building smaller variants; the reference production target keeps the full expert set.
mach-reap
REAP saliency pruning — removes low-saliency experts based on a saliency analysis plan. Distinct from layout slicing (mach-prune); used in experimentation pipelines.
Artifact workflow
Production serving expects an engine-format full-expert checkpoint plus expert_sidecar/ for direct-pread. The examples below use qwen3_6_35b_a3b_engine_k256 (the 256-expert Qwen reference checkpoint) as a concrete example; substitute your own checkpoint name.
Expected layout
qwen3_6_35b_a3b_engine_k256/
├── config.json
├── *.safetensors
├── tokenizer files
└── expert_sidecar/
├── layout.json
└── layer_XX.bin (every declared layer)Reference path in upstream repos: experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256.
This is an engine-layout conversion of upstream pre-quantized int4 weights — not REAP pruning, not custom quantization.
Obtain a prebuilt checkpoint (Modal volume)
modal volume get local-moe-engine-checkpoints qwen3_6_35b_a3b_engine_k256 ./experiments/pipeline_v1/results/Build the checkpoint yourself
Slice from a candidate plan:
modal run scripts/slice_qwen3_5_moe_modal.py::main \
--plan-rel-path experiments/pipeline_v1/local_plans/qwen3_6_35b_a3b_candidate_plans_uniform_k256.jsonl \
--candidate-id qwen3_6_35b_a3b_uniform_avgk256 \
--output-subdir qwen3_6_35b_a3b_engine_k256Export expert sidecar (required for production direct-pread):
python scripts/export_expert_sidecar.py \
--checkpoint experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256 \
--output experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256/expert_sidecar \
--num-experts 256 \
--bits 4Draft model (DFlash)
Either local:
experiments/pipeline_v1/results/eagle3_training/dflash_draftOr Hugging Face: z-lab/Qwen3.6-35B-A3B-DFlash.
Pass to mach-serve with --draft-dir when not using default discovery paths. See Speculative decoding.
Serve after artifacts are ready
mach-serve experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256 --streaming --port 8080Compatibility launchers
Kept for scripts and CI, not primary documentation:
| Entry | Notes |
|---|---|
mach-serve ... --backend production | Alias for production profile |
mach-serve ... --backend dflash | Legacy alias |
scripts/serve_production.py | Thin launcher |
scripts/serve_dflash.py | Legacy alias |
experiments/pipeline_v1/opencode_serve.sh production | Repo launcher wrapper |
Related pages
- Installation — install extras before building artifacts
- Conversion —
mach convert2-bit IQ2 GGUF pipeline - Expert residency — sidecar format
- Serving — production
mach-serveflags - Maniac integration — desktop catalog and env overrides for checkpoint paths
Serving
Run mach-serve on the production fast path or generic OpenAI backend, confirm startup signals, and use HTTP endpoints.
Conversion
mach convert — one command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint, with the per-arch ConversionConfig seam and validation gates.