CLI

mach-archs, mach-models, mach check, mach-generate, mach convert, mach-prune, mach-reap, and the artifact workflow for engine-format checkpoints and expert sidecars.

mach ships console scripts for discovering supported architectures and models, preflighting a checkpoint before loading it, one-shot generation, 2-bit conversion (mach convert — documented in Conversion), checkpoint layout conversion, pruning, and the production serving entry point (mach-serve — documented in Serving).

Start with the read-only discovery commands — mach-archs (what the engine can serve), mach-models (what is installed or downloadable), and mach check (whether a specific checkpoint will run, on which paths, before paying a load) — then move on to generation and artifact building. All three are import-light (no MLX) so they run on CPU-only / Modal hosts.

mach-archs

List the MoE architectures the engine can serve. The output is generated directly from the in-code architecture registry (io/arch_registry.REGISTRY), so it is the authoritative answer to "what can this engine load?" — a checkpoint whose config.json model_type is not listed here cannot be loaded.

mach-archs            # human-readable table
mach-archs --json     # machine-readable JSON (one source of truth for tools)

The engine currently supports five architectures: deepseek_v4, qwen3_5_moe (Qwen3.6-35B-A3B), qwen3_moe, gpt_oss, and gemma4. The table reports static facts per architecture:

Column	Meaning
`model_type`	`config.json` `model_type` key
`swiglu`	Activation variant (`plain_swiglu`, `silu_limited`, `plain_geglu`)
`stacked`	Whether the upstream axis-0 stacked layout loads directly (`no` ⇒ must be sliced first; `deepseek_v4` only)
`expert_bias`	Per-expert Linear bias present (only `gpt_oss`)
`wrapped`	Architecture nested under `text_config` (Qwen3.6, Gemma 4)
`bits` / `group`	Default routed-expert quantization fallbacks
`model_module`	Module imported for the `Model` / `ModelArgs` classes

--json emits the full machine-readable summary — the same arch_summary consumed by GET /v1/capabilities — which additionally includes the arch's capability metadata: kv_cache_kind (standard / hybrid_gdn / compressed / rotating), specdec_kind (specdec_v7 / mtp_eagle3 / null), and the derived continuous_batching capability (computed from kv_cache_kind).

This table is the same set rendered in Overview → Supported architectures; mach-archs --json is the authoritative source. To see which serving paths each architecture (or a concrete checkpoint) can actually run, use mach check --all-archs.

mach-models

List models known to the engine: locally-cached checkpoints plus (when configured) the remote downloadable catalog, merged into one view. Each row reports its architecture, whether the engine supports it (its model_type is in the registry), and whether it is installed locally.

mach-models                   # merged local cache + remote catalog (if configured)
mach-models --local-only      # only scan local checkpoints
mach-models --json            # machine-readable
mach-models --dir PATH        # extra directory to scan (repeatable)
mach-models --cache-dir PATH  # override the engine cache root (default: ~/.cache/mach)

Columns: name, source (local, remote, or local+remote), model_type, supported, installed, size.

Remote discovery reads the existing GET /v1/local-moe/catalog route. Enable it with:

Env var	Purpose
`LME_CATALOG_URL`	The composio-api base URL or a full `/v1/local-moe/catalog` URL
`LME_CATALOG_TOKEN`	A Supabase JWT used to authenticate the catalog request
`LME_MODELS_DIR`	Extra scan directories (`os.pathsep`-separated), added alongside `--dir`

When LME_CATALOG_URL / LME_CATALOG_TOKEN are unset (or --local-only is passed), the listing is local-only and prints a hint pointing at those env vars.

mach check

A static, pre-load model-support preflight. It answers — before paying a load — whether a checkpoint's architecture is supported, which serving paths it can run (the path × architecture support matrix), which artifacts it needs (expert sidecar, draft dir), and whether it will fit in memory — all from config.json + sidecar metadata alone, never loading weights.

mach check is the static counterpart to GET /v1/capabilities: the endpoint reports what the currently loaded process is doing; mach check predicts what could run. It stays drift-free by reusing the same arch registry, layout / routed-quant inspection, and serving gates the loader and server use (it never re-derives them).

mach check ./checkpoints/qwen3-coder                  # local checkpoint dir
mach check mlx-community/Qwen3-Coder-30B-A3B-MLX-4bit  # HF repo id (config-only fetch)
mach check MODEL --json                                # machine-readable report
mach check --all-archs                                 # full path × arch matrix (no checkpoint)
mach check MODEL --draft-dir ./draft --context-length 16384 --expert-cache-gb 4
mach check MODEL --local-only                          # never touch the network

Config-only resolution (never downloads weights)

A local directory containing a config.json is inspected in place. A Hugging Face repo id fetches only config.json and a best-effort expert_sidecar/layout.json (metadata only — weights are never downloaded), cached under the engine cache root. --local-only disables network resolution entirely and requires a local checkpoint directory.

Flags

Flag	Purpose
`--json`	Emit the full report as JSON instead of the human table.
`--all-archs`	Print the path × architecture matrix for every registered arch (no checkpoint required; omit the model argument).
`--draft-dir DIR`	Speculative-decode draft directory to probe for the `specdec` path.
`--context-length N`	Context length for the KV-byte memory estimate (default: `min(config max_position_embeddings, 8192)`).
`--expert-cache-gb GB`	Streaming expert-cache budget used for the memory verdict.
`--local-only`	Never touch the network; require a local checkpoint directory.
`--revision REV`	Hugging Face revision (ignored for local paths).
`--cache-dir PATH`	Override the engine cache root used for config-only fetches.

Exit status is 0 when the checkpoint is supported, 1 when it is not (and non-zero usage errors otherwise).

Report

For a concrete checkpoint the human report prints the resolved model_type, layout, quant mode, the artifact probe (sidecar, draft_dir), an optional memory block, the per-path status table, an overall verdict, and a suggested mach-serve command. An illustrative run:

model:      ./checkpoints/qwen3-coder
model_type: qwen3_moe
supported:  yes
layout:     stacked
quant:      mxfp4 (4-bit)
artifacts:  sidecar=no, draft_dir=none
memory:     verdict=fits, experts=14.2GB, kv@8192=1.0GB, working_set=18.0GB

path                 status          reason
-------------------  --------------  ---------------------------------------------------
openai_http          ok              registered architecture
expert_streaming     ok              expert-sidecar streaming
resident_stacked     ok              stacked-layout checkpoint usable for resident serving
bf16_streaming       ok              stacked layout supports bf16 streaming
gguf_experts_2bit    n/a             no gguf expert sidecar / format
specdec              needs_artifact  specdec_v7 draft requires a usable --draft-dir
continuous_batching  ok              standard KV cache is batch-compatible
turboquant_kv        ok              turboquant KV supported (mutually exclusive ...)
prefix_kv_reuse      ok              standard cache supports prefix trim reuse
native_tool_grammar  ok              native Qwen tool/grammar path (xgrammar)

verdict:    supported
serve with: mach-serve --checkpoint ./checkpoints/qwen3-coder --streaming --target-only

The memory block degrades gracefully: on CPU-only / Modal hosts (no MLX / Metal), the byte math is skipped and the report notes that the estimate is unavailable — every other section still works. Its verdict is one of fits / tight / wont_fit / unknown.

Path × architecture matrix

mach check --all-archs prints the executable path × architecture matrix for every registered arch — a quick way to keep the serving paths straight without a checkpoint:

path                 deepseek_v4     gemma4   gpt_oss  qwen3_5_moe     qwen3_moe
-------------------  --------------  -------  -------  --------------  --------------
openai_http          ok              ok       ok       ok              ok
expert_streaming     ok              ok       ok       ok              ok
resident_stacked     blocked         unknown  unknown  unknown         unknown
bf16_streaming       unknown         unknown  unknown  unknown         unknown
gguf_experts_2bit    unknown         unknown  unknown  unknown         unknown
specdec              needs_artifact  n/a      n/a      needs_artifact  needs_artifact
continuous_batching  blocked         unknown  unknown  ok              ok
turboquant_kv        ok              ok       ok       ok              ok
prefix_kv_reuse      unknown         unknown  unknown  ok              ok
native_tool_grammar  n/a             n/a      n/a      ok              ok

Each cell is one of ok / blocked / needs_artifact / unknown / n/a. Checkpoint-dependent rows (resident_stacked, bf16_streaming, gguf_experts_2bit) read unknown in the arch-only matrix and resolve to a concrete status when a checkpoint is inspected. The path names reuse the /v1/capabilities serving vocabulary where they overlap (continuous_batching, turboquant_kv) so the static prediction and the runtime report stay directly comparable.

mach-generate

One-shot text generation without standing up HTTP:

mach-generate mlx-community/Qwen3.6-35B-A3B-MLX-4bit \
  --prompt "def reverse(s):" \
  --max-tokens 64 \
  --temperature 0.0

Useful for smoke-testing a checkpoint or mlx-lm compatibility before mach-serve.

mach convert

One command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint — the replacement for the seven-step manual recipe in recipes/2bit-moe/scripts/*:

mach convert mlx-community/Qwen3.6-35B-A3B-bf16 --out ./qwen36-a3b-iq2 --gates full
mach-serve ./qwen36-a3b-iq2 --streaming --port 8080

It threads resolve master → slice → calibrate → quantize → pack → gates, writing each phase under --out plus a conversion_report.json manifest. The full flag reference, the per-arch ConversionConfig seam, validated-vs-untuned architectures, validation gates, exit codes, and the native libiqk dependency are documented in Conversion.

mach-prune

Convert a source checkpoint into the per-expert engine layout (sliced experts) used by streaming residency:

mach-prune \
  --source /path/to/source/checkpoint \
  --output ./qwen3_6_a3b_k128 \
  --plan path/to/keep_list.jsonl \
  --candidate-id qwen3_6_a3b_uniform_avgk128

K<n> in an artifact name is the number of routed experts retained per layer — not a model name or a quality tier. k256 keeps 256 experts; k192 / k128 are pruned subsets. For a model whose architecture has 256 routed experts (e.g. Qwen3.6-35B-A3B, DeepSeek-V4-Flash), k256 is the full, unpruned set; a model with a different expert count uses a correspondingly different k<n>. Prune plans select which experts to retain when building smaller variants; the reference production target keeps the full expert set.

mach-reap

REAP saliency pruning — removes low-saliency experts based on a saliency analysis plan. Distinct from layout slicing (mach-prune); used in experimentation pipelines.

Artifact workflow

Production serving expects an engine-format full-expert checkpoint plus expert_sidecar/ for direct-pread. The examples below use qwen3_6_35b_a3b_engine_k256 (the 256-expert Qwen reference checkpoint) as a concrete example; substitute your own checkpoint name.

Expected layout

qwen3_6_35b_a3b_engine_k256/
├── config.json
├── *.safetensors
├── tokenizer files
└── expert_sidecar/
    ├── layout.json
    └── layer_XX.bin  (every declared layer)

Reference path in upstream repos: experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256.

This is an engine-layout conversion of upstream pre-quantized int4 weights — not REAP pruning, not custom quantization.

modal volume get local-moe-engine-checkpoints qwen3_6_35b_a3b_engine_k256 ./experiments/pipeline_v1/results/

Build the checkpoint yourself

Slice from a candidate plan:

modal run scripts/slice_qwen3_5_moe_modal.py::main \
  --plan-rel-path experiments/pipeline_v1/local_plans/qwen3_6_35b_a3b_candidate_plans_uniform_k256.jsonl \
  --candidate-id qwen3_6_35b_a3b_uniform_avgk256 \
  --output-subdir qwen3_6_35b_a3b_engine_k256

Export expert sidecar (required for production direct-pread):

python scripts/export_expert_sidecar.py \
  --checkpoint experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256 \
  --output experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256/expert_sidecar \
  --num-experts 256 \
  --bits 4

Draft model (DFlash)

Either local:

experiments/pipeline_v1/results/eagle3_training/dflash_draft

Or Hugging Face: z-lab/Qwen3.6-35B-A3B-DFlash.

Pass to mach-serve with --draft-dir when not using default discovery paths. See Speculative decoding.

Serve after artifacts are ready

mach-serve experiments/pipeline_v1/results/qwen3_6_35b_a3b_engine_k256 --streaming --port 8080

Compatibility launchers

Kept for scripts and CI, not primary documentation:

Entry	Notes
`mach-serve ... --backend production`	Alias for production profile
`mach-serve ... --backend dflash`	Legacy alias
`scripts/serve_production.py`	Thin launcher
`scripts/serve_dflash.py`	Legacy alias
`experiments/pipeline_v1/opencode_serve.sh production`	Repo launcher wrapper

Installation — install extras before building artifacts
Conversion — mach convert 2-bit IQ2 GGUF pipeline
Expert residency — sidecar format
Serving — production mach-serve flags
Maniac integration — desktop catalog and env overrides for checkpoint paths

On this page