Maniac Docs

Conversion

mach convert — one command to turn a higher-precision MoE master into a servable 2-bit IQ2 GGUF checkpoint, with the per-arch ConversionConfig seam and validation gates.

mach convert turns a higher-precision MoE master (bf16 / fp8 / mxfp4) into a servable 2-bit IQ2 GGUF checkpoint in one command. It replaces the seven-step manual recipe that previously lived in recipes/2bit-moe/scripts/* — those scripts are now thin wrappers over the in-package mach.conversion.* modules that this command drives.

The output is a drop-in checkpoint with a GGUF v2 expert sidecar (lme-gguf-expert-sidecar-v2) and a config.json that declares quantization.expert_format="gguf". The engine self-selects the fused GGUF decode path from that config.json, so the result is served exactly like any other checkpoint:

mach convert mlx-community/Qwen3.6-35B-A3B-bf16 --out ./qwen36-a3b-iq2 --gates full
mach-serve ./qwen36-a3b-iq2 --streaming --port 8080

The pipeline

mach convert threads six resumable phases, writing each phase's artifact under --out plus a conversion_report.json manifest:

resolve master (bf16/fp8/mxfp4)

   ├─ slice        per-expert split (keep-all — every routed expert retained)
   ├─ calibrate    importance matrix (imatrix); + per-layer Hessian for --gptq
   ├─ quantize     experts → 2-bit IQ2 GGUF blocks in the sidecar layout
   ├─ pack         servable checkpoint (config.json + non-experts + expert_sidecar/)
   └─ gates        validation gates certify the result
PhaseWhat it does
sliceSplits the master into per-expert tensors (keep-all; not a quality-reducing prune).
calibrateCollects the per-projection importance matrix (imatrix); with --gptq, also captures the per-layer input Hessian for block-GPTQ error feedback.
quantizeQuantizes every expert to 2-bit IQ2 GGUF blocks using the per-arch codec schedule (+ imatrix, + optional GPTQ).
packAssembles the servable checkpoint — config.json (with quantization.expert_format="gguf"), non-expert tensors, and the expert_sidecar/ (lme-gguf-expert-sidecar-v2).
gatesRuns the validation gates to certify quality (see Gates).

Each phase is resumable: --resume skips a phase whose artifact already exists, and --phase runs a single phase in isolation.

Usage

mach convert <hf_repo_or_path> --out DIR
  [--arch X]                         # auto-detected from config.json via detect_arch
  [--variant V] [--layers all]      # variant defaults to the arch's default_variant
  [--imatrix|--no-imatrix] [--gptq]
  [--gates {off,fast,full}]          # default fast; "full" certifies vs a bf16 teacher
  [--phase {slice,calibrate,quantize,pack,gates}]
  [--master-dtype {bf16,mxfp4,fp8}]  # default per-arch ConversionConfig
  [--backend {local,modal}]          # local default
  [--teacher-dir DIR] [--resume]
FlagPurpose
<hf_repo_or_path>Higher-precision master — a Hugging Face repo id or a local checkpoint directory.
--out DIROutput directory for phase artifacts, the packed checkpoint, and conversion_report.json.
--archArchitecture key; auto-detected from the master's config.json (detect_arch) when omitted.
--variantCodec-schedule variant from the arch's ConversionConfig (defaults to that config's default_variant; DeepSeek-V4 ships the validated variant a).
--layersLayer subset to convert (default all).
--imatrix / --no-imatrixEnable/disable importance-matrix calibration.
--gptqAdd block-GPTQ error feedback (captures a per-layer Hessian in the calibrate phase).
--gates {off,fast,full}Gate level (default fast); full certifies against a bf16 teacher.
--phase {slice,calibrate,quantize,pack,gates}Run a single phase in isolation.
--master-dtype {bf16,mxfp4,fp8}Read dtype of the master (default from the arch's ConversionConfig).
--backend {local,modal}Execution backend; local is the default. modal is not yet implemented as an in-process path (see Backends).
--teacher-dir DIRTeacher checkpoint for the full gates.
--resumeSkip phases whose artifacts already exist under --out.

Exit codes

CodeMeaning
0Conversion succeeded (gates passed, or gates off).
1A required validation gate failed.
2Usage or runtime error — including a missing native libiqk codec (see Native dependency).

The ConversionConfig seam

The constants the old recipe hardcoded for DeepSeek-V4 are now policy on a per-model_type ConversionConfig (mach.conversion.config), while shapes are derived from config.json at runtime so one config serves any SKU of an architecture:

Was hardcodedNew home
HID, INTER, NE, NL dimsresolve_shape(config_json, adapter) (reuses the arch registry + slicer expert-count key path)
depth-graded down codec scheduleConversionConfig.schedules[variant]
master is mxfp4ConversionConfig.master_dtype
num_hash_layers, top-kConversionConfig.num_hash_layers / moe_top_k
corpus paths, teacher/refppl protocolConversionConfig.calibration / gate

CONVERSION_REGISTRY is kept in lockstep with the arch registry (io.arch_registry.REGISTRY) — the module raises at import if they drift — so a new architecture cannot get a serving adapter without a conversion policy, or vice versa.

Validated vs. untuned architectures

ArchitectureConversion status
deepseek_v4Validated — ships the certified codec schedule (variant a).
qwen3_5_moe (Qwen3.6-35B-A3B)Validated
qwen3_moeRegistered, untuned placeholder schedule
gpt_ossRegistered, untuned placeholder schedule
gemma4Registered, untuned placeholder schedule

Architectures whose depth schedule is still a conservative placeholder carry schedule_tuned=False. The full gates surface a non-fatal schedule_untuned warning for them, so a placeholder can't masquerade as certified. Re-derive the per-layer bit schedule from the model's own sensitivity before treating an untuned arch as production-ready.

Validation gates

LevelWhat runs
offNo gates — fastest, no certification.
fast (default)Cheap structural checks (codec round-trip, shape/geometry validation).
fullAdds output-space certification against a bf16 teacher (KL / top-1 vs the teacher; pass --teacher-dir), plus the schedule_untuned warning for untuned archs.

A failed required gate exits 1. See recipes/2bit-moe/README.md for the full gate rationale (block recon, Metal parity, end-to-end decode parity, per-domain KL/top-1, on-box fit).

Backends

--backend local (the default) runs the full pipeline in-process on Apple Silicon. --backend modal is recognized but not yet implemented as an in-process path; large-scale Modal-based quantization still goes through the recipe's standalone Modal workers (recipes/2bit-moe/scripts/modal_quant.py), because the per-shard quant runs MLX-free on Linux.

Native dependency

The IQ2 codec (IQ2_K / IQ2_KS / IQ2_KL quantize + dequant) is provided by the native libiqk library. Build it once before converting:

python vendor/local_moe_engine/scripts/build_libiqk.py

A missing libiqk makes mach convert exit 2. The codec also ships as the conversion optional-dependency extra.

Relationship to the recipe

The reproducible 2-bit MoE technique writeup — the measured quality/speed results, the codec-and-bit-allocation rationale, and the per-model calibration guidance — lives in recipes/2bit-moe/README.md. That recipe is now driven by mach convert; its scripts/* remain only as thin back-compat wrappers over mach.conversion.* for running a single phase by hand.

  • CLI — discovery, generation, and artifact commands
  • Serving — serve the converted checkpoint with mach-serve --streaming
  • Expert residency — the expert sidecar format the pack phase emits
  • Installation — extras and native builds

On this page