Maniac Docs

Expert Residency

Streaming, stacked, and bf16_streaming modes — sidecar format, direct-pread, bank sizing, and transient prefill with resident decode.

Expert residency is the core memory story for MoE on Apple Silicon. Routed experts far exceed GPU RAM; the engine keeps a bounded bank of hot experts and streams misses from disk (or holds everything resident when the checkpoint and machine allow).

Three residency modes

ModeSelectorExperts liveWhen to use
streaming--streamingDisk (sliced safetensors + packed sidecar); bounded GPU bankMemory-constrained machines; experts streamed from SSD into a bounded bank
stacked--no-streamingAll compatible experts resident in GPU (ResidentStackedExpertCache)Machines with enough RAM to hold every expert resident, using a stacked MLX export
bf16_streamingexpert_residency=bf16_streaming / LME_BF16_STACKED_STREAMING=1Full-precision stacked HF/SwiftLM slices streamed into BF16 banksBakeoffs / full-precision comparison, not the production sidecar path

Use mach check to predict which mode fits a given checkpoint on your machine before loading weights.

mach-serve requires an explicit --streaming or --no-streaming choice. --streaming enables transient prefill + resident decode for streaming sidecar serving without manually setting LME_TRANSIENT_PREFILL=1.

Streaming (--streaming)

  • Experts on disk; BankCache / LayerExpertBank holds a working subset per layer.
  • Misses pread from SSD via native sidecar I/O or safetensors slices.
  • Transient prefill: prefill uses a shared transient scratch arena so large prompt windows do not pin the full expert set.
  • Resident decode: decode keeps a smaller resident bank; optional decode-arena reclaim and hot-expert pinning via LME_DECODE_RESIDENCY.

This is the default streaming path for full-expert sliced checkpoints; run mach check to confirm it fits your machine.

Stacked (--no-streaming)

  • Requires axis-0 stacked MLX checkpoint compatible with ResidentStackedExpertCache.
  • No admission/eviction — all routed experts resident.
  • Fails fast if the checkpoint layout cannot support resident stacked banks.

Use only when you intentionally load every compatible quantized expert into memory.

BF16 streaming

  • Opt-in full-precision streaming for plain HF/SwiftLM stacked checkpoints.
  • Native safetensor pread into bounded BF16 banks (fallback: staged MLX load).
  • Counters: bf16_pread_*, staged_load_*.
  • Not the production quant sidecar path.

Transient prefill + resident decode

For quant sidecar streaming:

mach-serve /path/to/<your-engine-checkpoint> --streaming --port 8080

Startup should report:

  • production_streaming_mode=streaming
  • expert_residency=streaming
  • transient_prefill=1
  • native_transient_prefill=1

Prefill experts are loaded transiently for the current prompt window; decode retains a resident expert bank sized by --expert-cache-gb and related knobs.

Expert sidecar format

Production direct-pread expects:

checkpoint/
├── config.json
├── *.safetensors
└── expert_sidecar/
    ├── layout.json
    ├── layer_00.bin
    ├── layer_01.bin
    └── ...
  • layout.json — metadata: format (lme-expert-sidecar-v1, GGUF v1/v2), layer count, expert count, record layout.
  • layer_XX.bin — packed per-layer expert records for native I/O.

Export with:

python scripts/export_expert_sidecar.py \
  --checkpoint /path/to/<your-engine-checkpoint> \
  --output /path/to/<your-engine-checkpoint>/expert_sidecar \
  --num-experts <routed-experts-per-layer> \
  --bits 4

--num-experts must match the model's routed expert count per layer (e.g. 256 for Qwen3.6-35B-A3B).

Missing or invalid sidecar → production fail-fast (sidecar_valid=0) unless diagnostic fallback is explicitly enabled.

Direct-pread fast path

lme_mlx_pread_ext preads sidecar bytes directly into persistent bank slots.

SignalMeaning
direct_pread=1Fast path active
direct_pread_bytes / direct_pread_syscallsRuntime counters
native_extension=readyExtension import succeeded
fallback_policy=fail-fastProduction default

Related production optimizations (on by default with sidecar):

  • LME_NATIVE_PREFILL_FUSED=1 — plan admission + read misses in one native pass
  • LME_NATIVE_BANK_HANDLE=1 — reuse native sidecar/bank handles across commits
  • LME_NATIVE_ROUTE_MAP=1 — update expert-id → slot map in native commit path
  • LME_DIRECT_PREAD_EVAL_MODE=minimal — eval only changed slots after bank mutation

Fail-fast vs diagnostic Python fallback

Production fails fast when native extension or sidecar is incomplete. For intentional debugging only:

LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve ... --streaming

This path is slow (~6 tok/s prefill) and not a serving mode. See Troubleshooting.

Gather dispatch

LME_USE_GATHER_DISPATCH=1 (default on) routes production traffic through GatherSwitchGLU instead of legacy DiskSwitchGLU per-slot loops. Works across streaming, stacked, and bf16_streaming when the bank backend supports gather.

Memory and bank sizing

KnobDefaultPurpose
--wired-gb9Metal wired memory limit
--expert-cache-gb(profile-dependent)Resident expert bank size for decode
--bank-capacity-per-layerCap slots per layer
LME_BANK_EVICTION_POLICYlookahead (production)Retain experts visible in future dispatch windows

Eviction policy lookahead — under tight slot budgets, keeps experts needed soon rather than pure LRU.

Hit-only fast path — when all requested experts are already resident, skips miss planning, scatter, and cleanup.

Observability

GET /v1/cache/stats includes streaming_summary:

  • hits / misses / evictions
  • direct_pread_*, native_miss_*, bf16_pread_*
  • hit_only_fastpath_*, inline_prev_prefetch_*

High evictions → raise --wired-gb and/or --expert-cache-gb, or confirm --streaming so prefill stays transient. See Troubleshooting.

Optional experiments (off in production)

FlagNotes
LME_ASYNC_PREFETCH=1Heuristic staged prefetch; opt-in
LME_INLINE_PREV_PREFETCH=1Previous-route prefetch hints
LME_GATHER_DISPATCH_GROUPING=expertSort prefill routes by expert
LME_FUSE_GATE_UP=1Fuse gate/up gather (streaming concat cache via LME_FUSE_GATE_UP_STREAMING=1)

On this page