Troubleshooting

Diagnose production slow paths — direct pread, cache churn, slow prefill, stats timeouts, and diagnostic fallback.

Production mach-serve is designed to fail fast when the native fast path is unavailable. Use this page when startup logs or /v1/cache/stats indicate you are on a diagnostic or degraded path.

See Serving for expected startup signals and Installation for building native extensions.

direct_pread.active=0 or direct_pread.mode=none

Symptom: /v1/cache/stats → streaming_summary shows direct_pread.active=0 or mode=none.

Cause:

lme_mlx_pread_ext not built or not importable
expert_sidecar/ missing or invalid
Direct pread explicitly disabled

Fix:

pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"

Restart mach-serve and confirm startup logs:

startup direct-pread ... native_extension=ready ... fallback_policy=fail-fast
production optimizations direct_pread=1 ...

Re-export sidecar if sidecar_valid=0:

python scripts/export_expert_sidecar.py \
  --checkpoint /path/to/checkpoint \
  --output /path/to/checkpoint/expert_sidecar \
  --num-experts 256 \
  --bits 4

See Expert residency and CLI.

High evictions / cache churn

Symptom: streaming_summary shows high evictions, low hit rate, or rising misses under steady decode.

Cause: Resident decode bank too small for the workload's expert working set.

Fix:

Increase --wired-gb (Metal wired limit, default 9)
Increase --expert-cache-gb (resident bank size)
On constrained machines, use --streaming so prefill stays transient while decode remains resident

Example:

mach-serve /path/to/<your-engine-checkpoint> --streaming --wired-gb 10 --expert-cache-gb 8 --port 8080

Review LME_BANK_EVICTION_POLICY=lookahead behavior in Expert residency.

Very slow prefill (~6 tok/s)

Symptom: Prefill throughput orders of magnitude below expected; logs show diagnostic fallback.

Cause:

Python diagnostic fallback path (fallback_policy=diagnostic-python)
Native pread disabled or missing
Non-production backend (--backend openai)

Fix:

Confirm effective_backend=production profile=production
Confirm startup decode_path=specdec-v7
Confirm native_extension=ready and direct_pread=1
Avoid generic openai backend for production benchmarks

If you intentionally enabled diagnostic fallback for debugging, expect slow prefill — this is not a serving mode.

/v1/cache/stats timeout or stale_generation_in_flight

Symptom: Cache stats request hangs, returns timeout, or stale_generation_in_flight.

Cause: Single-flight _GENERATION_LOCK is held during an in-progress generation. Stats sampling contends with the active decode loop.

Fix:

Retry after the current generation completes
Sample stats between requests for reliable snapshots
For concurrent clients, consider opt-in Continuous batching (trades DFlash at B≥2)

Diagnostic fallback override (explicitly slow)

Python fallback is diagnostic only, not production.

LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve /path/to/checkpoint --streaming --port 8080

Use only when debugging startup blockers. Maniac Desktop may keep --streaming with this policy when expert_sidecar/ is missing from a catalog path — prefer fixing the checkpoint or setting MANIAC_LOCAL_MOE_CHECKPOINT_PATH.

Startup red flags checklist

Log value	Meaning	Action
`sidecar_valid=0`	Sidecar missing or corrupt	Export or fetch the engine-format sidecar artifact
`native_extension=missing`	`lme_mlx_pread_ext` import failed	Rebuild extension
`fallback_policy=diagnostic-python`	On Python slow path	Fix native + sidecar
`decode_path` ≠ `specdec-v7`	Spec-dec disabled or wrong backend	Install `[dflash]`, pass `--draft-dir`, avoid `--target-only`
`effective_backend=openai`	Generic mlx-lm path	Use production `mach-serve` without `--backend openai`

Maniac Desktop-specific

Issue	Check
Engine not starting	`MANIAC_LOCAL_MOE_ENABLED=1`, venv install logs
Wrong checkpoint	`MANIAC_LOCAL_MOE_CHECKPOINT_PATH`, catalog sidecar presence
Stale vendored code	`pnpm run vendor:local-moe:check`
Readiness timeout	First expert load can take minutes — 600s startup budget

See Maniac integration.

Observability commands

curl -s http://127.0.0.1:8080/v1/stats | jq '.adaptive_block_policy, .specdec_drafted, .specdec_accepted'
curl -s http://127.0.0.1:8080/v1/cache/stats | jq '.streaming_summary, .prefix_cache'

Focused residency probe (upstream script):

python scripts/run_expert_residency_probe.py --url http://127.0.0.1:8080

Troubleshooting

On this page