Troubleshooting
Diagnose production slow paths — direct pread, cache churn, slow prefill, stats timeouts, and diagnostic fallback.
Production mach-serve is designed to fail fast when the native fast path is unavailable. Use this page when startup logs or /v1/cache/stats indicate you are on a diagnostic or degraded path.
See Serving for expected startup signals and Installation for building native extensions.
direct_pread.active=0 or direct_pread.mode=none
Symptom: /v1/cache/stats → streaming_summary shows direct_pread.active=0 or mode=none.
Cause:
lme_mlx_pread_extnot built or not importableexpert_sidecar/missing or invalid- Direct pread explicitly disabled
Fix:
pip install -e ".[dev,dflash,native]"
python scripts/build_mlx_pread_ext.py
python -c "import lme_mlx_pread_ext"Restart mach-serve and confirm startup logs:
startup direct-pread ... native_extension=ready ... fallback_policy=fail-fast
production optimizations direct_pread=1 ...Re-export sidecar if sidecar_valid=0:
python scripts/export_expert_sidecar.py \
--checkpoint /path/to/checkpoint \
--output /path/to/checkpoint/expert_sidecar \
--num-experts 256 \
--bits 4See Expert residency and CLI.
High evictions / cache churn
Symptom: streaming_summary shows high evictions, low hit rate, or rising misses under steady decode.
Cause: Resident decode bank too small for the workload's expert working set.
Fix:
- Increase
--wired-gb(Metal wired limit, default 9) - Increase
--expert-cache-gb(resident bank size) - On constrained machines, use
--streamingso prefill stays transient while decode remains resident
Example:
mach-serve /path/to/<your-engine-checkpoint> --streaming --wired-gb 10 --expert-cache-gb 8 --port 8080Review LME_BANK_EVICTION_POLICY=lookahead behavior in Expert residency.
Very slow prefill (~6 tok/s)
Symptom: Prefill throughput orders of magnitude below expected; logs show diagnostic fallback.
Cause:
- Python diagnostic fallback path (
fallback_policy=diagnostic-python) - Native pread disabled or missing
- Non-production backend (
--backend openai)
Fix:
- Confirm
effective_backend=production profile=production - Confirm
startup decode_path=specdec-v7 - Confirm
native_extension=readyanddirect_pread=1 - Avoid generic
openaibackend for production benchmarks
If you intentionally enabled diagnostic fallback for debugging, expect slow prefill — this is not a serving mode.
/v1/cache/stats timeout or stale_generation_in_flight
Symptom: Cache stats request hangs, returns timeout, or stale_generation_in_flight.
Cause: Single-flight _GENERATION_LOCK is held during an in-progress generation. Stats sampling contends with the active decode loop.
Fix:
- Retry after the current generation completes
- Sample stats between requests for reliable snapshots
- For concurrent clients, consider opt-in Continuous batching (trades DFlash at B≥2)
Diagnostic fallback override (explicitly slow)
Python fallback is diagnostic only, not production.
LME_ALLOW_DIAGNOSTIC_PYTHON_FALLBACK=1 mach-serve /path/to/checkpoint --streaming --port 8080Use only when debugging startup blockers. Maniac Desktop may keep --streaming with this policy when expert_sidecar/ is missing from a catalog path — prefer fixing the checkpoint or setting MANIAC_LOCAL_MOE_CHECKPOINT_PATH.
Startup red flags checklist
| Log value | Meaning | Action |
|---|---|---|
sidecar_valid=0 | Sidecar missing or corrupt | Export or fetch the engine-format sidecar artifact |
native_extension=missing | lme_mlx_pread_ext import failed | Rebuild extension |
fallback_policy=diagnostic-python | On Python slow path | Fix native + sidecar |
decode_path ≠ specdec-v7 | Spec-dec disabled or wrong backend | Install [dflash], pass --draft-dir, avoid --target-only |
effective_backend=openai | Generic mlx-lm path | Use production mach-serve without --backend openai |
Maniac Desktop-specific
| Issue | Check |
|---|---|
| Engine not starting | MANIAC_LOCAL_MOE_ENABLED=1, venv install logs |
| Wrong checkpoint | MANIAC_LOCAL_MOE_CHECKPOINT_PATH, catalog sidecar presence |
| Stale vendored code | pnpm run vendor:local-moe:check |
| Readiness timeout | First expert load can take minutes — 600s startup budget |
See Maniac integration.
Observability commands
curl -s http://127.0.0.1:8080/v1/stats | jq '.adaptive_block_policy, .specdec_drafted, .specdec_accepted'
curl -s http://127.0.0.1:8080/v1/cache/stats | jq '.streaming_summary, .prefix_cache'Focused residency probe (upstream script):
python scripts/run_expert_residency_probe.py --url http://127.0.0.1:8080