Caching
Prefix cache L1/L2 for TTFT, warm_prefix endpoints, GDN snapshots, and optional TurboQuant KV compression.
Caching in mach targets time-to-first-token (TTFT) for repeated system and tool prefixes, plus optional KV memory reduction via TurboQuant. Expert bank caching is separate — see Expert residency.
Prefix cache overview
Production DFlash serving enables prefix caching by default (L2 on disk). Disable with --no-prefix-cache.
flowchart LR
prompt["Rendered prompt"] --> hash["SHA1 key"]
hash --> l1["L1 in-memory snapshots"]
hash --> l2["L2 DiskKVCache on disk"]
l2 --> restore["Restore KV on cache hit"]
restore --> ttft["Skip full prefill → lower TTFT"]L1 — in-memory
Hot prefix snapshots kept in process memory for immediate reuse within a session.
L2 — on-disk DiskKVCache
- Keys: SHA1 of rendered prompt text
- Eviction: value-scored LRU (
eviction_policy="value") - Score:
(effective_hits + 1) * tokens / file_sizewith hit decay and optional anchor multiplier - Triggers: cold miss save, continued generation, displacement, shutdown (
--disk-kv-dir)
| Flag | Purpose |
|---|---|
--prefix-cache / --no-prefix-cache | Enable or disable |
--disk-kv-dir | L2 directory (production default sets location) |
--disk-kv-budget-gb | Cap disk usage |
--prefix-cache-max-entries | Entry count limit |
--prefix-cache-max-gb | Memory budget for L1 |
--prefix-cache-block-size | Block granularity |
Env: LME_DISK_KV_EVICTION_POLICY=value for value-scored eviction.
GDN hybrid snapshots
Qwen3.5/3.6 hybrid GDN + attention models use PrefixKVSnapshotStore for recurrent GDN state that is not trimmable like standard KV. Prefix restore must rehydrate both attention KV and GDN caches.
DFlash prefix snapshot management integrates with dflash-mlx; cold snapshot save is deferred until after first-token emission.
Warming prefixes
Agent workloads repeat stable system and tool prefixes. Warm them before user traffic:
POST /v1/cache/warm_prefix
Prefill and cache a stable system/tool prefix without generating a full completion.
curl -s -X POST http://127.0.0.1:8080/v1/cache/warm_prefix \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"system","content":"You are a coding assistant."}]}'GET /v1/cache/warm_prefix
Inspect warm-prefix status and cache entries.
Warming pairs with production opencode-sampled-v1 spec-dec for tool-calling sessions. See Serving.
Observability
GET /v1/cache/stats reports:
prefix_cache_*counters- DFlash-specific prefix stats
streaming_summaryfor expert bank (see Expert residency)stale_generation_in_flightwhen single-flight lock is held
GET /v1/stats also exposes adaptive_block_policy alongside cache-related fields.
TurboQuant KV compression
Optional approximate compression for full-attention KV only. GDN recurrent caches stay fp32.
mach-serve /path/to/checkpoint --streaming --turboquant-kv --turboquant-bits 4 ...| Flag | Options / notes |
|---|---|
--turboquant-kv | Master enable |
--turboquant-bits | Quantization width |
--turboquant-group-size | Group size for quant kernels |
| Modes | v2_lean, v2_rotated, v3_* variants |
Expected benefit: up to ~5.5× KV memory reduction for long contexts.
Constraints
| Rule | Reason |
|---|---|
| No prefix-cache (de)serialization while TurboQuant active | Exact snapshot format incompatible with compressed KV |
| Cannot coexist with continuous batching | Shared KV mutation paths conflict |
Choose TurboQuant or continuous batching, not both. See Continuous batching.
Production defaults
| Capability | Production default |
|---|---|
| Prefix cache | Enabled (L2 on disk) |
| TurboQuant KV | Off |
| Disk KV eviction | Value-scored when DiskKVCache uses eviction_policy="value" |
Related pages
- Serving — cache HTTP endpoints
- Expert residency — expert bank vs KV cache
- Troubleshooting — stats timeout during generation