Caching

Prefix cache L1/L2 for TTFT, warm_prefix endpoints, GDN snapshots, and optional TurboQuant KV compression.

Caching in mach targets time-to-first-token (TTFT) for repeated system and tool prefixes, plus optional KV memory reduction via TurboQuant. Expert bank caching is separate — see Expert residency.

Prefix cache overview

Production DFlash serving enables prefix caching by default (L2 on disk). Disable with --no-prefix-cache.

flowchart LR
  prompt["Rendered prompt"] --> hash["SHA1 key"]
  hash --> l1["L1 in-memory snapshots"]
  hash --> l2["L2 DiskKVCache on disk"]
  l2 --> restore["Restore KV on cache hit"]
  restore --> ttft["Skip full prefill → lower TTFT"]

L1 — in-memory

Hot prefix snapshots kept in process memory for immediate reuse within a session.

L2 — on-disk `DiskKVCache`

Keys: SHA1 of rendered prompt text
Eviction: value-scored LRU (eviction_policy="value")
Score: (effective_hits + 1) * tokens / file_size with hit decay and optional anchor multiplier
Triggers: cold miss save, continued generation, displacement, shutdown (--disk-kv-dir)

Flag	Purpose
`--prefix-cache` / `--no-prefix-cache`	Enable or disable
`--disk-kv-dir`	L2 directory (production default sets location)
`--disk-kv-budget-gb`	Cap disk usage
`--prefix-cache-max-entries`	Entry count limit
`--prefix-cache-max-gb`	Memory budget for L1
`--prefix-cache-block-size`	Block granularity

Env: LME_DISK_KV_EVICTION_POLICY=value for value-scored eviction.

GDN hybrid snapshots

Qwen3.5/3.6 hybrid GDN + attention models use PrefixKVSnapshotStore for recurrent GDN state that is not trimmable like standard KV. Prefix restore must rehydrate both attention KV and GDN caches.

DFlash prefix snapshot management integrates with dflash-mlx; cold snapshot save is deferred until after first-token emission.

Warming prefixes

Agent workloads repeat stable system and tool prefixes. Warm them before user traffic:

`POST /v1/cache/warm_prefix`

Prefill and cache a stable system/tool prefix without generating a full completion.

curl -s -X POST http://127.0.0.1:8080/v1/cache/warm_prefix \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"system","content":"You are a coding assistant."}]}'

`GET /v1/cache/warm_prefix`

Inspect warm-prefix status and cache entries.

Warming pairs with production opencode-sampled-v1 spec-dec for tool-calling sessions. See Serving.

Observability

GET /v1/cache/stats reports:

prefix_cache_* counters
DFlash-specific prefix stats
streaming_summary for expert bank (see Expert residency)
stale_generation_in_flight when single-flight lock is held

GET /v1/stats also exposes adaptive_block_policy alongside cache-related fields.

TurboQuant KV compression

Optional approximate compression for full-attention KV only. GDN recurrent caches stay fp32.

mach-serve /path/to/checkpoint --streaming --turboquant-kv --turboquant-bits 4 ...

Flag	Options / notes
`--turboquant-kv`	Master enable
`--turboquant-bits`	Quantization width
`--turboquant-group-size`	Group size for quant kernels
Modes	`v2_lean`, `v2_rotated`, `v3_*` variants

Expected benefit: up to ~5.5× KV memory reduction for long contexts.

Constraints

Rule	Reason
No prefix-cache (de)serialization while TurboQuant active	Exact snapshot format incompatible with compressed KV
Cannot coexist with continuous batching	Shared KV mutation paths conflict

Choose TurboQuant or continuous batching, not both. See Continuous batching.

Production defaults

Capability	Production default
Prefix cache	Enabled (L2 on disk)
TurboQuant KV	Off
Disk KV eviction	Value-scored when `DiskKVCache` uses `eviction_policy="value"`

Serving — cache HTTP endpoints
Expert residency — expert bank vs KV cache
Troubleshooting — stats timeout during generation

Caching

On this page