Continuous Batching
Single-flight default serialization, opt-in ContinuousBatchScheduler, hybrid spec-dec policy, SSE fan-out, and tunables.
By default, mach-serve processes one generation at a time. Continuous batching is an opt-in scheduler that coalesces concurrent /v1/chat/completions requests into shared decode steps.
Single-flight default
All generations serialize behind a process-wide _GENERATION_LOCK because the engine shares mutable state:
- Expert bank LRU and eviction
- KV cache
- Sampling context
- Grammar / JSON-schema enforcer
Concurrent callers queue; only one decode loop runs at a time. This is the safe production default for DFlash v7 + streaming residency.
Implications:
/v1/cache/statsmay returnstale_generation_in_flightduring a long request- Stats sampled between requests give reliable snapshots
See Troubleshooting.
Opt-in continuous batching
Enable:
mach-serve /path/to/checkpoint --streaming --continuous-batching --port 8080Or:
LME_CONTINUOUS_BATCHING=1 mach-serve ...ContinuousBatchScheduler batches concurrent chat completions into one decode step, up to LME_CONTINUOUS_BATCHING_MAX_B (default 4).
Hybrid speculative decoding policy
| Active requests | Decode path |
|---|---|
| 1 (lone request) | Full DFlash v7 speculative decoding |
| ≥ 2 | Target-only BatchedDecoder with per-slot sampling and grammar (specdec/batch_sampling.py) |
Spec-dec and multi-request batching do not share the same code path — batching trades draft acceptance for throughput.
Response header: X-Decode-Mode: batched when the batched decoder is active.
Architecture guard
Continuous batching needs a KV cache that to_batch_cache can clone per row. The server resolves this from the arch's kv_cache_kind (the same continuous_batching_capability predicate that backs mach check): standard / hybrid_gdn are batch-compatible, compressed (DeepSeek-V4 CompressedKVCache) is blocked, and rotating (sliding-window gpt-oss / Gemma 4) is unknown (only batchable at keep == 0). When --continuous-batching is requested for a blocked/unknown arch, the server disables continuous batching and serves the serial path instead of crashing in to_batch_cache.
Keeping learned spec-dec engaged at B≥2 is narrower still — only qwen3_5_moe (Qwen3.6: DFlash specdec_v7 draft + hybrid_gdn cache) qualifies. mach check --all-archs surfaces this as the batched_specdec row, and /v1/capabilities reports it under speculation.batched_specdec. All other archs keep batched target-only decode.
SSE fan-out
Each streaming client receives events on its own asyncio.Queue. The scheduler multiplexes token deltas from the shared batched step onto per-client queues.
Compatible with OpenAI-style SSE on POST /v1/chat/completions (stream: true).
Admission and scheduling
| Env / flag | Default | Purpose |
|---|---|---|
LME_CONTINUOUS_BATCHING | off | Master enable |
LME_CONTINUOUS_BATCHING_MAX_B | 4 | Max batch size |
LME_CONTINUOUS_BATCHING_WINDOW_MS | — | Coalescing window for admitting requests into a step |
LME_CONTINUOUS_BATCHING_KV_BUDGET_GB | 10 | KV-budget admission — reject or defer when exceeded |
Response header X-Active-Loops reports how many generation loops are active in the scheduler.
Mutual exclusions
| Feature | Compatible with continuous batching? |
|---|---|
| DFlash v7 (B=1 only) | Partial — full spec-dec only for lone request |
| TurboQuant KV | No — cannot coexist |
| Prefix cache serialize/deserialize under TurboQuant | N/A — TurboQuant off when batching |
Enable one or the other, not both. See Caching.
When to enable
| Scenario | Recommendation |
|---|---|
| Single agent / OpenCode session | Default single-flight + DFlash |
| Multiple concurrent API clients, throughput over spec-dec | --continuous-batching |
| Memory-tight machines | Watch LME_CONTINUOUS_BATCHING_KV_BUDGET_GB |
Tunables summary
| Knob | Default | Notes |
|---|---|---|
--continuous-batching | off | CLI enable |
LME_CONTINUOUS_BATCHING | 0 | Env enable |
LME_CONTINUOUS_BATCHING_MAX_B | 4 | Upper bound on batch width |
LME_CONTINUOUS_BATCHING_WINDOW_MS | scheduler default | Admission coalescing |
LME_CONTINUOUS_BATCHING_KV_BUDGET_GB | 10 | KV admission cap |
Related pages
- Speculative decoding — DFlash when B=1
- Serving — HTTP endpoints and headers
- Caching — TurboQuant conflict