Continuous Batching

Single-flight default serialization, opt-in ContinuousBatchScheduler, hybrid spec-dec policy, SSE fan-out, and tunables.

By default, mach-serve processes one generation at a time. Continuous batching is an opt-in scheduler that coalesces concurrent /v1/chat/completions requests into shared decode steps.

Single-flight default

All generations serialize behind a process-wide _GENERATION_LOCK because the engine shares mutable state:

Expert bank LRU and eviction
KV cache
Sampling context
Grammar / JSON-schema enforcer

Concurrent callers queue; only one decode loop runs at a time. This is the safe production default for DFlash v7 + streaming residency.

Implications:

/v1/cache/stats may return stale_generation_in_flight during a long request
Stats sampled between requests give reliable snapshots

See Troubleshooting.

Opt-in continuous batching

Enable:

mach-serve /path/to/checkpoint --streaming --continuous-batching --port 8080

Or:

LME_CONTINUOUS_BATCHING=1 mach-serve ...

ContinuousBatchScheduler batches concurrent chat completions into one decode step, up to LME_CONTINUOUS_BATCHING_MAX_B (default 4).

Hybrid speculative decoding policy

Active requests	Decode path
1 (lone request)	Full DFlash v7 speculative decoding
≥ 2	Target-only `BatchedDecoder` with per-slot sampling and grammar (`specdec/batch_sampling.py`)

Spec-dec and multi-request batching do not share the same code path — batching trades draft acceptance for throughput.

Response header: X-Decode-Mode: batched when the batched decoder is active.

Continuous batching needs a KV cache that to_batch_cache can clone per row. The server resolves this from the arch's kv_cache_kind (the same continuous_batching_capability predicate that backs mach check): standard / hybrid_gdn are batch-compatible, compressed (DeepSeek-V4 CompressedKVCache) is blocked, and rotating (sliding-window gpt-oss / Gemma 4) is unknown (only batchable at keep == 0). When --continuous-batching is requested for a blocked/unknown arch, the server disables continuous batching and serves the serial path instead of crashing in to_batch_cache.

Keeping learned spec-dec engaged at B≥2 is narrower still — only qwen3_5_moe (Qwen3.6: DFlash specdec_v7 draft + hybrid_gdn cache) qualifies. mach check --all-archs surfaces this as the batched_specdec row, and /v1/capabilities reports it under speculation.batched_specdec. All other archs keep batched target-only decode.

SSE fan-out

Each streaming client receives events on its own asyncio.Queue. The scheduler multiplexes token deltas from the shared batched step onto per-client queues.

Compatible with OpenAI-style SSE on POST /v1/chat/completions (stream: true).

Admission and scheduling

Env / flag	Default	Purpose
`LME_CONTINUOUS_BATCHING`	off	Master enable
`LME_CONTINUOUS_BATCHING_MAX_B`	4	Max batch size
`LME_CONTINUOUS_BATCHING_WINDOW_MS`	—	Coalescing window for admitting requests into a step
`LME_CONTINUOUS_BATCHING_KV_BUDGET_GB`	10	KV-budget admission — reject or defer when exceeded

Response header X-Active-Loops reports how many generation loops are active in the scheduler.

Mutual exclusions

Feature	Compatible with continuous batching?
DFlash v7 (B=1 only)	Partial — full spec-dec only for lone request
TurboQuant KV	No — cannot coexist
Prefix cache serialize/deserialize under TurboQuant	N/A — TurboQuant off when batching

Enable one or the other, not both. See Caching.

When to enable

Scenario	Recommendation
Single agent / OpenCode session	Default single-flight + DFlash
Multiple concurrent API clients, throughput over spec-dec	`--continuous-batching`
Memory-tight machines	Watch `LME_CONTINUOUS_BATCHING_KV_BUDGET_GB`

Tunables summary

Knob	Default	Notes
`--continuous-batching`	off	CLI enable
`LME_CONTINUOUS_BATCHING`	0	Env enable
`LME_CONTINUOUS_BATCHING_MAX_B`	4	Upper bound on batch width
`LME_CONTINUOUS_BATCHING_WINDOW_MS`	scheduler default	Admission coalescing
`LME_CONTINUOUS_BATCHING_KV_BUDGET_GB`	10	KV admission cap

Speculative decoding — DFlash when B=1
Serving — HTTP endpoints and headers
Caching — TurboQuant conflict