Maniac Docs

Speculative Decoding

DFlash v7 draft-verify-accept cycle, adaptive block policies, acceptance sampling, draft-free request presets, target-only fallback, and DeepSeek-V4 rho-gate.

The default served decode path is DFlash v7 speculative decoding: a small draft model proposes token blocks; the target model verifies them in parallel; accepted prefixes commit in one step.

Startup logs: decode_path=specdec-v7.

Draft → verify → accept → commit

sequenceDiagram
  participant Draft as DFlash draft (8 layers)
  participant Target as Target MoE model
  participant Policy as Adaptive block policy
  Draft->>Policy: Propose block (up to 16 tokens)
  Policy->>Target: Verify draft tokens
  Target->>Target: Longest-prefix accept (greedy or sampled)
  Target->>Target: Commit accepted run; roll forward
  1. Draft — block-diffusion head proposes up to 16 tokens per forward (8-layer draft).
  2. Verify — target runs on the drafted continuation.
  3. Accept — longest matching prefix between draft and target distributions.
  4. Commit — append accepted tokens; repeat until stop or max tokens.

Runtime counters on GET /v1/stats: specdec_drafted, specdec_accepted, recent_alpha, observed_cycles, low_cycles.

DFlash v7 engine

Mode: BlockSpecDecEngineMode.NATIVE_V7.

The draft checkpoint is a block-diffusion head trained for the target family:

Load programmatically:

from mach import BlockSpecDecEngine, BlockSpecDecEngineConfig, BlockSpecDecEngineMode

engine = BlockSpecDecEngine.from_checkpoint(
    target_checkpoint="/path/to/target",
    draft_checkpoint="/path/to/dflash_draft",
    config=BlockSpecDecEngineConfig(mode=BlockSpecDecEngineMode.NATIVE_V7),
)

Requires the [dflash] extra (dflash-mlx). See Installation.

v7 optimizations (on for DFlash server): lazy commit + trim hidden — reduces per-cycle target feature projection overhead.

Adaptive block policies

--serving-adaptive-block-policy controls how many draft tokens to propose per cycle. Tool-calling workloads default to opencode-sampled-v1.

PolicyBehavior
opencode-sampled-v1Production default: window=4, alpha_threshold=0.40, low_block=4, default_block=16
balanced-v1Deterministic ablations / greedy comparisons
offStatic block size 16

Per-knob overrides:

  • --adaptive-block-alpha-window
  • --adaptive-block-alpha-threshold
  • --adaptive-block-low-block
  • --adaptive-block-default-block
  • --block-size (static override when policy is off)

Confirm active policy via /v1/statsadaptive_block_policy (name, knobs, observed_cycles, recent_alpha).

Acceptance

TemperatureStrategy
0 (greedy)Longest-prefix match between draft and target argmax
> 0Sampled acceptance (Leviathan rejection sampling)

Fallbacks:

  • Penalty fallback → target-only autoregressive step
  • Zero-acceptance fallback → target-only AR for that cycle

Target-only mode

Disable DFlash entirely:

mach-serve /path/to/checkpoint --streaming --target-only --port 8080

Auto-fallback also occurs when no usable --draft-dir is provided (no [dflash] install or missing draft weights).

The fallback is decided once by the arch gate (_resolve_effective_decode_path): a stray --draft-dir on a non-Qwen architecture, or a missing/unusable draft, resolves cleanly to target-only instead of mis-building the DFlash stack. The resolved path is surfaced as effective_decode_path (+ reason) in GET /v1/capabilities.

Draft-free request presets

Independently of the trained DFlash draft, callers can request a draft-free speculation preset per request via an additive speculation field on the chat-completions body. These presets need no draft head — they propose tokens from the running context and verify them against the target in a single block.

PresetStrategyBest for
none (default)Engine defaults — CopySpec for the Qwen DFlash path, plain autoregressive for target-only.Anything; identical to the no-speculation path.
summarizationPrompt-lookup decoding (PLD) — propose the continuation of the most recent n-gram match of the running suffix.Outputs that echo the input (summaries, edits, refactors).
classificationFixed-token — when the target's just-emitted token starts a candidate label, propose the rest of that label so it resolves in ~one verified block.Short, fixed-vocabulary answers (labels, routing).

All three presets are exactness-preserving: greedy decode with any preset is token-identical to greedy without it, because every committed token is the target's own argmax — the draft only ever short-circuits forwards the target would have produced anyway. When the speculation field is absent (or preset: "none"), decoding is byte-for-byte identical to today's path.

Request shape (injected into the POST body):

{
  "speculation": {
    "preset": "summarization",
    "min_ngram": 3,
    "max_ngram": 5,
    "max_draft": 4
  }
}
{
  "speculation": {
    "preset": "classification",
    "candidates": ["positive", "negative", "neutral"]
  }
}

candidates label strings are tokenized server-side into candidate token-id sequences (the engine never needs a tokenizer of its own). min_ngram / max_ngram (defaults 3 / 5) tune the PLD lookup window for summarization; max_draft caps the per-cycle draft length (the engine's block size caps it otherwise).

The Maniac harness can auto-select summarization for memory/summarizer-style sub-agent roles (where the output echoes the input); classification stays an explicit API capability because it requires a caller-supplied candidate label set.

How presets ride the decode paths

A preset engages on one of two paths depending on architecture:

  • Qwen DFlash path (qwen3_5_moe) — the preset layers onto the existing DFlash/CopySpec loop.
  • Target-only overlay (gpt_oss, gemma4, qwen3_moe) — a shared draft-free overlay (specdec/draft_free.py) runs the propose → verify → longest-prefix-accept → trim cycle.

The target-only overlay requires every KV-cache layer to support O(1) trim() — i.e. all-attention models. Hybrid linear-attention caches (Qwen3-Next / Qwen3.6 GDN) are not trimmable, so the dispatcher gates on a trimmability probe and falls back to plain decoding for those architectures (the preset becomes a no-op rather than an error).

DeepSeek-V4: rho-gate instead of DFlash

deepseek_v4 uses MTP / EAGLE-3.1 drafts with a rho-gate acceptance path, not DFlash v7.

Env / knobRole
LME_V4_DRAFTDraft source selection
LME_RHO_GATE*Rho-gate thresholds and behavior

See Architecture for the load_v4_flash entry point.

Research and experimental paths

Additional draft strategies exist behind experimental gating (LME_ALLOW_EXPERIMENTAL):

  • EAGLE-3 tree drafts
  • Zerocost drafts

These are not the production mach-serve default. Use for ablations and development only.

The prompt-lookup (PLD) and CopySpec strategies that previously lived here as experimental paths are now the supported, per-request draft-free presets above (summarization / classification).

Interaction with continuous batching

Default single-flight serving always uses DFlash when available. Opt-in continuous batching demotes to target-only BatchedDecoder when batch size ≥ 2. See Continuous batching.

  • Serving — startup decode_path=specdec-v7 contract
  • LibraryBlockSpecDecEngine API
  • CLI — draft artifact paths

On this page