Two nights ago I shipped llamacppqwen36mtpone v1.0.5 on the orales-one-market catalog: havenoammo’s UD-Q3_K_XL (14.9 GB) at FULL 262K context with MTP speculative decoding n=5. My validation bench showed 77 t/s steady at 262K with 75-80% draft acceptance — a clean win over the prior v1.0.3’s 65 t/s @ 128K.
Then a user reported the reality:
the pod crashed frequently with exit code 139 due to a CUDA OOM during MTP drafting:
/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error CUDA error: out of memory current device: 0, in function ggml_cuda_graph_evaluate_and_capture ... common_speculative_state_mtp::draftI deactivated the MTP and then the llm is running rock-stable. Even with context > 200K there is enough space left to re-activate MTP (it’s tight but OK). But if I do and MTP becomes active, CUDA crashes again.
This is the classic gap between boot-time static memory estimate and runtime worst-case. The CUDA graph capture sizes that the engine actually exercises during real prompts dwarf the boot estimate.
TL;DR
- v1.0.5 (havenoammo UD-Q3_K_XL @ 262K + MTP n=5): boots fine, 77 t/s on validation, runtime CUDA OOM in MTP draft on real prompts
- v1.0.7 (havenoammo UD-Q2_K_XL @ 262K + MTP n=5, what ships now): 72.14 t/s AVG over 10 clean runs, ZERO OOM, full 262K context preserved
- Trade-off: only -6% t/s for stability and full context. Strictly better than v1.0.5.
- Side experiment: tested
sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQon stock vLLM to see if NVFP4 could let us drop Genesis patches. Answer: no — NVFP4 alone gives 31.79 t/s (-64% vs Genesis 88 t/s), and adding MTP n=3 OOMs on 24GB.
Root cause: MTP draft compute buffer scales with cudagraph capture shapes
When llama.cpp’s MTP speculative decoding runs, it captures CUDA graphs for the draft head computation. The graph capture shapes depend on the actual batch + sequence patterns the engine encounters during inference. The boot-time --ctx-size 262144 --spec-draft-n-max 5 --cache-type-{k,v} q4_0 estimate is enough headroom for the static graph captures the engine does during init, but not enough for the broader set of shapes that real prompts trigger.
In numbers, on a 24GB sm_120 mobile (5090M, RTX 5090 Laptop):
- Q3_K_XL target: 14.92 GB
- q4_0 KV cache at 262K: ~4.3 GB
- Other vLLM init + compute buffers: ~3-4 GB
- MTP draft cudagraph captures: needs ~2-3 GB more than boot estimated
- Result: VRAM exhausted at runtime, exit 139 in
ggml_cuda_graph_evaluate_and_capture.
Same problem documented in the llama.cpp MTP PR #22673 thread: long context + MTP eats more VRAM than the boot-time KV calculation suggests. There’s no --mtp-draft-buf-size flag to limit the runtime allocation — the only knob today is the target quant size.
Fix: drop one Unsloth Dynamic quant tier
havenoammo’s Qwen3.6-27B-MTP-UD-GGUF ships a full Unsloth Dynamic ladder including UD-Q2_K_XL at 12.3 GB (vs Q3_K_XL 14.9 GB). The extra 2.6 GB is exactly what the MTP draft cudagraph captures need at runtime.
Unsloth Dynamic preserves critical layers in higher precision and (per havenoammo’s recipe) keeps the MTP head at Q8_0 even in Q2 variants — so the benchmark quality drop is around 5-8% vs Q3_K_XL rather than the larger drop you’d get going Q3_K_M → Q2_K with standard K-quants.
Direct validation bench
I tested both havenoammo’s UD-Q2_K_XL (the one v1.0.7 ships) and the newly-released unsloth/Qwen3.6-27B-GGUF-MTP UD-Q2_K_XL (same quant tier, official Unsloth) on my Olares One.
Stack:
- Hardware: Olares One — RTX 5090 Laptop, 24GB GDDR7, 896 GB/s, sm_120 Blackwell mobile
- Image:
aamsellem/llamacpp-mtp:0.1.0(custom build from am17an’s MTP branch; will become droppable once #22673 merges upstream) - Args:
--ctx-size 262144 --cache-type-k q4_0 --cache-type-v q4_0 --batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on --spec-type mtp --spec-draft-n-max 5 --chat-template-kwargs '{"enable_thinking": false}' - Prompt: Space Invaders HTML, 2000 tokens, temp 0.6 top_p 0.95
- Methodology: 2 warmups + 10 measured runs
havenoammo UD-Q2_K_XL @ 262K + MTP n=5 (what v1.0.7 ships):
runs: 68.60, 71.25, 75.24, 68.46, 76.25, 71.30, 73.12, 68.51, 73.92, 74.70 t/s
AVG = 72.14 t/s MIN = 68.46 MAX = 76.25
ZERO CUDA OOM NO degradation cycle 10 clean runs
unsloth UD-Q2_K_XL @ 262K + MTP n=5 (the official Unsloth that landed today):
runs: 66.01, 66.37, 64.46, 62.54, 65.48, 64.47, 62.88, 64.05, 61.58, 63.77 t/s
AVG = 64.16 t/s MIN = 61.58 MAX = 66.37
ZERO CUDA OOM NO degradation cycle 10 clean runs
havenoammo wins by +12% at the same quant tier. Different MTP integration metadata. v1.0.7 keeps havenoammo for that reason; will revisit if unsloth’s MTP integration improves.
Comparison table
| Stack | t/s | Stability | Context |
|---|---|---|---|
| v1.0.5 — havenoammo Q3_K_XL @ 262K + MTP n=5 | 77 (validation bench) | ❌ runtime OOM at MTP draft | 262K |
| v1.0.6 — havenoammo Q3_K_XL @ 128K + MTP n=4 (transient, never shipped) | ~65 (proven safe) | ✅ stable | 128K |
| v1.0.7 — havenoammo Q2_K_XL @ 262K + MTP n=5 (ships now) | 72.14 | ✅ rock-stable | 262K |
| unsloth UD-Q2_K_XL @ 262K + MTP n=5 | 64.16 | ✅ rock-stable | 262K |
v1.0.7 is strictly better than v1.0.5 for any user who values stability — same speedup tier, full context, zero crashes.
Side experiment: NVFP4 drop-Genesis attempt
While I had the GPU free, I tested sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ (NVFP4 + SmoothQuant + GPTQ, MTP / lm_head / embeddings preserved at BF16) on stock vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 — no Genesis patches.
Production vllmqwen36turbo27bone runs at 88 t/s using aamsellem/vllm-qwen36-blackwell:0.20.0-genesis-v3 (a 28-patch Genesis-modified vLLM). The whole point of the experiment: can NVFP4 quantization let us drop the custom Genesis build?
Phase 1 — NVFP4 alone (no spec decoding): boots cleanly on stock vLLM. 31.79 t/s STABLE over 10 runs (0.2% spread, ZERO outliers, ultra-deterministic). max-model-len had to drop from 88000 to 28000 because of KV cache budget (3 GB needed, only 1.2 GB free after model load).
Phase 2 — NVFP4 + MTP n=3 (spec decoding): CUDA OOM at MTP head load: needs 2.37 GB free, has 1.0 GB. The Genesis patches free that exact ~2-3 GB somewhere — likely buffer reuse / cudagraph pruning / activation reordering. Without Genesis, the MTP head doesn’t fit alongside the NVFP4 model + KV + non-PyTorch overhead on 24GB.
Verdict: keep Genesis in production. Drop-Genesis costs the entire MTP speedup. 88 t/s with Genesis vs 31.79 t/s without is a -64% regression; the simplification of the build chain isn’t worth that.
Side observation worth blogging on its own: the NVFP4-only bench (no spec decoding) showed 0.2% spread across 10 runs. Contrast with my Gemma 4 DFlash result from earlier where the same image showed a reproducible 5-fast/4-slow cycle. This narrows my hypothesis on the DFlash cycle: it’s specifically a spec-decode adaptive throttling effect (matching the “low-yield MTP fallback” pattern documented in llama.cpp PR #22931, closed), NOT a general vLLM engine bug. When the drafter’s acceptance rate temporarily collapses, the verify path’s 4.7× amplification produces the slow phase, and recovery happens once the drafter state realigns.
Reproducible
Helm chart, exact image tag, all flags, bench harness — all in orales-one-market. Pin: v1.0.7.
Watch list
The custom aamsellem/llamacpp-mtp:0.1.0 image will become unnecessary once am17an’s MTP PR #22673 merges upstream — which is imminent after the prereq #22838 (parallel drafting) merged into master at 16:09 UTC May 11. The rebase is in flight. When the merge lands, v1.0.8 will swap to ghcr.io/ggml-org/llama.cpp:server-cuda13-bNNNN mainline with identical flags.
If anyone running Qwen3.6-27B + MTP on a 24GB card sees a similar reproducible crash, drop to Q2_K_XL — that’s the lever. And ping me if you have a third-party Unsloth Dynamic variant that beats havenoammo’s +12% margin at Q2 tier.