Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

A user hit a reproducible runtime CUDA OOM in MTP draft on my Qwen3.6-27B v1.0.5 chart at 262K context. Boot fine, draft scales beyond static estimate, exit 139 in common_speculative_state_mtp draft. Fixed by dropping havenoammo UD-Q3_K_XL (14.9 GB) to UD-Q2_K_XL (12.3 GB). Direct bench validates v1.0.7 at 72.14 t/s stable, full 262K, no OOM. Plus a side experiment: can we drop Genesis patches by switching to NVFP4? Answer: no.

Two nights ago I shipped llamacppqwen36mtpone v1.0.5 on the orales-one-market catalog: havenoammo’s UD-Q3_K_XL (14.9 GB) at FULL 262K context with MTP speculative decoding n=5. My validation bench showed 77 t/s steady at 262K with 75-80% draft acceptance — a clean win over the prior v1.0.3’s 65 t/s @ 128K.

Then a user reported the reality:

the pod crashed frequently with exit code 139 due to a CUDA OOM during MTP drafting:

/src/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
  CUDA error: out of memory
  current device: 0, in function ggml_cuda_graph_evaluate_and_capture
...
common_speculative_state_mtp::draft

I deactivated the MTP and then the llm is running rock-stable. Even with context > 200K there is enough space left to re-activate MTP (it’s tight but OK). But if I do and MTP becomes active, CUDA crashes again.

This is the classic gap between boot-time static memory estimate and runtime worst-case. The CUDA graph capture sizes that the engine actually exercises during real prompts dwarf the boot estimate.

TL;DR

Root cause: MTP draft compute buffer scales with cudagraph capture shapes

When llama.cpp’s MTP speculative decoding runs, it captures CUDA graphs for the draft head computation. The graph capture shapes depend on the actual batch + sequence patterns the engine encounters during inference. The boot-time --ctx-size 262144 --spec-draft-n-max 5 --cache-type-{k,v} q4_0 estimate is enough headroom for the static graph captures the engine does during init, but not enough for the broader set of shapes that real prompts trigger.

In numbers, on a 24GB sm_120 mobile (5090M, RTX 5090 Laptop):

Same problem documented in the llama.cpp MTP PR #22673 thread: long context + MTP eats more VRAM than the boot-time KV calculation suggests. There’s no --mtp-draft-buf-size flag to limit the runtime allocation — the only knob today is the target quant size.

Fix: drop one Unsloth Dynamic quant tier

havenoammo’s Qwen3.6-27B-MTP-UD-GGUF ships a full Unsloth Dynamic ladder including UD-Q2_K_XL at 12.3 GB (vs Q3_K_XL 14.9 GB). The extra 2.6 GB is exactly what the MTP draft cudagraph captures need at runtime.

Unsloth Dynamic preserves critical layers in higher precision and (per havenoammo’s recipe) keeps the MTP head at Q8_0 even in Q2 variants — so the benchmark quality drop is around 5-8% vs Q3_K_XL rather than the larger drop you’d get going Q3_K_M → Q2_K with standard K-quants.

Direct validation bench

I tested both havenoammo’s UD-Q2_K_XL (the one v1.0.7 ships) and the newly-released unsloth/Qwen3.6-27B-GGUF-MTP UD-Q2_K_XL (same quant tier, official Unsloth) on my Olares One.

Stack:

havenoammo UD-Q2_K_XL @ 262K + MTP n=5 (what v1.0.7 ships):

runs: 68.60, 71.25, 75.24, 68.46, 76.25, 71.30, 73.12, 68.51, 73.92, 74.70 t/s
AVG = 72.14 t/s   MIN = 68.46   MAX = 76.25
ZERO CUDA OOM   NO degradation cycle   10 clean runs

unsloth UD-Q2_K_XL @ 262K + MTP n=5 (the official Unsloth that landed today):

runs: 66.01, 66.37, 64.46, 62.54, 65.48, 64.47, 62.88, 64.05, 61.58, 63.77 t/s
AVG = 64.16 t/s   MIN = 61.58   MAX = 66.37
ZERO CUDA OOM   NO degradation cycle   10 clean runs

havenoammo wins by +12% at the same quant tier. Different MTP integration metadata. v1.0.7 keeps havenoammo for that reason; will revisit if unsloth’s MTP integration improves.

Comparison table

Stackt/sStabilityContext
v1.0.5 — havenoammo Q3_K_XL @ 262K + MTP n=577 (validation bench)❌ runtime OOM at MTP draft262K
v1.0.6 — havenoammo Q3_K_XL @ 128K + MTP n=4 (transient, never shipped)~65 (proven safe)✅ stable128K
v1.0.7 — havenoammo Q2_K_XL @ 262K + MTP n=5 (ships now)72.14✅ rock-stable262K
unsloth UD-Q2_K_XL @ 262K + MTP n=564.16✅ rock-stable262K

v1.0.7 is strictly better than v1.0.5 for any user who values stability — same speedup tier, full context, zero crashes.

Side experiment: NVFP4 drop-Genesis attempt

While I had the GPU free, I tested sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ (NVFP4 + SmoothQuant + GPTQ, MTP / lm_head / embeddings preserved at BF16) on stock vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 — no Genesis patches.

Production vllmqwen36turbo27bone runs at 88 t/s using aamsellem/vllm-qwen36-blackwell:0.20.0-genesis-v3 (a 28-patch Genesis-modified vLLM). The whole point of the experiment: can NVFP4 quantization let us drop the custom Genesis build?

Phase 1 — NVFP4 alone (no spec decoding): boots cleanly on stock vLLM. 31.79 t/s STABLE over 10 runs (0.2% spread, ZERO outliers, ultra-deterministic). max-model-len had to drop from 88000 to 28000 because of KV cache budget (3 GB needed, only 1.2 GB free after model load).

Phase 2 — NVFP4 + MTP n=3 (spec decoding): CUDA OOM at MTP head load: needs 2.37 GB free, has 1.0 GB. The Genesis patches free that exact ~2-3 GB somewhere — likely buffer reuse / cudagraph pruning / activation reordering. Without Genesis, the MTP head doesn’t fit alongside the NVFP4 model + KV + non-PyTorch overhead on 24GB.

Verdict: keep Genesis in production. Drop-Genesis costs the entire MTP speedup. 88 t/s with Genesis vs 31.79 t/s without is a -64% regression; the simplification of the build chain isn’t worth that.

Side observation worth blogging on its own: the NVFP4-only bench (no spec decoding) showed 0.2% spread across 10 runs. Contrast with my Gemma 4 DFlash result from earlier where the same image showed a reproducible 5-fast/4-slow cycle. This narrows my hypothesis on the DFlash cycle: it’s specifically a spec-decode adaptive throttling effect (matching the “low-yield MTP fallback” pattern documented in llama.cpp PR #22931, closed), NOT a general vLLM engine bug. When the drafter’s acceptance rate temporarily collapses, the verify path’s 4.7× amplification produces the slow phase, and recovery happens once the drafter state realigns.

Reproducible

Helm chart, exact image tag, all flags, bench harness — all in orales-one-market. Pin: v1.0.7.

Watch list

The custom aamsellem/llamacpp-mtp:0.1.0 image will become unnecessary once am17an’s MTP PR #22673 merges upstream — which is imminent after the prereq #22838 (parallel drafting) merged into master at 16:09 UTC May 11. The rebase is in flight. When the merge lands, v1.0.8 will swap to ghcr.io/ggml-org/llama.cpp:server-cuda13-bNNNN mainline with identical flags.

If anyone running Qwen3.6-27B + MTP on a 24GB card sees a similar reproducible crash, drop to Q2_K_XL — that’s the lever. And ping me if you have a third-party Unsloth Dynamic variant that beats havenoammo’s +12% margin at Q2 tier.

Share this post on:

Comments