Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

MTP support was merged into llama.cpp master on May 16th. Five days later, three follow-up PRs quietly changed how MTP behaves — including the spec-draft-n-max default flipping from 16 to 3. On Olares One (RTX 5090M sm_120), that change plus NVIDIA's backend-sampling rewrite (#23287) pushed Qwen3.6 27B MTP from 64% to 86.7% draft acceptance. +22 points. Nobody is talking about this.

If you’ve been running Qwen3.6 27B with MTP on a custom am17an branch image since early May, you can stop. PR #22673 landed in ggml-org/llama.cpp master on May 16th, so the long custom-branch detour is over.

But “MTP merged” isn’t the actual news. The news is what came after the merge: three follow-up PRs landed in the next four days that quietly change how MTP behaves at the operating point most people are running it at. And one of them is a default value flip that I think most users will have missed.

Here’s what I found on Olares One (RTX 5090M Laptop, 24 GB GDDR7, sm_120 Blackwell consumer mobile).

The three follow-ups that matter

After the May 16 merge, three PRs landed against tools/server/ and common/speculative/ that touch the MTP draft path:

#23269 — “MTP clean-up” (ggerganov, 2026-05-19) — the headline change here, buried in the body, is:

Change default value of --spec-draft-n-max from 16 to 3

Plus several less-visible fixes: --spec-draft-p-min was broken and is now functional again (default 0.0), graph reuse logic was wrong for graphs with batch.token && batch.embd (which is exactly the MTP drafting graph), and all speculative implementations now see the accepted tokens — relevant when chaining multiple spec impls. Worth reading the PR body.

#23287 — “Move to backend sampling for MTP draft path” (gaugarg-nv, 2026-05-20) — NVIDIA-affiliated patch. Moves the draft-path sampling off the host and onto the CUDA backend. Fewer host roundtrips on the critical drafting loop, less synchronization overhead per draft step.

#22522 — “Programmatic Dependent Launch (PDL) for newer NVIDIA GPUs (Hopper+)” (aendk, 2026-05-20) — PDL is a CUDA 12+ feature that lets the GPU schedule the next kernel before the previous one finishes, reducing kernel-launch latency. Tagged “Hopper+” but Blackwell consumer mobile (sm_120) qualifies.

None of these are MTP-the-feature, they’re MTP-the-feature-but-actually-tuned. If you stopped paying attention after the May 16 merge, you’d assume the master build is “MTP works” and move on. But the operating point shifted under your feet.

Before: my v1.0.8 prod baseline

Until last night, my llamacppqwen36mtpone v1.0.8 was running a custom image built from am17an’s pre-merge branch (aamsellem/llamacpp-mtp:0.1.0) — the only way to get MTP on Blackwell before the merge. Config was straight from the Reddit MTP recipe that came out early May:

--spec-type mtp --spec-draft-n-max 5
--ctx-size 262144 --cache-type-k q4_0 --cache-type-v q4_0
--batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on

Validated on unsloth/Qwen3.6-27B-MTP-GGUF UD-Q3_K_XL (14.5 GB):

72.75 t/s AVG, ~64% draft acceptance.

Stable, zero OOM at full 262K context, no degradation cycle. Good enough that I stopped fiddling with it.

After: master HEAD with all three follow-ups

I built aamsellem/llama-cpp-mtp:master-ad27757 from ggml-org/llama.cpp master HEAD ad27757 (which is the merge commit of #23287, so all three follow-up PRs are included). Same model, same KV cache config, same context window. Two flag changes:

Ten back-to-back Space Invaders runs (2000 tokens each):

Runt/sdraft_nacceptedaccept %
174.901668144286.5%
272.601713142883.4%
375.011653144787.5%
474.321663144486.8%
574.311660144587.0%
673.731672144186.2%
773.871668144286.5%
875.141639145288.6%
974.351659144687.2%
1074.621650144887.8%

74.28 t/s AVG, 86.7% accept AVG. Range 72.60 – 75.14, σ = 2.54.

Pure speed: +2.1% vs v1.0.8.

Acceptance: +22 percentage points (64% → 86.7%).

Why the acceptance jump is the real story

If you only look at the throughput delta, +2% is forgettable. The interesting number is the acceptance jump.

At n_max=5 with the old sampling, we were drafting 5 tokens per step and accepting ~3 of them. The model proposed tokens that the verify pass rejected often enough that ~40% of the draft compute was wasted.

At n_max=3 with backend sampling, we draft 3 tokens per step and accept ~2.6 of them. Fewer drafts, but each one is much more likely to be right. The verify-reject rate dropped from 36% to 13%. That’s a different operating regime — closer to what MTP-the-paper actually promises.

The two changes compound. n_max=3 reduces the search depth so each draft proposal is in higher-probability territory. Backend sampling removes a host-CPU bottleneck that was probably introducing latency-induced drift in the draft distribution (the host-sampled tokens were less coherent with the GPU’s expected next-token distribution because of the roundtrip). Together: smaller search, tighter coupling, much higher acceptance.

The throughput stayed roughly flat because doing fewer drafts (lose a bit) but accepting more of them (gain a bit) ends up in the same place. The win you get isn’t t/s — it’s lower variance and lower wasted compute. Look at the range: σ = 2.54 over 10 runs. The previous setup had σ around 5-6 on the same prompt. Tighter is real value if you’re running an agentic stack where consistency matters more than peak.

The CUDA driver propagation bug, and Anbeeld’s quiet fix

One detail nobody is going to mention but that cost me a build cycle: at b9219, ggml-org/llama.cpp master fails its final link when you build with BUILD_SHARED_LIBS=ON + GGML_CUDA=ON on Ubuntu 24.04 / CUDA 13.0. The error:

/usr/bin/ld: libggml-cuda.so.0.12.0: undefined reference to `cuMemCreate'
                                     undefined reference to `cuMemAddressReserve'
                                     [...11 more cuMem*/cuDevice* symbols]

The shared library itself links (it’s allowed to have unresolved symbols), but the downstream executable link refuses to resolve them transitively.

Root cause: ggml/src/ggml-cuda/CMakeLists.txt links ggml-cuda against CUDA::cuda_driver with PRIVATE scope. For executables that link against ggml-cuda through ggml and llama, the dependency doesn’t propagate.

Anbeeld already fixed this on his BeeLlama fork (Anbeeld/beellama.cpp#18) with a one-line addition:

target_link_libraries(ggml-cuda PRIVATE CUDA::cuda_driver)
if (NOT GGML_BACKEND_DL)
    target_link_libraries(ggml-cuda INTERFACE $<LINK_ONLY:CUDA::cuda_driver>)
endif()

I filed the same bug upstream as ggml-org/llama.cpp#23357 with Anbeeld’s fix as the proposed solution. Pending review.

In the meantime my image carries a workaround:

-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"

Functionally identical, just less elegant.

What I shipped

llamacppqwen36mtpone v1.0.9 on my Olares One market source:

Drops the custom am17an branch image that’s been carrying us since early May. If you’ve been waiting for MTP to land cleanly in ggml-org/llama.cpp master before switching off your own custom build: now’s the moment.

What to test next

A few things I haven’t measured yet that you might want to:

If you run any of these on Blackwell consumer (5090 desktop, 5090M mobile, or 5080), drop me a line — I’d love to compare numbers.


Hardware: Olares One — RTX 5090M Laptop (24 GB GDDR7, sm_120 Blackwell consumer mobile), Intel Core Ultra 9 275HX 24-core, 96 GB DDR5. Software: my own Olares market source deployed at orales-one-market.aamsellem.workers.dev — single-click install of optimized AI apps for the Olares One device. Bench prompt: Space Invaders HTML game completion, 2000 tokens, temp=0.6 top_k=20 min_p=0. Ten runs back-to-back, single user, no warmup beyond model load.

Share this post on:

Comments