MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

If you’ve been running Qwen3.6 27B with MTP on a custom am17an branch image since early May, you can stop. PR #22673 landed in ggml-org/llama.cpp master on May 16th, so the long custom-branch detour is over.

But “MTP merged” isn’t the actual news. The news is what came after the merge: three follow-up PRs landed in the next four days that quietly change how MTP behaves at the operating point most people are running it at. And one of them is a default value flip that I think most users will have missed.

Here’s what I found on Olares One (RTX 5090M Laptop, 24 GB GDDR7, sm_120 Blackwell consumer mobile).

The three follow-ups that matter

After the May 16 merge, three PRs landed against tools/server/ and common/speculative/ that touch the MTP draft path:

#23269 — “MTP clean-up” (ggerganov, 2026-05-19) — the headline change here, buried in the body, is:

Change default value of --spec-draft-n-max from 16 to 3

Plus several less-visible fixes: --spec-draft-p-min was broken and is now functional again (default 0.0), graph reuse logic was wrong for graphs with batch.token && batch.embd (which is exactly the MTP drafting graph), and all speculative implementations now see the accepted tokens — relevant when chaining multiple spec impls. Worth reading the PR body.

#23287 — “Move to backend sampling for MTP draft path” (gaugarg-nv, 2026-05-20) — NVIDIA-affiliated patch. Moves the draft-path sampling off the host and onto the CUDA backend. Fewer host roundtrips on the critical drafting loop, less synchronization overhead per draft step.

#22522 — “Programmatic Dependent Launch (PDL) for newer NVIDIA GPUs (Hopper+)” (aendk, 2026-05-20) — PDL is a CUDA 12+ feature that lets the GPU schedule the next kernel before the previous one finishes, reducing kernel-launch latency. Tagged “Hopper+” but Blackwell consumer mobile (sm_120) qualifies.

None of these are MTP-the-feature, they’re MTP-the-feature-but-actually-tuned. If you stopped paying attention after the May 16 merge, you’d assume the master build is “MTP works” and move on. But the operating point shifted under your feet.

Before: my v1.0.8 prod baseline

Until last night, my llamacppqwen36mtpone v1.0.8 was running a custom image built from am17an’s pre-merge branch (aamsellem/llamacpp-mtp:0.1.0) — the only way to get MTP on Blackwell before the merge. Config was straight from the Reddit MTP recipe that came out early May:

--spec-type mtp --spec-draft-n-max 5
--ctx-size 262144 --cache-type-k q4_0 --cache-type-v q4_0
--batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on

Validated on unsloth/Qwen3.6-27B-MTP-GGUF UD-Q3_K_XL (14.5 GB):

72.75 t/s AVG, ~64% draft acceptance.

Stable, zero OOM at full 262K context, no degradation cycle. Good enough that I stopped fiddling with it.

After: master HEAD with all three follow-ups

I built aamsellem/llama-cpp-mtp:master-ad27757 from ggml-org/llama.cpp master HEAD ad27757 (which is the merge commit of #23287, so all three follow-up PRs are included). Same model, same KV cache config, same context window. Two flag changes:

--spec-type mtp → --spec-type draft-mtp (flag was renamed upstream on 2026-05-13)
--spec-draft-n-max 5 → 3 (matches the new upstream default from #23269)

Ten back-to-back Space Invaders runs (2000 tokens each):

Run	t/s	draft_n	accepted	accept %
1	74.90	1668	1442	86.5%
2	72.60	1713	1428	83.4%
3	75.01	1653	1447	87.5%
4	74.32	1663	1444	86.8%
5	74.31	1660	1445	87.0%
6	73.73	1672	1441	86.2%
7	73.87	1668	1442	86.5%
8	75.14	1639	1452	88.6%
9	74.35	1659	1446	87.2%
10	74.62	1650	1448	87.8%

74.28 t/s AVG, 86.7% accept AVG. Range 72.60 – 75.14, σ = 2.54.

Pure speed: +2.1% vs v1.0.8.

Acceptance: +22 percentage points (64% → 86.7%).

Why the acceptance jump is the real story

If you only look at the throughput delta, +2% is forgettable. The interesting number is the acceptance jump.

At n_max=5 with the old sampling, we were drafting 5 tokens per step and accepting ~3 of them. The model proposed tokens that the verify pass rejected often enough that ~40% of the draft compute was wasted.

At n_max=3 with backend sampling, we draft 3 tokens per step and accept ~2.6 of them. Fewer drafts, but each one is much more likely to be right. The verify-reject rate dropped from 36% to 13%. That’s a different operating regime — closer to what MTP-the-paper actually promises.

The two changes compound. n_max=3 reduces the search depth so each draft proposal is in higher-probability territory. Backend sampling removes a host-CPU bottleneck that was probably introducing latency-induced drift in the draft distribution (the host-sampled tokens were less coherent with the GPU’s expected next-token distribution because of the roundtrip). Together: smaller search, tighter coupling, much higher acceptance.

The throughput stayed roughly flat because doing fewer drafts (lose a bit) but accepting more of them (gain a bit) ends up in the same place. The win you get isn’t t/s — it’s lower variance and lower wasted compute. Look at the range: σ = 2.54 over 10 runs. The previous setup had σ around 5-6 on the same prompt. Tighter is real value if you’re running an agentic stack where consistency matters more than peak.

The CUDA driver propagation bug, and Anbeeld’s quiet fix

One detail nobody is going to mention but that cost me a build cycle: at b9219, ggml-org/llama.cpp master fails its final link when you build with BUILD_SHARED_LIBS=ON + GGML_CUDA=ON on Ubuntu 24.04 / CUDA 13.0. The error:

/usr/bin/ld: libggml-cuda.so.0.12.0: undefined reference to `cuMemCreate'
                                     undefined reference to `cuMemAddressReserve'
                                     [...11 more cuMem*/cuDevice* symbols]

The shared library itself links (it’s allowed to have unresolved symbols), but the downstream executable link refuses to resolve them transitively.

Root cause: ggml/src/ggml-cuda/CMakeLists.txt links ggml-cuda against CUDA::cuda_driver with PRIVATE scope. For executables that link against ggml-cuda through ggml and llama, the dependency doesn’t propagate.

Anbeeld already fixed this on his BeeLlama fork (Anbeeld/beellama.cpp#18) with a one-line addition:

target_link_libraries(ggml-cuda PRIVATE CUDA::cuda_driver)
if (NOT GGML_BACKEND_DL)
    target_link_libraries(ggml-cuda INTERFACE $<LINK_ONLY:CUDA::cuda_driver>)
endif()

I filed the same bug upstream as ggml-org/llama.cpp#23357 with Anbeeld’s fix as the proposed solution. Pending review.

In the meantime my image carries a workaround:

-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"

Functionally identical, just less elegant.

What I shipped

llamacppqwen36mtpone v1.0.9 on my Olares One market source:

Image: aamsellem/llama-cpp-mtp:master-ad27757 (1635 MB, amd64+CUDA13+sm_120, Docker Hub digest sha256:7451496e224c9)
Source HF repo: unsloth/Qwen3.6-27B-MTP-GGUF (the canonical name post the May 16 Unsloth re-release — they renamed GGUF-MTP to MTP-GGUF)
Same Q3_K_XL UD model bits as v1.0.8
--spec-type draft-mtp --spec-draft-n-max 3 (matches new upstream defaults)
q4_0 KV @ 262K context

Drops the custom am17an branch image that’s been carrying us since early May. If you’ve been waiting for MTP to land cleanly in ggml-org/llama.cpp master before switching off your own custom build: now’s the moment.

What to test next

A few things I haven’t measured yet that you might want to:

--spec-draft-p-min is functional again. Default 0.0 means it doesn’t filter. I’d be curious what 0.5 or 0.7 does to the n_max=3 / 86% accept point.
--spec-draft-n-max 2 would be the next probe. If 5→3 lost no throughput, can we go lower without losing acceptance? At n_max=2 you’d basically be running pure speculative-1 with MTP head.
PDL (#22522): it’s enabled by default on sm_90+ now, but I haven’t measured what fraction of the +2% throughput delta vs v1.0.8 comes from PDL vs the n_max change vs the backend sampling. Disabling PDL via env var would isolate it.

If you run any of these on Blackwell consumer (5090 desktop, 5090M mobile, or 5080), drop me a line — I’d love to compare numbers.

Hardware: Olares One — RTX 5090M Laptop (24 GB GDDR7, sm_120 Blackwell consumer mobile), Intel Core Ultra 9 275HX 24-core, 96 GB DDR5. Software: my own Olares market source deployed at orales-one-market.aamsellem.workers.dev — single-click install of optimized AI apps for the Olares One device. Bench prompt: Space Invaders HTML game completion, 2000 tokens, temp=0.6 top_k=20 min_p=0. Ten runs back-to-back, single user, no warmup beyond model load.