Hi there.
Last night on r/LocalLLaMA, u/ilintar announced that am17an’s MTP PR went into beta. 334 upvotes in a few hours. The kicker: “expect most performance gaps between llama.cpp and vLLM… to be erased.”
Spoiler: on my machine, baseline llama.cpp = 35 t/s. With PR #22673 + the right MTP-enabled GGUF: 78 t/s. +123%. Without touching Genesis, Lucebox, or HAMi. Here’s how.
Context: MTP was in vLLM, not in llama.cpp
Multi-Token Prediction is a speculative decoding technique where the model learns to predict several tokens ahead in a single forward pass. At serve time, those predictions are used as a free “draft”: the model verifies in parallel, accepts the good ones, drops the bad ones. With decent acceptance, that’s roughly 2× faster.
Qwen trained Qwen3.6 with a built-in MTP head. vLLM has been able to use it for a while (--speculative-config '{"method":"mtp","num_speculative_tokens":3}'). llama.cpp could not — until Aman Gupta’s PR #22673, opened May 4, 2026.
The PR adds:
- An
mtpclass in the--spec-typemenu - The pipeline that shares hidden states between target and MTP head
- Support for Qwen3.6 27B and 35B-A3B
Aman tests on a DGX Spark. The Reddit community tested it fast:
- alexandrupetraru: Strix Halo Qwen3.6-35B-A3B q8 → 40 → 70 tg/s (+75%)
- GloballyUniquePlaceholder: 3060 Laptop 6GB + 64GB RAM → real speedup
- superjamie: 3× RTX 3060 tested OK with
RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF
No consumer Blackwell bench yet. My Olares One with its RTX 5090M is the candidate.
The build: am17an/mtp-clean + sm_120
PR #22673 lives on the mtp-clean branch of the am17an/llama.cpp fork. Five commits:
1a4fe4e llama: allow partial seq_rm for GDN models for spec decoding
589490f add enum for part sequence removal
c5e0227 rename rollback to rs_seq
10829db llama + spec: MTP support
f8c6b03 add qwen35moe_mtp
Rather than cherry-picking into buun-llama-cpp (my usual DFlash fork), I build straight from this branch in minimal mode — just llama-server for Qwen3.6 27B + native sm_120:
FROM nvidia/cuda:13.1.0-devel-ubuntu22.04 AS build
RUN git clone --depth 1 --branch mtp-clean \
https://github.com/am17an/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp
RUN cmake -B build \
-DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
-DCMAKE_CUDA_ARCHITECTURES=120 \
-DLLAMA_CURL=ON \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
-DCMAKE_BUILD_TYPE=Release && \
cmake --build build -j$(nproc) --target llama-server
2h cross-compile on OrbStack Mac. Image aamsellem/llamacpp-mtp:0.1.0, 2.62 GB.
No Genesis patches, no DFlash custom kernels, no Lucebox. Just llama.cpp + a single PR.
The GGUF: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF
Important detail: to use MTP via llama.cpp, the GGUF has to include the MTP head. The standard Qwen3.6-27B Q4_K_M from unsloth doesn’t include it (the MTP head was stripped during standard quantization).
Lucky for us, RDson quantized it with ik_llama:
- Single file:
Qwen3.6-27B-MTP-Q4_K_M.gguf— 16.49 GB - Fits on 24GB with comfortable margin
- Compatible with llama.cpp PR #22673
am17an also published a Q8_0 on his repo, but at 28 GB = not for 24GB.
llama-server config
llama-server \
--model /models/Qwen3.6-27B-MTP-Q4_K_M.gguf \
--host 0.0.0.0 --port 8000 \
--n-gpu-layers 99 \
--ctx-size 32000 \
--threads 16 \
--batch-size 256 --ubatch-size 64 \
--parallel 1 \
--flash-attn on --jinja \
--spec-type mtp \
--spec-draft-n-max 4 \
--chat-template-kwargs '{"enable_thinking": false}'
Notes:
--spec-type mtp: the new option from PR #22673--spec-draft-n-max 4: USBhost on Reddit found 4 to be the sweet spot — kept their number- No
--model-draft: MTP is in the model, not a separate drafter enable_thinking: false: recommended everywhere, the draft head wasn’t trained on<think>tags
The bench
Three Space Invaders prompts, max_tokens=800, temp=0.6, top_p=0.95:
Run 1: 800 tok in 10.68s = 74.91 t/s
Run 2: 800 tok in 9.93s = 80.60 t/s
Run 3: 800 tok in 10.15s = 78.78 t/s
AVG 78.1 t/s [74.9-80.6].
Same-machine comparison (Olares One, RTX 5090M 24GB sm_120)
| Stack | t/s avg | Stack complexity |
|---|---|---|
| llama.cpp standard (no spec) | 33-36 | pure upstream |
| llama.cpp + MTP (PR #22673) | 78.1 | pure upstream + 1 PR |
| buun-llama-cpp DFlash + Q8_0 GGUF drafter | 80 | llama.cpp fork |
| vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3) | 88.0 | vLLM + 28 patches + custom image |
| Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22) | 88.7 | custom engine + libvgpu hot-swap + 4 workarounds |
+123% MTP llama.cpp vs baseline. More than alexandrupetraru’s +75% on Strix Halo — the 5090M probably has more headroom because the baseline is lower (bandwidth-bound more than sm_122).
Why 78 < 88? Because MTP is more modest than custom DFlash
MTP gives ~2× on the baseline (acceptance ~75% × 4 draft tokens). Well-tuned DFlash (Lucebox, dedicated drafter, custom kernels) gives ~2.5-3×. Above MTP llama.cpp, we have:
- buun DFlash 80 t/s: Q8_0 GGUF quantized drafter + sm_120-tuned spec-type dflash
- vLLM Turbo 88 t/s: Genesis 28 patches that unlock TurboQuant K8V4 + MTP n=3
- Lucebox 88.7 t/s: custom test_dflash engine + DDTree budget tuning
All of those need a fork or patches. MTP llama.cpp = the only version that will be merged upstream as soon as the PR review wraps up.
The actual message
Once PR #22673 lands in ggml-org/llama.cpp master, anyone who pulls ghcr.io/ggml-org/llama.cpp:server-cudaXY-bNNNN and downloads a Qwen3.6-MTP-enabled GGUF gets ~78 t/s on consumer mobile Blackwell 24GB, with no fork to maintain.
It’s not the absolute record (Lucebox 88.7), but it is:
- Reproducible: one
docker pull+ onehuggingface-cli download - Maintainable: upstream llama.cpp won’t break under you, unlike forks
- Compatible with LM Studio / Open WebUI / Ollama tools that consume official llama.cpp
That’s what actually changes the game for end users. Forks stay the right answer for record benchmarks; upstream MTP will be the right answer for mass distribution.
To follow the merge
- PR https://github.com/ggml-org/llama.cpp/pull/22673
- Status as of May 5: OPEN, BLOCKED (in review). 18 comments, lots of positive community tests.
- Once merged, will be in the next tagged llama.cpp build (probably b9027+)
- am17an will publish the official image, or wait for
ghcr.io/ggml-org/llama.cpp:server-cuda13-bNNNNonce it ships
Credits
- am17an (Aman Gupta) for PR #22673 and the Q8_0 GGUF
- u/ilintar for pushing the announcement on r/LocalLLaMA
- RDson for the MTP-enabled Q4_K_M quant that fits 24GB
- alexandrupetraru, USBhost, superjamie, GloballyUniquePlaceholder for the community benchmarks that validated the PR before I tested
That’s it! If you run on a 5090M / 4080M / 3090 24GB and reproduce these 78 t/s (or beat it with a --spec-draft-n-max sweep), send me your numbers. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.