Qwen3.6-27B MTP via llama.cpp PR #22673 on consumer Blackwell — 78 t/s with no fork, no patch

Hi there.

Last night on r/LocalLLaMA, u/ilintar announced that am17an’s MTP PR went into beta. 334 upvotes in a few hours. The kicker: “expect most performance gaps between llama.cpp and vLLM… to be erased.”

Spoiler: on my machine, baseline llama.cpp = 35 t/s. With PR #22673 + the right MTP-enabled GGUF: 78 t/s. +123%. Without touching Genesis, Lucebox, or HAMi. Here’s how.

Context: MTP was in vLLM, not in llama.cpp

Multi-Token Prediction is a speculative decoding technique where the model learns to predict several tokens ahead in a single forward pass. At serve time, those predictions are used as a free “draft”: the model verifies in parallel, accepts the good ones, drops the bad ones. With decent acceptance, that’s roughly 2× faster.

Qwen trained Qwen3.6 with a built-in MTP head. vLLM has been able to use it for a while (--speculative-config '{"method":"mtp","num_speculative_tokens":3}'). llama.cpp could not — until Aman Gupta’s PR #22673, opened May 4, 2026.

The PR adds:

An mtp class in the --spec-type menu
The pipeline that shares hidden states between target and MTP head
Support for Qwen3.6 27B and 35B-A3B

Aman tests on a DGX Spark. The Reddit community tested it fast:

alexandrupetraru: Strix Halo Qwen3.6-35B-A3B q8 → 40 → 70 tg/s (+75%)
GloballyUniquePlaceholder: 3060 Laptop 6GB + 64GB RAM → real speedup
superjamie: 3× RTX 3060 tested OK with RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

No consumer Blackwell bench yet. My Olares One with its RTX 5090M is the candidate.

The build: am17an/mtp-clean + sm_120

PR #22673 lives on the mtp-clean branch of the am17an/llama.cpp fork. Five commits:

1a4fe4e  llama: allow partial seq_rm for GDN models for spec decoding
589490f  add enum for part sequence removal
c5e0227  rename rollback to rs_seq
10829db  llama + spec: MTP support
f8c6b03  add qwen35moe_mtp

Rather than cherry-picking into buun-llama-cpp (my usual DFlash fork), I build straight from this branch in minimal mode — just llama-server for Qwen3.6 27B + native sm_120:

FROM nvidia/cuda:13.1.0-devel-ubuntu22.04 AS build
RUN git clone --depth 1 --branch mtp-clean \
    https://github.com/am17an/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp
RUN cmake -B build \
    -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_CUDA_ARCHITECTURES=120 \
    -DLLAMA_CURL=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build -j$(nproc) --target llama-server

2h cross-compile on OrbStack Mac. Image aamsellem/llamacpp-mtp:0.1.0, 2.62 GB.

No Genesis patches, no DFlash custom kernels, no Lucebox. Just llama.cpp + a single PR.

The GGUF: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

Important detail: to use MTP via llama.cpp, the GGUF has to include the MTP head. The standard Qwen3.6-27B Q4_K_M from unsloth doesn’t include it (the MTP head was stripped during standard quantization).

Lucky for us, RDson quantized it with ik_llama:

Single file: Qwen3.6-27B-MTP-Q4_K_M.gguf — 16.49 GB
Fits on 24GB with comfortable margin
Compatible with llama.cpp PR #22673

am17an also published a Q8_0 on his repo, but at 28 GB = not for 24GB.

llama-server config

llama-server \
  --model /models/Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 32000 \
  --threads 16 \
  --batch-size 256 --ubatch-size 64 \
  --parallel 1 \
  --flash-attn on --jinja \
  --spec-type mtp \
  --spec-draft-n-max 4 \
  --chat-template-kwargs '{"enable_thinking": false}'

Notes:

--spec-type mtp: the new option from PR #22673
--spec-draft-n-max 4: USBhost on Reddit found 4 to be the sweet spot — kept their number
No --model-draft: MTP is in the model, not a separate drafter
enable_thinking: false: recommended everywhere, the draft head wasn’t trained on <think> tags

The bench

Three Space Invaders prompts, max_tokens=800, temp=0.6, top_p=0.95:

Run 1: 800 tok in 10.68s = 74.91 t/s
Run 2: 800 tok in  9.93s = 80.60 t/s
Run 3: 800 tok in 10.15s = 78.78 t/s

AVG 78.1 t/s [74.9-80.6].

Same-machine comparison (Olares One, RTX 5090M 24GB sm_120)

Stack	t/s avg	Stack complexity
llama.cpp standard (no spec)	33-36	pure upstream
llama.cpp + MTP (PR #22673)	78.1	pure upstream + 1 PR
buun-llama-cpp DFlash + Q8_0 GGUF drafter	80	llama.cpp fork
vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3)	88.0	vLLM + 28 patches + custom image
Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22)	88.7	custom engine + libvgpu hot-swap + 4 workarounds

+123% MTP llama.cpp vs baseline. More than alexandrupetraru’s +75% on Strix Halo — the 5090M probably has more headroom because the baseline is lower (bandwidth-bound more than sm_122).

Why 78 < 88? Because MTP is more modest than custom DFlash

MTP gives ~2× on the baseline (acceptance ~75% × 4 draft tokens). Well-tuned DFlash (Lucebox, dedicated drafter, custom kernels) gives ~2.5-3×. Above MTP llama.cpp, we have:

buun DFlash 80 t/s: Q8_0 GGUF quantized drafter + sm_120-tuned spec-type dflash
vLLM Turbo 88 t/s: Genesis 28 patches that unlock TurboQuant K8V4 + MTP n=3
Lucebox 88.7 t/s: custom test_dflash engine + DDTree budget tuning

All of those need a fork or patches. MTP llama.cpp = the only version that will be merged upstream as soon as the PR review wraps up.

The actual message

Once PR #22673 lands in ggml-org/llama.cpp master, anyone who pulls ghcr.io/ggml-org/llama.cpp:server-cudaXY-bNNNN and downloads a Qwen3.6-MTP-enabled GGUF gets ~78 t/s on consumer mobile Blackwell 24GB, with no fork to maintain.

It’s not the absolute record (Lucebox 88.7), but it is:

Reproducible: one docker pull + one huggingface-cli download
Maintainable: upstream llama.cpp won’t break under you, unlike forks
Compatible with LM Studio / Open WebUI / Ollama tools that consume official llama.cpp

That’s what actually changes the game for end users. Forks stay the right answer for record benchmarks; upstream MTP will be the right answer for mass distribution.

To follow the merge

PR https://github.com/ggml-org/llama.cpp/pull/22673
Status as of May 5: OPEN, BLOCKED (in review). 18 comments, lots of positive community tests.
Once merged, will be in the next tagged llama.cpp build (probably b9027+)
am17an will publish the official image, or wait for ghcr.io/ggml-org/llama.cpp:server-cuda13-bNNNN once it ships

Credits

am17an (Aman Gupta) for PR #22673 and the Q8_0 GGUF
u/ilintar for pushing the announcement on r/LocalLLaMA
RDson for the MTP-enabled Q4_K_M quant that fits 24GB
alexandrupetraru, USBhost, superjamie, GloballyUniquePlaceholder for the community benchmarks that validated the PR before I tested

That’s it! If you run on a 5090M / 4080M / 3090 24GB and reproduce these 78 t/s (or beat it with a --spec-draft-n-max sweep), send me your numbers. See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.