Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Qwen3.6-27B MTP via llama.cpp PR #22673 on consumer Blackwell — 78 t/s with no fork, no patch

MTP finally lands in llama.cpp upstream (PR #22673 by am17an, May 4). Bench on Olares One RTX 5090M sm_120: 78 t/s with an MTP-enabled GGUF, +123% vs baseline. No Lucebox, no Genesis, no permanent custom fork.

Hi there.

Last night on r/LocalLLaMA, u/ilintar announced that am17an’s MTP PR went into beta. 334 upvotes in a few hours. The kicker: “expect most performance gaps between llama.cpp and vLLM… to be erased.”

Spoiler: on my machine, baseline llama.cpp = 35 t/s. With PR #22673 + the right MTP-enabled GGUF: 78 t/s. +123%. Without touching Genesis, Lucebox, or HAMi. Here’s how.

Context: MTP was in vLLM, not in llama.cpp

Multi-Token Prediction is a speculative decoding technique where the model learns to predict several tokens ahead in a single forward pass. At serve time, those predictions are used as a free “draft”: the model verifies in parallel, accepts the good ones, drops the bad ones. With decent acceptance, that’s roughly 2× faster.

Qwen trained Qwen3.6 with a built-in MTP head. vLLM has been able to use it for a while (--speculative-config '{"method":"mtp","num_speculative_tokens":3}'). llama.cpp could not — until Aman Gupta’s PR #22673, opened May 4, 2026.

The PR adds:

Aman tests on a DGX Spark. The Reddit community tested it fast:

No consumer Blackwell bench yet. My Olares One with its RTX 5090M is the candidate.

The build: am17an/mtp-clean + sm_120

PR #22673 lives on the mtp-clean branch of the am17an/llama.cpp fork. Five commits:

1a4fe4e  llama: allow partial seq_rm for GDN models for spec decoding
589490f  add enum for part sequence removal
c5e0227  rename rollback to rs_seq
10829db  llama + spec: MTP support
f8c6b03  add qwen35moe_mtp

Rather than cherry-picking into buun-llama-cpp (my usual DFlash fork), I build straight from this branch in minimal mode — just llama-server for Qwen3.6 27B + native sm_120:

FROM nvidia/cuda:13.1.0-devel-ubuntu22.04 AS build
RUN git clone --depth 1 --branch mtp-clean \
    https://github.com/am17an/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp
RUN cmake -B build \
    -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_CUDA_ARCHITECTURES=120 \
    -DLLAMA_CURL=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build -j$(nproc) --target llama-server

2h cross-compile on OrbStack Mac. Image aamsellem/llamacpp-mtp:0.1.0, 2.62 GB.

No Genesis patches, no DFlash custom kernels, no Lucebox. Just llama.cpp + a single PR.

The GGUF: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

Important detail: to use MTP via llama.cpp, the GGUF has to include the MTP head. The standard Qwen3.6-27B Q4_K_M from unsloth doesn’t include it (the MTP head was stripped during standard quantization).

Lucky for us, RDson quantized it with ik_llama:

am17an also published a Q8_0 on his repo, but at 28 GB = not for 24GB.

llama-server config

llama-server \
  --model /models/Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 32000 \
  --threads 16 \
  --batch-size 256 --ubatch-size 64 \
  --parallel 1 \
  --flash-attn on --jinja \
  --spec-type mtp \
  --spec-draft-n-max 4 \
  --chat-template-kwargs '{"enable_thinking": false}'

Notes:

The bench

Three Space Invaders prompts, max_tokens=800, temp=0.6, top_p=0.95:

Run 1: 800 tok in 10.68s = 74.91 t/s
Run 2: 800 tok in  9.93s = 80.60 t/s
Run 3: 800 tok in 10.15s = 78.78 t/s

AVG 78.1 t/s [74.9-80.6].

Same-machine comparison (Olares One, RTX 5090M 24GB sm_120)

Stackt/s avgStack complexity
llama.cpp standard (no spec)33-36pure upstream
llama.cpp + MTP (PR #22673)78.1pure upstream + 1 PR
buun-llama-cpp DFlash + Q8_0 GGUF drafter80llama.cpp fork
vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3)88.0vLLM + 28 patches + custom image
Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22)88.7custom engine + libvgpu hot-swap + 4 workarounds

+123% MTP llama.cpp vs baseline. More than alexandrupetraru’s +75% on Strix Halo — the 5090M probably has more headroom because the baseline is lower (bandwidth-bound more than sm_122).

Why 78 < 88? Because MTP is more modest than custom DFlash

MTP gives ~2× on the baseline (acceptance ~75% × 4 draft tokens). Well-tuned DFlash (Lucebox, dedicated drafter, custom kernels) gives ~2.5-3×. Above MTP llama.cpp, we have:

All of those need a fork or patches. MTP llama.cpp = the only version that will be merged upstream as soon as the PR review wraps up.

The actual message

Once PR #22673 lands in ggml-org/llama.cpp master, anyone who pulls ghcr.io/ggml-org/llama.cpp:server-cudaXY-bNNNN and downloads a Qwen3.6-MTP-enabled GGUF gets ~78 t/s on consumer mobile Blackwell 24GB, with no fork to maintain.

It’s not the absolute record (Lucebox 88.7), but it is:

That’s what actually changes the game for end users. Forks stay the right answer for record benchmarks; upstream MTP will be the right answer for mass distribution.

To follow the merge

Credits

That’s it! If you run on a 5090M / 4080M / 3090 24GB and reproduce these 78 t/s (or beat it with a --spec-draft-n-max sweep), send me your numbers. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments