Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain

MTP finally lands in llama.cpp upstream (PR #22673 by am17an, May 4). Bench on Olares One RTX 5090M sm_120: 78 t/s with an MTP-enabled GGUF, +123% vs baseline. No Lucebox, no Genesis, no permanent custom fork.

Hi there.

Last night on r/LocalLLaMA, u/ilintar announced that am17an’s MTP PR went into beta. 334 upvotes in a few hours. The kicker: “expect most performance gaps between llama.cpp and vLLM… to be erased.”

Spoiler: on my machine, baseline llama.cpp = 35 t/s. With PR #22673 + the right MTP-enabled GGUF: 78 t/s. +123%. Without touching Genesis, Lucebox, or HAMi. Here’s how.

Context: MTP was in vLLM, not in llama.cpp

Multi-Token Prediction is a speculative decoding technique where the model learns to predict several tokens ahead in a single forward pass. At serve time, those predictions are used as a free “draft”: the model verifies in parallel, accepts the good ones, drops the bad ones. With decent acceptance, that’s roughly 2× faster.

Qwen trained Qwen3.6 with a built-in MTP head. vLLM has been able to use it for a while (--speculative-config '{"method":"mtp","num_speculative_tokens":3}'). llama.cpp could not — until Aman Gupta’s PR #22673, opened May 4, 2026.

The PR adds:

Aman tests on a DGX Spark. The Reddit community tested it fast:

No consumer Blackwell bench yet. My Olares One with its RTX 5090M is the candidate.

The build: am17an/mtp-clean + sm_120

PR #22673 lives on the mtp-clean branch of the am17an/llama.cpp fork. Five commits:

1a4fe4e  llama: allow partial seq_rm for GDN models for spec decoding
589490f  add enum for part sequence removal
c5e0227  rename rollback to rs_seq
10829db  llama + spec: MTP support
f8c6b03  add qwen35moe_mtp

Rather than cherry-picking into buun-llama-cpp (my usual DFlash fork), I build straight from this branch in minimal mode — just llama-server for Qwen3.6 27B + native sm_120:

FROM nvidia/cuda:13.1.0-devel-ubuntu22.04 AS build
RUN git clone --depth 1 --branch mtp-clean \
    https://github.com/am17an/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp
RUN cmake -B build \
    -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_CUDA_ARCHITECTURES=120 \
    -DLLAMA_CURL=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build -j$(nproc) --target llama-server

2h cross-compile on OrbStack Mac. Image aamsellem/llamacpp-mtp:0.1.0, 2.62 GB.

No Genesis patches, no DFlash custom kernels, no Lucebox. Just llama.cpp + a single PR.

The GGUF: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF

Important detail: to use MTP via llama.cpp, the GGUF has to include the MTP head. The standard Qwen3.6-27B Q4_K_M from unsloth doesn’t include it (the MTP head was stripped during standard quantization).

Lucky for us, RDson quantized it with ik_llama:

am17an also published a Q8_0 on his repo, but at 28 GB = not for 24GB.

llama-server config

llama-server \
  --model /models/Qwen3.6-27B-MTP-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 32000 \
  --threads 16 \
  --batch-size 256 --ubatch-size 64 \
  --parallel 1 \
  --flash-attn on --jinja \
  --spec-type mtp \
  --spec-draft-n-max 4 \
  --chat-template-kwargs '{"enable_thinking": false}'

Notes:

The bench

Three Space Invaders prompts, max_tokens=800, temp=0.6, top_p=0.95:

Run 1: 800 tok in 10.68s = 74.91 t/s
Run 2: 800 tok in  9.93s = 80.60 t/s
Run 3: 800 tok in 10.15s = 78.78 t/s

AVG 78.1 t/s [74.9-80.6].

Same-machine comparison (Olares One, RTX 5090M 24GB sm_120)

Stackt/s avgStack complexity
llama.cpp standard (no spec)33-36pure upstream
llama.cpp + MTP (PR #22673)78.1pure upstream + 1 PR
buun-llama-cpp DFlash + Q8_0 GGUF drafter80llama.cpp fork
vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3)88.0vLLM + 28 patches + custom image
Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22)88.7custom engine + libvgpu hot-swap + 4 workarounds

+123% MTP llama.cpp vs baseline. More than alexandrupetraru’s +75% on Strix Halo — the 5090M probably has more headroom because the baseline is lower (bandwidth-bound more than sm_122).

Why 78 < 88? Because MTP is more modest than custom DFlash

MTP gives ~2× on the baseline (acceptance ~75% × 4 draft tokens). Well-tuned DFlash (Lucebox, dedicated drafter, custom kernels) gives ~2.5-3×. Above MTP llama.cpp, we have:

All of those need a fork or patches. MTP llama.cpp = the only version that will be merged upstream as soon as the PR review wraps up.

The actual message

Once PR #22673 lands in ggml-org/llama.cpp master, anyone who pulls ghcr.io/ggml-org/llama.cpp:server-cudaXY-bNNNN and downloads a Qwen3.6-MTP-enabled GGUF gets ~78 t/s on consumer mobile Blackwell 24GB, with no fork to maintain.

It’s not the absolute record (Lucebox 88.7), but it is:

That’s what actually changes the game for end users. Forks stay the right answer for record benchmarks; upstream MTP will be the right answer for mass distribution.

May 6 update: pushing to 128K context — 65 t/s

I bumped the llamacppqwen36mtpone app to v1.0.3 with two changes:

  1. Switched GGUF to froggeric/Qwen3.6-27B-MTP-GGUF (instead of RDson) — that’s the de-facto community-blessed quant after the May 6 post that hit 678 upvotes (ex-arman68 relaying froggeric)
  2. Pushed context to 128K (from 32K) with --cache-type-k q4_0 --cache-type-v q4_0 and --spec-draft-n-max 5

Caveat: froggeric’s official table claims “24 GB | Q4_K_M | q4_0 | 262K | 22.8 GB” — but it doesn’t account for the MTP draft compute buffer overhead. At 262K my pod crashes with graph_reserve: failed to allocate compute buffers / failed to create MTP context, and the server falls back to non-MTP mode → 37 t/s baseline.

At 128K it holds:

prompt eval time =  94.28 ms /   22 tokens (   4.29 ms per token, 233.36 tokens per second)
       eval time =  44.86 s   / 2921 tokens (  15.36 ms per token,  65.11 tokens per second)
draft acceptance rate = 64.15% (2226/3470)

65 t/s @ 128K with 64% acceptance rate. -17% vs 32K but 4× the context — sweet spot for coding agents reading large codebases.

Real VRAM math on 24GB with Q4_K_M + MTP:

At 262K the KV eats 4.3 GB, nothing left for the MTP draft buffer. To push further, switch to IQ4_XS (-1.5 GB on the model) or IQ3_M (-3 GB).

To follow the merge

Credits

That’s it! If you run on a 5090M / 4080M / 3090 24GB and reproduce these 78 t/s (or beat it with a --spec-draft-n-max sweep), send me your numbers. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments