Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

249 t/s on Qwen3.6 35B-A3B MTP — the bigger model that runs faster than everything smaller

I posted yesterday about Nemotron-Labs Elastic 30B-A3B NVFP4 hitting 166 t/s on Olares One — then 182 once vLLM #40082 landed. New record. Headline of the post: 'fastest LLM on Olares One'. Less than 12 hours later, that record is now sitting in second place. Qwen3.6 35B-A3B MTP runs at 249 t/s on the same hardware. Bigger model, +37% faster. Here's what's going on.

I posted yesterday that Nemotron-Labs Elastic 30B-A3B NVFP4 was the fastest LLM I could run on Olares One. 166 t/s. Then this morning vLLM PR #40082 (FlashInfer SM120 b12x MoE + FP4 GEMM) landed in nightly and that number jumped to 182 t/s. I wrote about that too. Felt good. Filed it.

12 hours later that’s not the fastest LLM I can run on Olares One anymore.

The fastest LLM I can run on Olares One is unsloth/Qwen3.6-35B-A3B-MTP-GGUF at 249.30 t/s AVG, 86.6% draft acceptance, 10 runs, σ ≈ 3.0.

It is a bigger model. Same hardware. +37% throughput.

Here’s how that math works.

The bench

unsloth/Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL (17.2 GB on disk) loaded into my aamsellem/llama-cpp-mtp:master-ad27757 image (the same one I shipped yesterday for the Qwen3.6 27B dense MTP path — includes #23269 cleanup, #23287 NVIDIA backend sampling, #22522 PDL).

Config:

--spec-type draft-mtp --spec-draft-n-max 3
--ctx-size 32768 --cache-type-k q4_0 --cache-type-v q4_0
--batch-size 512 --ubatch-size 512 --parallel 1 --flash-attn on
--chat-template-kwargs '{"enable_thinking": false}'

10 runs of Space Invaders HTML completion (2000 tokens each), single user, no warmup beyond model load:

Runt/sdraft_nacceptedaccept %
1253.731634145489.0%
2248.631672144186.2%
3251.651653144887.6%
4251.751653144887.6%
5250.431657144687.3%
6245.621692143484.8%
7248.141672144086.1%
8249.721661144587.0%
9249.781663144486.8%
10243.581707143083.8%

249.30 t/s AVG. Range 10.15 across all 10 runs (no warmup penalty — MTP draft graphs warm up much faster than vLLM’s full CUDA graph capture).

The new Olares One leaderboard

Stackt/s AVGModel class
Qwen3.6 35B-A3B MTP Q3_K_XL (llama.cpp master)249.3035B MoE-A3B + MTP
Nemotron-Labs Elastic 30B-A3B NVFP4 (vLLM + FlashInfer #40082)182.1430B MoE-A3B + NVFP4
Gemma 4 26B-A4B (vLLM tokenspeed-preview)135.9726B MoE-A4B + AWQ-INT4
BeeLlama Qwen3.6 27B + DFlash + turbo3 KV107.5427B dense + DFlash
llama.cpp master ad27757 — Qwen3.6 27B dense MTP74.2827B dense + MTP

What’s striking is the bottom row. Same image, same MTP code path, same model series, same hardware. The 27B dense variant runs at 74. The 35B-A3B variant runs at 249. The bigger model is 3.4× faster than the smaller one.

That’s the whole story right there. Below is the why.

Three multipliers that compound

1. MoE-A3B routing means active params, not total params, drive per-token cost

Qwen3.6 35B-A3B has 35 billion parameters in storage. At inference, the router picks ~8 experts out of 128 per token, and only those activate. Total active params per token: roughly 3 billion. Per-token compute is closer to a dense 3B model than to a dense 35B model.

The 27B dense variant has to push all 27 billion params through the matmul for every single token. The 35B-A3B variant has to push ~3 billion. That’s the dominant term. The router has a small overhead (~5%) but it’s swamped by the savings.

Quality-wise, you’re not getting “the quality of a 3B model” — you’re getting “the quality of a 35B model where each token routes to the best 3B-equivalent expert mixture for that token”. The reason this works at all is that different tokens (math, code, prose, structured output, etc.) end up routing to different experts that were trained to specialize. MoE buys you breadth on the parameter axis without paying breadth on the compute axis.

2. MTP at 86.6% acceptance gives ~3.6× effective tokens per decode step

MTP (Multi-Token Prediction) trains the model with auxiliary heads that predict not just the next token but the next 2, 3, or more. At inference, the model proposes a draft of N tokens per decode step. The target then verifies them all in one forward pass and accepts as many as the prediction agrees with.

The math at 86.6% acceptance with n_max=3:

The acceptance rate is the key knob. At 50% accept you’d get 2.5 tokens/step (2.5× speedup), at 70% you’d get 3.1, at 86.6% you get 3.6. Higher is dramatically better — diminishing returns aren’t there yet at 86%.

The 27B dense variant on the same image hits 64% acceptance (with the old n_max=5 default) and gets to 74 t/s. The 35B-A3B variant hits 86% (with the new n_max=3 default from #23269) and gets to 249 t/s. The acceptance jump from 64% → 86% multiplies with the per-token MoE savings — that’s why the smaller model loses badly.

3. NVIDIA backend sampling (#23287) keeps the draft loop tight

PR #23287 merged a day before this bench. It moves the draft-path token sampling off the host CPU and onto the CUDA backend. Each draft step used to involve a small but non-zero host roundtrip to sample the next draft token; now it stays on the GPU.

Per draft step the savings are small — maybe a few hundred microseconds. But at 249 t/s the model is producing a token every ~4 ms, and within that there are 3 draft steps. Saving even 100 μs per draft step compounds into a measurable percentage of throughput.

The reason this PR matters more for MoE-A3B than for dense models: MoE has higher draft step frequency for the same wall-clock time (because each step is faster). So the host roundtrip overhead becomes a larger relative cost. #23287 specifically targets the case where you’re doing many draft steps in quick succession — exactly this case.

Constraints / what’s missing

Context window: 32K only 262K full native works (updated post-publish — I was wrong about the 32K limit).

After publishing the initial 249 t/s number at 32K context, I ran a context sweep just to double-check the headroom. Turns out 262K (the model’s full native context) fits and runs at essentially the same throughput:

Contextt/s AVGDelta vs 32K
32K249.30baseline
64K252.64+1.3% (noise)
128K250.39+0.4% (noise)
262K (full native)245.71-1.4%

The memory math: model 17.2 GB + q4_0 KV @ 262K (32 layers × 262K × 2) ≈ 3.2 GB + MTP draft compute buffer ≈ 1 GB + general compute ≈ 0.5 GB ≈ 21.9 GB on 23.42 GB usable. Comfortable margin.

The reason this works on a model that big: MoE-A3B is mostly cost-per-token, not cost-per-context. The Mamba layers carry constant per-token cost regardless of context position, and the attention layers are flash-attn so the read cost stays manageable at 262K. The KV cache itself is the dominant context-scaling cost, and at q4_0 quantization (4-bit per value) it’s small enough that going from 32K to 262K only adds ~2.5 GB.

Shipped as llamacppqwen36a3bone v1.0.2 with CTX_SIZE=262144 as default. This is now both the fastest AND the longest-context Qwen3.6 path on Olares One — +130% throughput vs the BeeLlama dense path at the same 262K context window.

Q4 not feasible at 24 GB. The Q4_K_XL variant (22 GB on disk) loads but then OOMs on KV cache allocation. Q3_K_XL is the largest quant that fits with MTP enabled. If you have a 32 GB consumer card (RTX 5090 desktop, RTX 4090 with ridiculous VRAM), try Q4_K_XL — should be a quality win at maybe -5% throughput.

Thinking mode off. MTP draft heads were trained on non-thinking outputs. Re-enabling thinking mode (enable_thinking: true) confuses the draft path and tanks acceptance to ~40%. If your use case is reasoning-heavy, the dense 27B variant might actually win on quality even if it’s slower.

No vision, no tool calling. Pure single-stream text. For tool calling use Gemma 4 vision. For vision use BeeLlama Qwen3.6 vision. For raw single-shot speed on text: this app.

Sustained-load not validated. 10 runs is ~3.5 min of sustained load. The Gemma 4 + DFlash path on vLLM has a known “5-fast then 4-slow” degradation cycle over longer load. I haven’t validated this yet on 35B-A3B MTP. If you’re using it for agentic workflows that run for 15+ minutes, drop me a line — I’d love to know if the throughput holds.

What I shipped

llamacppqwen36a3bone v1.0.0 → v1.0.1 on my Olares market source. The old v1.0.0 was a non-MTP Q4_K_XL config that OOMs anyway on 24 GB. v1.0.1 is the rebuild:

Worker hash a5f581097eaf5fc14e4506eaeecf6fc8 on orales-one-market.aamsellem.workers.dev.

Where this puts us

On my Olares One leaderboard tonight, the gap from #1 (this) to #5 (the dense 27B with MTP) is 3.4×. That’s wild. Twelve months ago the same hardware was struggling to hit 30 t/s on a 7B model. Tonight I’m clearing 249 on a 35B model.

What changed isn’t the silicon. It’s:

None of these are individually new ideas. Their compounding is what’s new — and that compounding is happening on a 24 GB consumer laptop GPU, not in a data center.

Twelve months ago you’d have run this exact stack at 50 t/s on a $30k H100 and called it cutting-edge. Tonight I’m running it on $4k of mobile workstation silicon at 5× that speed.

I don’t know how long that gap stays this large. Probably not very long. But for tonight, on this hardware, on this stack — it’s real, it’s reproducible, and the chart is one click to install.


Hardware: Olares One — RTX 5090M Laptop (24 GB GDDR7, sm_120 Blackwell consumer mobile), Intel Core Ultra 9 275HX 24-core, 96 GB DDR5. Image: aamsellem/llama-cpp-mtp:master-ad27757 built from ggml-org/llama.cpp master HEAD ad27757. Bench prompt: Space Invaders HTML game completion, 2000 tokens, temp=0.6 top_k=20 min_p=0. Ten runs single-stream, single-user.

Share this post on:

Comments