Two days ago I shipped the Qwen 3.6 35B-A3B MTP champion at 249 t/s. Text only. I wrote the post. Filed it. Moved on.
Today the same hardware runs Gemma 4 26B at 250.80 t/s with vision and tool calling.
Same speed. Plus images. Plus Hermes Agent.
The unlock dropped in vLLM v0.21 on May 15 — quietly merged PR #41745 “official Gemma 4 MTP + Gemma4Proposer + centroids masking”. Today I scaled the Qwen pod to zero, ran the bench, and the gap between my text champion and my vision app evaporated.
The bench
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (~17 GB, native multimodal Gemma4ForConditionalGeneration) on vllm/vllm-openai:v0.21.0-cu129. Speculative config:
--speculative-config '{
"model": "google/gemma-4-26B-A4B-it-assistant",
"method": "gemma4_mtp",
"num_speculative_tokens": 3
}'
google/gemma-4-26B-A4B-it-assistant is Google’s official 840 MB MTP draft head for the 26B target. It replaces the z-lab DFlash drafter I was using before — the one with the famously reproducible “5 fast runs, 4 slow runs, recovers, repeats” degradation cycle.
10 runs of Space Invaders HTML completion, 2500 tokens each, single user:
| Run | t/s |
|---|---|
| 1 | 250.35 |
| 2 | 250.87 |
| 3 | 252.97 |
| 4 | 252.14 |
| 5 | 249.51 |
| 6 | 247.84 |
| 7 | 252.48 |
| 8 | 249.61 |
| 9 | 248.61 |
| 10 | 251.10 |
AVG 250.54 t/s. Range 5.13 across all 10 runs. σ ≈ 1.7.
Acceptance rate from /metrics: 25920 accepted / 29997 draft tokens = 86.4%. Almost identical to the Qwen 3.6 35B-A3B figure (86.6%). Two different model families, two different drafters, both landing on the same ~86% rate. Coincidence? Probably means both teams converged on the same target accept rate during training.
Vision smoke test, 64×64 solid red PNG, “What color is this image? One word.”:
Red
Tool calling still works (--enable-auto-tool-choice --tool-call-parser gemma4). I haven’t re-validated the full Hermes Agent flow today, but no args changed from the v1.0.2 release that did pass that smoke.
What changed vs the previous version
| Field | v1.0.0 (April) | v1.0.5 (today) |
|---|---|---|
| vLLM image | tokenspeed-preview | v0.21.0-cu129 |
| Spec decoding | none | Gemma4 MTP, n=3 |
| Drafter | n/a | google/gemma-4-26B-A4B-it-assistant |
| Throughput | 135.97 t/s | 250.54 t/s |
| Context | 128K | 128K |
| Vision | ✓ | ✓ |
| Tool calling | ✓ | ✓ |
| Cycle bug | n/a | gone |
+85% throughput on the same hardware, same model file, same prompts. The only thing that changed is the vLLM image and the speculative config.
The 5-fast/4-slow ghost
If you’ve been following along you’ll remember vllmgemma4dflashone — the text-only Gemma 4 sibling I shipped earlier at 224 t/s with z-lab’s DFlash drafter. It hit a higher peak (224 vs the no-spec 135) but had a reproducible pattern: 5 fast runs (~220 t/s), 4 slow runs (~60 t/s), recover, repeat. I sweep-tested n_spec from 6 to 20 trying to find the regime that didn’t trigger it. Couldn’t.
Today’s 10 runs on the official MTP path: zero slow runs. Tightest spread I’ve measured on this hardware. Whatever vLLM’s internal state machine was doing wrong with DFlash, the official MTP path bypasses it.
I’m not going to claim I understand why. The DFlash drafter and the MTP drafter are architecturally different (DFlash is a separate small model trained to mimic the target’s draft path, MTP is auxiliary heads bolted onto the target itself). They route through different parts of vLLM’s spec decoding engine. PR #41745 brought in Gemma4Proposer and centroids masking — and somewhere in that codepath the cycle disappears.
Push the context, the model says no
Free GPU, 24 GB. Qwen 35B-A3B scaled to zero for the bench. Why not try the max?
Tried 262144 (256K). vLLM cleanly reported:
ValueError: To serve at least one request with the model's max seq len (262144),
3.38 GiB KV cache is needed, which is larger than the available KV cache memory (2.82 GiB).
Based on the available memory, the estimated maximum model length is 203584.
OK so 203584 (200K) is the hard ceiling. Tried 196608 (192K) just below. Boots, runs at 249.65 t/s steady. Same speed as 64K.
But 192K is beyond Gemma 4’s native 128K training range. vLLM scales RoPE automatically when you exceed the model’s declared context, which technically works — the model produces tokens — but attention drifts on positions past the native limit, and needle-in-haystack recall on that 64K bonus tail is unreliable.
Backed off to 131072 (128K native). Bench:
| Run | t/s |
|---|---|
| 1 | 212.41 (cold) |
| 2 | 254.39 |
| 3 | 251.89 |
| 4 | 252.42 |
| 5 | 250.74 |
| 6 | 247.27 |
| 7 | 249.58 |
| 8 | 251.14 |
| 9 | 253.27 |
| 10 | 246.53 |
Excluding the cold first run: 250.80 t/s AVG, range 7.86. Identical decode speed to the 64K bench. Confirming that on a sliding-window attention architecture like Gemma 4, the KV cache footprint scales with full context but the per-token decode cost doesn’t.
So 128K it is. Native. Quality preserved. No throughput penalty.
The leaderboard, again
| Stack | t/s AVG | Multimodal? | Tool calling? |
|---|---|---|---|
| Gemma 4 26B-A4B Vision + MTP (vLLM v0.21) | 250.54 | ✓ | ✓ |
| Qwen 3.6 35B-A3B MTP (llama.cpp master) | 249.30 | ✗ | ✓ |
| Gemma 4 26B-A4B DFlash (vLLM, cycle bug) | 224.00 | ✗ | ✓ |
| Nemotron-Labs Elastic 30B-A3B NVFP4 (vLLM #40082) | 182.14 | ✗ | ✓ |
| Gemma 4 26B-A4B no-spec (vLLM, v1.0.0 baseline) | 135.97 | ✓ | ✓ |
| BeeLlama Qwen 3.6 27B DFlash | 107.54 | ✗ | ✓ |
The Gemma path now ties the Qwen 35B-A3B champion on raw speed, and it has vision. For Hermes Agent use cases that mix tool calls with screenshot analysis, this app becomes the obvious endpoint.
The Qwen still wins for pure text agents that need 262K context. Gemma 4’s sliding window architecture doesn’t let you push as far without quality degradation. Pick by use case:
- Vision + tool calling + ≤128K → Gemma 4 (this app)
- Text + tool calling + 262K → Qwen 3.6 35B-A3B MTP
Why it works
Three things compound, mostly the same as the Qwen story:
MoE-A4B routing. Gemma 4 26B-A4B has 26 billion total parameters, 3.8 billion active per token. Per-token compute is closer to a dense 4B model. The 26B is just a parameter library the router pulls from.
MTP at 86% acceptance. n_spec=3 means we draft 3 tokens per decode step. At 86.4% accept, expected tokens-per-step is roughly 1 + 0.864×3 ≈ 3.6. Roughly 3.6× more tokens-per-decode-step than no-spec.
Official drafter quality. The google/gemma-4-26B-A4B-it-assistant head was trained jointly with the target during Google’s pretraining. That’s why it hits 86% on the first try — it’s not a separate model trying to imitate the target, it’s literally the next-token heads from the target’s own training. Same trick Qwen 3.6 uses with its -MTP variant. When the drafter and target are co-trained, acceptance shoots up.
The catch: this requires the model team to ship an MTP head with the model. Google did. Alibaba did. Most others haven’t yet. When they do, watch for the same 200+ t/s pattern.
What I shipped
vllmgemma426ba4bvisionone v1.0.4 → v1.0.5 on my Olares market source.
- Image:
vllm/vllm-openai:v0.21.0-cu129 - Target:
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit - Drafter:
google/gemma-4-26B-A4B-it-assistant - Context: 131072 (128K native)
--attention-backend triton_attn,--kv-cache-dtype fp8,--gpu-memory-utilization 0.94--enable-prefix-caching,--limit-mm-per-prompt '{"image":1}'--enable-auto-tool-choice --tool-call-parser gemma4preserved--speculative-config '{"model":"google/gemma-4-26B-A4B-it-assistant","method":"gemma4_mtp","num_speculative_tokens":3}'
Pull from https://orales-one-market.aamsellem.workers.dev if you have Olares One and want to try it.
Coda
I’m cataloging these wins as they come because I think the pattern matters: hardware is fixed, software keeps eating the throughput problem. Same RTX 5090M mobile chip. Same model file on disk. April baseline 135 t/s. May baseline 250 t/s. +85% from upstream changes I didn’t write.
The thing is, the merges that bought us this — vLLM #41745 (Gemma4 MTP), llama.cpp #22673 (MTP support), llama.cpp #23287 (CUDA backend sampling), the fact that Google and Alibaba both shipped co-trained MTP drafters — none of these were specifically about consumer Blackwell mobile. They just happened to land on this hardware too. The community of upstream contributors did the work. I just have to keep redeploying.
Next watch: when Gemma 5 lands with co-trained MTP heads from day one, and when vLLM gets even tighter on hybrid attention KV management, this number goes up again. Probably 280+ before summer.