Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

Hi there.

On May 5, 2026, Google publishes the official Multi-Token Prediction drafters for the four Gemma 4 sizes (E2B, E4B, 26B-A4B, 31B). Mini hype explosion on r/LocalLLaMA — 891 upvotes in 12h. But at the time, nobody could use them: the Gemma4AssistantForCausalLM arch with its centroid LM heads (top_k=32, num_centroids=2048) wasn’t recognized by either llama.cpp or vLLM.

On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: “[Spec Decode] Add Gemma4 MTP speculative decoding support”. 9 files, +1,121 / -72 lines. New Gemma4MTP model, new Gemma4Proposer, and — juicy detail — the centroids masking optimization that drops the lm_head compute from 262K to 4K candidates via the learned selection.

On May 7 at 06:13 UTC, the nightly Docker vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657 lands on Docker Hub.

At 06:35 UTC, ~22 min later, I send the first prompt to my Olares One. 178.6 t/s average across three runs, 77.3% acceptance rate. First public Gemma 4 MTP bench on consumer mobile Blackwell (RTX 5090M sm_120, 24GB GDDR7).

Here’s how I got there.

The drafter, in 2 sentences

Gemma 4 ships with a small “assistant” model trained jointly with the target. At inference time it predicts several tokens ahead from the target’s last hidden state; the verifier accepts the sequence in parallel if it matches. For E2B/E4B there’s an elegant detail: a centroids head that reduces the 262,144-token softmax to a 2,048-centroid mask — faster AND (per Google) lossless on quality.

Gemma4AssistantForCausalLM drafter architecture (read from config.json yesterday):

- 4 hidden layers, hidden_size 256, intermediate_size 2048 (tiny)
- 4 attention heads, 1 KV head (KV-shared across the 4 layers)
- centroid_intermediate_top_k=32, num_centroids=2048
- vocab_size 262144 (matches Gemma 4)
- 78M params for E2B-it-assistant (158 MB safetensors)

It’s MTP-dedicated, not a small Gemma 4 reused. Hence the dedicated upstream PR.

Why this is only landing now

Before May 6 14:39, the gemma4_assistant arch was unknown to vLLM — direct boot crash → NotImplementedError. PR #41745 adds:

New model vllm/model_executor/models/gemma4_mtp.py (~700 lines)
New proposer vllm/v1/spec_decode/gemma4.py
Centroids masking with CUDA graph acceleration for E2B/E4B
Cross-model KV sharing with multi-group attention (sliding + full, heterogeneous head_dim)
TP>1 via all-gather of the sharded lm_head
Override _create_draft_vllm_config to keep the TRITON_ATTN backend on draft layers (KV-shared compat)

Config-side, it boils down to:

--speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":3}'

The code automatically detects model_type == "gemma4_assistant" and converts it to gemma4_mtp at load.

The nightlies race

Timing oddity: the standard vllm/vllm-openai:nightly from May 6 06:08 UTC predates the merge. When I tried to test late afternoon, I first reached for tokenspeed-preview-ubuntu2404 (pushed at 14:47 UTC, 8 minutes after the merge) — boot crash: NotImplementedError: Unsupported speculative method: 'mtp'. Build cutoff must have been pre-merge.

On May 7 at 06:13 UTC, vLLM publishes a new nightly tagged nightly-1acd67a795ebccdf9b9db7697ae9082058301657 — commit 1acd67a from 2026-05-07 04:57 UTC (several hours post-#41745). This time it boots.

The Kubernetes manifest

On Olares One the pod is dead simple — no Genesis, no patches, no fork:

containers:
  - name: vllm-server
    image: docker.io/vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657
    command: ["sh", "-c"]
    args:
      - |
        exec vllm serve google/gemma-4-E4B-it \
          --served-model-name gemma-4-e4b-mtp \
          --host 0.0.0.0 --port 8000 \
          --max-model-len 32000 \
          --gpu-memory-utilization 0.85 \
          --max-num-seqs 1 --dtype auto \
          --trust-remote-code --download-dir /models \
          --enable-prefix-caching \
          --speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":3}'
    env:
      - name: CUDA_DEVICE_MEMORY_LIMIT_0
        value: "24000m"   # workaround for the HAMi 0m parsing bug

The only Olares One-specific bit is the CUDA_DEVICE_MEMORY_LIMIT_0=24000m override — without it, HAMi parses “0m” as 0 bytes and any CUDA alloc crashes at EagleProposer.__init__ → torch.zeros. Off-Olares, drop that line.

The bench

Three standard Space Invaders prompts (HTML+CSS+JS, max_tokens=800, temp=0.6, top_p=0.95):

Run 1 (cold start): 800 tok in  6.17 s = 129.73 t/s
Run 2:              800 tok in  4.17 s = 191.73 t/s
Run 3:              800 tok in  3.73 s = 214.38 t/s

AVG = 178.6 t/s, range 129-214.

MTP metrics (extracted from vLLM logs):

Mean acceptance length      : 3.32 (out of 3 draft tokens)
Per-position acceptance rate : 0.868 / 0.765 / 0.687
Avg Draft acceptance rate   : 77.3 %

77% acceptance is very high. For comparison, Qwen3.6 + MTP llama.cpp sits around 64% on the same hardware. The diff probably comes from the centroids optimization — the E4B drafter is small AND specifically trained to match the target, so the draft is more accurate than a generic drafter.

Comparison on Olares One (RTX 5090M 24GB sm_120)

Stack	Model	t/s	Notes
llama.cpp baseline	Qwen3.6-27B Q4_K_M	33-37	pure upstream
llama.cpp + MTP (PR #22673)	Qwen3.6-27B-MTP-Q4_K_M @ 32K	78.1	RDson recipe
llama.cpp + MTP (PR #22673)	froggeric MTP @ 128K	65.1	trade context for perf
vLLM no-Genesis + #39931 + `--enforce-eager`	Qwen3.6-27B int4 AutoRound	72.55	CG bug workaround
vLLM Turbo (Genesis)	Qwen3.6-27B int4 AutoRound	88.0	28 patches
Lucebox DFlash v1.4.4	Qwen3.6-27B Q4_K_M	88.5	custom engine
vLLM nightly + PR #41745	Gemma 4 E4B + MTP	178.6	pure upstream, 1 PR

Note: Gemma 4 E4B is a ~5B effective model, much smaller than Qwen3.6-27B. Raw t/s comparison isn’t apples-to-apples. But for low-latency agentic/voice use cases on Gemma 4, we’re at 178 t/s on a 100% upstream stack.

May 8 update: the llama.cpp counterpart hits 206 t/s on E2B and 140 t/s on 26B-A4B

On the llama.cpp side, AtomicChat also shipped Gemma 4 MTP support (200+ upvotes on r/LocalLLaMA on May 7). They maintain a fork — AtomicBot-ai/atomic-llama-cpp-turboquant — that adds the gemma4_assistant arch, the --mtp-head runtime flag, and as a bonus TurboQuant KV cache (-ctk turbo3 -ctv turbo3) — unexpected freebie.

I built their fork for sm_120 (image aamsellem/llamacpp-atomic-mtp:0.1.0, 2.72 GB) and benched the two target variants we have on HF:

Gemma 4 E2B + MTP (atomic llama.cpp)
  eval time = 4.84 ms per token = 206.56 t/s
  draft acceptance rate = 60.93 %
  3,198 tokens generated in 15.48 s

Gemma 4 26B-A4B + MTP (atomic llama.cpp)
  eval time = 7.14 ms per token = 140.03 t/s
  draft acceptance rate = 78.15 %
  3,238 tokens generated in 23.12 s

The 26B-A4B hits 140 t/s with 78% acceptance — first run, no warm-up. That beats AtomicChat’s M5Max reference bench (138 t/s). And it’s a 26B MoE → dense ~26B quality at ~6B-dense latency.

So Olares One-side we now have two validated Gemma 4 MTP paths:

Path	Model	t/s	Stack
vLLM nightly + PR #41745	Gemma 4 E4B	178.6	pure upstream, 1 merged PR
llama.cpp + atomic-llama-cpp-turboquant fork	Gemma 4 E2B	206.6	tier-3 fork + 2 GGUFs (unsloth target + AtomicChat drafter)
llama.cpp + atomic-llama-cpp-turboquant fork	Gemma 4 26B-A4B	140.0	tier-3 fork + 2 GGUFs

The vLLM side is cleaner maintenance-wise (one upstream PR that’ll end up stable), the llama.cpp side delivers higher t/s on the smaller models AND supports the 26B-A4B MoE. Pick depending on your use case (latency vs quality).

Why this changes the game

A year ago we’d have spent two weeks hacking a fork or writing monkey-patches. Here, 24 hours after the upstream merge, the nightly Docker is out and the bench runs. That’s the direction the ecosystem is heading: techniques that used to live in forks (DFlash via Lucebox, MTP via Genesis for vLLM, MTP via am17an for llama.cpp) end up in mainline.

For Gemma 4 specifically, the vLLM team did great work — the PR adds infrastructure that’ll be reusable for other dedicated drafter archs (e.g. an eventual Mimo v2.5 dedicated drafter, or future Llama 4 ones).

To reproduce

Image: docker.io/vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657 (or newer)
Target: google/gemma-4-E4B-it
Drafter: google/gemma-4-E4B-it-assistant
Recipe: --speculative-config '{"method":"mtp","model":"google/gemma-4-E4B-it-assistant","num_speculative_tokens":3}'
On Olares One: add CUDA_DEVICE_MEMORY_LIMIT_0=24000m

If you test on other consumer Blackwell hardware (5070, 5080, desktop 5090): send your numbers in the comments, we build the comparison base.

Credits

Google DeepMind for Gemma 4 and the MTP drafters
lucianommartins (vLLM contributor) for PR #41745 — super clean architecture
vLLM core team for the fast nightly publishing

That’s it! If you run Gemma 4 on an Olares One and you reproduce these 178 t/s (or beat them with a num_speculative_tokens sweep), send me your numbers. See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.