Yesterday’s post: Gemma 4 12B jumps from 102 to 170 t/s thanks to PR #23398 (Gemma 4 MTP support for 12B/26B/31B) merged into llama.cpp upstream.
Today: PR #24282 (the E2B/E4B counterpart) merged at 20:48 UTC last night by max-krasnyansky (the original Gemma 4 MTP author at Qualcomm).
Gemma 4 audio E4B on Olares One:
- v1.0.1 baseline (pre-MTP): 47 t/s
- v1.1.1 (PR #24282 + MTP Q8_0 drafter): 288.4 t/s
- +513% speedup. 6.1×.
Same model file. Same drafter pattern. Same hardware. Just an image rebuild from post-merge source + a community Q8_0 drafter aligned with the official arch convention.
The family lap in two PRs
Gemma 4 dropped late May in 5 sizes: E2B / E4B (the small actives, MoE), 12B (dense), 26B-A4B (MoE), 31B (dense). When Google shipped their official MTP recipe (assistant drafters), llama.cpp accepted it as two distinct PRs because E2B/E4B use a different arch from the rest of the family.
PR #23398 (am17an, merged 2026-06-07): MTP support for 12B + 26B-A4B + 31B. The dense + MoE-A4B variants share a homogeneous KV cache format.
PR #24282 (max-krasnyansky, merged 2026-06-08 20:48 UTC): MTP support for E2B + E4B. The E* variants have a KV layer-sharing optimization (layers 0-2 share their KV with layer 40, layer 3 with 41) that required separate code.
Both PRs together = the entire Gemma 4 family has official upstream MTP support. It’s rare to capture a release feature at this depth, in under a week, across 5 sizes.
Today’s drafter
For the 12B yesterday I was using Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF — GGUF conversion of Google’s official QAT assistant checkpoint.
For the E4B today: NicklausCairns/gemma-4-E4B-it-qat-assistant-MTP-Q8_0 — same pattern, 94 MB Q8_0, GGUF arch gemma4-assistant (hyphen, not underscore) matching the upstream convention.
Once the gemma4-assistant (hyphen) convention was set post-merge, the community immediately published GGUF conversions for every size. This is the first time I’ve seen a release model + drafter + community GGUF + chart shipping arrive in under a week.
The flash attention trap
First attempt on Olares One: pod crashes at boot with:
/src/ggml/src/ggml-cuda/fattn.cu:110: fatal error
The ggml_cuda_flash_attn_ext kernel triggers an assertion on an unhandled case. Likely cause: Gemma 4 E4B uses KV layer sharing (visible in the logs: layer 3: sharing with layer 41. k = 0x71a414000000, v = 0x71a418000000; layers 0-2: sharing with layer 40) which interacts badly with the FA + MTP draft + audio mmproj pipeline.
Probably a coverage bug — the combo of 4 features (E4B KV-layer-share + FA + spec-decode + multimodal mmproj) falls through no upstream test. Filing a minimal repro on llama.cpp will fix it in a follow-up.
Immediate workaround: --flash-attn off. Clean boot, decode @ 288 t/s. The FA loss on a 4B model + KV layer sharing is minimal since the heads are few → FA speedup is marginal here. We lose almost nothing.
The bench
Olares One (RTX 5090 Mobile sm_120 Blackwell, 24 GB), 3 runs Space Invaders HTML, single user, vision off (audio only for this chart), MTP n=2 active:
Run 1: 289.32 t/s | 2000 tokens
Run 2: 295.22 t/s | 2000 tokens
Run 3: 280.51 t/s | 1849 tokens
AVG: 288.4 t/s. MTP draft acceptance 72-79% (AVG 76%). VRAM ~7 GB total (Q4_K_M target + audio mmproj BF16 + Q8_0 drafter + KV).
vs v1.0.1 baseline (47 t/s, no MTP) = +513% / 6.1× speedup.
The Q8_0 drafter at 76% accept rate is in the same league as what we see on 12B (91% Janvitos) and 35B-A3B (86% colefuoco). The E4B size with its 4B active params and the QAT-matched drafter allows for a high enough draft acceptance that MTP n=2 pays off.
New leaderboard on Olares One
| Stack | t/s | Context | Vision | Audio | Tool calling | VRAM |
|---|---|---|---|---|---|---|
| Qwen 3.6 35B-A3B abliterated (champion) | ~250 | 65K | ✓ | — | ✓ | ~24 GB |
| Gemma 4 E4B audio + MTP (v1.1.1, today) | 288.4 | 32K | — | ✓ | ✓ | 7 GB |
| Gemma 4 12B QAT + upstream MTP (v1.0.5, yesterday) | 169.8 | 65K | ✓ | — | ✓ | 8.6 GB |
| Gemma 4 12B Q8_0 (v1.0.2, 3 days ago) | 87.5 | 32K | ✓ | — | ✓ | ~14 GB |
Gemma 4 audio E4B takes first place in raw throughput, ahead even of the Qwen 35B-A3B text champion. The reason: E4B is 4B active params on a model with 8B total, vs 3B active on Qwen 35B-A3B which weighs 17 GB. The active-params-to-total-VRAM ratio is better on E4B → more tokens per second extractable from the same GPU.
Different use cases, though:
- Audio E4B: voice agent input layer, light ASR, speech summarization
- Qwen 35B-A3B: general high-quality agent, reasoning, code, vision
- Gemma 4 12B: chat with vision, multilingual, mid-tier reasoning
Not competitors: complementary. On the same Olares One, a user mixing all 3 has a very different stack from a user with only the 35B-A3B.
CUDA 13.3 trap reminder
Same as yesterday’s post: I had to build my custom image on nvidia/cuda:13.1.1-devel-ubuntu24.04 base. ggml-org bumped their official Dockerfile to CUDA 13.3 on 2026-06-07, and Olares One driver 590.44.01 caps at 13.1.x. Any official image post-bump fails at startup:
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=13.3
My custom image building directly from main HEAD source bypasses the official Dockerfile → CUDA 13.1.1 + latest source code post-PR #24282 = what works on Olares.
Final tag: aamsellem/llamacpp-gemma4mtp:main-postmerge-cuda131-r3 (r3 = third revision after the curl-CLI binary fix that was missing in r1/r2).
Coda
Three posts in three days. Three times the same story:
- Monday: Gemma 4 QAT release → 87 → 102 t/s on 12B (+17%, weight quantization upgrade)
- Friday: PR #23398 merge → 102 → 170 t/s on 12B (+67%, upstream MTP compute path optimization)
- Today: PR #24282 merge → 47 → 288 t/s on E4B audio (+513%, MTP feature unlock on the family’s second half)
Each jump is between two commits, not between two hardware generations. The hardware stays fixed — RTX 5090 Mobile since we unboxed the Olares One. The software keeps eating the problem, faster than I can write.
On Olares One: pull https://orales-one-market.aamsellem.workers.dev, upgrade Gemma 4 Audio One to v1.1.1 from the market UI. ~5 GB image pull + 4 GB models (E4B target + audio mmproj + Q8_0 drafter), ~40s boot post-download.
Upstream-side, next to watch: KVarN Issue #10 (head_dim=256 support) which would unblock the Marc agent-powerhouse path (vLLM + Qwen 3.6 27B + KVarN + MTP @ 131K context). philippebich shipped the MLA adaptation yesterday — Qwen/Gemma head_dim=256 next in the queue.
Three posts in three days. Let’s see how many this week ends with.