Tag · gemma-4

# gemma-4

All posts tagged "gemma-4".

Gemma 4 audio E4B hits 288 t/s — the second upstream merge closes the family

09.06.2026

Yesterday I shipped Gemma 4 12B at 170 t/s via the upstream PR #23398 merge. Today PR #24282 (the E2B/E4B counterpart) merged. Custom rebuild, chart swap, bench: Gemma 4 audio E4B jumps from 47 t/s to 288 t/s. 6.1x speedup on the same hardware in 5 minutes of config. With a flash-attention trap on the way — the combo Gemma 4 E4B + audio mmproj + MTP draft crashes the CUDA flash attention kernel, no-FA fallback unlocks everything.
Lire →
Gemma 4 12B hits 170 t/s — upstream merge buys +67% speed for free

08.06.2026

Two days ago I shipped Gemma 4 12B QAT at 102 t/s on Olares One. Today I ship 170 t/s. Same hardware. Same model file. Same drafter. Same context. The delta: am17an's PR #23398 (Gemma 4 MTP support) merged into llama.cpp upstream at 12:50 UTC. My custom image — a snapshot of the WIP branch at commit dd97604 — was missing 10+ polish commits that ggerganov forced in review. +67% speed on the exact same setup, just by rebasing. Bonus: critical insight on Olares One's nvidia driver capping CUDA at 13.1, blocking the whole upstream Docker ecosystem.
Lire →
Gemma 4 12B QAT lands — +17% speed, −39% VRAM, 65K context on 24 GB consumer Blackwell

05.06.2026

Google released the QAT (Quantization-Aware Training) variants of Gemma 4 today at 1pm UTC. Three hours later, Olares One is running on them. On the 12B: 102.78 t/s vs 87.5 baseline = +17.4% speed. 8.6 GB VRAM vs ~14 GB = −39%. Context 32K → 65K with margin to spare. Tool calling intact, vision intact (modulo an mmproj gotcha I explain below).
Lire →
Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

23.05.2026

Two days ago I shipped Qwen 3.6 35B-A3B MTP at 249 t/s on Olares One. Text-only, but the new champion. Today the same hardware runs Gemma 4 26B at 250 t/s with vision and tool calling. The unlock: vLLM v0.21 quietly merged the official Google Gemma 4 MTP drafter. No more 5-fast/4-slow cycle bug from DFlash. No more 135 t/s no-spec fallback. Just full speed, plus images.
Lire →
Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB

15.05.2026

An Olares One peer user shared a Discord patch to restore vision on the gemma426ba4bone chart. 24 hours later, I shipped a vLLM variant hitting 135 t/s at 128K context — and the same user validated it in production. The story of a community-driven engineering loop, four llama.cpp configs benched in parallel, and the moment turbo3 KV stopped being the answer.
Lire →
A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

08.05.2026

From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.
Lire →

Gemma 4 audio E4B hits 288 t/s — the second upstream merge closes the family

Gemma 4 12B hits 170 t/s — upstream merge buys +67% speed for free

Gemma 4 12B QAT lands — +17% speed, −39% VRAM, 65K context on 24 GB consumer Blackwell

Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB

A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock