Tag · qat

# qat

All posts tagged "qat".

Gemma 4 12B hits 170 t/s — upstream merge buys +67% speed for free

08.06.2026

Two days ago I shipped Gemma 4 12B QAT at 102 t/s on Olares One. Today I ship 170 t/s. Same hardware. Same model file. Same drafter. Same context. The delta: am17an's PR #23398 (Gemma 4 MTP support) merged into llama.cpp upstream at 12:50 UTC. My custom image — a snapshot of the WIP branch at commit dd97604 — was missing 10+ polish commits that ggerganov forced in review. +67% speed on the exact same setup, just by rebasing. Bonus: critical insight on Olares One's nvidia driver capping CUDA at 13.1, blocking the whole upstream Docker ecosystem.
Lire →
Gemma 4 12B QAT lands — +17% speed, −39% VRAM, 65K context on 24 GB consumer Blackwell

05.06.2026

Google released the QAT (Quantization-Aware Training) variants of Gemma 4 today at 1pm UTC. Three hours later, Olares One is running on them. On the 12B: 102.78 t/s vs 87.5 baseline = +17.4% speed. 8.6 GB VRAM vs ~14 GB = −39%. Context 32K → 65K with margin to spare. Tool calling intact, vision intact (modulo an mmproj gotcha I explain below).
Lire →