Gemma 4 12B QAT lands — +17% speed, −39% VRAM, 65K context on 24 GB consumer Blackwell

Google released the QAT (Quantization-Aware Training) variants of the entire Gemma 4 family today at 1pm UTC — E2B, E4B, 12B, 26B-A4B, 31B. All sizes under the google/ org, Apache 2.0, direct GGUF + mmproj.

Three hours later, Olares One is running on them in production for the 12B.

102.78 t/s vs 87.5 t/s baseline = +17.4% speed. 8.6 GB VRAM vs ~14 GB = −39%. Context 32K → 65K with margin to spare. Tool calling intact, vision intact (modulo an mmproj gotcha I explain below).

What QAT is, in one explanation

Quantization-Aware Training means training the model KNOWING it’ll be quantized. During the final fine-tuning, the forward pass simulates Q4_0 rounding — so gradients flowing back nudge weights to be robust to that precision.

Result: a Q4_0 trained with QAT preserves Q8_0 PTQ-grade quality (Post-Training Quantization, what unsloth or bartowski normally do after the fact). In short: same quality, half the file size.

The technique isn’t new — DeepSeek did it on their BF16→FP8 recipe. But this is the first time Google ships it upstream and public for Gemma. And the first time there’s an official HuggingFace repo with the artifacts ready to drop in.

Why it matters on consumer Blackwell 24 GB

Olares One is an RTX 5090 Mobile, 24 GB GDDR7, sm_120 Blackwell. The frontier between “fits comfortably” and “need to make trade-offs” is very narrow when you want everything at once:

12B Q8_0 model: 12.7 GB
mmproj BF16 (vision): 167 MB
KV cache @ 32K q8_0/q8_0: ~600 MB
MTP drafter Q8_0 (colefuoco assistant): 465 MB
Compute buffer ubatch 2048: ~1.5 GB

Total: ~15 GB. On 24 GB physical that leaves 9 GB of headroom, but HAMi vGPU on Olares One caps to 20 GB per pod by default. It fits, but the context stayed stuck at 32K.

With QAT Q4:

12B QAT UD-Q4_K_XL model: 7.5 GB (−40%)
The rest unchanged: 2.7 GB extra

Total: 8.6 GB observed in production. On the same 20 GB HAMi cap, that frees 11 GB to push the context to 65K with 3 GB of margin left.

What I shipped

Chart llamacppgemma412bone v1.0.2 → v1.0.3 on my market source https://orales-one-market.aamsellem.workers.dev.

Stack:

Image      : aamsellem/llama-cpp-gemma4mtp:am17an-dd97604
             (custom build from am17an PR #23398, Gemma 4 MTP support + gemma4uv projector)
Model      : unsloth/gemma-4-12B-it-qat-GGUF
File       : gemma-4-12B-it-qat-UD-Q4_K_XL.gguf  (6.4 GB, imatrix + upcast sensitive layers)
Mmproj     : google/gemma-4-12B-it-qat-q4_0-gguf
             mmproj-gemma-4-12b-it-qat-q4_0.gguf  (167 MB, vision encoder)
Drafter    : colefuoco00/gemma-4-12B-it-assistant-GGUF Q8_0  (465 MB)
Context    : 65536
KV cache   : q8_0 / q8_0
Spec       : --spec-type draft-mtp --spec-draft-n-max 2

Why unsloth UD rather than Google’s plain Q4_0? Unsloth runs imatrix on top of the QAT base (calibration set + selective upcast of the most sensitive layers to 8-bit). Head-to-head bench further down in this article.

The unsloth mmproj gotcha (one commit lost)

First attempt by just swapping TARGET_MODEL/TARGET_FILE and keeping MMPROJ_FILE=mmproj-BF16.gguf same as v1.0.2:

mtmd_init_from_file: error: mismatch between text model
(n_embd = 3840) and mmproj (n_embd = 2048)
hint: you may be using wrong mmproj

The mmproj-BF16.gguf unsloth uploaded in their 12B QAT repo is actually an E4B-sized mmproj (n_embd=2048). The 12B Gemma 4 has n_embd=3840. So the vision encoder’s embedding dimension doesn’t match the text backbone. Unsloth upload bug — they probably scripted the conversion and pulled the wrong mmproj source.

Fix: pull the mmproj from Google’s official repo instead:

https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf/resolve/main/mmproj-gemma-4-12b-it-qat-q4_0.gguf

167 MB. Compatible with the unsloth UD-Q4_K_XL weights since both derive from the same Google QAT BF16 base. n_embd=3840 confirmed via logs. Vision loaded, single endpoint, image_url API ready.

Chart-side workaround: add a MMPROJ_MODEL variable separate from TARGET_MODEL in the ConfigMap, and change the mmproj download path to use ${MMPROJ_MODEL} instead of ${TARGET_MODEL}. Three lines.

The bench

Olares One, RTX 5090M sm_120 Blackwell mobile, 24 GB physical (HAMi cap 20 GB).

3 runs Space Invaders HTML, single user, vision-loaded, MTP active:

Run 1: 107.18 t/s | 2000 tokens
Run 2: 100.89 t/s | 2000 tokens
Run 3: 100.27 t/s | 1886 tokens

AVG: 102.78 t/s. MTP draft acceptance 72.5–75.0% (vs v1.0.2 84–88%, slightly lower but still excellent). GPU usage: 8.6 GB.

vs v1.0.2 baseline (unsloth Q8_0): +17.4% speed, −39% VRAM, context bumped 32K → 65K.

The accept rate dropping ~12 points is expected — the colefuoco drafter was calibrated against the Q8_0 base, and the logits distribution of a QAT-Q4 diverges slightly (Q4 rounding isn’t identical to Q8 floats). But the speed win largely compensates the accept loss.

Head-to-head: Google Q4_0 plain vs Unsloth UD-Q4_K_XL

Since unsloth published 2h after Google, I wanted to verify that picking their UD variant wasn’t naive. Cross-bench on the same chart, just TARGET swapped.

Variant	AVG t/s	MTP accept	GPU	File size
Google Q4_0 plain	103.27	73–74%	8.84 GB	6.65 GB
Unsloth UD-Q4_K_XL	102.78	72–75%	8.6 GB	6.4 GB

Quasi-tie on every axis. Speed Google +0.5% (within noise σ ≈ 3), MTP identical, but Google +0.24 GB VRAM and +250 MB file. No reason to switch: unsloth UD has the imatrix bonus (calibration set + sensitive-layer upcast) without any measurable speed penalty.

Decision: stays on unsloth UD-Q4_K_XL for production.

12B leaderboard on Olares One

Stack	t/s	Context	Vision	Tool calling	VRAM
Gemma 4 12B QAT UD-Q4_K_XL (v1.0.3, today)	102.78	65K	✓	✓	8.6 GB
Gemma 4 12B Q8_0 (v1.0.2, baseline yesterday)	87.5	32K	✓	✓	~14 GB
Gemma 4 12B Q8_0 no-MTP (v1.0.0)	47	32K	✓	✓	~13 GB

The step to QAT is more than +17% — it’s the first time a 12B with vision + MTP + 65K ctx fits comfortably on 24 GB physical with free headroom. The free headroom is what opens the door to parallel runs: you can now keep the 12B in the background while another app uses ~10 GB of the same GPU.

What’s still to unlock

Official Google MTP drafter. Google published the QAT MTP drafters for every size (google/gemma-4-12B-it-qat-q4_0-unquantized-assistant), but it’s safetensors-only — the GGUF conversion depends on llama.cpp PR #23398 (Gemma 4 MTP support, am17an, still WIP at time of writing). When it merges, we should switch from the community colefuoco drafter to the official Google one, likely with an accept-rate bump.

Upstream build b8740+. The image we run is my custom build from the am17an branch. When the PR merges into master and ggml-org cuts a new Docker tag with the gemma4uv projector, we can drop the custom image and go back to the official upstream.

Generalize to other Gemma 4 QAT sizes. The E4B QAT exists for our llamacppgemma4audione app (Gemma 4 E4B + audio). Worth testing whether swapping E4B Q8_0 → E4B QAT Q4 replicates the +17% speed, −40% VRAM pattern. On smaller models it’s less critical but it’s still margin gained.

Coda

When DeepSeek released V2 in native FP8 two years ago, the community said “interesting but Q8_0 PTQ after the fact is near-lossless so QAT doesn’t matter much”. That was partly true at the time. Today on consumer Blackwell mobile 24 GB where every GB counts, the Q4↔Q8 gap becomes critical again.

Google now ships official QAT for the entire Gemma 4 family on release day. Unsloth re-quantizes in UD within 2 h. Olares One runs it under 6 h. The release → bench → ship pipeline cut to half a day.

The scoreboard of models that fit on 24 GB consumer is changing. The 12B used to be a “single-tenant” app on Olares One (impossible to leave running while another LLM was running). With QAT, it can co-exist with the 35B-A3B champion. That’s a qualitative change, not just quantitative.

On Olares One: pull https://orales-one-market.aamsellem.workers.dev, upgrade Gemma 4 12B One to v1.0.3 via the market UI.