This one started on Discord. A peer user of Olares One (same RTX 5090M 24 GB sm_120 hardware as me) sent the kind of message that’s worth its weight in gold:
“I was looking for a vision capable model with decent context size and downloaded gemma426ba4bone but realized that one doesn’t have vision. So I tried a few things and came out with a pretty good result I wanted to share.”
He’d hand-patched my gemma426ba4bone v1.0.9 chart (text-only MTP variant) to restore vision: added mmproj-F16, removed MTP (incompatible with multimodal in llama.cpp), removed --mlock (his host couldn’t lock 19 GB), and re-enabled GGML_CUDA_GRAPH_OPT=1. 125 t/s avg with vision + 128K context.
Twenty-four hours later, I shipped vllmgemma426ba4bvisionone v1.0.1 at 135 t/s, 128K context, vision validated end-to-end, range 0.54 — and the same user validated it in production: “This one is really a perfect alrounder which many people will appreciate.”
Here’s how four llama.cpp configs and one pivot to vLLM got us there, and why turbo3 KV stopped being the answer on Gemma 4.
The use case framing
Mid-thread the peer dropped his actual goal, which is the most concrete framing for a local LLM workload I’ve seen in months:
“A tech stack that allows decent office work. A workhorse for the everyday use in a company. 2-3 concurrent user → vllm for PagedAttention. 128k or higher context window to allow decent multiple user and long(er) pdf processing. Vision to process scans and images. Performance > 100 t/s to make it acceptable to work with.”
Four constraints: PagedAttention (so vLLM, not llama.cpp), 128K context (long PDFs + multi-user reservation), vision (scan/image processing), >100 t/s. Let me try each in turn.
Step 1 — Ship the patch as-is
First instinct: ship his patched config exactly as he sent it, so the community can install it in one click via Olares Studio. Created llamacppgemma426ba4bvisionone v1.0.3 (atomic fork llama.cpp + mmproj F16 + q4_0 KV + no spec + --parallel 2 + GGML_CUDA_GRAPH_OPT=1). Benched 10 runs Space Invaders HTML, 2000 tokens each.
But I’d also been wanting to test 3 things he hadn’t bothered with:
- Is ngram-cache spec decoding worth the overhead on multimodal? (MTP is forbidden, but ngram works with vision in llama.cpp)
- Is turbo3 KV as good on Gemma 4 as it is on Qwen3.6?
- Does q8_0 KV beat both?
So four configs:
| Ver | KV | Spec | —parallel | AVG t/s | Range |
|---|---|---|---|---|---|
| v1.0.0 | turbo3 | ngram-cache | 2 | 115.93 | 0.23 |
| v1.0.1 | turbo3 | — | 2 | 116.20 | 0.15 |
| v1.0.2 | q8_0 | — | 2 | 123.97 | 0.22 |
| v1.0.3 | q4_0 | — | 1 | 121.81 | 0.26 |
Findings:
- ngram-cache contributes nothing on HTML generation. v1.0.0 → v1.0.1: drop ngram-cache, gain marginal +0.27 t/s. The drafter never fires on variable-output workloads — pure overhead. This was my first hypothesis to validate. Confirmed.
- turbo3 KV has a real overhead on Gemma 4 — about 6.7%. v1.0.1 → v1.0.2: swap turbo3 → q8_0, gain +6.7%. This was unexpected. On Qwen3.6 27B with BeeLlama, turbo3 KV is essential (saves 50% memory, enables 262K context). On Gemma 4, the Hadamard rotation dequant cost outweighs the bandwidth savings. The model class matters more than the KV size for this particular hardware.
- q4_0 KV is slightly slower than q8_0 but frees 3.5 GB headroom, which I needed for image compute buffer at 128K context. v1.0.3 keeps q4_0 for that reason. Trade-off: -1.7% throughput for safety margin.
That last point is what shipped as v1.0.3, the peer’s setup improved by ~3 t/s and validated end-to-end with image input (a 64×64 PNG gradient → “Red” correctly identified, 0.73 s wall time including image processing).
Step 2 — The vLLM pivot
The peer’s actual constraint was PagedAttention for 2-3 concurrent users, which llama.cpp can’t really do (the --parallel N flag exists but it’s slot-based reservation, not vLLM-grade multi-request scheduling). And his concern was that the Atomic fork I was using is maintained off-master, so future llama.cpp updates wouldn’t reach the path easily.
vLLM has Gemma 4 multimodal support natively via the Gemma4ForConditionalGeneration architecture in cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (compressed-tensors WNA16-Marlin, ~17 GB on GPU, vision encoder bundled, no separate mmproj). Worth testing.
The config diffs:
Image: vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404
Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (native multimodal)
Attention: triton_attn (Gemma 4 head_dim heterogeneous forces this — flash_attn errors)
KV cache: fp8 (vLLM native)
Spec: none (DFlash drafter has a separate 5-fast/4-slow cycle bug, more below)
max_num_seqs: 4
max_num_batched_tokens: 8192
gpu_memory_utilization: 0.85 (initially)
max_model_len: 16384 (initially)
Bench, 10 runs, same Space Invaders HTML prompt:
AVG 135.97 t/s
MIN 135.67, MAX 136.23
Range 0.56
+9.7% over the best llama.cpp config (v1.0.2), with exceptional stability. Vision validated identically — image gradient → “Red”. Shipped as vllmgemma426ba4bvisionone v1.0.0. Deleted the llama.cpp variant (one app per use case, the peer was clear about wanting a single optimized Gemma vision app).
Step 3 — Pushing context from 16K to 128K
v1.0.0 shipped with 16K context, which was the conservative starting value matched from vllmgemma4dflashone. The peer’s office workhorse use case needs 128K (long PDF processing across 2-3 concurrent slots).
Math on 24 GB:
Target loaded: 16.6 GB
vLLM compute reserve: ~3-5 GB
Available for KV cache: ~3-4 GB (at gpu_mem 0.85)
fp8 KV per token: ~80-100 KB for Gemma 4 hybrid (mixed SSM + attn layers)
KV cache for 16K: ~1.4 GB
KV cache for 128K: ~10 GB ← doesn't fit at 0.85
The naive 0.85 utilization left 3.7 GB free at 16K, not enough for 128K. Bumped gpu_memory_utilization to 0.92 (which I’d validated as safe on vllmgemma4dflashone v1.0.4 already). That freed enough margin to fit 128K + the headroom for cudagraph profiling and multi-seq slot reservation.
Patched and benched, 10 runs:
AVG 135.85 t/s @ 128K context
MIN 135.52, MAX 136.06
Range 0.54
Identical to the 16K bench (135.97 t/s, range 0.56) within measurement noise. 8× the context for zero throughput regression. fp8 KV in vLLM is genuinely efficient on Gemma 4 hybrid — bandwidth pressure stays manageable even at 128K.
Shipped as vllmgemma426ba4bvisionone v1.0.1 within an hour. Total time from “peer mentioned 128K” to “v1.0.1 in production” — about 2 hours including the bench.
Why turbo3 stopped being the answer
This is the finding I want to highlight. Going into this work I had a strong prior: turbo3 KV (3-bit Walsh-Hadamard rotated cache) is always the right choice on Blackwell consumer mobile because (a) it saves 50% memory vs q4_0 and (b) BeeLlama hit 262K context on Qwen3.6 27B specifically because turbo3 was so compact.
That prior was wrong on Gemma 4. The bench progression makes it brutal:
v1.0.0 turbo3 KV + ngram-cache → 115.93 t/s
v1.0.1 turbo3 KV + no spec → 116.20 t/s
v1.0.2 q8_0 KV + no spec → 123.97 t/s ← +6.7% just by switching turbo3 → q8_0
What’s happening: on Gemma 4 dense / MoE-3.8B-active, the Hadamard rotation dequant adds ~120 micro-ops per token per layer. On Qwen3.6 hybrid arch (Gated DeltaNet + SSM), there are fewer attention layers (the SSM layers don’t use traditional KV), so the rotation overhead is amortized across less work. The same primitive that wins on Qwen3.6 loses on Gemma 4.
And then in vLLM the equivalent question is FP8 KV vs no KV quant. FP8 wins easily on Gemma 4: 1.7× capacity, ~zero quality loss per NVIDIA’s documentation, AND the dequant is fused into the matmul on H100/Blackwell tensor cores so there’s no measurable overhead. That’s why the 128K push works at no throughput cost — fp8 KV is the right primitive for this model class.
Per-model KV quant tradeoff is real. I’d been treating turbo3 as a universal default. That’s wrong. The right rule is: turbo3 for Qwen3.6 (saves the day at 262K), q8_0 for Gemma 4 + AtomicBot fork, fp8 for vLLM-served models. There’s no single “best KV format” — it depends on the dequant kernel path used by the inference engine.
The DFlash drafter retest
One open question I needed to chase: does adding DFlash spec decoding push throughput beyond 135 t/s? vLLM has a z-lab DFlash drafter for Gemma 4 specifically (z-lab/gemma-4-26B-A4B-it-DFlash). My earlier vllmgemma4dflashone text-only app hit 224 t/s using it.
But I’d previously documented a “5-fast/4-slow cycle” on that DFlash text-only deployment: bimodal distribution where 60% of requests hit 215-235 t/s and 40% hit 60-100 t/s, with no recovery strategy that worked (enforce-eager, prefix-caching off, max-num-seqs=1 — all tried, none fixed).
Today vLLM merged PR #42692 [Bugfix] DFlash FP8 KV-Cache. Hopeful, I switched to v0.21.0-x86_64-cu129-ubuntu2404 (the latest stable release with the fix), added the DFlash drafter, and benched. 10 runs at 32K context (couldn’t fit 128K + DFlash drafter on 24 GB — drafter takes ~3 GB extra footprint):
runs: 218, 229, 62, 223, 215, 103, 62, 224, 62, 221
AVG: 161.94 (misleading)
MIN: 61.66, MAX: 228.61
Range: 167
Same bimodal cycle. PR #42692 fixed a different DFlash + FP8 bug (correctness/precision likely), not the adaptive spec throttling that causes our distribution. The cycle is an upstream vLLM scheduler issue — when DFlash detects “draft acceptance dropping” it falls back to no-spec target-only mode for a few requests, then re-tests. On a workload where DFlash always wins (Gemma 4 + cyankiwi AWQ), this oscillation is pure UX damage.
UX math: with DFlash, 40% of user requests get 60 t/s (worse than no-spec baseline 136). Without DFlash, every request gets 135. The consistency wins. The DFlash variant stays off until upstream vLLM addresses the throttling cycle specifically.
What shipped
vllmgemma426ba4bvisionone v1.0.1 on the orales-one-market catalog. Final config:
Image: vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404
Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
KV: fp8 (vLLM native)
Attn: triton_attn (Gemma 4 head_dim hetero forces this)
Spec: disabled (DFlash cycle bug, no MTP for multimodal)
ctx: 131072 (128K)
gpu_mem: 0.92
max_num_seqs: 4
prefix-caching: enabled
multimodal: native (vision + audio config bundled, audio path not yet wired)
Peer validated in production within hours of v1.0.1 landing: “I’ve already downloaded and briefly tested vllmgemma426ba4bvisionone — works like a charm 🙂 Will do more in-depth tests later… This one is really a perfect alrounder which many people will appreciate, I am sure.”
That’s the full feedback loop. Discord patch → app shipped → bench iteration → peer validates → community can install. Total elapsed time about 30 hours.
What’s next for v1.0.2
Two paths worth watching:
-
NVFP4 Gemma 4 weights. NVIDIA released Kimi K2.6 NVFP4 a few days ago, validating NVFP4 as production-grade for external models. If a Gemma-4-26B-A4B-NVFP4 lands (either NVIDIA themselves or community port), we’d save ~5 GB on model footprint, which could push max_model_len from 128K to 250K+ at the same gpu_memory_utilization. That’s the “very long PDF batch” path the peer mentioned wanting eventually.
-
vLLM upstream fix for the adaptive spec throttling cycle. When that lands, we revisit DFlash and target 200+ t/s steady-state with vision. Probably worth a v1.0.2 just for that.
For now the office workhorse path is shipped, peer-validated, and serving. Vision + 128K + >100 t/s + PagedAttention on a 24 GB consumer Blackwell mobile is achievable today.
Reproducibility
- Chart: vllmgemma426ba4bvisionone v1.0.1
- Model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
- Add my market source
https://orales-one-market.aamsellem.workers.devin Olares Studio → Market → Settings → Add source. App appears within 5 minutes.
If you run this on a different sm_120 card, 32 GB cards have room to push max_model_len to 256K+ trivially. Let me know your numbers.