After running Gemma 4 26B-A4B with z-lab’s DFlash drafter at 214 t/s steady on the Olares One (RTX 5090 Laptop, 24GB, sm_120 Blackwell mobile), Tech-Practice published a sweep showing 228 → 600 t/s on the 32GB desktop 5090 with n_spec=15. Naturally I wondered: does n=15 also win on mobile?
It does not. n_spec=8 wins on the 5090M, with peak ~235 t/s and stable ~224 t/s — about +5% over my previous default of n=13. I also found something weirder: a 100% reproducible degradation cycle every 9-10 requests that I couldn’t fix from configuration alone.
TL;DR
- Optimal n_spec=8 on 5090M mobile: AVG 223.9 t/s, MAX 235.2 t/s (5 clean runs, Space Invaders 2000-token completion)
- n=15 (Tech-Practice rec) regresses to 201 t/s on mobile — the desktop recommendation doesn’t transfer
- 5-10× faster than community baseline: the dasroot.net “Gemma 4 Speed Hacks” post (May 9) reports 22-40 t/s for Gemma 4 26B-31B DFlash on RTX 5090 desktop with vLLM 0.7.1. We’re at 224 t/s on mobile with the right image + n_spec — that’s not just a marginal win, it’s an order-of-magnitude gap.
- Degradation cycle: every 9-10 requests, 4 runs drop from ~220 t/s to ~60 t/s (no spec decoding active), then recovers
- Cycle is not caused by prefix-caching, max-num-seqs, cudagraph, or chunked prefill — looks like an adaptive low-yield spec fallback (see hypothesis below)
- Effective long-session AVG ≈ 161 t/s (including degraded phases) vs 224 t/s peak
Shipped as vllmgemma4dflashone v1.0.4 in orales-one-market.
Hardware
- GPU: RTX 5090 Laptop, 24GB GDDR7, 896 GB/s, sm_120 Blackwell consumer mobile
- CPU: Intel Core Ultra 9 275HX, 24 cores
- RAM: 96GB DDR5-5600
- vs Tech-Practice’s reference: RTX 5090 desktop, 32GB GDDR7, 1.79 TB/s (2× mobile bandwidth)
Stack
- vLLM:
vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404(0.20.2rc1.dev67+g58c8a5eaa) - Target:
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit(compressed-tensors WNA16-Marlin, ~17.5 GB) - Drafter:
z-lab/gemma-4-26B-A4B-it-DFlash(~860 MB safetensors) - KV cache: fp8
- Attention backend:
triton_attn— Gemma 4 multimodal tokens forbidflash_attneven on the drafter (partial multimodal token full attention not supported) --max-num-seqs 4 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.92 --enable-prefix-caching
Methodology
User-standard Space Invaders HTML prompt, 2000 tokens, temp=0.6, top_p=0.95. 2 warmups + 3-10 measured runs depending on the sweep point. Bench harness runs inside the pod via kubectl exec to bypass the Envoy sidecar.
The sweep
| n_spec | Stable AVG | MAX | n_runs |
|---|---|---|---|
| 6 | 220.61 | 225.95 | 5 |
| 8 | 223.90 | 235.22 | 5 |
| 10 | ~215 | 222.07 | 3 |
| 11 | 212.25 | 217.66 | 5 |
| 12 | ~202 | 206.80 | 3 |
| 13 (was v1.0.0) | ~207 | 213.27 | 5 |
| 15 (Tech-Practice rec) | 201.39 | 211.27 | 3 |
| 17 | ~194 | 195.76 | 3 |
| 20 | 187.33 | 196.12 | 3 |
n=6: ████████████████████████████████ 220.61
n=8: █████████████████████████████████ 223.90 ← peak
n=10: ████████████████████████████████ ~215
n=11: ██████████████████████████████ 212.25
n=12: ██████████████████████████ ~202
n=13: ████████████████████████████ ~207
n=15: █████████████████████████ 201.39
n=17: ████████████████████ ~194
n=20: ██████████████ 187.33
The peak is unambiguously at n=8. Going deeper (n=15+) tanks throughput by ~15%; going shallower (n=6) gives up ~1.5%.
Why n=8 on mobile but n=15 on desktop
Two constraints push mobile toward shallower drafts:
- Memory bandwidth: 896 GB/s vs 1.79 TB/s. Each speculative token in the draft requires extra KV reads to verify in the target forward pass. Deeper drafts saturate the memory bus.
- VRAM budget: 24GB vs 32GB. Compute buffers for the speculative pipeline scale with n_spec. At n=15, we already saw OOM in
cudagraph_capturewhen bumpingmax-num-batched-tokensto 32768.
The desktop has ~33% more VRAM and ~100% more bandwidth, both of which favor deeper drafts. On mobile, n_spec=8 is the Pareto-optimal point — deep enough that draft amortizes the verify cost, shallow enough that compute buffers fit comfortably.
The weird part: a reproducible degradation cycle
When I extended the n=8 bench to 10 runs (instead of 3) to validate, I saw this:
run1: 221 t/s ← fast (DFlash active)
run2: 214 t/s
run3: 218 t/s
run4: 222 t/s
run5: 230 t/s
run6: 103 t/s ← transition
run7: 59 t/s ← DEGRADED (no spec decoding)
run8: 62 t/s
run9: 62 t/s
run10: 212 t/s ← recovered
Exactly 5 fast, 1 transition, 3 degraded, 1 recovery. Reproducible across multiple pod boots. The 60 t/s rate matches what I’d expect from Gemma 4 26B-A4B at vanilla decode WITHOUT speculative decoding — so DFlash is being temporarily disabled by something in vLLM’s pipeline.
I tried four workarounds:
--enable-prefix-cachingOFF → same cycle--max-num-seqs 1→ same cycle--enforce-eager(no cudagraph) → cycle delayed from run 6 to run 9, but eager caps throughput at ~130 t/s — worse overall--no-enable-chunked-prefill→ boot fails (max_num_batched_tokens 8192 < max_model_len 16384invariant violated without chunked prefill)
I also tried swapping the vLLM image to cu130-nightly-x86_64 (vLLM 0.19.2) and vllm/vllm-openai:gemma4 (vLLM 0.18.2) — both are older than the tokenspeed-preview image (vLLM 0.20.2) I’m using, so neither supports our DFlash + Gemma 4 + triton_attn combination at all (one errors with “non-causal attention not supported”, the other doesn’t have DFlash registered for gemma4_text).
So I can’t test the recent Gemma 4 DFlash hardening PRs (#41703, #42102, #40898) without building a custom vLLM image from main — out of scope for tonight.
Hypothesis: low-yield spec fallback
The cycle is too deterministic to be GPU thermal (the device sits at ~70 °C throughout). Initially I suspected cudagraph re-capture or KV defragmentation, but a closed llama.cpp PR I found while triaging — #22931 “adaptive low-yield MTP fallback” by leon7609 — gives a much better fit.
That PR documents the same pathological pattern on DeepSeek V4 in llama.cpp:
- When the speculative drafter’s acceptance rate drops below a threshold, the target model’s verify pass amplifies its work by 4.7× because every rejected draft token forces extra work on the target side.
- Sustained low acceptance produces a -79.9% throughput regression compared to the spec-accepted regime.
- That ratio matches my observation exactly: peak 220 t/s → degraded 60 t/s ≈ -73%.
My best current model:
- z-lab/gemma-4-26B-A4B-it-DFlash has noisy acceptance — confirmed earlier in my first Gemma 4 DFlash bench, where drafter pos-0 acceptance was only ~28% (much lower than the 60% typical for DFlash on its tuned targets).
- After a streak of ~5 lucky-accept requests, the drafter’s KV state diverges enough from the target that acceptance temporarily collapses for the next ~4 requests.
- While acceptance is collapsed, the 4.7× verify amplification produces the ~60 t/s rate we see.
- After enough rejected drafts, internal drafter state resets and acceptance recovers.
If this hypothesis is correct, the fix is not in vLLM — it’s in the drafter alignment. Either z-lab ships a better-aligned DFlash drafter for Gemma 4 (one finetuned more aggressively on the target’s output distribution), or we accept the cycle as a property of “off-the-shelf DFlash on Gemma 4” until a Gemma 4-specific drafter trained jointly with the target lands.
I’d love a second reproducer from anyone running the same stack — if you can confirm the 5-fast/4-slow pattern with z-lab/gemma-4-26B-A4B-it-DFlash, ping me and I’ll open a clean upstream issue with two independent reproductions.
Real-world impact: my long-session AVG over 10 requests is ~161 t/s instead of the 224 t/s peak. Mid-session, you’ll occasionally see a request take 30 seconds instead of 9 — disruptive for any agentic workflow that streams tokens.
What ships in vllmgemma4dflashone v1.0.4
Olares Market chart, single config diff vs v1.0.0:
SPEC_CONFIG: '{"method":"dflash","model":"z-lab/gemma-4-26B-A4B-it-DFlash","num_speculative_tokens":8}'
The --attention-backend triton_attn was already implicitly auto-selected by vLLM; I made it explicit. Everything else (KV fp8, max-num-seqs 4, prefix-caching on) stayed the same.
Bonus: Gemma 4 Audio One
While the bench cycles were running, I also shipped a new app: llamacppgemma4audione v1.0.0. It uses llama.cpp PR #21421 (USM Conformer encoder for Gemma 4 audio input, merged April 12 — already in b9101) plus unsloth/gemma-4-E4B-it-GGUF Q4_K_M + the BF16 mmproj (F16 and Q8_0 mmproj produce repetitive output — only BF16 works due to numerical sensitivity in ClippableLinear layers).
~6 GB VRAM, audio max 30 seconds per input, OpenAI-compatible chat completions with input_audio content type. ASR, audio understanding, voice agent input layer. Light enough to run alongside another GPU app.
Reproducible
Full Helm chart, exact image tag, all flags, bench harness — all in orales-one-market. Per the homepage promise: every number here comes with the exact stack to reproduce it.
If you’re running Gemma 4 + DFlash on consumer Blackwell (5090M, 5070 Ti, 5080) and see the same cycle, ping me — I’ll file the vLLM upstream issue once I have a second independent reproducer.