Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

After running Gemma 4 26B-A4B with z-lab’s DFlash drafter at 214 t/s steady on the Olares One (RTX 5090 Laptop, 24GB, sm_120 Blackwell mobile), Tech-Practice published a sweep showing 228 → 600 t/s on the 32GB desktop 5090 with n_spec=15. Naturally I wondered: does n=15 also win on mobile?

It does not. n_spec=8 wins on the 5090M, with peak ~235 t/s and stable ~224 t/s — about +5% over my previous default of n=13. I also found something weirder: a 100% reproducible degradation cycle every 9-10 requests that I couldn’t fix from configuration alone.

TL;DR

Optimal n_spec=8 on 5090M mobile: AVG 223.9 t/s, MAX 235.2 t/s (5 clean runs, Space Invaders 2000-token completion)
n=15 (Tech-Practice rec) regresses to 201 t/s on mobile — the desktop recommendation doesn’t transfer
5-10× faster than community baseline: the dasroot.net “Gemma 4 Speed Hacks” post (May 9) reports 22-40 t/s for Gemma 4 26B-31B DFlash on RTX 5090 desktop with vLLM 0.7.1. We’re at 224 t/s on mobile with the right image + n_spec — that’s not just a marginal win, it’s an order-of-magnitude gap.
Degradation cycle: every 9-10 requests, 4 runs drop from ~220 t/s to ~60 t/s (no spec decoding active), then recovers
Cycle is not caused by prefix-caching, max-num-seqs, cudagraph, or chunked prefill — looks like an adaptive low-yield spec fallback (see hypothesis below)
Effective long-session AVG ≈ 161 t/s (including degraded phases) vs 224 t/s peak

Shipped as vllmgemma4dflashone v1.0.4 in orales-one-market.

Hardware

GPU: RTX 5090 Laptop, 24GB GDDR7, 896 GB/s, sm_120 Blackwell consumer mobile
CPU: Intel Core Ultra 9 275HX, 24 cores
RAM: 96GB DDR5-5600
vs Tech-Practice’s reference: RTX 5090 desktop, 32GB GDDR7, 1.79 TB/s (2× mobile bandwidth)

Stack

vLLM: vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 (0.20.2rc1.dev67+g58c8a5eaa)
Target: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit (compressed-tensors WNA16-Marlin, ~17.5 GB)
Drafter: z-lab/gemma-4-26B-A4B-it-DFlash (~860 MB safetensors)
KV cache: fp8
Attention backend: triton_attn — Gemma 4 multimodal tokens forbid flash_attn even on the drafter (partial multimodal token full attention not supported)
--max-num-seqs 4 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.92 --enable-prefix-caching

Methodology

User-standard Space Invaders HTML prompt, 2000 tokens, temp=0.6, top_p=0.95. 2 warmups + 3-10 measured runs depending on the sweep point. Bench harness runs inside the pod via kubectl exec to bypass the Envoy sidecar.

The sweep

n_spec	Stable AVG	MAX	n_runs
6	220.61	225.95	5
8	223.90	235.22	5
10	~215	222.07	3
11	212.25	217.66	5
12	~202	206.80	3
13 (was v1.0.0)	~207	213.27	5
15 (Tech-Practice rec)	201.39	211.27	3
17	~194	195.76	3
20	187.33	196.12	3

n=6:   ████████████████████████████████  220.61
n=8:   █████████████████████████████████ 223.90  ← peak
n=10:  ████████████████████████████████  ~215
n=11:  ██████████████████████████████    212.25
n=12:  ██████████████████████████        ~202
n=13:  ████████████████████████████      ~207
n=15:  █████████████████████████         201.39
n=17:  ████████████████████              ~194
n=20:  ██████████████                    187.33

The peak is unambiguously at n=8. Going deeper (n=15+) tanks throughput by ~15%; going shallower (n=6) gives up ~1.5%.

Why n=8 on mobile but n=15 on desktop

Two constraints push mobile toward shallower drafts:

Memory bandwidth: 896 GB/s vs 1.79 TB/s. Each speculative token in the draft requires extra KV reads to verify in the target forward pass. Deeper drafts saturate the memory bus.
VRAM budget: 24GB vs 32GB. Compute buffers for the speculative pipeline scale with n_spec. At n=15, we already saw OOM in cudagraph_capture when bumping max-num-batched-tokens to 32768.

The desktop has ~33% more VRAM and ~100% more bandwidth, both of which favor deeper drafts. On mobile, n_spec=8 is the Pareto-optimal point — deep enough that draft amortizes the verify cost, shallow enough that compute buffers fit comfortably.

The weird part: a reproducible degradation cycle

When I extended the n=8 bench to 10 runs (instead of 3) to validate, I saw this:

run1:  221 t/s ← fast (DFlash active)
run2:  214 t/s
run3:  218 t/s
run4:  222 t/s
run5:  230 t/s
run6:  103 t/s ← transition
run7:   59 t/s ← DEGRADED (no spec decoding)
run8:   62 t/s
run9:   62 t/s
run10: 212 t/s ← recovered

Exactly 5 fast, 1 transition, 3 degraded, 1 recovery. Reproducible across multiple pod boots. The 60 t/s rate matches what I’d expect from Gemma 4 26B-A4B at vanilla decode WITHOUT speculative decoding — so DFlash is being temporarily disabled by something in vLLM’s pipeline.

I tried four workarounds:

--enable-prefix-caching OFF → same cycle
--max-num-seqs 1 → same cycle
--enforce-eager (no cudagraph) → cycle delayed from run 6 to run 9, but eager caps throughput at ~130 t/s — worse overall
--no-enable-chunked-prefill → boot fails (max_num_batched_tokens 8192 < max_model_len 16384 invariant violated without chunked prefill)

I also tried swapping the vLLM image to cu130-nightly-x86_64 (vLLM 0.19.2) and vllm/vllm-openai:gemma4 (vLLM 0.18.2) — both are older than the tokenspeed-preview image (vLLM 0.20.2) I’m using, so neither supports our DFlash + Gemma 4 + triton_attn combination at all (one errors with “non-causal attention not supported”, the other doesn’t have DFlash registered for gemma4_text).

So I can’t test the recent Gemma 4 DFlash hardening PRs (#41703, #42102, #40898) without building a custom vLLM image from main — out of scope for tonight.

Hypothesis: low-yield spec fallback

The cycle is too deterministic to be GPU thermal (the device sits at ~70 °C throughout). Initially I suspected cudagraph re-capture or KV defragmentation, but a closed llama.cpp PR I found while triaging — #22931 “adaptive low-yield MTP fallback” by leon7609 — gives a much better fit.

That PR documents the same pathological pattern on DeepSeek V4 in llama.cpp:

When the speculative drafter’s acceptance rate drops below a threshold, the target model’s verify pass amplifies its work by 4.7× because every rejected draft token forces extra work on the target side.
Sustained low acceptance produces a -79.9% throughput regression compared to the spec-accepted regime.
That ratio matches my observation exactly: peak 220 t/s → degraded 60 t/s ≈ -73%.

My best current model:

z-lab/gemma-4-26B-A4B-it-DFlash has noisy acceptance — confirmed earlier in my first Gemma 4 DFlash bench, where drafter pos-0 acceptance was only ~28% (much lower than the 60% typical for DFlash on its tuned targets).
After a streak of ~5 lucky-accept requests, the drafter’s KV state diverges enough from the target that acceptance temporarily collapses for the next ~4 requests.
While acceptance is collapsed, the 4.7× verify amplification produces the ~60 t/s rate we see.
After enough rejected drafts, internal drafter state resets and acceptance recovers.

If this hypothesis is correct, the fix is not in vLLM — it’s in the drafter alignment. Either z-lab ships a better-aligned DFlash drafter for Gemma 4 (one finetuned more aggressively on the target’s output distribution), or we accept the cycle as a property of “off-the-shelf DFlash on Gemma 4” until a Gemma 4-specific drafter trained jointly with the target lands.

I’d love a second reproducer from anyone running the same stack — if you can confirm the 5-fast/4-slow pattern with z-lab/gemma-4-26B-A4B-it-DFlash, ping me and I’ll open a clean upstream issue with two independent reproductions.

Real-world impact: my long-session AVG over 10 requests is ~161 t/s instead of the 224 t/s peak. Mid-session, you’ll occasionally see a request take 30 seconds instead of 9 — disruptive for any agentic workflow that streams tokens.

What ships in `vllmgemma4dflashone` v1.0.4

Olares Market chart, single config diff vs v1.0.0:

SPEC_CONFIG: '{"method":"dflash","model":"z-lab/gemma-4-26B-A4B-it-DFlash","num_speculative_tokens":8}'

The --attention-backend triton_attn was already implicitly auto-selected by vLLM; I made it explicit. Everything else (KV fp8, max-num-seqs 4, prefix-caching on) stayed the same.

Bonus: Gemma 4 Audio One

While the bench cycles were running, I also shipped a new app: llamacppgemma4audione v1.0.0. It uses llama.cpp PR #21421 (USM Conformer encoder for Gemma 4 audio input, merged April 12 — already in b9101) plus unsloth/gemma-4-E4B-it-GGUF Q4_K_M + the BF16 mmproj (F16 and Q8_0 mmproj produce repetitive output — only BF16 works due to numerical sensitivity in ClippableLinear layers).

~6 GB VRAM, audio max 30 seconds per input, OpenAI-compatible chat completions with input_audio content type. ASR, audio understanding, voice agent input layer. Light enough to run alongside another GPU app.

Reproducible

Full Helm chart, exact image tag, all flags, bench harness — all in orales-one-market. Per the homepage promise: every number here comes with the exact stack to reproduce it.

If you’re running Gemma 4 + DFlash on consumer Blackwell (5090M, 5070 Ti, 5080) and see the same cycle, ping me — I’ll file the vLLM upstream issue once I have a second independent reproducer.