Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

Full num_speculative_tokens sweep for Gemma 4 26B-A4B + z-lab DFlash drafter on RTX 5090M Laptop (24GB sm_120). Optimal is n_spec=8 (not n=15 like desktop). I also found a 100% reproducible vLLM degradation cycle that I couldn't fix from config alone.

After running Gemma 4 26B-A4B with z-lab’s DFlash drafter at 214 t/s steady on the Olares One (RTX 5090 Laptop, 24GB, sm_120 Blackwell mobile), Tech-Practice published a sweep showing 228 → 600 t/s on the 32GB desktop 5090 with n_spec=15. Naturally I wondered: does n=15 also win on mobile?

It does not. n_spec=8 wins on the 5090M, with peak ~235 t/s and stable ~224 t/s — about +5% over my previous default of n=13. I also found something weirder: a 100% reproducible degradation cycle every 9-10 requests that I couldn’t fix from configuration alone.

TL;DR

Shipped as vllmgemma4dflashone v1.0.4 in orales-one-market.

Hardware

Stack

Methodology

User-standard Space Invaders HTML prompt, 2000 tokens, temp=0.6, top_p=0.95. 2 warmups + 3-10 measured runs depending on the sweep point. Bench harness runs inside the pod via kubectl exec to bypass the Envoy sidecar.

The sweep

n_specStable AVGMAXn_runs
6220.61225.955
8223.90235.225
10~215222.073
11212.25217.665
12~202206.803
13 (was v1.0.0)~207213.275
15 (Tech-Practice rec)201.39211.273
17~194195.763
20187.33196.123
n=6:   ████████████████████████████████  220.61
n=8:   █████████████████████████████████ 223.90  ← peak
n=10:  ████████████████████████████████  ~215
n=11:  ██████████████████████████████    212.25
n=12:  ██████████████████████████        ~202
n=13:  ████████████████████████████      ~207
n=15:  █████████████████████████         201.39
n=17:  ████████████████████              ~194
n=20:  ██████████████                    187.33

The peak is unambiguously at n=8. Going deeper (n=15+) tanks throughput by ~15%; going shallower (n=6) gives up ~1.5%.

Why n=8 on mobile but n=15 on desktop

Two constraints push mobile toward shallower drafts:

  1. Memory bandwidth: 896 GB/s vs 1.79 TB/s. Each speculative token in the draft requires extra KV reads to verify in the target forward pass. Deeper drafts saturate the memory bus.
  2. VRAM budget: 24GB vs 32GB. Compute buffers for the speculative pipeline scale with n_spec. At n=15, we already saw OOM in cudagraph_capture when bumping max-num-batched-tokens to 32768.

The desktop has ~33% more VRAM and ~100% more bandwidth, both of which favor deeper drafts. On mobile, n_spec=8 is the Pareto-optimal point — deep enough that draft amortizes the verify cost, shallow enough that compute buffers fit comfortably.

The weird part: a reproducible degradation cycle

When I extended the n=8 bench to 10 runs (instead of 3) to validate, I saw this:

run1:  221 t/s ← fast (DFlash active)
run2:  214 t/s
run3:  218 t/s
run4:  222 t/s
run5:  230 t/s
run6:  103 t/s ← transition
run7:   59 t/s ← DEGRADED (no spec decoding)
run8:   62 t/s
run9:   62 t/s
run10: 212 t/s ← recovered

Exactly 5 fast, 1 transition, 3 degraded, 1 recovery. Reproducible across multiple pod boots. The 60 t/s rate matches what I’d expect from Gemma 4 26B-A4B at vanilla decode WITHOUT speculative decoding — so DFlash is being temporarily disabled by something in vLLM’s pipeline.

I tried four workarounds:

I also tried swapping the vLLM image to cu130-nightly-x86_64 (vLLM 0.19.2) and vllm/vllm-openai:gemma4 (vLLM 0.18.2) — both are older than the tokenspeed-preview image (vLLM 0.20.2) I’m using, so neither supports our DFlash + Gemma 4 + triton_attn combination at all (one errors with “non-causal attention not supported”, the other doesn’t have DFlash registered for gemma4_text).

So I can’t test the recent Gemma 4 DFlash hardening PRs (#41703, #42102, #40898) without building a custom vLLM image from main — out of scope for tonight.

Hypothesis: low-yield spec fallback

The cycle is too deterministic to be GPU thermal (the device sits at ~70 °C throughout). Initially I suspected cudagraph re-capture or KV defragmentation, but a closed llama.cpp PR I found while triaging — #22931 “adaptive low-yield MTP fallback” by leon7609 — gives a much better fit.

That PR documents the same pathological pattern on DeepSeek V4 in llama.cpp:

My best current model:

  1. z-lab/gemma-4-26B-A4B-it-DFlash has noisy acceptance — confirmed earlier in my first Gemma 4 DFlash bench, where drafter pos-0 acceptance was only ~28% (much lower than the 60% typical for DFlash on its tuned targets).
  2. After a streak of ~5 lucky-accept requests, the drafter’s KV state diverges enough from the target that acceptance temporarily collapses for the next ~4 requests.
  3. While acceptance is collapsed, the 4.7× verify amplification produces the ~60 t/s rate we see.
  4. After enough rejected drafts, internal drafter state resets and acceptance recovers.

If this hypothesis is correct, the fix is not in vLLM — it’s in the drafter alignment. Either z-lab ships a better-aligned DFlash drafter for Gemma 4 (one finetuned more aggressively on the target’s output distribution), or we accept the cycle as a property of “off-the-shelf DFlash on Gemma 4” until a Gemma 4-specific drafter trained jointly with the target lands.

I’d love a second reproducer from anyone running the same stack — if you can confirm the 5-fast/4-slow pattern with z-lab/gemma-4-26B-A4B-it-DFlash, ping me and I’ll open a clean upstream issue with two independent reproductions.

Real-world impact: my long-session AVG over 10 requests is ~161 t/s instead of the 224 t/s peak. Mid-session, you’ll occasionally see a request take 30 seconds instead of 9 — disruptive for any agentic workflow that streams tokens.

What ships in vllmgemma4dflashone v1.0.4

Olares Market chart, single config diff vs v1.0.0:

SPEC_CONFIG: '{"method":"dflash","model":"z-lab/gemma-4-26B-A4B-it-DFlash","num_speculative_tokens":8}'

The --attention-backend triton_attn was already implicitly auto-selected by vLLM; I made it explicit. Everything else (KV fp8, max-num-seqs 4, prefix-caching on) stayed the same.

Bonus: Gemma 4 Audio One

While the bench cycles were running, I also shipped a new app: llamacppgemma4audione v1.0.0. It uses llama.cpp PR #21421 (USM Conformer encoder for Gemma 4 audio input, merged April 12 — already in b9101) plus unsloth/gemma-4-E4B-it-GGUF Q4_K_M + the BF16 mmproj (F16 and Q8_0 mmproj produce repetitive output — only BF16 works due to numerical sensitivity in ClippableLinear layers).

~6 GB VRAM, audio max 30 seconds per input, OpenAI-compatible chat completions with input_audio content type. ASR, audio understanding, voice agent input layer. Light enough to run alongside another GPU app.

Reproducible

Full Helm chart, exact image tag, all flags, bench harness — all in orales-one-market. Per the homepage promise: every number here comes with the exact stack to reproduce it.

If you’re running Gemma 4 + DFlash on consumer Blackwell (5090M, 5070 Ti, 5080) and see the same cycle, ping me — I’ll file the vLLM upstream issue once I have a second independent reproducer.

Share this post on:

Comments