Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.

Hi there.

Quick recap for newcomers: on this blog, I benchmark LLMs locally on the Olares One (a mini-PC with an RTX 5090M 24GB — my personal lab). Three names show up a lot: Genesis is Sandermage’s custom vLLM stack that gets me to 88 t/s, Lucebox is a heavily-tuned llama.cpp fork by sandropuppo, and MTP is the speculative decoding head built into Qwen3.6 and Gemma 4. If you want the full backstory, the post on the hardware choice explains the context.

Anyway — a weird week on the Olares One. Upstream everything moves at the same time: Gemma 4 ships its MTP drafters (May 5), vLLM merges native support (May 6), llama.cpp follows (May 7 via a third-party fork), and meanwhile my Lucebox v1.9.0 regresses 22% for no obvious reason. Quick recap of the three benches that took up my afternoons, plus the cleanup of the 8 Qwen3.6 27B apps on the Olares One Market.

1) Gemma 4 E4B + MTP via vLLM = 178 t/s on consumer mobile Blackwell

This one deserves its own post (link at the end), but in two lines: PR #41745 merged May 6 at 14:39 UTC adds native support for Gemma 4 Multi-Token Prediction drafters (E2B/E4B/26B-A4B/31B). On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One spits:

Run 1 (cold): 800 tok in 6.17 s = 129.73 t/s
Run 2:        800 tok in 4.17 s = 191.73 t/s
Run 3:        800 tok in 3.73 s = 214.38 t/s

AVG = 178.6 t/s, 77.3% acceptance. No Genesis, no fork, no patch — just vllm/vllm-openai:nightly-1acd67a795... + the right --speculative-config. Full details in the short post from this afternoon.

The takeaway here: 24 hours between upstream merge and validated bench on consumer mobile Blackwell. A year ago that loop would have taken 2-3 weeks.

2) Lucebox v1.9.0 — the mystery 88 → 69 t/s regression

This one’s trickier. On May 4 I benched Lucebox v1.4.4 (custom DFlash path, target Qwen3.6-27B Q4_K_M, z-lab BF16 drafter) at 88.5 t/s on the same Olares One. When I tried to rebench on May 7 after the v1.9.0 rebuild (which includes PR #94 matched 3.6 SWA draft + PR #99 consumer Blackwell fix), I consistently get 69 t/s — on the exact same hardware, same chart, same drafters cached on disk.

I isolated hypothesis by hypothesis:

Recap table:

Configt/s
v1.9.0 (PR #94+#99, TQ3_0, fa_window=2048)68.85
v1.9.0 + fa_window=068.83
v1.6.0 (PR #94 only, TQ3_0)69.05
v1.9.0 + Q4_0 KV68.96
May 4 v1.4.4 reference88.5

Every image/config hypothesis is ruled out. What’s left is the device environment: kernel update, HAMi runtime, libvgpu state, or bench methodology (May 4 I might have benched through an Open WebUI / public auth URL path with HTTP/2, while now I go through kubectl exec localhost). The kubectl exec route adds k8s-stack microseconds, but not 22% in my opinion.

Provisional conclusion: environmental regression that we can’t isolate without device-level access. Documented in long-term memory and Lucebox is removed from the app store (see §4).

3) vLLM no-Genesis with PR #39931 — the half-fix that breaks on the workspace lock

vLLM-side, JartX lands PR #39931 in the official codebase on May 5 at 00:14 UTC: “[Feature] TurboQuant: support hybrid models and uniform quantization”. Big deal: it fixes the crash TurboQuant threw the moment it hit a Mamba layer on Qwen3.5/3.6/Qwen3-Next. Three of the 28 Genesis patches become unnecessary in one go (P60 ngram for GDN, P65 CUDA graph mode downgrade in spec-decode, P66 invalid-size filter at capture).

I tested on the vllm/vllm-openai:gemma4-0505-cu130 image (main HEAD post-merge) with the exact vllmqwen36turbo27bone config, but zero Genesis vars.

First crash, paradoxal: EagleProposer.__init__ → torch.zeros(...) → CUDA driver error: invalid argument. That’s the HAMi 0m bug I already knew Lucebox-side (Driver API). I thought vLLM was immune (Runtime API). Wrong: the MTP drafter allocation goes through a path HAMi-core intercepts. Workaround CUDA_DEVICE_MEMORY_LIMIT_0=24000m → boot.

Second crash, expected: CUDA graph capture on query_start_loc.tolist() in turboquant_attn.py:583. That’s exactly what P65 monkey-patches. Workaround: --enforce-eager (skip ALL CUDA graphs).

Bench:

Run 1: 800 tok in 11.12 s = 71.97 t/s
Run 2: 800 tok in 10.84 s = 73.81 t/s
Run 3: 800 tok in 11.13 s = 71.87 t/s

AVG = 72.55 t/s−17.5% vs Genesis baseline (88 t/s). That’s the cost of --enforce-eager. Genesis P65 is smarter: it downgrades _cudagraph_support only when speculative_config is active, so 1-token decode batches keep their CUDA graph.

I also tried Patch 65 alone (3 lines inline-patched on main HEAD). Boot OK, clean CUDA graph capture, explicit log: “setting cudagraph_mode=PIECEWISE”. But on the first inference: crash.

File ".../vllm/v1/attention/backends/turboquant_attn.py", line 862, in _decode_attention
    current_workspace_manager().get_simultaneous(...)
AssertionError: Workspace is locked but allocation requires 0.76 MB,
current size is 0.00 MB. Workspace growth is not allowed after locking.

That’s the profiler-invisible torch.empty bug in the “continuation prefill” phase that Sandermage described in the #40807 thread — his patches P22 (shared dequant buffer + 4-D K/V workspace prealloc) and P38 (memory state machine) are the actual fix. P65 alone isn’t enough. To genuinely drop Genesis you need at minimum P22 + P38 + P65.

I posted a comment on #40807 with these two datapoints (4th confirmed hardware after Ampere 3090, A5000, and Blackwell 5080). Still no reply as of May 8 11h.

Conclusion: we can’t drop Genesis today without porting ~5 non-trivial patches upstream. PR #39931 is necessary but not sufficient. The full fix needs Sandermage or a motivated core-team member.

4) Cleanup: 8 Qwen3.6 27B apps on the Olares One Market → 2

After all this stack iteration, my Olares One Market had 8 Qwen3.6 27B variants:

Too many. I kept two 27B apps, and to help users choose, the titles position them directly:

AppTitleStackContext
Qwen36 27B FastvLLM Turbo + Genesis 28 patches + TurboQuant K8V4 + MTP n=3128K
📜Qwen36 27B Long Contextllama.cpp + PR #22673 (custom image) + froggeric MTP-GGUF128K (262K native)

Removed: lucedflashqwen36one, dflashqwen36one, vllmqwen36dense27bone, llamacppqwen36dense27bone. The 35B-A3B apps (llamacppqwen36a3bone + qwen36a3bvisionone) stay — different model.

Along the way I also caught 2 bugs on the “Fast” version during re-test:

All of that to eventually boot v2.5.6 at stable 128K context.

5) Bonus: Atomic Chat fork for Gemma 4 MTP via llama.cpp

For folks who want Gemma 4 MTP on the llama.cpp side (not vLLM), AtomicChat published the MTP-compatible GGUFs + a fork AtomicBot-ai/atomic-llama-cpp-turboquant that adds:

I built their fork for sm_120 → image aamsellem/llamacpp-atomic-mtp:0.1.0 (2.72 GB). AtomicChat bench on M5Max: 97 → 138 t/s on Gemma 4 26B (+40%).

Validated on Olares One on May 8 afternoon:

Gemma 4 E2B + MTP    : 206.56 t/s, 60.9 % acceptance (3,198 tok in 15.48 s)
Gemma 4 26B-A4B + MTP : 140.03 t/s, 78.1 % acceptance (3,238 tok in 23.12 s)

The 26B-A4B delivers 140 t/s with 78% acceptance on the first run (no warm-up). That beats AtomicChat’s M5Max reference bench (138 t/s) — the 5090M sm_120 mobile has ~75% of a 4090’s bandwidth, but the Gemma 4 MoE (3.8B activated out of 26B) takes good advantage of it. The E2B at 206 t/s is essentially the “single-stream max” ceiling for a 5B model on this hardware.

Charts shipped: gemma4e2bone v1.0.2 (E2B, 128K ctx) and gemma426ba4bone v1.0.9 (26B-A4B, 64K ctx, no vision because mmproj isn’t wired with MTP in this fork).

What works, what breaks — cheat sheet

StackStatus on Olares One 5090M
Atomic Chat fork + Gemma 4 E2B MTP via llama.cpp206 t/s, 61% acceptance — single-stream ceiling
vLLM nightly + #41745 + Gemma 4 E4B MTP✅ 178 t/s, 77% acceptance, 100% upstream stack
Atomic Chat fork + Gemma 4 26B-A4B MTP via llama.cpp140 t/s, 78% acceptance — beats the M5Max ref (138 t/s)
vLLM Turbo (Genesis) + Qwen3.6-27B + MTP n=3 + 128K✅ 88 t/s steady, validated after HAMi fix + drop xxhash + revert P5B
llama.cpp + PR #22673 + Qwen3.6-27B-MTP + 128K✅ 65 t/s, near-upstream stack (1 PR + 1 GGUF)
Lucebox DFlash v1.9.0 + Qwen3.6-27B⚠️ 69 t/s reproducible vs 88 t/s May 4 ref (regression unisolated)
vLLM main HEAD + #39931 + --enforce-eager (no Genesis)⚠️ 72.55 t/s — not viable for dropping Genesis
vLLM main HEAD + Patch 65 alone❌ Workspace lock crash — also need P22 + P38

The only honest conclusion

The local-inference ecosystem in 2026 has a release cadence nobody can follow daily. In 7 days I saw: an official Gemma 4 MTP drafter + 2 PRs to support it (vLLM + llama.cpp), a Genesis-compat PR in vLLM main, two reproducible regressions, and a TurboQuant fork ported to llama.cpp. My job on the Olares One Market is to triage: keep what’s validated, remove what’s broken, write down what helps other Olares One owners avoid my crashes.

Gemma 4 MTP via vLLM-side, it’s too fresh to generalize beyond a 5090M. If you run on a 5070, 5080, desktop 4090 / 3090, your numbers matter to me. And if you’ve solved the Lucebox 88 → 69 t/s regression somehow, I want to know.

See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments