Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 9: the PR that promised +57% and delivered +0.2%

Last night Lucebox crossed 88.5 t/s on Olares One and became the new champion. This morning PR #94 reports +57% on RTX 4090. If it scales, we hit 120 t/s. Spoiler: 88.7 t/s. Full DDTree sweep, three hypotheses, the honest lesson on upstream benches that don't reproduce.

Hi there.

Last night Lucebox crossed 88.5 t/s on my Olares One and became the new champion on this hardware, 0.5 t/s ahead of my vLLM Turbo. Not a blowout but an honest win. I closed the laptop happy. Then this morning, sipping coffee, I see this on the Lucebox-hub repo:

PR #94 — Support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090

106 tokens per second. On Ada Lovelace (RTX 4090). With a DDTree budget sweep showing +57% vs the mismatched 3.5 drafter (67.4 → 106.1 t/s).

If that scales linearly to my consumer Blackwell RTX 5090M Laptop (which has more bandwidth overall), we should hit 120-130 t/s on Olares. The saga continues, and I’m expecting a nice gain tonight.

Spoiler right now: 88.7 t/s. +0.2 vs yesterday. In the noise floor. Here’s the post-mortem.

Quick recap for newcomers

Lucebox is sandropuppo’s llama.cpp fork running DFlash speculative decoding with custom CUDA kernels. The “mismatched 3.5 drafter” we’ve been using since the start is the small model that proposes tokens ahead — except it was trained on Qwen3.5, not Qwen3.6 which is what we’re running. It works, but with middling acceptance. PR #94 finally publishes a drafter specifically trained for Qwen3.6, with as a bonus a Sliding Window Attention (SWA) mechanism that should boost things further.

On paper, that’s exactly the upgrade I was missing.

What the PR promises

Three changes:

  1. Matched 3.6 draft instead of mismatched 3.5 — four SWA layers + one full-attention layer instead of five full-attention layers. Layer types read from config.json next to the safetensors.

  2. SWA mask plumbed through qwen3_dflash_graph.cpp via ggml_flash_attn_ext.

  3. Build flag DFLASH27B_ENABLE_BSA=ON turns on Block-Sparse-Attention (PFlash path, but also useful for the drafter).

The PR’s RTX 4090 sweep:

budget=6:   93.9 t/s
budget=10: 106.1 t/s ← sweet spot
budget=14:  96.5
budget=18: 102.7
budget=22: 103.5

Sweet spot at 10. Logical: lower budget = less verify parallelism = fewer chunks = good on cards with limited bandwidth.

The morning: the build

I rebuild a v1.6.0 image off this branch:

RUN git clone --depth 1 --branch feat/dflash-qwen36-swa-draft \
    --recurse-submodules \
    https://github.com/Quitetall/lucebox-hub /src/lucebox

RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DDFLASH27B_ENABLE_BSA=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target test_dflash test_flashprefill_kernels -j$(nproc)

100 minutes of cross-compile on OrbStack. Image v1.6.0 = 5.39 GB (vs 4.17 GB for v1.4.0 — +1.2 GB = the BSA kernels).

First bench, default config

Default chart config (inherited from v1.4.4): --cache-type-k tq3_0 --cache-type-v tq3_0 --budget 22.

Run 1: 87.45 t/s
Run 2: 89.22 t/s
Run 3: 88.09 t/s

88.25 t/s avg. Identical to yesterday (88.5). Within noise.

First reaction: PR #94 isn’t showing? I check that it’s actually active. grep "SWA layers" source/safetensors_draft.cpp → the function is there. cat /models/draft/config.jsonlayer_types: [sliding × 4, full × 1], sliding_window: 2048 — exactly what the PR expects. The code path runs.

But the diagnostic log [draft] SWA layers: 4/5 (window=2048) doesn’t appear in test_dflash output. Either the code path runs silently, or the log goes to stderr we don’t capture in kubectl logs.

Verdict: probably active, but with no visible gain here.

Noon: the PR’s official config

PR #94 documents a specific setup for the 106 t/s:

DFLASH27B_KV_K=q4_0 DFLASH27B_KV_V=q4_0 DFLASH27B_FA_WINDOW=0 \
  python scripts/run.py --budget 10 --max-ctx 4096 ...

Three differences vs my config:

I patch via kubectl patch deployment (faster than re-bumping the chart) and run a full sweep: 6, 10, 14, 18, 22, 26, 30. Seven boots × 5 min + 30 s bench = ~40 min. I let it run while I do other things.

The afternoon: the full sweep

BudgetStatust/s avg
6crash 503 on first request
10crash 503
14crash 503
18OK84.8 [83.5-85.9]
22OK ← peak88.7 [87.5-89.4]
26OK then drop76.7 [75.5-77.5]
301 run OK then crash 50368.3 → unstable

Dome curve. Peak at 22.

+0.2 t/s vs my v1.4.4 baseline (88.5). Within noise.

Three possible explanations

1. The matched 3.6 SWA draft doesn’t shine on short prompts

My Space Invaders prompt is ~25 tokens. SWA optimizations (window=2048) on the drafter only kick in when the draft sequence exceeds the window. At 25 tokens, the drafter sees everything, like a normal full-attention layer.

The PR #94 RTX 4090 sweep probably used bench_he.py running HumanEval (longer prompts, full code completion). Closer to where SWA actually wins.

To validate, would need to retest with an 8K-16K token prompt. But our Space Invaders bench has been our standard since the beginning, and it’s representative of normal chat.

2. The sweet spot differs by GPU architecture

PR #94 on RTX 4090 (Ada sm_89) = sweet spot budget 10, max 106 t/s. My Olares One RTX 5090M (Blackwell sm_120) = sweet spot budget 22, max 88.7 t/s.

Radical difference. And higher budget → wider batch → more bandwidth consumed. On consumer mobile 5090M sm_120, we have 75% of a 4090’s bandwidth but only half the SMs — the window where the verify batch fits without crashing is different.

More likely hypothesis: with q4_0 KV cache, memory-per-token is denser. Low budget (6, 10, 14) → too much fragmentation and the daemon OOMs on the first request (503). Budget 22 = sweet spot where there’s enough memory for the verify batch without crashing. Budget 26+ = verify dominates the overhead, drop.

That’s consistent with the fact that my budget 6/10/14 crash 503 while the PR’s RTX 4090 handles them. The desktop 4090 has more free VRAM after model + drafter (probably because no HAMi vGPU on top).

3. PR #94 is in progress, maybe not finalized

PR #94 is still OPEN (May 4). Not merged yet. The fact that my test doesn’t reproduce the 106 t/s could simply mean the branch isn’t at its final state. Sandropuppo might have additional commits locally not yet pushed.

To watch. If the PR merges with a fix that changes things on sm_120 specifically, I rebuild.

Today’s honest verdict

Plateau at ~88 t/s on Qwen3.6-27B / RTX 5090M / 24GB. Three distinct stacks land within 0.5 t/s:

Stackt/s avg
vLLM Turbo (Genesis + TurboQuant K8V4 + MTP n=3)88.0
Lucebox v1.4.4 (tq3_0 KV + DDTree 22)88.5
Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22)88.7
buun-llama-cpp DFlash + Q8_0 drafter80
llama.cpp standard (no spec)33-36

For a breakthrough past 90 t/s, several upstream paths to wait on:

None of these is actionable today. 88 t/s is our practical cap on this hardware in late Q2 2026.

Lesson from this episode

Not every upstream benchmark reproduces on every hardware. PR #94 on RTX 4090 desktop 24GB = +57%. On RTX 5090M Blackwell mobile 24GB = +0.2%. Same VRAM, same model, same quantization — the GPU architecture difference and memory configuration is enough to invalidate the claim.

And transparency on negative results is as useful as transparency on positive ones. Someone else will read PR #94, see the announced 106 t/s, try to reproduce on their 5090. This post saves them a 2-hour build and points them where to look.

Next steps

That’s it! If anyone on a 5090M or other consumer Blackwell 24GB reproduces these numbers or finds the unlock I’m missing, I want to know.


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments