Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 9: the 106 t/s attempt that crashed at 88.7 t/s, and why

Lucebox PR #94 reports 67 → 106 t/s (+57%) on RTX 4090 with the matched 3.6 SWA draft. I port it to the 5090M Blackwell. Spoiler: 88.5 → 88.7 t/s. Full DDTree budget sweep, and the honest reason why.

Episode 8 — I had Lucebox running at 88.5 t/s on Olares One via the libvgpu hot-swap + three workarounds. New champion, 0.5 t/s ahead of vLLM Turbo. Then I see PR #94 land on Lucebox-hub the same morning: “Support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090”. With a DDTree budget sweep showing +57% vs the mismatched 3.5 drafter (67.4 → 106.1 t/s on RTX 4090).

Ada sm_89 → 106 t/s. If it scales linearly to Blackwell sm_120 (which has more bandwidth overall), we should hit 120-130 t/s on Olares.

Spoiler right now: 88.7 t/s. +0.2 vs yesterday. In the noise floor. Here’s the post-mortem.

What PR #94 promises

Three changes in the PR:

  1. Matched 3.6 draft instead of mismatched 3.5 — the z-lab/Qwen3.6-27B-DFlash draft (4 sliding-window-attention layers + 1 full-attention) instead of the 3.5 draft (5 full-attention layers). Layer types read from the config.json next to the safetensors.

  2. SWA mask plumbed through qwen3_dflash_graph.cpp via ggml_flash_attn_ext.

  3. Build flag DFLASH27B_ENABLE_BSA=ON turns on Block-Sparse-Attention (PFlash path, but also useful for the drafter).

The PR’s RTX 4090 sweep:

budget=6:   93.9 t/s
budget=10: 106.1 t/s ← sweet spot
budget=14:  96.5
budget=18: 102.7
budget=22: 103.5

Sweet spot at 10. Logical: lower budget = less verify parallelism = fewer chunks = good on cards with limited bandwidth.

The build

I rebuild a v1.6.0 image off this branch:

RUN git clone --depth 1 --branch feat/dflash-qwen36-swa-draft \
    --recurse-submodules \
    https://github.com/Quitetall/lucebox-hub /src/lucebox

RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DDFLASH27B_ENABLE_BSA=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target test_dflash test_flashprefill_kernels -j$(nproc)

Build: 100 min on cross-arch OrbStack. Image v1.6.0 = 5.39 GB (vs 4.17 GB for v1.4.0 = +1.2 GB = the BSA kernels).

First bench — default config

Default chart config (inherited from v1.4.4): --cache-type-k tq3_0 --cache-type-v tq3_0 --budget 22 (default).

Run 1: 87.45 t/s
Run 2: 89.22 t/s
Run 3: 88.09 t/s

88.25 t/s avg. Identical to yesterday (88.5). Ouch. PR #94 doesn’t show up?

I check that the PR is actually active. grep "SWA layers" source/safetensors_draft.cpp → the function is there, parsing config.json. cat /models/draft/config.jsonlayer_types: [sliding × 4, full × 1], sliding_window: 2048 — exactly what the PR expects.

But the diagnostic log [draft] SWA layers: 4/5 (window=2048) doesn’t appear in test_dflash output. Either the code path runs silently, or the log goes to stderr which we don’t capture in kubectl logs (in theory we capture stdout+stderr, but unsure for daemon subprocesses spawned by server.py).

Verdict: probably active but with no visible gain here.

Hypothesis 1: default config is suboptimal

PR #94 documents a specific setup for the 106 t/s:

DFLASH27B_KV_K=q4_0 DFLASH27B_KV_V=q4_0 DFLASH27B_FA_WINDOW=0 \
  python scripts/run.py --budget 10 --max-ctx 4096 ...

Three differences vs my config:

I patch via kubectl patch deployment (faster than bumping the chart).

And I run a full sweep: 6, 10, 14, 18, 22, 26, 30. Seven builds × 5 min boot + 30 s bench = ~40 min. I let it run.

The full sweep

BudgetStatust/s avg
6crash 503 on first request
10crash 503
14crash 503
18OK84.8 [83.5-85.9]
22OK ← peak88.7 [87.5-89.4]
26OK then drop76.7 [75.5-77.5]
301 OK run then crash 50368.3 → unstable

Dome curve. Peak at 22.

+0.2 t/s vs my v1.4.4 baseline (88.5). In the noise floor.

Three possible explanations

1. The matched 3.6 SWA draft doesn’t shine on short prompts

My Space Invaders prompt is ~25 tokens. SWA optimizations (window=2048) on the drafter only kick in when the draft sequence exceeds the window. At 25 tokens, the drafter sees everything, like a normal full-attention layer.

The PR #94 RTX 4090 sweep probably used bench_he.py running HumanEval (longer prompts, full code completion). Closer to where SWA actually wins.

To validate, would need to retest with an 8K-16K token prompt. But our Space Invaders bench has been our standard since the beginning, and it’s representative of normal chat usage.

2. The sweet spot differs by GPU architecture

PR #94 on RTX 4090 (Ada sm_89) = sweet spot budget 10, max 106 t/s. My Olares One RTX 5090M (Blackwell sm_120) = sweet spot budget 22, max 88.7 t/s.

Radical difference. Why?

DDTree budget controls verify batch width. The bigger, the more parallelism but the more memory required. Consumer Blackwell (5090M) has fewer tensor core resources per SM than datacenter Ada — verify too wide = bandwidth-bound. Higher sweet spot on Blackwell because… no, that’s not logical. Higher = bigger batch = more bandwidth needed.

More likely hypothesis: with q4_0 KV cache memory-per-token is denser. Low budget (6, 10, 14) → too much fragmentation and the daemon OOMs on the first request (503). Budget 22 = sweet spot where there’s enough memory for the verify batch without crashing. Budget 26+ = verify dominates the overhead, drop.

This is consistent with the fact that my budget 6/10/14 crash 503 while the PR #94 RTX 4090 handles them. The 24GB desktop 4090 has more free VRAM after model + drafter (probably because no HAMi vGPU on top).

3. PR #94 is in progress, maybe not finalized

PR #94 is still OPEN (May 4). Not merged yet. The fact that my test doesn’t reproduce the 106 t/s could simply mean the branch isn’t at its final state. Sandropuppo might have additional commits locally not yet pushed.

To watch. If the PR merges with a fix that changes things on sm_120 specifically, I rebuild.

Today’s honest verdict

Plateau at ~88 t/s on Qwen3.6-27B / RTX 5090M / 24GB. Three distinct stacks land within 0.5 t/s of each other:

Stackt/s avg
vLLM Turbo (Sandermage Genesis + TurboQuant K8V4 + MTP n=3)88.0
Lucebox v1.4.4 (tq3_0 KV + DDTree 22)88.5
Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22)88.7
buun-llama-cpp DFlash + Q8_0 GGUF drafter80
llama.cpp standard (no spec)33-36

For a breakthrough past 90 t/s, several upstream paths to wait on:

None of these is actionable today. 88 t/s is our practical cap on this hardware in late Q2 2026.

Lesson from this episode

Not every upstream benchmark reproduces on every hardware. PR #94 on RTX 4090 desktop 24GB = +57%. On RTX 5090M Blackwell mobile 24GB = +0.2%. Same VRAM, same model, same quantization — the GPU architecture difference and memory configuration is enough to invalidate the claim.

And transparency on negative results is as useful as transparency on positive ones. Someone else will read PR #94, see the announced 106 t/s, try to reproduce on their 5090. This post saves them a 2-hour build and points them where to look.

Next steps

That’s it! If anyone on a 5090M or other consumer Blackwell 24GB reproduces these numbers or finds the unlock I’m missing, I want to know.


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments