Episode 8 — I had Lucebox running at 88.5 t/s on Olares One via the libvgpu hot-swap + three workarounds. New champion, 0.5 t/s ahead of vLLM Turbo. Then I see PR #94 land on Lucebox-hub the same morning: “Support Qwen3.6-27B-DFlash draft (SWA layers) — 106 t/s on RTX 4090”. With a DDTree budget sweep showing +57% vs the mismatched 3.5 drafter (67.4 → 106.1 t/s on RTX 4090).
Ada sm_89 → 106 t/s. If it scales linearly to Blackwell sm_120 (which has more bandwidth overall), we should hit 120-130 t/s on Olares.
Spoiler right now: 88.7 t/s. +0.2 vs yesterday. In the noise floor. Here’s the post-mortem.
What PR #94 promises
Three changes in the PR:
-
Matched 3.6 draft instead of mismatched 3.5 — the
z-lab/Qwen3.6-27B-DFlashdraft (4 sliding-window-attention layers + 1 full-attention) instead of the 3.5 draft (5 full-attention layers). Layer types read from theconfig.jsonnext to the safetensors. -
SWA mask plumbed through
qwen3_dflash_graph.cppviaggml_flash_attn_ext. -
Build flag
DFLASH27B_ENABLE_BSA=ONturns on Block-Sparse-Attention (PFlash path, but also useful for the drafter).
The PR’s RTX 4090 sweep:
budget=6: 93.9 t/s
budget=10: 106.1 t/s ← sweet spot
budget=14: 96.5
budget=18: 102.7
budget=22: 103.5
Sweet spot at 10. Logical: lower budget = less verify parallelism = fewer chunks = good on cards with limited bandwidth.
The build
I rebuild a v1.6.0 image off this branch:
RUN git clone --depth 1 --branch feat/dflash-qwen36-swa-draft \
--recurse-submodules \
https://github.com/Quitetall/lucebox-hub /src/lucebox
RUN cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DDFLASH27B_ENABLE_BSA=ON \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
&& cmake --build build --target test_dflash test_flashprefill_kernels -j$(nproc)
Build: 100 min on cross-arch OrbStack. Image v1.6.0 = 5.39 GB (vs 4.17 GB for v1.4.0 = +1.2 GB = the BSA kernels).
First bench — default config
Default chart config (inherited from v1.4.4): --cache-type-k tq3_0 --cache-type-v tq3_0 --budget 22 (default).
Run 1: 87.45 t/s
Run 2: 89.22 t/s
Run 3: 88.09 t/s
88.25 t/s avg. Identical to yesterday (88.5). Ouch. PR #94 doesn’t show up?
I check that the PR is actually active. grep "SWA layers" source/safetensors_draft.cpp → the function is there, parsing config.json. cat /models/draft/config.json → layer_types: [sliding × 4, full × 1], sliding_window: 2048 — exactly what the PR expects.
But the diagnostic log [draft] SWA layers: 4/5 (window=2048) doesn’t appear in test_dflash output. Either the code path runs silently, or the log goes to stderr which we don’t capture in kubectl logs (in theory we capture stdout+stderr, but unsure for daemon subprocesses spawned by server.py).
Verdict: probably active but with no visible gain here.
Hypothesis 1: default config is suboptimal
PR #94 documents a specific setup for the 106 t/s:
DFLASH27B_KV_K=q4_0 DFLASH27B_KV_V=q4_0 DFLASH27B_FA_WINDOW=0 \
python scripts/run.py --budget 10 --max-ctx 4096 ...
Three differences vs my config:
q4_0instead oftq3_0for the KV cache (4-bit standard vs 3-bit TurboQuant)FA_WINDOW=0(full attention) instead of 2048 (sliding window)budget=10instead of 22
I patch via kubectl patch deployment (faster than bumping the chart).
And I run a full sweep: 6, 10, 14, 18, 22, 26, 30. Seven builds × 5 min boot + 30 s bench = ~40 min. I let it run.
The full sweep
| Budget | Status | t/s avg |
|---|---|---|
| 6 | crash 503 on first request | — |
| 10 | crash 503 | — |
| 14 | crash 503 | — |
| 18 | OK | 84.8 [83.5-85.9] |
| 22 | OK ← peak | 88.7 [87.5-89.4] |
| 26 | OK then drop | 76.7 [75.5-77.5] |
| 30 | 1 OK run then crash 503 | 68.3 → unstable |
Dome curve. Peak at 22.
+0.2 t/s vs my v1.4.4 baseline (88.5). In the noise floor.
Three possible explanations
1. The matched 3.6 SWA draft doesn’t shine on short prompts
My Space Invaders prompt is ~25 tokens. SWA optimizations (window=2048) on the drafter only kick in when the draft sequence exceeds the window. At 25 tokens, the drafter sees everything, like a normal full-attention layer.
The PR #94 RTX 4090 sweep probably used bench_he.py running HumanEval (longer prompts, full code completion). Closer to where SWA actually wins.
To validate, would need to retest with an 8K-16K token prompt. But our Space Invaders bench has been our standard since the beginning, and it’s representative of normal chat usage.
2. The sweet spot differs by GPU architecture
PR #94 on RTX 4090 (Ada sm_89) = sweet spot budget 10, max 106 t/s. My Olares One RTX 5090M (Blackwell sm_120) = sweet spot budget 22, max 88.7 t/s.
Radical difference. Why?
DDTree budget controls verify batch width. The bigger, the more parallelism but the more memory required. Consumer Blackwell (5090M) has fewer tensor core resources per SM than datacenter Ada — verify too wide = bandwidth-bound. Higher sweet spot on Blackwell because… no, that’s not logical. Higher = bigger batch = more bandwidth needed.
More likely hypothesis: with q4_0 KV cache memory-per-token is denser. Low budget (6, 10, 14) → too much fragmentation and the daemon OOMs on the first request (503). Budget 22 = sweet spot where there’s enough memory for the verify batch without crashing. Budget 26+ = verify dominates the overhead, drop.
This is consistent with the fact that my budget 6/10/14 crash 503 while the PR #94 RTX 4090 handles them. The 24GB desktop 4090 has more free VRAM after model + drafter (probably because no HAMi vGPU on top).
3. PR #94 is in progress, maybe not finalized
PR #94 is still OPEN (May 4). Not merged yet. The fact that my test doesn’t reproduce the 106 t/s could simply mean the branch isn’t at its final state. Sandropuppo might have additional commits locally not yet pushed.
To watch. If the PR merges with a fix that changes things on sm_120 specifically, I rebuild.
Today’s honest verdict
Plateau at ~88 t/s on Qwen3.6-27B / RTX 5090M / 24GB. Three distinct stacks land within 0.5 t/s of each other:
| Stack | t/s avg |
|---|---|
| vLLM Turbo (Sandermage Genesis + TurboQuant K8V4 + MTP n=3) | 88.0 |
| Lucebox v1.4.4 (tq3_0 KV + DDTree 22) | 88.5 |
| Lucebox v1.6.0 (PR #94 + q4_0 KV + DDTree 22) | 88.7 |
| buun-llama-cpp DFlash + Q8_0 GGUF drafter | 80 |
| llama.cpp standard (no spec) | 33-36 |
For a breakthrough past 90 t/s, several upstream paths to wait on:
- llama.cpp PR #22673 (am17an, MTP support): open, in review, Reddit community measures +75% on Strix Halo. On Blackwell estimated ~70 t/s vs 33-36 baseline.
- vLLM PR #41123 (TurboQuant 2-bit hybrid): open, unlocks more aggressive KV compression on Qwen3.6 hybrid.
- Qwen FlashQLA sm_120: their lib reports 2-3× on Qwen3.6 GDN kernels. sm_120 release “soon” per maintainer.
- Quantized DFlash drafter in a distribution other than spiritbuun GGUF — for instance from AutoRound INT4 if a team ports it.
None of these is actionable today. 88 t/s is our practical cap on this hardware in late Q2 2026.
Lesson from this episode
Not every upstream benchmark reproduces on every hardware. PR #94 on RTX 4090 desktop 24GB = +57%. On RTX 5090M Blackwell mobile 24GB = +0.2%. Same VRAM, same model, same quantization — the GPU architecture difference and memory configuration is enough to invalidate the claim.
And transparency on negative results is as useful as transparency on positive ones. Someone else will read PR #94, see the announced 106 t/s, try to reproduce on their 5090. This post saves them a 2-hour build and points them where to look.
Next steps
- File a comment on PR #94 with my Olares numbers so sandropuppo sees that sm_120 doesn’t reproduce their 106 t/s
- When PR #22673 (llama.cpp MTP) lands → rebuild buun fork against that commit + bench
- Test PFlash on a long prompt (the only path where the PR #94 BSA build flag could shine)
That’s it! If anyone on a 5090M or other consumer Blackwell 24GB reproduces these numbers or finds the unlock I’m missing, I want to know.
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.