You’re scrolling r/LocalLLaMA on a Sunday evening, looking at Qwen3.6-27B benches, and this catches your eye:
“Lucebox: 134.78 t/s @ 128K context on a single RTX 3090, with DFlash speculative decoding.”
134 t/s. On an RTX 3090. With a 128K-token context. Meanwhile, on my Olares One with the RTX 5090M, the best I get with Genesis + TurboQuant K8V4 is 38 t/s. Bandwidth ratio 5090M vs 3090: +30% in the 5090M’s favor. So if Lucebox holds its claim, I should be flying at 170-200 t/s on consumer Blackwell.
Of course I want to test it. Welcome to the saga.
What is Lucebox?
Lucebox is a llama.cpp fork ultra-tuned for consumer hardware. Research-grade and MIT-licensed. Three key ingredients:
- DFlash — speculative decoding via block diffusion model. ICLR 2026 paper, code at z-lab. The drafter generates 15 tokens in a single forward pass instead of drafting token by token like EAGLE/MTP.
- DDTree — Diffusion Draft Tree. An extension that turns the drafter’s distributions into a candidate tree. Lifts DFlash’s acceptance rate.
- TQ3_0 KV cache — TurboQuant 3-bit with Hadamard rotation. 3.5 bits per value. 5×+ compression vs fp16, near-neutral quality.
All packaged in a custom Luce-Org/llama.cpp@luce-dflash fork with specialized CUDA kernels for tree-mode operations.
100% open source. Compiles from source. There’s a README. And then, the catch.
The catch
No public Docker. No binary release. No nightly build.
Their documented target is Ampere (RTX 3090, A100) and Ada Lovelace (RTX 4090). Nobody has tested on consumer Blackwell (sm_120). Nor on Kubernetes in general.
And I want to run it on Olares One, which is:
- RTX 5090 Laptop GPU (24 GB GDDR7, sm_120 consumer Blackwell)
- Under Kubernetes
- With HAMi vGPU as the GPU isolation layer
Three unknowns stacked on top of each other:
- Will the Lucebox CUDA kernels compile for sm_120?
- Will the binary run under HAMi vGPU?
- Will perf scale the way the bandwidth suggests?
None of these three has a public answer. The only way to know is to write a Dockerfile and compile.
The plan
I’m going with a multi-stage Dockerfile, classic shape:
- Stage 1 (
builder):nvidia/cuda:13.0.0-devel-ubuntu22.04, clone the Lucebox repo with submodules, cmake config + compile. - Stage 2 (
runtime):nvidia/cuda:13.0.0-runtime-ubuntu22.04, copy the binaries from the builder, entrypoint that downloads models from HuggingFace + startsllama-server.
Target: aamsellem/lucebox-qwen36-blackwell:1.0.0 on Docker Hub. Build amd64, sm_120. And a Helm chart to deploy on Olares in a few commands.
On paper, 30-40 minutes of compile time and a push. In practice, you’ll see.
Why I’m telling you this
Because nobody has done it on this platform. And the 3 fixes I’m about to discover episode by episode aren’t written down anywhere. And by the end, I’ll hit a bug inside HAMi vGPU itself that crashes not just Lucebox but also any recent llama.cpp and vLLM running async scheduling. That bug, I’ll fix upstream in a PR. But we’re not there yet.
For now, I run my first docker buildx build.
In episode 2: 2h13 of CUDA compile, and an error message that doesn’t say anything friendly. Enjoy, and see you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.