Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 1: 134 t/s on RTX 3090, what about my rig?

You're scrolling r/LocalLLaMA, you see a post claiming 134 t/s on Qwen3.6-27B with an RTX 3090 thanks to Lucebox. Of course you want to try it on your Olares One. Spoiler: it'll take 12 hours of compile time and 6 Docker builds. Episode 1.

You’re scrolling r/LocalLLaMA on a Sunday evening, looking at Qwen3.6-27B benches, and this catches your eye:

“Lucebox: 134.78 t/s @ 128K context on a single RTX 3090, with DFlash speculative decoding.”

134 t/s. On an RTX 3090. With a 128K-token context. Meanwhile, on my Olares One with the RTX 5090M, the best I get with Genesis + TurboQuant K8V4 is 38 t/s. Bandwidth ratio 5090M vs 3090: +30% in the 5090M’s favor. So if Lucebox holds its claim, I should be flying at 170-200 t/s on consumer Blackwell.

Of course I want to test it. Welcome to the saga.

What is Lucebox?

Lucebox is a llama.cpp fork ultra-tuned for consumer hardware. Research-grade and MIT-licensed. Three key ingredients:

All packaged in a custom Luce-Org/llama.cpp@luce-dflash fork with specialized CUDA kernels for tree-mode operations.

100% open source. Compiles from source. There’s a README. And then, the catch.

The catch

No public Docker. No binary release. No nightly build.

Their documented target is Ampere (RTX 3090, A100) and Ada Lovelace (RTX 4090). Nobody has tested on consumer Blackwell (sm_120). Nor on Kubernetes in general.

And I want to run it on Olares One, which is:

Three unknowns stacked on top of each other:

  1. Will the Lucebox CUDA kernels compile for sm_120?
  2. Will the binary run under HAMi vGPU?
  3. Will perf scale the way the bandwidth suggests?

None of these three has a public answer. The only way to know is to write a Dockerfile and compile.

The plan

I’m going with a multi-stage Dockerfile, classic shape:

Target: aamsellem/lucebox-qwen36-blackwell:1.0.0 on Docker Hub. Build amd64, sm_120. And a Helm chart to deploy on Olares in a few commands.

On paper, 30-40 minutes of compile time and a push. In practice, you’ll see.

Why I’m telling you this

Because nobody has done it on this platform. And the 3 fixes I’m about to discover episode by episode aren’t written down anywhere. And by the end, I’ll hit a bug inside HAMi vGPU itself that crashes not just Lucebox but also any recent llama.cpp and vLLM running async scheduling. That bug, I’ll fix upstream in a PR. But we’re not there yet.

For now, I run my first docker buildx build.

In episode 2: 2h13 of CUDA compile, and an error message that doesn’t say anything friendly. Enjoy, and see you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments