Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 8: seven days of waiting, one lib swapped by hand, 88.5 t/s

Seven days after my PR #188 to HAMi-core, still no review. The saga had its cliffhanger — I was waiting on someone else. Then a stupid idea: compile my patched lib and swap it myself. Three new bugs, one night, and at the end Lucebox hits 88.5 t/s. First llama.cpp-based path to pass vLLM Turbo on this hardware.

Hi there.

It’s been seven days.

Seven days since I filed issue #187 and PR #188 with Project-HAMi/HAMi-core — a clean fix for the 6 cuCtxGetDevice hooks that crash silently (full diagnosis in episode 7). The PR is OPEN, no review, no maintainer comment. Seven days of upstream silence. And meanwhile, my Olares One with its RTX 5090M still runs the original HAMi lib that crashes with Illegal device id: -644371744 the moment a GPU pod boots.

The saga had stopped there in theory. I was waiting. And then this morning, sipping my coffee, a stupid-simple idea: why am I waiting for an upstream merge when I can compile my patched lib and swap it directly on the Olares node? The PR will benefit the whole HAMi community the day it merges. But in the meantime, I unblock my own Olares.

Tonight’s spoiler: it worked. And at the end of the chain, Lucebox hits 88.5 t/s on Qwen3.6-27B, 0.5 t/s ahead of my vLLM Turbo (88). It’s the first time a llama.cpp-based path passes vLLM Turbo on this hardware. Here’s the story.

Quick recap for newcomers

Lucebox is a llama.cpp fork tuned by sandropuppo that runs DFlash speculative decoding with custom CUDA kernels. HAMi (Heterogeneous AI computing Virtualization Middleware) is the GPU isolation layer running under the Kubernetes pods on Olares. And the bug I filed in episode 7 is six hooks in HAMi-core that read an uninitialized dev variable → random crash whenever a pod uses the CUDA Driver API path (Lucebox, recent llama.cpp, async vLLM).

The morning

I rebuild HAMi-core from my fix/cumemcreate-uninit-dev branch (the PR #188 commits). Output: libvgpu.so, 676 KB, ELF amd64. The original Olares version is 863 KB (different build, more options). Doesn’t matter, the ABI is compatible.

On the Olares node, the master libvgpu lives at /usr/local/vgpu/libvgpu.so. I back it up, then swap:

ssh olares@olares-one "sudo cp /usr/local/vgpu/libvgpu.so /usr/local/vgpu/libvgpu.so.backup-original
                       sudo cp /tmp/libvgpu.so.patched /usr/local/vgpu/libvgpu.so"

sha256 check: 85b2ebe47bfe.... That’s my patched lib.

A small caveat to keep in mind: the swap is node-wide. Every GPU pod that boots after will LD_PRELOAD my version. Pods already running keep the old one in memory (LD_PRELOAD reads the file at process load). So zero impact on running workloads, but at every HAMi DaemonSet update the master file might get overwritten. Moderate risk, reversible (backup is in place).

The moment Lucebox boots… and three new bugs

I redeploy lucedflashqwen36one v1.3.0 (the standard chart from the repo). The “Illegal device id” bug from episode 5 is gone. HAMi is unblocked.

Except. The pod still crashes. CrashLoop. With a new error message:

Layer 1: gguf_init_from_file_ptr: invalid magic characters: 'p???', expected 'GGUF'

I was using the spiritbuun/Qwen3.6-27B-DFlash-GGUF drafter in the chart (carry-over from my parallel buun-llama-cpp work). But Lucebox has its own custom loader for z-lab BF16 safetensors — it doesn’t read the spiritbuun GGUF. Format mismatch.

Fix: revert to the z-lab BF16 drafter (huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/).

Layer 2: Total VRAM: 0 MiB

Pod boots, loads the target on GPU… immediate OOM. HAMi logs:

[HAMI-core Warn]: invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 0 MiB):

Olares sets CUDA_DEVICE_MEMORY_LIMIT_0=0m by default (intent: “no limit”). But HAMi-core parses “0m” as “invalid limit” and available memory drops to 0 bytes. That’s another original HAMi bug, separate from my PR #188.

Fix in the entrypoint before launching server.py:

export CUDA_DEVICE_MEMORY_LIMIT_0=24000m

Layer 3: prefix_cache.startup_sync() 10s timeout

Pod now boots correctly, target loads (851 tensors, 14.99 GiB on GPU), drafter loads (3.5 GiB). Then:

File ".../scripts/prefix_cache.py", line 670, in startup_sync
    reply = await self._await_reply("[snap] slots=")
asyncio.exceptions.TimeoutError

server.py spawns the test_dflash daemon then waits for it to print [snap] slots= on stdout. Hardcoded 10-second timeout. But the daemon, on first boot, has to JIT-compile its CUDA kernels for sm_120 (compute capability 12.0 = consumer Blackwell). Takes 30-60 seconds. Timeout fires before the daemon is ready.

Workaround: disable both prefix-cache and prefill-cache (which both call startup_sync):

python3 scripts/server.py --prefix-cache-slots 0 --prefill-cache-slots 0 ...

Trade-off: no KV cache reuse across requests. Fine for solo benching; for production agent use, would need to fix the timeout in source (PR to file later).

The afternoon: OOM round 2

Boot OK! server.py logs:

Luce DFlash OpenAI server on http://0.0.0.0:8000
  target    = /models/Qwen3.6-27B-Q4_K_M.gguf
  draft     = /models/draft/model.safetensors
  budget    = 22
  [daemon] [target] target loaded: 851 tensors on GPU 14.99 GiB
  [daemon] [draft]  loaded
  [daemon] [daemon] ready

First curl on /v1/chat/completions:

500 Internal Server Error
{"detail":"dflash daemon has exited unexpectedly"}

Ouch. The daemon died on the first inference. Logs:

[HAMI-core ERROR allocator.c:56]: Device 0 OOM 24232551680 / 24117248000

24,232 MB requested vs 24,117 MB authorized (my 23000m apparently, miscounted). Bump to 24,000m (= 25,165,824,000 bytes, i.e. 24.03 GiB ~150 MB under the hardware 24,463 MiB). Tight margin but should work.

Patch via kubectl patch deployment at runtime (instead of re-bumping the chart):

kubectl patch deployment lucedflashqwen36one ... '/spec/.../args/0' with sed 23000m 24000m

Pod recreates. 4 min 24 s later, 2/2 Running, 0 restarts.

The moment

Bench Space Invaders × 3 (max_tokens=800, temp=0.6) via python3 urllib.request from inside the container (curl isn’t in the Lucebox image, but Python is):

Run 1: 800 tok in 9.19 s = 87.05 t/s
Run 2: 800 tok in 8.93 s = 89.54 t/s
Run 3: 800 tok in 9.00 s = 88.88 t/s

AVG 88.5 t/s [87-89.5].

I read the numbers twice. It’s the first time I see Lucebox deliver on consumer mobile Blackwell since we started the saga seven days ago. And it sits slightly ahead of my vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3) which runs at 88 t/s.

Final ranking on Olares One (May 4, 2026)

BackendStackt/s avg
llama.cpp standardUD-Q4_K_XL, no spec33-36
buun-llama-cpp DFlashHEAD + Q8_0 GGUF drafter80
vLLM TurboGenesis + TurboQuant K8V4 + MTP n=388.0
Lucebox DFlash HTTPscripts/server.py + test_dflash + TQ3_0 KV88.5 🏆

Lucebox is the new champion on Olares One. Not a blowout — 0.5 t/s is within noise. But it’s the first time a llama.cpp-based path on consumer mobile Blackwell beats the vLLM custom path. And that’s with the mismatched 3.5 drafter, which has middling acceptance. Lucebox PR #94 from May 4 reports 106 t/s on RTX 4090 (Ada sm_89) with the matched 3.6 draft + SWA support. If that extrapolates linearly to Blackwell sm_120, we should hit 100+ t/s estimated as soon as we rebuild against PR #94 once merged.

(Spoiler for episode 9: it doesn’t extrapolate linearly. But that’s another story.)

Lesson from this episode

The saga had a natural cliffhanger at episode 7: “PR filed, let’s wait”. With a bit of aggression — compile your own libvgpu.so instead of waiting on upstream — you turn “let’s wait” into “it works”. None of the three workarounds is particularly clean:

  1. Node-wide hot-swap: risk that the HAMi DaemonSet rewrites my master file at the next Olares update
  2. Override CUDA_DEVICE_MEMORY_LIMIT_0=24000m: workaround for an original HAMi bug that deserves its own PR
  3. Disable prefix-cache: lose KV cache reuse. To fix in server.py (bump the timeout)

But 88.5 t/s on consumer mobile 24GB via the Lucebox HTTP path — that’s the first public demo. Enough to validate the path is viable. Polish comes after.

Next steps

Episode 9 — the 106 t/s attempt that crashed at 88.7 t/s, and why. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments