Hi there.
It’s been seven days.
Seven days since I filed issue #187 and PR #188 with Project-HAMi/HAMi-core — a clean fix for the 6 cuCtxGetDevice hooks that crash silently (full diagnosis in episode 7). The PR is OPEN, no review, no maintainer comment. Seven days of upstream silence. And meanwhile, my Olares One with its RTX 5090M still runs the original HAMi lib that crashes with Illegal device id: -644371744 the moment a GPU pod boots.
The saga had stopped there in theory. I was waiting. And then this morning, sipping my coffee, a stupid-simple idea: why am I waiting for an upstream merge when I can compile my patched lib and swap it directly on the Olares node? The PR will benefit the whole HAMi community the day it merges. But in the meantime, I unblock my own Olares.
Tonight’s spoiler: it worked. And at the end of the chain, Lucebox hits 88.5 t/s on Qwen3.6-27B, 0.5 t/s ahead of my vLLM Turbo (88). It’s the first time a llama.cpp-based path passes vLLM Turbo on this hardware. Here’s the story.
Quick recap for newcomers
Lucebox is a llama.cpp fork tuned by sandropuppo that runs DFlash speculative decoding with custom CUDA kernels. HAMi (Heterogeneous AI computing Virtualization Middleware) is the GPU isolation layer running under the Kubernetes pods on Olares. And the bug I filed in episode 7 is six hooks in HAMi-core that read an uninitialized dev variable → random crash whenever a pod uses the CUDA Driver API path (Lucebox, recent llama.cpp, async vLLM).
The morning
I rebuild HAMi-core from my fix/cumemcreate-uninit-dev branch (the PR #188 commits). Output: libvgpu.so, 676 KB, ELF amd64. The original Olares version is 863 KB (different build, more options). Doesn’t matter, the ABI is compatible.
On the Olares node, the master libvgpu lives at /usr/local/vgpu/libvgpu.so. I back it up, then swap:
ssh olares@olares-one "sudo cp /usr/local/vgpu/libvgpu.so /usr/local/vgpu/libvgpu.so.backup-original
sudo cp /tmp/libvgpu.so.patched /usr/local/vgpu/libvgpu.so"
sha256 check: 85b2ebe47bfe.... That’s my patched lib.
A small caveat to keep in mind: the swap is node-wide. Every GPU pod that boots after will LD_PRELOAD my version. Pods already running keep the old one in memory (LD_PRELOAD reads the file at process load). So zero impact on running workloads, but at every HAMi DaemonSet update the master file might get overwritten. Moderate risk, reversible (backup is in place).
The moment Lucebox boots… and three new bugs
I redeploy lucedflashqwen36one v1.3.0 (the standard chart from the repo). The “Illegal device id” bug from episode 5 is gone. HAMi is unblocked.
Except. The pod still crashes. CrashLoop. With a new error message:
Layer 1: gguf_init_from_file_ptr: invalid magic characters: 'p???', expected 'GGUF'
I was using the spiritbuun/Qwen3.6-27B-DFlash-GGUF drafter in the chart (carry-over from my parallel buun-llama-cpp work). But Lucebox has its own custom loader for z-lab BF16 safetensors — it doesn’t read the spiritbuun GGUF. Format mismatch.
Fix: revert to the z-lab BF16 drafter (huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/).
Layer 2: Total VRAM: 0 MiB
Pod boots, loads the target on GPU… immediate OOM. HAMi logs:
[HAMI-core Warn]: invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 0 MiB):
Olares sets CUDA_DEVICE_MEMORY_LIMIT_0=0m by default (intent: “no limit”). But HAMi-core parses “0m” as “invalid limit” and available memory drops to 0 bytes. That’s another original HAMi bug, separate from my PR #188.
Fix in the entrypoint before launching server.py:
export CUDA_DEVICE_MEMORY_LIMIT_0=24000m
Layer 3: prefix_cache.startup_sync() 10s timeout
Pod now boots correctly, target loads (851 tensors, 14.99 GiB on GPU), drafter loads (3.5 GiB). Then:
File ".../scripts/prefix_cache.py", line 670, in startup_sync
reply = await self._await_reply("[snap] slots=")
asyncio.exceptions.TimeoutError
server.py spawns the test_dflash daemon then waits for it to print [snap] slots= on stdout. Hardcoded 10-second timeout. But the daemon, on first boot, has to JIT-compile its CUDA kernels for sm_120 (compute capability 12.0 = consumer Blackwell). Takes 30-60 seconds. Timeout fires before the daemon is ready.
Workaround: disable both prefix-cache and prefill-cache (which both call startup_sync):
python3 scripts/server.py --prefix-cache-slots 0 --prefill-cache-slots 0 ...
Trade-off: no KV cache reuse across requests. Fine for solo benching; for production agent use, would need to fix the timeout in source (PR to file later).
The afternoon: OOM round 2
Boot OK! server.py logs:
Luce DFlash OpenAI server on http://0.0.0.0:8000
target = /models/Qwen3.6-27B-Q4_K_M.gguf
draft = /models/draft/model.safetensors
budget = 22
[daemon] [target] target loaded: 851 tensors on GPU 14.99 GiB
[daemon] [draft] loaded
[daemon] [daemon] ready
First curl on /v1/chat/completions:
500 Internal Server Error
{"detail":"dflash daemon has exited unexpectedly"}
Ouch. The daemon died on the first inference. Logs:
[HAMI-core ERROR allocator.c:56]: Device 0 OOM 24232551680 / 24117248000
24,232 MB requested vs 24,117 MB authorized (my 23000m apparently, miscounted). Bump to 24,000m (= 25,165,824,000 bytes, i.e. 24.03 GiB ~150 MB under the hardware 24,463 MiB). Tight margin but should work.
Patch via kubectl patch deployment at runtime (instead of re-bumping the chart):
kubectl patch deployment lucedflashqwen36one ... '/spec/.../args/0' with sed 23000m → 24000m
Pod recreates. 4 min 24 s later, 2/2 Running, 0 restarts.
The moment
Bench Space Invaders × 3 (max_tokens=800, temp=0.6) via python3 urllib.request from inside the container (curl isn’t in the Lucebox image, but Python is):
Run 1: 800 tok in 9.19 s = 87.05 t/s
Run 2: 800 tok in 8.93 s = 89.54 t/s
Run 3: 800 tok in 9.00 s = 88.88 t/s
AVG 88.5 t/s [87-89.5].
I read the numbers twice. It’s the first time I see Lucebox deliver on consumer mobile Blackwell since we started the saga seven days ago. And it sits slightly ahead of my vLLM Turbo (Genesis 28 patches + TurboQuant K8V4 + MTP n=3) which runs at 88 t/s.
Final ranking on Olares One (May 4, 2026)
| Backend | Stack | t/s avg |
|---|---|---|
| llama.cpp standard | UD-Q4_K_XL, no spec | 33-36 |
| buun-llama-cpp DFlash | HEAD + Q8_0 GGUF drafter | 80 |
| vLLM Turbo | Genesis + TurboQuant K8V4 + MTP n=3 | 88.0 |
| Lucebox DFlash HTTP | scripts/server.py + test_dflash + TQ3_0 KV | 88.5 🏆 |
Lucebox is the new champion on Olares One. Not a blowout — 0.5 t/s is within noise. But it’s the first time a llama.cpp-based path on consumer mobile Blackwell beats the vLLM custom path. And that’s with the mismatched 3.5 drafter, which has middling acceptance. Lucebox PR #94 from May 4 reports 106 t/s on RTX 4090 (Ada sm_89) with the matched 3.6 draft + SWA support. If that extrapolates linearly to Blackwell sm_120, we should hit 100+ t/s estimated as soon as we rebuild against PR #94 once merged.
(Spoiler for episode 9: it doesn’t extrapolate linearly. But that’s another story.)
Lesson from this episode
The saga had a natural cliffhanger at episode 7: “PR filed, let’s wait”. With a bit of aggression — compile your own libvgpu.so instead of waiting on upstream — you turn “let’s wait” into “it works”. None of the three workarounds is particularly clean:
- Node-wide hot-swap: risk that the HAMi DaemonSet rewrites my master file at the next Olares update
- Override
CUDA_DEVICE_MEMORY_LIMIT_0=24000m: workaround for an original HAMi bug that deserves its own PR - Disable prefix-cache: lose KV cache reuse. To fix in
server.py(bump the timeout)
But 88.5 t/s on consumer mobile 24GB via the Lucebox HTTP path — that’s the first public demo. Enough to validate the path is viable. Polish comes after.
Next steps
- Rebuild the image once PR #94 (matched 3.6 draft + SWA) is merged → bench → expected ~100 t/s
- File a PR to Lucebox to bump the
startup_synctimeout (the 10s hardcode is too short on sm_120) - File a PR to HAMi-core to fix the “0m = no limit” semantics (separate from my PR #188)
- Test PFlash (10× prefill speedup) in combo with DFlash decode
Episode 9 — the 106 t/s attempt that crashed at 88.7 t/s, and why. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.