Episode 7 — I’d filed issue #187 and PR #188 on Project-HAMi/HAMi-core with the fix for the 6 cuCtxGetDevice hooks. And then… nothing. The PR sat OPEN, no review. The saga was paused, waiting on a HAMi maintainer to take a look.
Seven days later, still no review. But in the meantime I had a stupid-simple idea: why wait for upstream to merge when I can build my own patched libvgpu.so and hot-swap it directly on the Olares node? The PR will benefit the whole HAMi community when it lands — but in the meantime, I unblock my own Olares.
Spoiler: it worked. And at the end of the chain, Lucebox runs at 88.5 t/s on Qwen3.6-27B, which beats my vLLM Turbo (88) by 0.5 t/s. Here’s how.
Phase 1: hot-swap libvgpu.so
I rebuild HAMi-core from my fix/cumemcreate-uninit-dev branch (the PR #188 commits). Output: libvgpu.so 676 KB, ELF amd64. The original Olares version is 863 KB (different build, more options). Doesn’t matter, the ABI is compatible.
On the Olares node, the master libvgpu lives at /usr/local/vgpu/libvgpu.so. Backup, swap:
ssh olares@olares-one "sudo cp /usr/local/vgpu/libvgpu.so /usr/local/vgpu/libvgpu.so.backup-original
sudo cp /tmp/libvgpu.so.patched /usr/local/vgpu/libvgpu.so"
sha256 check: 85b2ebe47bfe... (our patched build). Bingo.
Caveat: this is node-wide. Every GPU pod that starts after the swap will LD_PRELOAD the patched version. Pods already running keep the old one in memory (LD_PRELOAD reads the file at process load). So zero impact on running workloads, but at every Olares HAMi DaemonSet update the master file might get overwritten. Moderate risk, reversible (backup is in place).
Phase 2: Lucebox boots… and three layers of bugs
I redeploy lucedflashqwen36one v1.3.0 (the standard chart from the repo). The “Illegal device id” bug from episode 5 is gone. HAMi is unblocked.
Except. The pod still crashes. CrashLoop. With a new error message:
Layer 1: gguf_init_from_file_ptr: invalid magic characters: 'p???', expected 'GGUF'
I was using the spiritbuun/Qwen3.6-27B-DFlash-GGUF drafter in the chart (carry-over from my parallel buun-llama-cpp work). But Lucebox has its own custom loader for z-lab BF16 safetensors — it doesn’t read spiritbuun’s GGUF. Format mismatch.
Fix: revert to the z-lab BF16 drafter (huggingface-cli download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/) which gives model.safetensors + config.json.
Layer 2: Total VRAM: 0 MiB then cuMemoryAllocate failed res=2
Pod boots, loads the target on GPU… immediate OOM. HAMi logs:
[HAMI-core Warn]: invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 0 MiB):
Olares sets CUDA_DEVICE_MEMORY_LIMIT_0=0m by default (which should mean “no limit”). But HAMi-core parses “0m” as “invalid limit” and sets available memory to 0 bytes. On the Runtime API path (cudaMalloc) this check is bypassed. On the Driver API path that test_dflash uses, it’s strict. Result: 0 MiB visible.
Diagnosis from the HAMi-core source, function get_limit_from_env:
if (scaled_res == 0) {
if (env_name[12]=='M'){
LOG_WARN("invalid device memory limit %s=%s",env_name,env_limit);
}
return 0;
}
“0m” → res=0 → scaled_res=0 → log “invalid” → return 0. Caller treats 0 as “no memory available”. Original HAMi bug, not in my PR #188.
Fix: override in the entrypoint before launching server.py:
export CUDA_DEVICE_MEMORY_LIMIT_0=24000m # 24 GB, just under the hardware 24463
Layer 3: prefix_cache.startup_sync() 10s timeout
Pod now boots correctly, target loads (851 tensors, 14.99 GiB on GPU), drafter loads (3.5 GiB). Then:
File "/opt/lucebox/src/dflash/scripts/prefix_cache.py", line 670, in startup_sync
reply = await self._await_reply("[snap] slots=")
asyncio.exceptions.TimeoutError
server.py spawns the test_dflash daemon then waits for it to print [snap] slots= on stdout. Timeout hardcoded at 10 seconds. But the daemon, on first boot, has to JIT-compile its CUDA kernels for sm_120 (compute capability 12.0 = consumer Blackwell). Takes 30-60 seconds. Timeout fires before the daemon is ready.
Fix: disable both prefix-cache and prefill-cache (which both call startup_sync):
python3 scripts/server.py \
--prefix-cache-slots 0 \
--prefill-cache-slots 0 \
...
Trade-off: no KV cache reuse across requests. Fine for solo benching; for production agent use, would need to fix the timeout in source (PR to file later).
Phase 3: OOM round 2
Boot OK! server.py logs:
Luce DFlash OpenAI server on http://0.0.0.0:8000
target = /models/Qwen3.6-27B-Q4_K_M.gguf
draft = /models/draft/model.safetensors
bin = /opt/lucebox/dflash-build/test_dflash
budget = 22
[daemon] [target] target loaded: 851 tensors on GPU 14.99 GiB
[daemon] [draft] loaded
[daemon] [daemon] ready
First curl on /v1/chat/completions:
500 Internal Server Error
{"detail":"dflash daemon has exited unexpectedly"}
Ouch. The daemon died on the first inference. Logs:
[HAMI-core ERROR allocator.c:56]: Device 0 OOM 24232551680 / 24117248000
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2428.00 MiB on device 0: cudaMalloc failed: out of memory
24232 MB requested vs 24117 MB authorized (my 23000m… wait, I’d set 23000m? No, 24000m) → ah, 24117 MB matches 23000 * 1024 * 1024 if I do the math. I’d forgotten to bump correctly.
Bump to 24000m (= 24117 * 1024 * 1024 actual = 25165824000 bytes, i.e. 24.03 GiB ~150 MB under the hardware 24463 MiB). Tight margin but should work.
Phase 4: the bench
Patch the deployment at runtime via kubectl patch (instead of re-bumping the chart):
kubectl patch deployment lucedflashqwen36one ... '/spec/.../args/0' with sed 23000m → 24000m
Pod recreates. 4m24s later, 2/2 Running, 0 restarts.
Bench Space Invaders × 3 (max_tokens=800, temp=0.6) via python3 urllib.request from inside the container (curl isn’t in the Lucebox image, but Python is):
Run 1: 800 tok in 9.19s = 87.05 t/s
Run 2: 800 tok in 8.93s = 89.54 t/s
Run 3: 800 tok in 9.00s = 88.88 t/s
AVG 88.5 t/s [87-89.5].
Final ranking on Olares One (May 4, 2026)
| Backend | Stack | t/s avg |
|---|---|---|
| llama.cpp standard | UD-Q4_K_XL, no spec | 33-36 |
| buun-llama-cpp DFlash | HEAD + Q8_0 GGUF drafter | 80 |
| vLLM Turbo | v0.20.0 + Genesis + TurboQuant K8V4 + MTP n=3 | 88.0 |
| Lucebox DFlash HTTP | scripts/server.py + test_dflash + TQ3_0 KV | 88.5 🏆 |
| vLLM vanilla (other app) | 0.19.1 + AutoRound INT4 + MTP n=3 | 99 peak |
Lucebox is the new champion on Olares One, beating vLLM Turbo by 0.5 t/s. Not a blowout, but it’s the first time a llama.cpp-based path on consumer mobile Blackwell beats the vLLM custom path. And that’s with the mismatched 3.5 drafter, which has middling acceptance. Lucebox PR #94 from May 4 reports 106 t/s on RTX 4090 (Ada sm_89) with the matched 3.6 draft + SWA support. Extrapolated to Blackwell sm_120, that suggests 100+ t/s as soon as we rebuild against PR #94 once merged.
Lesson from this episode
The saga had a natural cliffhanger at episode 7: “PR filed, let’s wait”. With a bit of aggression (compile your own libvgpu.so instead of waiting upstream), you turn “wait” into “it works”. Three workarounds later. None of them is particularly clean:
- Node-wide hot-swap: risk that the HAMi DaemonSet rewrites my master file at the next Olares update
- Override
CUDA_DEVICE_MEMORY_LIMIT_0=24000m: workaround for an original HAMi bug that deserves its own PR (“0m means no limit”) - Disable prefix-cache: lose KV-cache reuse. To fix in
server.py(bump the timeout)
But 88.5 t/s on consumer mobile 24GB via the Lucebox HTTP path — that’s the first public demo. Enough to validate the path is viable. Polish comes after.
Next steps
- Rebuild the image once PR #94 (matched 3.6 draft + SWA) is merged → bench → expect ~100 t/s
- File a PR to Lucebox to bump the
startup_synctimeout (the 10s hardcode is too short on sm_120) - File a PR to HAMi-core to fix the “0m = no limit” semantics on memory_limit (separate from PR #188)
- Test PFlash (10× prefill speedup) in combo with DFlash decode
That’s it! If you run on a 5090M, 4080M or 3090 24GB and you reproduce these numbers (or beat 100+ t/s with PR #94), I want to know how you did it. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.