Episode 4 — after 6h of cumulative compile time, the image aamsellem/lucebox-qwen36-blackwell:1.0.0 is on Docker Hub. It compiles, it links, it pushes. Logical.
Except the pod doesn’t boot.
The crash
Pod logs, after the Q4_K_M target (~17 GB) and the z-lab drafter (~3.5 GB) finish downloading:
==> Starting Lucebox llama-server: /opt/lucebox/.../llama-server
[HAMI-core Msg(...)]: Initializing.....
[HAMI-core Warn(...)]: invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
[HAMI-core Msg(...)]: get_nvml_device_memory_total 0
[HAMI-core Msg(...)]: get_nvml_device_memory_total 1
... (up to 15)
[HAMI-core Msg(...)]: Initialized
[HAMI-core Msg(...)]: SCHEDULER_WEBSOCKET_URL ws://gpu-scheduler.os-gpu:6000
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
VMM: yes, VRAM: 24463 MiB
[HAMI-core ERROR (...)]: Illegal device id: -644371744
Then exit. Pod in CrashLoopBackOff. I restart, and:
[HAMI-core ERROR (...)]: Illegal device id: -39296272
Then:
[HAMI-core ERROR (...)]: Illegal device id: 1816936528
The device id changes every run, and it’s random.
First instinct
When a C/C++ program throws a “random” integer that changes every run, nine times out of ten it’s an uninitialized stack variable. You declare int dev; without initializing, and dev takes the value of whatever was on the stack at that moment — i.e. whatever the previous functions left there. Random.
But where does the error come from? Not ggml. Not Lucebox. Not llama-server. It’s [HAMI-core ERROR ...] — that’s HAMi vGPU, the GPU isolation layer running on Olares Kubernetes.
What HAMi actually is
HAMi (Heterogeneous AI computing Virtualization Middleware) is a Kubernetes device plugin that virtualizes GPUs. On Olares, it lives in the kube-system namespace:
$ kubectl get ds -n kube-system | grep hami
hami-device-plugin 1/1 ... gpu.bytetrade.io/cuda-supported=true
hami-nvidia-dcgm-exporter 1/1 ...
How does it work? At startup, the HAMi init script copies a custom lib (libvgpu.so) onto the host at /usr/local/vgpu/. When a pod requests nvidia.com/gpu in its resources, HAMi mounts it via hostPath into the container, and libvgpu.so is LD_PRELOAD’d on every binary in the pod. That lib intercepts CUDA calls to do memory tracking, scheduling, quota enforcement.
So when llama-server calls a CUDA function, it goes through HAMi first. And HAMi crashes with Illegal device id: <random>.
First diag: is it VMM?
The log says VMM: yes. Lucebox uses the CUDA Driver API VMM (cuMemCreate, cuMemAddressReserve, cuMemMap) for its allocations. That’s what we saw in episodes 2-4 — every undefined reference was a VMM function. It’s more efficient than cudaMalloc for managing dynamic pools.
Hypothesis: maybe HAMi doesn’t intercept the VMM path correctly. If I disable VMM on the ggml side, the bug might disappear.
ggml has a CMake flag for this: -DGGML_CUDA_NO_VMM=ON. It forces cudaMalloc (Runtime API) instead of cuMemCreate (Driver API). HAMi normally knows how to intercept Runtime API.
So I rebuild. Build #5. Another 2h of compile because the flag invalidates the entire CUDA cache.
docker.io/aamsellem/lucebox-qwen36-blackwell:1.1.0
Push, swap the image in the Olares chart (v1.0.0 → v1.1.0), restart the pod. It boots, downloads models, starts llama-server.
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
VMM: yes, VRAM: 24463 MiB
[HAMI-core ERROR (...)]: Illegal device id: -1967326112
Same thing.
Wait, the log says VMM: yes. Why? I disabled NO_VMM in the build…
Oh. It’s the device’s capability that’s printed, not ggml’s actual usage. The device can do VMM, ggml just chose not to use it. But Illegal device id is still there.
Which means: the bug is not in the VMM path. It’s somewhere else. And NO_VMM does nothing.
At this point, I’m 7h into the saga, I’ve burned 8h of CPU on compiles, and I still don’t know why my pod crashes. Time to read the HAMi source code.
Episode 6 — Reading the HAMi-core source. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.