Lucebox on Olares One — Episode 6: We read the HAMi-core source and we find 6 bugs

Episode 5 — NO_VMM doesn’t help, the pod still crashes with a random Illegal device id. I’m leaving the “tweak CMake flags” zone and entering “read the source of the component that crashes”.

The repo

HAMi is split across two repos:

Project-HAMi/HAMi — the Kubernetes device plugin (Go)
Project-HAMi/HAMi-core — the libvgpu.so lib that’s LD_PRELOAD’d into pods (C)

The crash comes from the C lib. I clone HAMi-core:

git clone https://github.com/Project-HAMi/HAMi-core
cd HAMi-core
grep -rn "Illegal device id" src/

Bingo, first hit:

// src/multiprocess/multiprocess_memory_limit.c
int set_current_device_memory_limit(const int dev, size_t newlimit) {
    ensure_initialized();
    if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
        LOG_ERROR("Illegal device id: %d", dev);
    }
    LOG_INFO("dev %d new limit set to %ld", dev, newlimit);
    region_info.shared_region->limit[dev] = newlimit;  // ← OOB write
    return 0;
}

First thing that strikes me: the function logs the error and continues, writing into region_info.shared_region->limit[dev] with an invalid dev. That’s a silent OOB write into shared memory. Another app on the system can be affected.

But that’s not the root cause — the root cause is where the invalid dev comes from. Walk back up.

The guilty pattern

set_current_device_memory_limit is called from add_chunk_only which is called from several hooks in src/cuda/memory.c and src/allocator/allocator.c. And in all those hooks I see this pattern:

CUdevice dev;                                          // ← uninitialized
cuCtxGetDevice(&dev);                                  // ← return value ignored
if (oom_check(dev, size)) { ... }                      // ← dev forwarded
add_chunk_only(*handle, size, dev);                    // ← stored as device id

There’s the bug. cuCtxGetDevice can return CUDA_ERROR_INVALID_CONTEXT if the calling thread hasn’t pushed its CUDA context — which happens all the time in async / multi-thread / early-init scenarios:

A thread calling a cuMem* hook before cuCtxSetCurrent (multi-thread allocator)
Init paths where allocations happen before the global context is attached
CUDA Graphs capture, which bypasses the context by design

When cuCtxGetDevice fails, the code ignores the return code and dev keeps its random stack value. That’s our Illegal device id: -644371744.

How many sites?

I grep everything:

$ grep -n "cuCtxGetDevice" src/cuda/*.c src/allocator/*.c
src/allocator/allocator.c:39:        cuCtxGetDevice(&d);        // ignored
src/allocator/allocator.c:106:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:126:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:172:            cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:229:            cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:249:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:276:    cuCtxGetDevice(&dev);         // ignored
src/cuda/memory.c:592:    if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {

8 occurrences. Only one (src/cuda/memory.c:592) checks the return code. The other 7 ignore it. It’s not a one-off bug, it’s a systemic pattern.

Let’s lay out what each site does with a corrupted dev:

File	Function	Consequence
`cuda/memory.c`	`cuMemCreate`	OOB index in shared region via `add_chunk_only`
`allocator/allocator.c`	`oom_check` (`dev == -1` branch)	Reads `get_current_device_memory_limit(d)` with invalid `d`
`allocator/allocator.c`	`add_chunk`	OOB in `oom_check` then `add_gpu_device_memory_usage`
`allocator/allocator.c`	`remove_chunk`	OOB in `rm_gpu_device_memory_usage`
`allocator/allocator.c`	`remove_chunk_async`	OOB in `rm_gpu_device_memory_usage`
`allocator/allocator.c`	`add_chunk_async`	OOB in `oom_check` then `cuDeviceGetMemPool` which can crash with a negative device id

Six critical sites. Only one (cuMemCreate line 592) already has the guard — and partially (only when do_oom_check == true).

Why everything breaks at the same time

Now I see why NO_VMM did nothing. Disabling VMM pushes ggml to use cudaMalloc instead of cuMemCreate. But cudaMalloc ends up calling the allocator.c hooks too (add_chunk, remove_chunk, etc.) — which have the exact same bug. The pattern is everywhere in the code.

So as long as my llama-server thread allocated before having a context attached — which it does during KV cache init, on sm_120, because the init sequence is slightly different — cuCtxGetDevice fails and we fall into the pattern. Whatever allocation path was chosen.

To actually fix this, you need to patch all 6 sites.

The plan

Initialize dev = -1 at the start of each site
Check the return code of cuCtxGetDevice
If it fails, bail out gracefully — skip the memory tracking but let the underlying CUDA alloc/free continue normally
Tighten set_current_device_memory_limit to return early on invalid dev instead of OOB writing

Roughly a 50-line patch. And it has to be pushed upstream, not kept locally.

Episode 7 — Issue #187 and PR #188 against HAMi-core. See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

The repo

The guilty pattern

How many sites?

Why everything breaks at the same time

The plan

Comments