Episode 5 — NO_VMM doesn’t help, the pod still crashes with a random Illegal device id. I’m leaving the “tweak CMake flags” zone and entering “read the source of the component that crashes”.
The repo
HAMi is split across two repos:
Project-HAMi/HAMi— the Kubernetes device plugin (Go)Project-HAMi/HAMi-core— thelibvgpu.solib that’s LD_PRELOAD’d into pods (C)
The crash comes from the C lib. I clone HAMi-core:
git clone https://github.com/Project-HAMi/HAMi-core
cd HAMi-core
grep -rn "Illegal device id" src/
Bingo, first hit:
// src/multiprocess/multiprocess_memory_limit.c
int set_current_device_memory_limit(const int dev, size_t newlimit) {
ensure_initialized();
if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
LOG_ERROR("Illegal device id: %d", dev);
}
LOG_INFO("dev %d new limit set to %ld", dev, newlimit);
region_info.shared_region->limit[dev] = newlimit; // ← OOB write
return 0;
}
First thing that strikes me: the function logs the error and continues, writing into region_info.shared_region->limit[dev] with an invalid dev. That’s a silent OOB write into shared memory. Another app on the system can be affected.
But that’s not the root cause — the root cause is where the invalid dev comes from. Walk back up.
The guilty pattern
set_current_device_memory_limit is called from add_chunk_only which is called from several hooks in src/cuda/memory.c and src/allocator/allocator.c. And in all those hooks I see this pattern:
CUdevice dev; // ← uninitialized
cuCtxGetDevice(&dev); // ← return value ignored
if (oom_check(dev, size)) { ... } // ← dev forwarded
add_chunk_only(*handle, size, dev); // ← stored as device id
There’s the bug. cuCtxGetDevice can return CUDA_ERROR_INVALID_CONTEXT if the calling thread hasn’t pushed its CUDA context — which happens all the time in async / multi-thread / early-init scenarios:
- A thread calling a
cuMem*hook beforecuCtxSetCurrent(multi-thread allocator) - Init paths where allocations happen before the global context is attached
- CUDA Graphs capture, which bypasses the context by design
When cuCtxGetDevice fails, the code ignores the return code and dev keeps its random stack value. That’s our Illegal device id: -644371744.
How many sites?
I grep everything:
$ grep -n "cuCtxGetDevice" src/cuda/*.c src/allocator/*.c
src/allocator/allocator.c:39: cuCtxGetDevice(&d); // ignored
src/allocator/allocator.c:106: cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:126: cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:172: cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:229: cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:249: cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:276: cuCtxGetDevice(&dev); // ignored
src/cuda/memory.c:592: if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
8 occurrences. Only one (src/cuda/memory.c:592) checks the return code. The other 7 ignore it. It’s not a one-off bug, it’s a systemic pattern.
Let’s lay out what each site does with a corrupted dev:
| File | Function | Consequence |
|---|---|---|
cuda/memory.c | cuMemCreate | OOB index in shared region via add_chunk_only |
allocator/allocator.c | oom_check (dev == -1 branch) | Reads get_current_device_memory_limit(d) with invalid d |
allocator/allocator.c | add_chunk | OOB in oom_check then add_gpu_device_memory_usage |
allocator/allocator.c | remove_chunk | OOB in rm_gpu_device_memory_usage |
allocator/allocator.c | remove_chunk_async | OOB in rm_gpu_device_memory_usage |
allocator/allocator.c | add_chunk_async | OOB in oom_check then cuDeviceGetMemPool which can crash with a negative device id |
Six critical sites. Only one (cuMemCreate line 592) already has the guard — and partially (only when do_oom_check == true).
Why everything breaks at the same time
Now I see why NO_VMM did nothing. Disabling VMM pushes ggml to use cudaMalloc instead of cuMemCreate. But cudaMalloc ends up calling the allocator.c hooks too (add_chunk, remove_chunk, etc.) — which have the exact same bug. The pattern is everywhere in the code.
So as long as my llama-server thread allocated before having a context attached — which it does during KV cache init, on sm_120, because the init sequence is slightly different — cuCtxGetDevice fails and we fall into the pattern. Whatever allocation path was chosen.
To actually fix this, you need to patch all 6 sites.
The plan
- Initialize
dev = -1at the start of each site - Check the return code of
cuCtxGetDevice - If it fails, bail out gracefully — skip the memory tracking but let the underlying CUDA alloc/free continue normally
- Tighten
set_current_device_memory_limitto return early on invaliddevinstead of OOB writing
Roughly a 50-line patch. And it has to be pushed upstream, not kept locally.
Episode 7 — Issue #187 and PR #188 against HAMi-core. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.