Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 6: We read the HAMi-core source and we find 6 bugs

NO_VMM doesn't fix anything. The `Illegal device id` bug comes back every run. Time to read the HAMi-core source. And what we find is not a single bug — it's a systemic pattern across 6 different hooks.

Episode 5NO_VMM doesn’t help, the pod still crashes with a random Illegal device id. I’m leaving the “tweak CMake flags” zone and entering “read the source of the component that crashes”.

The repo

HAMi is split across two repos:

The crash comes from the C lib. I clone HAMi-core:

git clone https://github.com/Project-HAMi/HAMi-core
cd HAMi-core
grep -rn "Illegal device id" src/

Bingo, first hit:

// src/multiprocess/multiprocess_memory_limit.c
int set_current_device_memory_limit(const int dev, size_t newlimit) {
    ensure_initialized();
    if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
        LOG_ERROR("Illegal device id: %d", dev);
    }
    LOG_INFO("dev %d new limit set to %ld", dev, newlimit);
    region_info.shared_region->limit[dev] = newlimit;  // ← OOB write
    return 0;
}

First thing that strikes me: the function logs the error and continues, writing into region_info.shared_region->limit[dev] with an invalid dev. That’s a silent OOB write into shared memory. Another app on the system can be affected.

But that’s not the root cause — the root cause is where the invalid dev comes from. Walk back up.

The guilty pattern

set_current_device_memory_limit is called from add_chunk_only which is called from several hooks in src/cuda/memory.c and src/allocator/allocator.c. And in all those hooks I see this pattern:

CUdevice dev;                                          // ← uninitialized
cuCtxGetDevice(&dev);                                  // ← return value ignored
if (oom_check(dev, size)) { ... }                      // ← dev forwarded
add_chunk_only(*handle, size, dev);                    // ← stored as device id

There’s the bug. cuCtxGetDevice can return CUDA_ERROR_INVALID_CONTEXT if the calling thread hasn’t pushed its CUDA context — which happens all the time in async / multi-thread / early-init scenarios:

When cuCtxGetDevice fails, the code ignores the return code and dev keeps its random stack value. That’s our Illegal device id: -644371744.

How many sites?

I grep everything:

$ grep -n "cuCtxGetDevice" src/cuda/*.c src/allocator/*.c
src/allocator/allocator.c:39:        cuCtxGetDevice(&d);        // ignored
src/allocator/allocator.c:106:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:126:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:172:            cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:229:            cuCtxGetDevice(&dev); // ignored
src/allocator/allocator.c:249:    cuCtxGetDevice(&dev);         // ignored
src/allocator/allocator.c:276:    cuCtxGetDevice(&dev);         // ignored
src/cuda/memory.c:592:    if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {

8 occurrences. Only one (src/cuda/memory.c:592) checks the return code. The other 7 ignore it. It’s not a one-off bug, it’s a systemic pattern.

Let’s lay out what each site does with a corrupted dev:

FileFunctionConsequence
cuda/memory.ccuMemCreateOOB index in shared region via add_chunk_only
allocator/allocator.coom_check (dev == -1 branch)Reads get_current_device_memory_limit(d) with invalid d
allocator/allocator.cadd_chunkOOB in oom_check then add_gpu_device_memory_usage
allocator/allocator.cremove_chunkOOB in rm_gpu_device_memory_usage
allocator/allocator.cremove_chunk_asyncOOB in rm_gpu_device_memory_usage
allocator/allocator.cadd_chunk_asyncOOB in oom_check then cuDeviceGetMemPool which can crash with a negative device id

Six critical sites. Only one (cuMemCreate line 592) already has the guard — and partially (only when do_oom_check == true).

Why everything breaks at the same time

Now I see why NO_VMM did nothing. Disabling VMM pushes ggml to use cudaMalloc instead of cuMemCreate. But cudaMalloc ends up calling the allocator.c hooks too (add_chunk, remove_chunk, etc.) — which have the exact same bug. The pattern is everywhere in the code.

So as long as my llama-server thread allocated before having a context attached — which it does during KV cache init, on sm_120, because the init sequence is slightly different — cuCtxGetDevice fails and we fall into the pattern. Whatever allocation path was chosen.

To actually fix this, you need to patch all 6 sites.

The plan

  1. Initialize dev = -1 at the start of each site
  2. Check the return code of cuCtxGetDevice
  3. If it fails, bail out gracefully — skip the memory tracking but let the underlying CUDA alloc/free continue normally
  4. Tighten set_current_device_memory_limit to return early on invalid dev instead of OOB writing

Roughly a 50-line patch. And it has to be pushed upstream, not kept locally.

Episode 7 — Issue #187 and PR #188 against HAMi-core. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments