Lucebox on Olares One — Episode 7: Issue #187, PR #188, and 6 hooks fixed in one go

Episode 6 — full diagnosis: 6 HAMi-core hooks share the same bug — uninitialized dev after cuCtxGetDevice ignored. The fix is clear: initialize, check the return code, bail out gracefully if it fails.

Now, two options:

Patch the lib locally on my Olares (a single .so to compile, scp onto the host, restart the GPU pods). Quick, only benefits me.
Push the fix upstream to Project-HAMi/HAMi-core. Clean, helps the whole HAMi community.

Of course I’m doing both, starting with the upstream PR. Open-source good karma.

The issue

Reflex before a PR: open an issue that describes the bug in detail, with a minimal repro. That gives maintainers the context they need to review the PR.

The repro is short. A C main calling cuMemCreate with prop.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA — exactly the path ggml takes for its virtual address pool:

#include <stdio.h>
#include <cuda.h>

int main(void) {
    cuInit(0);
    CUdevice device; cuDeviceGet(&device, 0);
    CUcontext ctx;   cuCtxCreate(&ctx, 0, device);

    CUmemAllocationProp prop = {0};
    prop.type             = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type    = CU_MEM_LOCATION_TYPE_HOST_NUMA;
    prop.location.id      = 0;

    size_t granularity;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t size = ((1<<20) + granularity - 1) / granularity * granularity;

    CUmemGenericAllocationHandle handle;
    CUresult res = cuMemCreate(&handle, size, &prop, 0);
    printf("cuMemCreate returned %d\n", res);
    cuCtxDestroy(ctx);
    return 0;
}

nvcc -lcuda repro.c -o repro, run inside a HAMi-managed pod, and [HAMI-core ERROR ...]: Illegal device id: <random> shows up immediately.

Issue: Project-HAMi/HAMi-core#187. Clear text, full repro, root cause identified, fix suggested inline.

The PR

Then I fork the repo, branch fix/cumemcreate-uninit-dev, apply the fix.

First commit — cuMemCreate (the specific hook that broke me):

// before
CUdevice dev;                                          // uninitialized
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    dev = prop->location.id;
}
// ...
if (res == CUDA_SUCCESS) {
    add_chunk_only(*handle, size, dev);                // possibly uninit
}

// after
CUdevice dev = prop->location.id;                      // initialized up-front
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    dev = prop->location.id;
}
// ...
if (res == CUDA_SUCCESS && do_oom_check) {             // skip non-DEVICE
    add_chunk_only(*handle, size, dev);
}

Plus tightening set_current_device_memory_limit to return early:

if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
    LOG_ERROR("Illegal device id: %d", dev);
    return -1;                                         // was missing
}

Second commit — the 5 sites in allocator.c. Same pattern everywhere:

// before
CUdevice dev;
cuCtxGetDevice(&dev);
if (oom_check(dev, size)) ...

// after
CUdevice dev = -1;
if (cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    LOG_WARN("add_chunk: cuCtxGetDevice failed, skipping memory tracking");
    return CUDA_SUCCESS;
}
if (oom_check(dev, size)) ...

Strategy: if we don’t know which device the alloc happened on, we don’t track it instead of tracking nonsense and corrupting shared memory. The underlying CUDA alloc continues normally, HAMi just doesn’t count it against quotas. Acceptable trade-off — a context-less thread allocating is rare in steady state.

Total: +36 / -21 lines. Two files touched.

Push and review

git push -u origin fix/cumemcreate-uninit-dev
gh pr create --repo Project-HAMi/HAMi-core \
  --base main \
  --head aamsellem:fix/cumemcreate-uninit-dev \
  --title "fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure" \
  --body-file pr-body.md

PR live: Project-HAMi/HAMi-core#188.

The PR body covers:

The problematic pattern
Why it fires in practice (multi-thread early-alloc, async scheduling, CUDA Graphs)
What each change does in each file
How to reproduce with repro.c
The end-to-end test with my Lucebox image
Design notes (why skip tracking instead of falling back to device 0, why bundle the bounds-check fix with the rest)

With Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com> because most NVIDIA-adjacent projects ask for DCO.

And then I wait.

What it means for Olares specifically

The thing is, Olares doesn’t pull master directly. Olares uses beclab/hami:v2.6.14, a Docker Hub image published on March 18, 2026 — so it doesn’t include recent upstream changes. So we’ll need:

PR #188 merged on HAMi-core master
beclab to rebuild their beclab/hami:v2.6.x image from master
Olares to update their Helm chart to point at the new image
The next Olares update for your cluster to pull the new libvgpu.so

So before this lands in prod, some patience required. Probably 2-4 weeks.

In the meantime, two workarounds:

Compile libvgpu.so locally with my patch and swap it into /usr/local/vgpu/libvgpu.so on the Olares host. Hot-swap, works for every GPU pod without touching the image. Hack but effective.
Disable VMM in ggml with -DGGML_CUDA_NO_VMM=ON. Partial mitigation — fixes the cuMemCreate path but not the allocator paths. Useful if you can’t touch the host.

What we got out of this saga (so far)

6 Docker builds. About 12h of cumulative compile time, plus 6h of debug, plus 4h of upstream patch work.
3 undocumented fixes uncovered one after the other: libcuda.so.1 stub, rpath-link, NO_VMM (which turns out to only be a partial mitigation).
1 systemic upstream bug identified and fixed in PR #188.
0 Lucebox tokens generated yet.

Yes, zero. Because as I’m writing this, the pod still doesn’t boot — we need either the upstream PR merged + landing on Olares, or me compiling libvgpu manually on the host.

Episode 8 — Lucebox numbers on consumer Blackwell, finally, incoming as soon as we clear that last step. See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

The issue

The PR

Push and review

What it means for Olares specifically

What we got out of this saga (so far)

Comments