Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 7: Issue #187, PR #188, and 6 hooks fixed in one go

The bug is identified: 6 hooks in HAMi-core ignore the return value of cuCtxGetDevice. The fix is 50 lines. But for the entire HAMi community to benefit, it has to go upstream. Here's how that played out.

Episode 6 — full diagnosis: 6 HAMi-core hooks share the same bug — uninitialized dev after cuCtxGetDevice ignored. The fix is clear: initialize, check the return code, bail out gracefully if it fails.

Now, two options:

  1. Patch the lib locally on my Olares (a single .so to compile, scp onto the host, restart the GPU pods). Quick, only benefits me.

  2. Push the fix upstream to Project-HAMi/HAMi-core. Clean, helps the whole HAMi community.

Of course I’m doing both, starting with the upstream PR. Open-source good karma.

The issue

Reflex before a PR: open an issue that describes the bug in detail, with a minimal repro. That gives maintainers the context they need to review the PR.

The repro is short. A C main calling cuMemCreate with prop.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA — exactly the path ggml takes for its virtual address pool:

#include <stdio.h>
#include <cuda.h>

int main(void) {
    cuInit(0);
    CUdevice device; cuDeviceGet(&device, 0);
    CUcontext ctx;   cuCtxCreate(&ctx, 0, device);

    CUmemAllocationProp prop = {0};
    prop.type             = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type    = CU_MEM_LOCATION_TYPE_HOST_NUMA;
    prop.location.id      = 0;

    size_t granularity;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t size = ((1<<20) + granularity - 1) / granularity * granularity;

    CUmemGenericAllocationHandle handle;
    CUresult res = cuMemCreate(&handle, size, &prop, 0);
    printf("cuMemCreate returned %d\n", res);
    cuCtxDestroy(ctx);
    return 0;
}

nvcc -lcuda repro.c -o repro, run inside a HAMi-managed pod, and [HAMI-core ERROR ...]: Illegal device id: <random> shows up immediately.

Issue: Project-HAMi/HAMi-core#187. Clear text, full repro, root cause identified, fix suggested inline.

The PR

Then I fork the repo, branch fix/cumemcreate-uninit-dev, apply the fix.

First commitcuMemCreate (the specific hook that broke me):

// before
CUdevice dev;                                          // uninitialized
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    dev = prop->location.id;
}
// ...
if (res == CUDA_SUCCESS) {
    add_chunk_only(*handle, size, dev);                // possibly uninit
}
// after
CUdevice dev = prop->location.id;                      // initialized up-front
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    dev = prop->location.id;
}
// ...
if (res == CUDA_SUCCESS && do_oom_check) {             // skip non-DEVICE
    add_chunk_only(*handle, size, dev);
}

Plus tightening set_current_device_memory_limit to return early:

if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
    LOG_ERROR("Illegal device id: %d", dev);
    return -1;                                         // was missing
}

Second commit — the 5 sites in allocator.c. Same pattern everywhere:

// before
CUdevice dev;
cuCtxGetDevice(&dev);
if (oom_check(dev, size)) ...

// after
CUdevice dev = -1;
if (cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    LOG_WARN("add_chunk: cuCtxGetDevice failed, skipping memory tracking");
    return CUDA_SUCCESS;
}
if (oom_check(dev, size)) ...

Strategy: if we don’t know which device the alloc happened on, we don’t track it instead of tracking nonsense and corrupting shared memory. The underlying CUDA alloc continues normally, HAMi just doesn’t count it against quotas. Acceptable trade-off — a context-less thread allocating is rare in steady state.

Total: +36 / -21 lines. Two files touched.

Push and review

git push -u origin fix/cumemcreate-uninit-dev
gh pr create --repo Project-HAMi/HAMi-core \
  --base main \
  --head aamsellem:fix/cumemcreate-uninit-dev \
  --title "fix(cuda,allocator): initialize dev in 6 hooks reading uninitialised stack on cuCtxGetDevice failure" \
  --body-file pr-body.md

PR live: Project-HAMi/HAMi-core#188.

The PR body covers:

With Signed-off-by: aamsellem <620182+aamsellem@users.noreply.github.com> because most NVIDIA-adjacent projects ask for DCO.

And then I wait.

What it means for Olares specifically

The thing is, Olares doesn’t pull master directly. Olares uses beclab/hami:v2.6.14, a Docker Hub image published on March 18, 2026 — so it doesn’t include recent upstream changes. So we’ll need:

  1. PR #188 merged on HAMi-core master
  2. beclab to rebuild their beclab/hami:v2.6.x image from master
  3. Olares to update their Helm chart to point at the new image
  4. The next Olares update for your cluster to pull the new libvgpu.so

So before this lands in prod, some patience required. Probably 2-4 weeks.

In the meantime, two workarounds:

What we got out of this saga (so far)

Yes, zero. Because as I’m writing this, the pod still doesn’t boot — we need either the upstream PR merged + landing on Olares, or me compiling libvgpu manually on the host.

Episode 8 — Lucebox numbers on consumer Blackwell, finally, incoming as soon as we clear that last step. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments