Lucebox on Olares One — Episode 2: 2h of CUDA compile for 11 undefined references

If you missed episode 1, quick recap: I’m trying to package Lucebox DFlash as a Docker image for Olares One (RTX 5090 Mobile, sm_120 consumer Blackwell). Nobody has done it before. And we kick off the first docker buildx build.

Spoiler episode 2: we trip at the link step.

The naive Dockerfile

Three simple steps in the builder:

FROM nvidia/cuda:13.0.0-devel-ubuntu22.04 AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    git build-essential cmake ninja-build pkg-config \
    libcurl4-openssl-dev ca-certificates python3-pip python3-dev

WORKDIR /src
RUN git clone --depth 1 --recurse-submodules \
    https://github.com/Luce-Org/lucebox-hub /src/lucebox

WORKDIR /src/lucebox/dflash
RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
    && cmake --build build --target test_dflash -j $(nproc)

Three CUDA architectures targeted: Ampere (86, RTX 3090), Ada (89, RTX 4090), consumer Blackwell (120, RTX 5090). Lucebox auto-detects sm_120 and rewrites it to 120a at config time — good sign.

I run the build. docker buildx build --platform linux/amd64 -t aamsellem/lucebox-qwen36-blackwell:1.0.0 .

And I wait.

2h13 later

Compiling CUDA for 3 architectures, on 16 vCPU inside an OrbStack VM, is long. Very long. The fattn-mma-f16-instance-*.cu.o (Flash Attention multi-head, template instances) take 30 to 90 seconds each. And there are dozens. Plus the mmq-instance-* (matmul quantized). Plus template-instances/fattn-vec-instance-*. Plus…

At 7208 seconds (2h00), I’m at [90%]. At 8126s, [98%] Linking CXX static library libdflash27b.a. At 8202s, [100%] Linking CXX executable test_dflash. Yes!

Then boom:

/usr/bin/ld: warning: libcuda.so.1, needed by deps/llama.cpp/ggml/src/ggml-cuda/libggml-cuda.so.0.9.11, not found (try using -rpath or -rpath-link)
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemCreate'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemAddressReserve'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemUnmap'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemSetAccess'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuDeviceGet'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemAddressFree'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuGetErrorString'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuDeviceGetAttribute'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemMap'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemRelease'
/usr/bin/ld: libggml-cuda.so.0.9.11: undefined reference to `cuMemGetAllocationGranularity'
collect2: error: ld returned 1 exit status

Eleven undefined references. All CUDA Driver API VMM (Virtual Memory Management) functions. Compile worked, link failed.

Ouch.

Why

On a regular NVIDIA machine, libcuda.so.1 ships with the driver (not the CUDA toolkit). It lives at /usr/lib/x86_64-linux-gnu/libcuda.so.1 and is installed by the NVIDIA driver, not by CUDA packages.

In a Docker container, the driver comes from the host via NVIDIA Container Toolkit, at runtime. At build time, inside an nvidia/cuda:13.0.0-devel image, the driver isn’t there.

But NVIDIA thought of this: they ship a stub at /usr/local/cuda/lib64/stubs/libcuda.so. It’s an empty lib that exports every CUDA Driver API symbol with no logic behind it. At build time, ld resolves symbols against the stub. At runtime, the host’s real libcuda.so.1 is used.

Except the stub is named libcuda.so, not libcuda.so.1. And libggml-cuda.so is looking for .1. So ld doesn’t find it, and dumps the undefined references.

The fix that looked obvious

Two lines in the Dockerfile:

ENV LIBRARY_PATH="/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}"
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
           /usr/local/cuda/lib64/stubs/libcuda.so.1

Symlink + add the stubs path to LIBRARY_PATH. Logical. I rerun.

2 hours of compile later…

Continued in episode 3, because LIBRARY_PATH doesn’t do what I thought it did.

See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

The naive Dockerfile

2h13 later

Why

The fix that looked obvious

Comments