Episode 3 — I figured out that LIBRARY_PATH doesn’t resolve indirect dependencies, switched to -Wl,-rpath-link, and test_dflash finally linked.
Except test_dflash is a minimal CLI: it takes a prompt, dumps tokens. To get an OpenAI-compatible endpoint (so you can plug in Continue, Roo, your usual client), I need llama-server, which lives in Luce-Org/llama.cpp@luce-dflash (the llama.cpp fork embedded as a submodule of Lucebox).
So a second cmake invocation in the Dockerfile:
WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_CURL=ON \
&& cmake --build build --target llama-server -j $(nproc)
See the bug? Me neither, in the moment. I just rebuilt.
67 minutes later
[100%] Linking CXX executable llama-server
/usr/bin/ld: ../../bin/libggml-cuda.so.0.9.11: undefined reference to `cuMemCreate'
/usr/bin/ld: ../../bin/libggml-cuda.so.0.9.11: undefined reference to `cuMemAddressReserve'
...
Same 11 undefined references as episode 2. Except this time I just forgot to pass -DCMAKE_EXE_LINKER_FLAGS=... and -DCMAKE_SHARED_LINKER_FLAGS=... to the second cmake invocation. The episode 3 fix was in cmake #1 (test_dflash), not in cmake #2 (llama-server).
Classic Dockerfile mistake: each RUN is its own layer, and each cmake -B build -S . is a fresh config that inherits nothing. Anything local to one folder stays in that folder.
Need to copy-paste the fix to the second invocation:
WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_CURL=ON \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
&& cmake --build build --target llama-server -j $(nproc)
And of course adding these flags busts the next layer’s cache — so back to a full 2h compile on the submodule. Cool cool cool.
Build #6
I should’ve counted earlier. Build #1: 2h13 (broken episode 2). Build #2: 2h (broken episode 3, same error). Build #3: 56 min (test_dflash linked, but llama-server link broken). Build #4: 67 min (llama-server linked after fix). Roughly 6 hours of cumulative compile time just for the link flags.
This time it goes through. The final Dockerfile shape:
FROM nvidia/cuda:13.0.0-devel-ubuntu22.04 AS builder
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
git build-essential cmake ninja-build pkg-config \
libcurl4-openssl-dev ca-certificates python3-pip python3-dev \
&& rm -rf /var/lib/apt/lists/*
# Fix #1: stub libcuda.so.1 at link time
ENV LIBRARY_PATH="/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}"
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
/usr/local/cuda/lib64/stubs/libcuda.so.1
WORKDIR /src
RUN git clone --depth 1 --recurse-submodules \
https://github.com/Luce-Org/lucebox-hub /src/lucebox
# Fix #2 + #3: rpath-link everywhere, in both cmake invocations
WORKDIR /src/lucebox/dflash
RUN cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
&& cmake --build build --target test_dflash -j $(nproc)
WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=ON \
-DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
&& cmake --build build --target llama-server -j $(nproc)
And the final push:
==> Pushing aamsellem/lucebox-qwen36-blackwell:1.0.0
...
==> Done. Image: docker.io/aamsellem/lucebox-qwen36-blackwell:1.0.0
🎉 Public image on Docker Hub. It compiles. It links. It pushes. All good.
Now I package a Helm chart for Olares, add the model download (unsloth/Qwen3.6-27B-Q4_K_M.gguf + drafter z-lab/Qwen3.6-27B-DFlash), I deploy. The pod boots, downloads the ~20 GB of models, starts llama-server. ggml detects the GPU:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
VMM: yes, VRAM: 24463 MiB
And then…
[HAMI-core ERROR (...)]: Illegal device id: -644371744
The pod dies. CrashLoopBackOff. Restart, and -644371744 becomes -39296272, then 1816936528. Random.
I have no idea what’s going on.
Episode 5 — the runtime slams the door on us, see you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.