Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Lucebox on Olares One — Episode 4: The llama-server submodule serves it up to you 1h later

test_dflash compiles, great. But to serve over HTTP I need llama-server, which compiles from the submodule. And the submodule has its own cmake invocation — where I forgot to add -rpath-link. And boom, 1h later, here we go again.

Episode 3 — I figured out that LIBRARY_PATH doesn’t resolve indirect dependencies, switched to -Wl,-rpath-link, and test_dflash finally linked.

Except test_dflash is a minimal CLI: it takes a prompt, dumps tokens. To get an OpenAI-compatible endpoint (so you can plug in Continue, Roo, your usual client), I need llama-server, which lives in Luce-Org/llama.cpp@luce-dflash (the llama.cpp fork embedded as a submodule of Lucebox).

So a second cmake invocation in the Dockerfile:

WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_CURL=ON \
    && cmake --build build --target llama-server -j $(nproc)

See the bug? Me neither, in the moment. I just rebuilt.

67 minutes later

[100%] Linking CXX executable llama-server
/usr/bin/ld: ../../bin/libggml-cuda.so.0.9.11: undefined reference to `cuMemCreate'
/usr/bin/ld: ../../bin/libggml-cuda.so.0.9.11: undefined reference to `cuMemAddressReserve'
...

Same 11 undefined references as episode 2. Except this time I just forgot to pass -DCMAKE_EXE_LINKER_FLAGS=... and -DCMAKE_SHARED_LINKER_FLAGS=... to the second cmake invocation. The episode 3 fix was in cmake #1 (test_dflash), not in cmake #2 (llama-server).

Classic Dockerfile mistake: each RUN is its own layer, and each cmake -B build -S . is a fresh config that inherits nothing. Anything local to one folder stays in that folder.

Need to copy-paste the fix to the second invocation:

WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_CURL=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target llama-server -j $(nproc)

And of course adding these flags busts the next layer’s cache — so back to a full 2h compile on the submodule. Cool cool cool.

Build #6

I should’ve counted earlier. Build #1: 2h13 (broken episode 2). Build #2: 2h (broken episode 3, same error). Build #3: 56 min (test_dflash linked, but llama-server link broken). Build #4: 67 min (llama-server linked after fix). Roughly 6 hours of cumulative compile time just for the link flags.

This time it goes through. The final Dockerfile shape:

FROM nvidia/cuda:13.0.0-devel-ubuntu22.04 AS builder

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
    git build-essential cmake ninja-build pkg-config \
    libcurl4-openssl-dev ca-certificates python3-pip python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Fix #1: stub libcuda.so.1 at link time
ENV LIBRARY_PATH="/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}"
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
           /usr/local/cuda/lib64/stubs/libcuda.so.1

WORKDIR /src
RUN git clone --depth 1 --recurse-submodules \
    https://github.com/Luce-Org/lucebox-hub /src/lucebox

# Fix #2 + #3: rpath-link everywhere, in both cmake invocations
WORKDIR /src/lucebox/dflash
RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target test_dflash -j $(nproc)

WORKDIR /src/lucebox/dflash/deps/llama.cpp
RUN cmake -B build -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="120" \
    -DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=ON \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-Wl,-rpath-link,/usr/local/cuda/lib64/stubs" \
    && cmake --build build --target llama-server -j $(nproc)

And the final push:

==> Pushing aamsellem/lucebox-qwen36-blackwell:1.0.0
...
==> Done. Image: docker.io/aamsellem/lucebox-qwen36-blackwell:1.0.0

🎉 Public image on Docker Hub. It compiles. It links. It pushes. All good.

Now I package a Helm chart for Olares, add the model download (unsloth/Qwen3.6-27B-Q4_K_M.gguf + drafter z-lab/Qwen3.6-27B-DFlash), I deploy. The pod boots, downloads the ~20 GB of models, starts llama-server. ggml detects the GPU:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
  Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
            VMM: yes, VRAM: 24463 MiB

And then…

[HAMI-core ERROR (...)]: Illegal device id: -644371744

The pod dies. CrashLoopBackOff. Restart, and -644371744 becomes -39296272, then 1816936528. Random.

I have no idea what’s going on.

Episode 5 — the runtime slams the door on us, see you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments