Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Gemma 4 12B hits 170 t/s — upstream merge buys +67% speed for free

Two days ago I shipped Gemma 4 12B QAT at 102 t/s on Olares One. Today I ship 170 t/s. Same hardware. Same model file. Same drafter. Same context. The delta: am17an's PR #23398 (Gemma 4 MTP support) merged into llama.cpp upstream at 12:50 UTC. My custom image — a snapshot of the WIP branch at commit dd97604 — was missing 10+ polish commits that ggerganov forced in review. +67% speed on the exact same setup, just by rebasing. Bonus: critical insight on Olares One's nvidia driver capping CUDA at 13.1, blocking the whole upstream Docker ecosystem.

Two days ago I shipped llamacppgemma412bone v1.0.3 on Olares One with Gemma 4 12B QAT: 102.78 t/s.

Yesterday v1.0.4 with Janvitos’s QAT-matched drafter: 101.7 t/s, accept rate +18 pts but speed flat.

Today v1.0.5: 169.76 t/s.

Same 24 GB GPU, same Qwen-class hardware, same model file, same drafter, same context, same KV cache. The only change: the llama.cpp image.

The context

am17an had opened PR #23398 a few weeks ago to add Gemma 4 MTP support to llama.cpp. While the PR was iterating, I took a snapshot of his branch at commit dd97604 to ship the feature without waiting for upstream merge. That’s the aamsellem/llama-cpp-gemma4mtp:am17an-dd97604 image we’ve been running.

Today at 12:50 UTC, ggerganov merged the PR. The tail of the review cycle was dense — 10+ polish commits in 12 hours:

The dd97604 snapshot missed all of that.

The bench

3 runs Space Invaders HTML, single user, vision-loaded, MTP n=2 active, identical prompt to v1.0.4.

Run 1: 170.03 t/s | 2000 tokens
Run 2: 169.46 t/s | 2000 tokens
Run 3: 169.78 t/s | 1849 tokens

AVG: 169.76 t/s. vs v1.0.4 baseline 101.7 t/s = +66.9% speed.

MTP draft acceptance: 90.4–90.9% (vs v1.0.4 91.4% = within noise). GPU usage unchanged at 8.6 GB.

The +67% is pure upstream MTP graph optimization — no precision change, no quant change, no drafter change. The Gemma 4 + MTP compute path got rewritten during review.

The CUDA 13.3 trap

When the PR merged, I thought I could just pull the next official ghcr.io/ggml-org/llama.cpp:server-cuda13-bXXXX post-merge. Except in parallel on the same day, another commit 3f7c79d (PR #24228) bumped the official Dockerfile from CUDA 13.1 → CUDA 13.3. Olares One runs on nvidia driver 590.44.01 which caps at CUDA 13.1.x. Any official ggml-org image post-merge fails at startup:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=13.3,
please update your driver to a newer version, or use an earlier cuda container

It’s a silent cap that blocks the entire “wait for ggml-org to publish the official image” scenario. Marc (another Olares One user on the same hardware) hits the same wall for his vLLM setup.

Workaround: build my own image from the ggml-org/llama.cpp main HEAD source, but on nvidia/cuda:13.1.1-devel-ubuntu24.04 base. The llama.cpp source itself is NOT gated on CUDA 13.3 — only the official Dockerfile is. So we capture the merge without paying the bump.

Full Dockerfile:

FROM nvidia/cuda:13.1.1-devel-ubuntu24.04 AS build
ARG CUDA_DOCKER_ARCH=120
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential cmake git libcurl4-openssl-dev ca-certificates ccache && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /src
RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git .
# Workaround llama.cpp issue #23357 — CUDA::cuda_driver PRIVATE-scope propagation bug:
# explicitly link against the cuda driver stub.
RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH} \
    -DLLAMA_CURL=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda" \
    -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs -lcuda"
RUN cmake --build build --config Release --target llama-server -j$(nproc)

FROM nvidia/cuda:13.1.1-runtime-ubuntu24.04
RUN apt-get update && apt-get install -y --no-install-recommends \
    libcurl4 ca-certificates libgomp1 && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=build /src/build/bin/llama-server /app/llama-server
COPY --from=build /src/build/bin/*.so* /app/
ENV LD_LIBRARY_PATH=/app
EXPOSE 8080
ENTRYPOINT ["/app/llama-server"]

amd64 build + docker.io push in ~30 min. Final tag: aamsellem/llamacpp-gemma4mtp:main-postmerge-cuda131.

The new 12B leaderboard on Olares One

Stackt/sContextVisionTool callingVRAM
Gemma 4 12B QAT + upstream MTP (v1.0.5, today)169.7665K8.6 GB
Gemma 4 12B QAT + Janvitos drafter (v1.0.4, yesterday)101.765K8.6 GB
Gemma 4 12B QAT + colefuoco (v1.0.3, 2 days ago)102.7865K8.6 GB
Gemma 4 12B Q8_0 baseline (v1.0.2)87.532K~14 GB
Gemma 4 12B no-MTP (v1.0.0)4732K~13 GB

From the pre-MTP v1.0.0 baseline (47 t/s, 32K): +261% speed and +103% context in 4 versions across 4 days.

Why it matters beyond Olares One

The +67% is a pure-upstream gain that every llama.cpp + Gemma 4 + MTP user will capture by pulling the next image. On RTX 5090 desktop, RTX 4090, RTX 3090 — every hardware that supported am17an’s branch will see the same delta as soon as the upstream Docker tag propagates.

The CUDA 13.3 trap, however, is specific to setups with nvidia driver < 595.x. Probably affects quite a few people — Olares One ships a recent driver but not the bleeding edge. On older cards or conservative distros, the cap is even lower.

Coda

Two days ago I wrote that “the scoreboard of models that fit on ≤24 GB is changing.” Today the board changes AGAIN — Gemma 4 12B jumps from 102 t/s to 170 t/s on the same hardware without changing a config line. That’s the kind of leap you usually see between hardware generations. Here it’s between two commits.

Hardware stays fixed. Software keeps eating the problem — faster than before.

On Olares One: pull https://orales-one-market.aamsellem.workers.dev, upgrade Gemma 4 12B One to v1.0.5 from the market UI. ~3 GB image pull, 30s cold start.

Share this post on:

Comments