Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.

Hi there.

You know those Reddit posts that stop your evening scroll cold? Last week’s was this one on r/LocalLLaMA:

“135 t/s on Qwen3.6 27B Q5 + vision + 200K context on a single RTX 3090.”

135 tokens per second. On RTX 3090. With 200K context. And the last bit that stings: a fork called BeeLlama.cpp that nobody had heard of two weeks earlier.

My best path on Olares One (RTX 5090M Laptop, 24 GB sm_120 consumer mobile Blackwell) tops out at 88 t/s with vLLM + Genesis (the 28-monkey-patch stack from Sandermage we’ve been dragging since April) + MTP n=3, or 72.75 t/s at 262K full context with llama.cpp + MTP. So either that Reddit claim was cherry-picked, or I was leaving 50% of throughput on the table without knowing.

I tested it. Not cherry-picked. 107 t/s at 262K full, +48% over my fastest long-context path. Here’s the story.

Quick recap for newcomers

On this blog I’ve been publishing local LLM inference benches since April on Olares One (a mini-PC with an RTX 5090M 24 GB — my personal lab). Three names show up a lot:

BeeLlama is the last link in a four-fork chain of llama.cpp variants stacked on each other: ggml-org/llama.cppTheTom/llama-cpp-turboquant (adds TurboQuant 2/3/4-bit KV cache) → spiritbuun/buun-llama-cpp (adds DFlash) → Anbeeld/beellama.cpp (adds MTP + CopySpec + reasoning-loop protection). None are merged upstream, none publish pre-built Linux images, none target consumer Blackwell sm_120. I was already running the penultimate one (buun-llama-cpp) via my own image — perf capped at 76 t/s at 96K.

When the BeeLlama v0.1.1 release dropped on May 11 on top, it claimed MTP support on top of the inherited TurboQuant + DFlash. Reddit was announcing 135 t/s.

The morning: 50 minutes of qemu build

Olares One runs CUDA 13 and the RTX 5090M is sm_120 consumer mobile Blackwell — not sm_121 (DGX Spark) or sm_100 (B100/B200 datacenter). I need a specific image. My Mac is arm64, so Docker buildx with qemu emulation.

cd ~/dev/beellama.cpp                              # cloned -b v0.1.1
docker buildx build \
  --platform linux/amd64 \
  --build-arg CUDA_VERSION=13.0.0 \
  --build-arg CUDA_DOCKER_ARCH=120 \
  -f .devops/cuda.Dockerfile \
  --target server \
  -t aamsellem/beellama-cpp:0.1.1 .

50 minutes of cross-compile. Template instantiation for CUDA under qemu is slow. Final image: 2.67 GB. Push to Docker Hub. While it ran, I re-read the BeeLlama docs and prepped my Helm chart.

Noon: the first crash

Pod deployed. Target loads… and immediate fail:

done_getting_tensors: wrong number of tensors; expected 866, got 862

Ouch. I had havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached on the device (the MTP-baked GGUF I use for the long-context MTP app, story in the havenoammo post). The MTP head bakes 4 extra tensors into the GGUF that BeeLlama’s loader doesn’t recognize. The recipe wants the non-MTP unsloth variant — BeeLlama uses DFlash spec decoding, not MTP, for Qwen3.6.

Swap target → unsloth/Qwen3.6-27B-UD-Q3_K_XL.gguf (14.5 GB). Drafter kept: spiritbuun/Qwen3.6-27B-DFlash-GGUF Q8_0 (1.85 GB). Pod redeploys. This time it boots.

The config that works

To reproduce:

TARGET_URL: unsloth/Qwen3.6-27B-UD-Q3_K_XL.gguf   # 14.5 GB
DRAFT_URL:  spiritbuun/Qwen3.6-27B-DFlash-GGUF (q8_0)  # 1.85 GB

# Server args
--spec-type dflash
--spec-dflash-cross-ctx 1024
-ngl 99 --spec-draft-ngl 99
--ctx-size 262144
--cache-type-k turbo3 --cache-type-v turbo3   # 3-bit Walsh-Hadamard rotated KV
--batch-size 2048 --ubatch-size 256
--parallel 1 --kv-unified
--flash-attn on --jinja --no-mmap --mlock
--temp 0.6 --top-k 20 --min-p 0.0

The moment

First bench at 96K (the docs default). I run ten consecutive Space Invaders runs, max_tokens=800.

AVG 106.67 t/s [97.84 - 115.36]

Bingo, already in the announced ballpark. I push to 128K. Then 200K. Then 262K — Qwen3.6’s native maximum context.

ContextRunsAVG t/sRangeKV cache (turbo3)
96K10106.6797.84 – 115.36~3 GB
128K5116.0107.12 – 127.32~4 GB
200K5108.5100.51 – 122.82~6 GB
262K (full native)10107.54101.70 – 119.38~8 GB

Perf is essentially flat across context sizes. The turbo3 KV cache (3-bit Walsh-Hadamard rotated) compresses aggressively enough that even at full 262K context, the whole stack fits in 24 GB of VRAM with margin. Zero degradation cycle like the one I documented on Gemma 4 DFlash, zero runtime CUDA OOM. The 128K spike is a curiosity — probably the batches aligning better with the cudagraph capture sizes.

The ranking after this night

Stackt/s avgContextMaintenance cost
llama.cpp standard (no spec)33-3632Kpure upstream
llamacppqwen36dflashone (buun-llama-cpp)7696Kcustom image
llamacppqwen36mtpone (am17an MTP)72.75262K1 open PR + custom image
vllmqwen36turbo27bone (Genesis)8888K28 patches + 5GB image
llamacppqwen36beellamaone (BeeLlama)107.54262Kcustom image one-time

+48% vs MTP @ 262K, +22% vs vLLM Turbo @ 88K, +40% vs buun DFlash @ 96K. On every dimension, it’s a strict upgrade — speed AND context AND stability.

Why it’s +48% vs MTP at the same context

Three differences between BeeLlama and my llamacppqwen36mtpone v1.0.8 (which ran on the am17an MTP branch):

  1. DFlash vs MTP for Qwen3.6. spiritbuun’s DFlash Q8_0 drafter has noticeably higher acceptance than the MTP head baked into havenoammo’s GGUF. Probably because z-lab tuned it specifically on Qwen3.6’s output distribution, whereas the MTP head is generic.

  2. turbo3 KV vs q4_0 KV. turbo3 is 3-bit with Walsh-Hadamard rotation, ~25% smaller than q4_0 (which is actually 4.5 bpv). Less memory pressure = bigger compute buffer = bigger batch = better throughput. The am17an MTP branch doesn’t support turbo3, I’m stuck on q4_0.

  3. batch 2048 / ubatch 256 vs 512/512. The BeeLlama recipe uses a top-level batch 4× larger with a smaller ubatch. More prefill per scheduler cycle. The MTP setup is bottlenecked by the number of DFlash dual-buffer sequences and can’t go as wide.

What it costs vs what it gains

Costs:

Gains:

The lesson from tonight

That Reddit claim I thought was ridiculous a week ago — it was correct. And the reason I was at 88 t/s instead of 107 wasn’t a hardware question (I have a 5090M, he had a 3090). It was a question of knowing which fork to install among four undocumented stack levels on consumer Blackwell.

BeeLlama just made three apps in my catalog obsolete (llamacppqwen36dflashone, llamacppqwen36mtpone, and largely vllmqwen36turbo27bone) in one night of build. That’s the third time in six weeks that a niche fork nobody follows unlocks perf nobody else can reproduce on this hardware (spiritbuun in May, havenoammo on May 8, Anbeeld today).

The reflex I keep building: before patching code, scan tier-3 forks on GitHub. UD, Dynamic, MTP-preserved, Heretic, BeeLlama. There’s an ecosystem of quiet people doing extremely precise things, who don’t make noise on Twitter or Reddit.

To reproduce

App llamacppqwen36beellamaone v1.0.1 on my Olares market source:

Olares Market → Settings → Add source
https://orales-one-market.aamsellem.workers.dev

Everything is in orales-one-market — Helm chart, exact image tag, all flags, bench harness. The app appears within 5 minutes.

If you run another consumer Blackwell card (5070 Ti, 5080, desktop 5090, 5090M), the aamsellem/beellama-cpp:0.1.1 image should work for any sm_120. A desktop 5090 with 32 GB and 1.79 TB/s bandwidth should land around 150-180 t/s if the mobile→desktop scaling for this stack holds. If you test, share your numbers.

Next steps

That’s it! If you reproduce these 107 t/s or find an even sweeter spot on another consumer Blackwell, send me your numbers. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments