Hi there.
You know those Reddit posts that stop your evening scroll cold? Last week’s was this one on r/LocalLLaMA:
“135 t/s on Qwen3.6 27B Q5 + vision + 200K context on a single RTX 3090.”
135 tokens per second. On RTX 3090. With 200K context. And the last bit that stings: a fork called BeeLlama.cpp that nobody had heard of two weeks earlier.
My best path on Olares One (RTX 5090M Laptop, 24 GB sm_120 consumer mobile Blackwell) tops out at 88 t/s with vLLM + Genesis (the 28-monkey-patch stack from Sandermage we’ve been dragging since April) + MTP n=3, or 72.75 t/s at 262K full context with llama.cpp + MTP. So either that Reddit claim was cherry-picked, or I was leaving 50% of throughput on the table without knowing.
I tested it. Not cherry-picked. 107 t/s at 262K full, +48% over my fastest long-context path. Here’s the story.
Quick recap for newcomers
On this blog I’ve been publishing local LLM inference benches since April on Olares One (a mini-PC with an RTX 5090M 24 GB — my personal lab). Three names show up a lot:
- Genesis — the custom vLLM stack from Sandermage, 28 Python patches to maintain, holds me at 88 t/s
- DFlash — the speculative decoding technique via block diffusion drafter (ICLR 2026 paper, code at z-lab)
- MTP — Multi-Token Prediction, the spec decoding head baked natively into Qwen3.6 and Gemma 4
BeeLlama is the last link in a four-fork chain of llama.cpp variants stacked on each other: ggml-org/llama.cpp → TheTom/llama-cpp-turboquant (adds TurboQuant 2/3/4-bit KV cache) → spiritbuun/buun-llama-cpp (adds DFlash) → Anbeeld/beellama.cpp (adds MTP + CopySpec + reasoning-loop protection). None are merged upstream, none publish pre-built Linux images, none target consumer Blackwell sm_120. I was already running the penultimate one (buun-llama-cpp) via my own image — perf capped at 76 t/s at 96K.
When the BeeLlama v0.1.1 release dropped on May 11 on top, it claimed MTP support on top of the inherited TurboQuant + DFlash. Reddit was announcing 135 t/s.
The morning: 50 minutes of qemu build
Olares One runs CUDA 13 and the RTX 5090M is sm_120 consumer mobile Blackwell — not sm_121 (DGX Spark) or sm_100 (B100/B200 datacenter). I need a specific image. My Mac is arm64, so Docker buildx with qemu emulation.
cd ~/dev/beellama.cpp # cloned -b v0.1.1
docker buildx build \
--platform linux/amd64 \
--build-arg CUDA_VERSION=13.0.0 \
--build-arg CUDA_DOCKER_ARCH=120 \
-f .devops/cuda.Dockerfile \
--target server \
-t aamsellem/beellama-cpp:0.1.1 .
50 minutes of cross-compile. Template instantiation for CUDA under qemu is slow. Final image: 2.67 GB. Push to Docker Hub. While it ran, I re-read the BeeLlama docs and prepped my Helm chart.
Noon: the first crash
Pod deployed. Target loads… and immediate fail:
done_getting_tensors: wrong number of tensors; expected 866, got 862
Ouch. I had havenoammo/Qwen3.6-27B-MTP-UD-GGUF cached on the device (the MTP-baked GGUF I use for the long-context MTP app, story in the havenoammo post). The MTP head bakes 4 extra tensors into the GGUF that BeeLlama’s loader doesn’t recognize. The recipe wants the non-MTP unsloth variant — BeeLlama uses DFlash spec decoding, not MTP, for Qwen3.6.
Swap target → unsloth/Qwen3.6-27B-UD-Q3_K_XL.gguf (14.5 GB). Drafter kept: spiritbuun/Qwen3.6-27B-DFlash-GGUF Q8_0 (1.85 GB). Pod redeploys. This time it boots.
The config that works
To reproduce:
TARGET_URL: unsloth/Qwen3.6-27B-UD-Q3_K_XL.gguf # 14.5 GB
DRAFT_URL: spiritbuun/Qwen3.6-27B-DFlash-GGUF (q8_0) # 1.85 GB
# Server args
--spec-type dflash
--spec-dflash-cross-ctx 1024
-ngl 99 --spec-draft-ngl 99
--ctx-size 262144
--cache-type-k turbo3 --cache-type-v turbo3 # 3-bit Walsh-Hadamard rotated KV
--batch-size 2048 --ubatch-size 256
--parallel 1 --kv-unified
--flash-attn on --jinja --no-mmap --mlock
--temp 0.6 --top-k 20 --min-p 0.0
The moment
First bench at 96K (the docs default). I run ten consecutive Space Invaders runs, max_tokens=800.
AVG 106.67 t/s [97.84 - 115.36]
Bingo, already in the announced ballpark. I push to 128K. Then 200K. Then 262K — Qwen3.6’s native maximum context.
| Context | Runs | AVG t/s | Range | KV cache (turbo3) |
|---|---|---|---|---|
| 96K | 10 | 106.67 | 97.84 – 115.36 | ~3 GB |
| 128K | 5 | 116.0 | 107.12 – 127.32 | ~4 GB |
| 200K | 5 | 108.5 | 100.51 – 122.82 | ~6 GB |
| 262K (full native) | 10 | 107.54 | 101.70 – 119.38 | ~8 GB |
Perf is essentially flat across context sizes. The turbo3 KV cache (3-bit Walsh-Hadamard rotated) compresses aggressively enough that even at full 262K context, the whole stack fits in 24 GB of VRAM with margin. Zero degradation cycle like the one I documented on Gemma 4 DFlash, zero runtime CUDA OOM. The 128K spike is a curiosity — probably the batches aligning better with the cudagraph capture sizes.
The ranking after this night
| Stack | t/s avg | Context | Maintenance cost |
|---|---|---|---|
| llama.cpp standard (no spec) | 33-36 | 32K | pure upstream |
llamacppqwen36dflashone (buun-llama-cpp) | 76 | 96K | custom image |
llamacppqwen36mtpone (am17an MTP) | 72.75 | 262K | 1 open PR + custom image |
vllmqwen36turbo27bone (Genesis) | 88 | 88K | 28 patches + 5GB image |
llamacppqwen36beellamaone (BeeLlama) | 107.54 | 262K | custom image one-time |
+48% vs MTP @ 262K, +22% vs vLLM Turbo @ 88K, +40% vs buun DFlash @ 96K. On every dimension, it’s a strict upgrade — speed AND context AND stability.
Why it’s +48% vs MTP at the same context
Three differences between BeeLlama and my llamacppqwen36mtpone v1.0.8 (which ran on the am17an MTP branch):
-
DFlash vs MTP for Qwen3.6. spiritbuun’s DFlash Q8_0 drafter has noticeably higher acceptance than the MTP head baked into havenoammo’s GGUF. Probably because z-lab tuned it specifically on Qwen3.6’s output distribution, whereas the MTP head is generic.
-
turbo3 KV vs q4_0 KV. turbo3 is 3-bit with Walsh-Hadamard rotation, ~25% smaller than q4_0 (which is actually 4.5 bpv). Less memory pressure = bigger compute buffer = bigger batch = better throughput. The am17an MTP branch doesn’t support turbo3, I’m stuck on q4_0.
-
batch 2048 / ubatch 256 vs 512/512. The BeeLlama recipe uses a top-level batch 4× larger with a smaller ubatch. More prefill per scheduler cycle. The MTP setup is bottlenecked by the number of DFlash dual-buffer sequences and can’t go as wide.
What it costs vs what it gains
Costs:
- ~50 min one-time custom Docker build (or pull
aamsellem/beellama-cpp:0.1.1if you trust me) - Image is unmaintained relative to upstream — BeeLlama hasn’t synced master since April 23. New llama.cpp builds (b9130+) won’t land on this fork unless Anbeeld rebases
- Multi-GPU broken in this fork (issue #7) — single-GPU only, which is fine for Olares One
- Fragile DFlash arch detection (issue #4) — confirmed by my crash loading the MTP-baked GGUF
Gains:
- 107 t/s at 262K FULL on Qwen3.6 27B on a single consumer mobile Blackwell GPU 24 GB
- No weight of maintaining the 28 Genesis patches
- No 5-fast/4-slow degradation cycle
- No runtime CUDA OOM on long context
The lesson from tonight
That Reddit claim I thought was ridiculous a week ago — it was correct. And the reason I was at 88 t/s instead of 107 wasn’t a hardware question (I have a 5090M, he had a 3090). It was a question of knowing which fork to install among four undocumented stack levels on consumer Blackwell.
BeeLlama just made three apps in my catalog obsolete (llamacppqwen36dflashone, llamacppqwen36mtpone, and largely vllmqwen36turbo27bone) in one night of build. That’s the third time in six weeks that a niche fork nobody follows unlocks perf nobody else can reproduce on this hardware (spiritbuun in May, havenoammo on May 8, Anbeeld today).
The reflex I keep building: before patching code, scan tier-3 forks on GitHub. UD, Dynamic, MTP-preserved, Heretic, BeeLlama. There’s an ecosystem of quiet people doing extremely precise things, who don’t make noise on Twitter or Reddit.
To reproduce
App llamacppqwen36beellamaone v1.0.1 on my Olares market source:
Olares Market → Settings → Add source
https://orales-one-market.aamsellem.workers.dev
Everything is in orales-one-market — Helm chart, exact image tag, all flags, bench harness. The app appears within 5 minutes.
If you run another consumer Blackwell card (5070 Ti, 5080, desktop 5090, 5090M), the aamsellem/beellama-cpp:0.1.1 image should work for any sm_120. A desktop 5090 with 32 GB and 1.79 TB/s bandwidth should land around 150-180 t/s if the mobile→desktop scaling for this stack holds. If you test, share your numbers.
Next steps
- Watch for BeeLlama’s master sync (to pick up am17an’s MTP work that’ll eventually merge)
- Test CopySpec (BeeLlama’s unique addition on top of buun) on structured-output workloads like JSON tool-call replay — my current bench is open-ended HTML generation where CopySpec has few matches
- Re-test with q4_0 KV instead of turbo3 to isolate whether the gain comes from turbo3 or from the engine’s other optims
That’s it! If you reproduce these 107 t/s or find an even sweeter spot on another consumer Blackwell, send me your numbers. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.