Quick followup to yesterday’s BeeLlama 262K post. I claimed 107 t/s at full 262K context on Qwen3.6 27B was a strict upgrade over my previous best paths. A reader asked the obvious next question: “can we add vision?”
Short answer: yes, and it lands at 106 t/s at 200K context with vision active. Three findings worth flagging:
- BeeLlama supports
--mmprojalongside--spec-type dflashfor Qwen3.6 27B. That same combo crashes on Gemma 4 (fattn.cu:1265: fatal error). Fork is target-specific. - 200K context outperforms 128K by +4.4% on this stack. Counter-intuitive but reproducible — same cudagraph alignment effect that gave the 128K sweet spot on text-only.
- 262K + mmproj = OOM (+1 GB short on 24 GB VRAM). The 200K ceiling is hard.
TL;DR
| Stack | t/s avg | Context | Vision | Range | KV @ context |
|---|---|---|---|---|---|
llamacppqwen36beellamaone v1.0.2 (text-only) | 107.54 | 262K | ❌ | 17.7 | ~8 GB |
llamacppqwen36beellamavisionone v1.0.0 | 106.43 | 200K | ✅ | 23.4 | ~6.2 GB |
| BeeLlama vision @ 128K (test only) | 101.98 | 128K | ✅ | 16.9 | ~4 GB |
-1% throughput, -23% context, +vision, +DFlash drafter still firing alongside mmproj (27% acceptance). Shipped as llamacppqwen36beellamavisionone v1.0.0 in the orales-one-market catalog.
The BeeLlama vs Gemma 4 incompatibility
Earlier this week I tried swapping BeeLlama (aamsellem/beellama-cpp:0.1.2) onto a Gemma 4 26B-A4B + mmproj F16 + q4_0 KV deployment on the same hardware (RTX 5090M, sm_120 consumer Blackwell mobile, 24 GB). It crashed at warmup:
TCQ decode: context-adaptive V alpha enabled
/app/ggml/src/ggml-cuda/fattn.cu:1265: fatal error
libggml-base.so.0(+0x1adb6)[0x7ee7b3241db6]
libggml-cuda.so(+0x23e181)[0x7ee7a788b181]
libggml-cuda.so(_Z24ggml_cuda_flash_attn_ext...)
That makes sense once you trace the fork lineage: BeeLlama is Anbeeld/beellama.cpp ← spiritbuun/buun-llama-cpp ← TheTom/llama-cpp-turboquant ← ggml-org/llama.cpp. Every link in that chain modifies the CUDA flash attention path for Qwen3.6 hybrid arch (Gated DeltaNet + SSM) + TCQ (turbo cache quant) + DFlash. None of those modifications were ever validated against Gemma 4’s sliding window attention. The crash isn’t a bug — it’s an out-of-scope.
The implication: BeeLlama is the right fork for Qwen3.6 DFlash, AtomicBot fork is the right one for Gemma 4 + MTP. Pick by model class, not by feature wishlist.
What I didn’t know going in: does BeeLlama support --mmproj alongside --spec-type dflash for Qwen3.6? It’s not documented either way. Could’ve been “mmproj works only without spec” (the MTP path has this exact incompatibility upstream — multimodal + MTP is forbidden in llama.cpp until further notice). Time to test.
The config
Same target + drafter as the text-only v1.0.2, plus the mmproj from unsloth/Qwen3.6-27B-GGUF:
TARGET: unsloth/Qwen3.6-27B-UD-Q3_K_XL.gguf # 14.5 GB
DRAFT: spiritbuun/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf # 1.85 GB
MMPROJ: unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf # 0.93 GB
# Server args
--mmproj /models/$MMPROJ_FILE
--spec-type dflash
--spec-dflash-cross-ctx 1024
-ngl 99 --spec-draft-ngl 99
--ctx-size 200000
--cache-type-k turbo3 --cache-type-v turbo3
--batch-size 2048 --ubatch-size 2048 # mandatory ubatch ≥ image_max_tokens (~1100)
--parallel 1 --kv-unified
--flash-attn on --jinja --no-mmap --mlock
--temp 0.6 --top-k 20 --min-p 0.0
Note the --ubatch-size 2048 — bumping from the text-only default of 256 is mandatory when mmproj is loaded. Gemma 4 has image_max_pixels: 645120 which translates to roughly 1100 image tokens per request. Vision encoder is non-causal, so it can’t split image tokens across ubatches. If ubatch < image_max_tokens, the server asserts on the first image request (llama.cpp PR #21550). Qwen3.6 27B has similar image token counts. Safe default is --ubatch-size 2048 for any chart with --mmproj.
The smoke test
Generated a small 64×64 PNG gradient (red on the right side, dark teal-black on the left) and sent it via OpenAI-compatible image_url content type. Response:
“The image shows a gradient. It transitions from a dark, almost black color on the left to a vibrant red on the right.”
Correct. DFlash drafter telemetry showed draft_n=221, draft_n_accepted=61 → 27.6% acceptance on this vision prompt. The drafter genuinely fires alongside mmproj — BeeLlama supports the combo. Surprising win given how brittle this path is on the Gemma 4 side.
The bench
I started at 128K context expecting the standard “vision adds latency” hit. Got 101.98 t/s avg over 10 runs, range 92.69-109.62. Not bad, but I noticed the histogram peak skewed lower than the text-only 128K result (which had hit 116 t/s avg on the same machine yesterday).
Then I pushed to 200K. Expected another drop. Got the opposite:
| Context | Runs | AVG t/s | MIN | MAX | Range | Notes |
|---|---|---|---|---|---|---|
| 128K | 10 | 101.98 | 92.69 | 109.62 | 16.94 | baseline vision |
| 200K | 10 | 106.43 | 93.29 | 116.67 | 23.38 | +4.4% over 128K |
| 262K + mmproj | n/a | OOM | — | — | — | exceeds 24 GB |
200K is +4.4% faster than 128K, with a wider range. This is the same effect I observed yesterday on text-only — a “sweet spot” where cudagraph capture sizes align with prefill chunks better. On the text-only stack the sweet spot was 128K. With mmproj added (which changes prefill chunk shapes), the sweet spot shifts to 200K.
It’s the kind of finding I wouldn’t believe in a single bench, but the histogram and 10-run consistency are clean enough. The MIN of 93 at 200K is higher than the MIN of 93 at 128K (essentially identical), the MAX is higher (116.67 vs 109.62), and the median is higher. Better at every percentile.
Why 262K + mmproj OOMs
VRAM math at 24 GB budget:
Target Q3_K_XL 14.5 GB
DFlash drafter q8_0 1.85 GB
mmproj F16 0.93 GB
turbo3 KV @ 262K ~8.2 GB
Compute buffer + vLLM ~0.5 GB
────────────────────────────
TOTAL @ 262K + mmproj ~26 GB ← OOM (~1.5 GB short)
TOTAL @ 200K + mmproj ~24 GB ← fits with ~0.3 GB margin
TOTAL @ 262K text-only ~24.3 GB ← fits, that's why v1.0.2 ships at 262K
The 0.93 GB mmproj is what tips the 262K text-only config into OOM. Two ways out: (a) drop context to 200K (this app), (b) drop drafter quant to q4_k_m (loses ~1 GB but DFlash acceptance drops). I picked (a) because 200K is still 6× any other Qwen3.6 27B vision deployment I’ve seen, and the speed scales gracefully.
DFlash acceptance on vision prompts
Quick datapoint on the spec decoding behavior: on the gradient PNG smoke test, the DFlash drafter hit 27.6% pos-0 acceptance. On the standard Space Invaders HTML prompt during the 10-run bench, acceptance averaged around 28% — same range. So mmproj doesn’t appear to degrade the drafter’s ability to predict the target’s output distribution. The vision encoder produces standard text tokens after the image segment, and from that point on it’s regular Qwen3.6 generation — DFlash treats it like any other prompt.
The 27-28% acceptance is on par with what I see on text-only with the same drafter. Reasonable validation that BeeLlama’s modified attention kernel propagates target hidden states to the drafter correctly even when image tokens are in the prompt.
What this means for the catalog
Three Qwen3.6 27B paths now live on the orales-one-market :
| App | Path | Best for |
|---|---|---|
llamacppqwen36beellamaone v1.0.2 | BeeLlama text-only @ 262K | Max throughput + max context, no vision |
llamacppqwen36beellamavisionone v1.0.0 (NEW) | BeeLlama + mmproj @ 200K | Vision + long context single-user |
llamacppqwen36mtpone v1.0.8 | am17an MTP @ 262K | Closer to upstream llama.cpp main, no DFlash dependency |
Users on Olares One pick by use case. The vision app is single-user oriented (BeeLlama is --parallel 1 only). For multi-user vision (PagedAttention) on different model class, the catalog ships vllmgemma426ba4bvisionone which hits 135 t/s @ 128K on Gemma 4 — different model, different stack, different tradeoff.
Reproducibility
Helm chart and exact config in llamacppqwen36beellamavisionone v1.0.0. Pull aamsellem/beellama-cpp:0.1.2 from Docker Hub if you have an sm_120 GPU and don’t want to rebuild from source. Same image as the text-only post, just a different chart wiring.
If you run a different sm_120 card (5090 desktop, 5080, 5070 Ti) and want to push beyond 200K, you have ~8 GB more VRAM headroom on 32 GB cards — likely 256K + mmproj + DFlash works there. Would love to know your numbers if you test.
Open questions
- Why does 200K outperform 128K? My best guess is cudagraph capture sizes aligning with prefill chunks. The same effect on text-only had its sweet spot at 128K. With mmproj loaded (which changes prefill shapes), the sweet spot shifts. I haven’t profiled deep enough to confirm.
- Does CopySpec (BeeLlama’s unique addition over the buun fork) help on vision workloads? It’s designed for repeated suffix matching — could be useful for OCR tasks where the model re-emits document boilerplate. Not benched yet.
- Audio path? Qwen3.6 has audio config too in some variants. mmproj-F16 here is image-only. Worth investigating.