Tag · llama-cpp

# llama-cpp

All posts tagged "llama-cpp".

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

15.05.2026

Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire →
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

14.05.2026

Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire →
Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

12.05.2026

A user hit a reproducible runtime CUDA OOM in MTP draft on my Qwen3.6-27B v1.0.5 chart at 262K context. Boot fine, draft scales beyond static estimate, exit 139 in common_speculative_state_mtp draft. Fixed by dropping havenoammo UD-Q3_K_XL (14.9 GB) to UD-Q2_K_XL (12.3 GB). Direct bench validates v1.0.7 at 72.14 t/s stable, full 262K, no OOM. Plus a side experiment: can we drop Genesis patches by switching to NVFP4? Answer: no.
Lire →
The story of the day I broke my Qwen3.6 ceiling — not with code, but with a stranger's name

09.05.2026

I spent a whole day trying to push my Qwen3.6 27B on Olares past 65 t/s. Custom builds, experimental forks, merges that crash. And then late evening, in a desperate HuggingFace search, I run into a name: havenoammo. Five minutes later, 77 t/s on a 262K context. The story of a day chasing an answer that was waiting one click away.
Lire →
Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain

05.05.2026

MTP finally lands in llama.cpp upstream (PR #22673 by am17an, May 4). Bench on Olares One RTX 5090M sm_120: 78 t/s with an MTP-enabled GGUF, +123% vs baseline. No Lucebox, no Genesis, no permanent custom fork.
Lire →
DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post

04.05.2026

Four days ago I wrote that DFlash on 24GB consumer Blackwell didn't fit. On April 28, a dev publishes a quantized drafter. On April 30, I build, I test, I get 0.97 t/s. On May 1, after my issue, the dev fixes it in 24h. Tonight: 80 t/s. The story of a thesis that lasted 72 hours.
Lire →

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

The story of the day I broke my Qwen3.6 ceiling — not with code, but with a stranger's name

Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain

DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post