Tag · turboquant

# turboquant

All posts tagged "turboquant".

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

15.05.2026

Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire →
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

14.05.2026

Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire →
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

06.05.2026

PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire →
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB

28.04.2026

Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire →

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB