Tag · turboquant
# turboquant
All posts tagged "turboquant".
-
BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile
Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire → -
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path
Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire → -
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why
PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire → -
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB
Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire →