Tag · dflash

# dflash

All posts tagged "dflash".

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

15.05.2026

Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire →
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

14.05.2026

Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire →
Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

11.05.2026

Full num_speculative_tokens sweep for Gemma 4 26B-A4B + z-lab DFlash drafter on RTX 5090M Laptop (24GB sm_120). Optimal is n_spec=8 (not n=15 like desktop). I also found a 100% reproducible vLLM degradation cycle that I couldn't fix from config alone.
Lire →
Lucebox on Olares One — Episode 9: the PR that promised +57% and delivered +0.2%

05.05.2026

Last night Lucebox crossed 88.5 t/s on Olares One and became the new champion. This morning PR #94 reports +57% on RTX 4090. If it scales, we hit 120 t/s. Spoiler: 88.7 t/s. Full DDTree sweep, three hypotheses, the honest lesson on upstream benches that don't reproduce.
Lire →
Lucebox on Olares One — Episode 8: seven days of waiting, one lib swapped by hand, 88.5 t/s

04.05.2026

Seven days after my PR #188 to HAMi-core, still no review. The saga had its cliffhanger — I was waiting on someone else. Then a stupid idea: compile my patched lib and swap it myself. Three new bugs, one night, and at the end Lucebox hits 88.5 t/s. First llama.cpp-based path to pass vLLM Turbo on this hardware.
Lire →
DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post

04.05.2026

Four days ago I wrote that DFlash on 24GB consumer Blackwell didn't fit. On April 28, a dev publishes a quantized drafter. On April 30, I build, I test, I get 0.97 t/s. On May 1, after my issue, the dev fixes it in 24h. Tonight: 80 t/s. The story of a thesis that lasted 72 hours.
Lire →

BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile

BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

Lucebox on Olares One — Episode 9: the PR that promised +57% and delivered +0.2%

Lucebox on Olares One — Episode 8: seven days of waiting, one lib swapped by hand, 88.5 t/s

DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post