Tag · speculative-decoding

# speculative-decoding

All posts tagged "speculative-decoding".

Gemma 4 audio E4B hits 288 t/s — the second upstream merge closes the family

09.06.2026

Yesterday I shipped Gemma 4 12B at 170 t/s via the upstream PR #23398 merge. Today PR #24282 (the E2B/E4B counterpart) merged. Custom rebuild, chart swap, bench: Gemma 4 audio E4B jumps from 47 t/s to 288 t/s. 6.1x speedup on the same hardware in 5 minutes of config. With a flash-attention trap on the way — the combo Gemma 4 E4B + audio mmproj + MTP draft crashes the CUDA flash attention kernel, no-FA fallback unlocks everything.
Lire →
Vision unlocked on Qwen3.6 35B-A3B MTP — 243 t/s + 262K context + image input via spiritbuun's --mmproj-gpu-swap

24.05.2026

Three days ago I shipped Qwen3.6 35B-A3B MTP at 249 t/s text-only on Olares One — the new champion. Yesterday I shipped Gemma 4 26B at 250 t/s with vision. Today the Qwen champion gets vision too. Same 24 GB GPU. Same model file. The unlock: spiritbuun merged a feature called --mmproj-gpu-swap on May 22 that hot-swaps MTP and the vision encoder in VRAM on-demand. Trade-off: -2.8% text throughput, +full vision support, +4× more context vs my v1.0.5 vision attempt.
Lire →
Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

23.05.2026

Two days ago I shipped Qwen 3.6 35B-A3B MTP at 249 t/s on Olares One. Text-only, but the new champion. Today the same hardware runs Gemma 4 26B at 250 t/s with vision and tool calling. The unlock: vLLM v0.21 quietly merged the official Google Gemma 4 MTP drafter. No more 5-fast/4-slow cycle bug from DFlash. No more 135 t/s no-spec fallback. Just full speed, plus images.
Lire →
249 t/s on Qwen3.6 35B-A3B MTP — the bigger model that runs faster than everything smaller

21.05.2026

I posted yesterday about Nemotron-Labs Elastic 30B-A3B NVFP4 hitting 166 t/s on Olares One — then 182 once vLLM #40082 landed. New record. Headline of the post: 'fastest LLM on Olares One'. Less than 12 hours later, that record is now sitting in second place. Qwen3.6 35B-A3B MTP runs at 249 t/s on the same hardware. Bigger model, +37% faster. Here's what's going on.
Lire →
MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

21.05.2026

MTP support was merged into llama.cpp master on May 16th. Five days later, three follow-up PRs quietly changed how MTP behaves — including the spec-draft-n-max default flipping from 16 to 3. On Olares One (RTX 5090M sm_120), that change plus NVIDIA's backend-sampling rewrite (#23287) pushed Qwen3.6 27B MTP from 64% to 86.7% draft acceptance. +22 points. Nobody is talking about this.
Lire →
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path

14.05.2026

Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire →

Gemma 4 audio E4B hits 288 t/s — the second upstream merge closes the family

Vision unlocked on Qwen3.6 35B-A3B MTP — 243 t/s + 262K context + image input via spiritbuun's --mmproj-gpu-swap

Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

249 t/s on Qwen3.6 35B-A3B MTP — the bigger model that runs faster than everything smaller

MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path