Tag · speculative-decoding

# speculative-decoding

All posts tagged "speculative-decoding".

Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

12.05.2026

A user hit a reproducible runtime CUDA OOM in MTP draft on my Qwen3.6-27B v1.0.5 chart at 262K context. Boot fine, draft scales beyond static estimate, exit 139 in common_speculative_state_mtp draft. Fixed by dropping havenoammo UD-Q3_K_XL (14.9 GB) to UD-Q2_K_XL (12.3 GB). Direct bench validates v1.0.7 at 72.14 t/s stable, full 262K, no OOM. Plus a side experiment: can we drop Genesis patches by switching to NVFP4? Answer: no.
Lire →
Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

11.05.2026

Full num_speculative_tokens sweep for Gemma 4 26B-A4B + z-lab DFlash drafter on RTX 5090M Laptop (24GB sm_120). Optimal is n_spec=8 (not n=15 like desktop). I also found a 100% reproducible vLLM degradation cycle that I couldn't fix from config alone.
Lire →
A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

08.05.2026

From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.
Lire →
Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

08.05.2026

On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: native support for Gemma 4 Multi-Token Prediction drafters. On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One hits 178.6 t/s with 77.3% acceptance — first public Gemma 4 MTP bench on consumer mobile Blackwell.
Lire →
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

06.05.2026

PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire →
Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain

05.05.2026

MTP finally lands in llama.cpp upstream (PR #22673 by am17an, May 4). Bench on Olares One RTX 5090M sm_120: 78 t/s with an MTP-enabled GGUF, +123% vs baseline. No Lucebox, no Genesis, no permanent custom fork.
Lire →

Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain