Tag · speculative-decoding
# speculative-decoding
All posts tagged "speculative-decoding".
-
DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post
Four days ago I wrote that DFlash on 24GB consumer Blackwell didn't fit. On April 28, a dev publishes a quantized drafter. On April 30, I build, I test, I get 0.97 t/s. On May 1, after my issue, the dev fixes it in 24h. Tonight: 80 t/s. The story of a thesis that lasted 72 hours.
Lire → -
Why DFlash on Qwen3.6-27B doesn't fit on a 24GB single GPU
Three paths tested (z-lab BF16, AEON-7 NVFP4, Lucebox custom). All need ≥26 GB. VRAM math, honest negatives, what to wait for on 24GB.
Lire → -
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB
Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire → -
Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU
Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.
Lire →