DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 3 days after the "impossible" post

Hi there.

Remember three days ago when I wrote a post titled “Why DFlash on Qwen3.6-27B doesn’t fit on 24GB single GPU”? My reasoning held up: tight VRAM math on the stock vLLM path, no published quantized drafter, llama.cpp fork untested on consumer Blackwell (I did note that Lucebox was already running DFlash on RTX 3090 24GB at 78 t/s, but with no public Docker and no tested sm_120 support). Spoiler, as the saying goes. Today, I ran DFlash on my Olares One (RTX 5090 Laptop 24GB sm_120 Blackwell) at 80 t/s avg on Qwen3.6-27B via a different route. Here’s how it played out.

The original post: “doesn’t fit”

Quick recap of the announced failure ingredients:

z-lab BF16 drafter: 6 GiB. Too big after the target on 24GB.
No published quantization from z-lab.
buun-llama-cpp fork: untested on consumer Blackwell, perf unknown.

Conclusion at the time: “DFlash on 24GB consumer = impossible with the public paths available today (Q1 2026)”.

Three days later, all three of those ingredients changed under my nose.

Change #1: a quantized GGUF drafter showed up

On April 28, spiritbuun publishes spiritbuun/Qwen3.6-27B-DFlash-GGUF on HuggingFace:

dflash-draft-3.6-q8_0.gguf — 1.75 GB (vs 6 GiB BF16)
dflash-draft-3.6-q4_k_m.gguf — 1.03 GB

Suddenly, the VRAM math fits:

Component	Size
Target Qwen3.6-27B Q4_K_M	16.8 GB
Drafter Q8_0	1.75 GB
KV cache @ 32K (FP16)	~2 GB
Buffers + cudagraph	~1.7 GB
Total	~22.3 GB → fits in 24

Bingo. But the drafter alone does nothing. You need an inference engine that knows DFlash.

Change #2: actually testing buun-llama-cpp

The buun-llama-cpp fork (also by spiritbuun) adds --spec-type dflash support to standard llama.cpp. So I build the Docker image for consumer Blackwell (-DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_NO_VMM=ON -Wl,-rpath-link,/usr/local/cuda/lib64/stubs). 2h30 of cross-compile on OrbStack. Image aamsellem/buun-llama-cpp-dflash:0.1.0. Deploy on Olares One.

First bench, April 30:

Run 1: 800 tok in 235s = 3.40 t/s
Run 2: 800 tok in 525s = 1.52 t/s
Run 3: 800 tok in 824s = 0.97 t/s

Ouch. 3.4 t/s degrading to 0.97. That’s 25× slower than vLLM Turbo (88 t/s) on the same card. Disaster. Logs show a DFlash cycle at 1860ms per batch with 1521ms of verify, when it should be 15-25ms. The custom kernels are clearly not tuned for sm_120.

I file an issue #35 on buun-llama-cpp with all the numbers and a full repro.

Change #3: spiritbuun fixes the bug in 24 hours

On May 1, spiritbuun replies:

“I think this may be fixed now - can you repull and give it another try?”

In between, 8 commits on master:

cab1fb597: dflash: add p_min confidence threshold + adaptive draft length ← likely the fix
905483277: --no-fused-gdn debug flag
115995e41: disable fused GDN kernels on non-CUDA backends
merge upstream llama.cpp/master (325 commits Apr 5 → Apr 30)

I rebuild. Another 2h cross-compile. Tag: aamsellem/buun-llama-cpp-dflash:0.2.0. Re-deploy on Olares One. Re-bench Space Invaders × 3.

The result

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

AVG ~80 t/s [74-85 range].

+2300% vs v0.1.0. 80× faster in 4 days. The jump from 0.97 t/s to 85 t/s rides on a single upstream commit. That’s the open source ecosystem in 2026 — it moves on a weekly clock.

Olares One comparison

Backend	Stack	t/s avg
llama.cpp standard	UD-Q4_K_XL, no spec decoding	33-36
vLLM Turbo	v0.20.0 + Genesis + TurboQuant K8V4 + MTP n=3	88
buun-llama-cpp DFlash	HEAD `aecbbd5d` + Q8_0 GGUF drafter	80
vLLM vanilla (other app)	0.19.1 + AutoRound INT4 + MTP n=3	99 peak

DFlash lands in the champions’ yard. Not first in absolute terms (vLLM Turbo stays ahead at 88), and not first on 24GB consumer either — Lucebox published DFlash numbers on RTX 3090 24GB (sm_86 Ampere) a few weeks ago: 78 t/s HumanEval, 70 t/s Math500, 60 t/s GSM8K. Our 80 t/s on Olares One sits in the same range, on the same model class.

What’s new here vs Lucebox 3090:

sm_120 consumer Blackwell hardware instead of sm_86 Ampere
RTX 5090 Laptop (mobile) instead of desktop
Stack: buun-llama-cpp + spiritbuun Q8_0 GGUF drafter instead of Lucebox custom engine + z-lab BF16 drafter
Public reproduction of the fix following an issue + bench filing

To my knowledge, Lucebox PR #86 (May 4, 2026) reports 218 t/s on RTX 5090 desktop 32GB via their path — absolute Blackwell record — but their HTTP server isn’t yet wired up and their engine doesn’t load the spiritbuun GGUF drafter.

The config that works

Image: aamsellem/buun-llama-cpp-dflash:0.2.0 (CUDA 13.1, sm_120 native, NO_VMM, libcuda stub link)

llama-server args:

--model Qwen3.6-27B-Q4_K_M.gguf
--model-draft dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99
--n-gpu-layers-draft 99
--ctx-size 32000
--ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Note: per spiritbuun, disable thinking for an extra ~1.8× speedup. The drafter wasn’t trained on the distribution with <think> tags — so acceptance collapses if the model emits them.

What we haven’t (yet) tested to push past 100+ t/s

DDTree budget tuning — Lucebox on RTX 5090 desktop hits 218 t/s at budget 22. The default underuses it. Worth sweeping.
--no-fused-gdn ON vs OFF — fused GDN kernels may still have issues on sm_120.
p_min adaptive draft length — sweet spot is prompt-dependent, would need a sweep.
Wider context — 32K is conservative. 80K should fit.

Bonus: PFlash, the other half of the problem

While I was finalizing this post, sandropuppo (Lucebox author) posted on r/LocalLLaMA about another release: PFlash. It’s the inverse of DFlash:

DFlash = decode 2-3× faster (what we just tested)
PFlash = prefill 10× faster at 128K (24.8s TTFT vs 257s vanilla llama.cpp on RTX 3090)

The thing is, on Qwen3.6-27B Q4_K_M, decode is fast (74 t/s with DFlash on a 3090, 80 t/s on our 5090M) but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 4 minutes before the first token. Painful. Fast decode is moot if you wait 4 minutes per message.

PFlash combines Speculative Prefill (arXiv 2502.02789) + FlashPrefill + Block-Sparse-Attention to score important tokens via a tiny Qwen3-0.6B drafter and only prefill the spans that matter. Pure C++/CUDA, no Python, in-process with DFlash.

PFlash was merged into the Lucebox repo today (May 4) — not yet in our v1.2.0 image. Next rebuild we add it. The DFlash + PFlash combo on sm_120 is potentially the top of the top on consumer 24GB: fast decode AND fast first token even on 128K prompts.

To be tested very soon.

TL;DR

The stock DFlash path on 24GB consumer mobile Blackwell is no longer impossible. 80 t/s avg on Qwen3.6-27B with:

spiritbuun/Qwen3.6-27B-DFlash-GGUF Q8_0 (1.75 GB)
spiritbuun/buun-llama-cpp HEAD compiled for sm_120

My April 28 post calling the stock vLLM/llama.cpp path “impossible” became obsolete on May 1. The lesson: in 2026 in the consumer LLM ecosystem, “impossible” has a 72-hour half-life.

Credits

spiritbuun for the llama.cpp fork with DFlash + the GGUF drafter + the express fix after my issue #35
z-lab/dflash for the Block Diffusion drafter method
unsloth for the Qwen3.6-27B Q4_K_M GGUF
Lucebox for the RTX 5090 desktop bench that proved sm_120 could pull this off

That’s it! If you run on a 5090M, 4080M, 3090 or 4090 24GB and you reproduce these numbers (or beat 100+ t/s with a DDTree tweak), send me your results. See you next time!

Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.