Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 3 days after the "impossible" post

Three days ago I wrote that the stock DFlash path didn't fit 24GB consumer. Spoiler: it works now via buun-llama-cpp + a Q8_0 GGUF spiritbuun drafter. 80 t/s avg on Olares One sm_120 mobile Blackwell.

Hi there.

Remember three days ago when I wrote a post titled “Why DFlash on Qwen3.6-27B doesn’t fit on 24GB single GPU”? My reasoning held up: tight VRAM math on the stock vLLM path, no published quantized drafter, llama.cpp fork untested on consumer Blackwell (I did note that Lucebox was already running DFlash on RTX 3090 24GB at 78 t/s, but with no public Docker and no tested sm_120 support). Spoiler, as the saying goes. Today, I ran DFlash on my Olares One (RTX 5090 Laptop 24GB sm_120 Blackwell) at 80 t/s avg on Qwen3.6-27B via a different route. Here’s how it played out.

The original post: “doesn’t fit”

Quick recap of the announced failure ingredients:

Conclusion at the time: “DFlash on 24GB consumer = impossible with the public paths available today (Q1 2026)”.

Three days later, all three of those ingredients changed under my nose.

Change #1: a quantized GGUF drafter showed up

On April 28, spiritbuun publishes spiritbuun/Qwen3.6-27B-DFlash-GGUF on HuggingFace:

Suddenly, the VRAM math fits:

ComponentSize
Target Qwen3.6-27B Q4_K_M16.8 GB
Drafter Q8_01.75 GB
KV cache @ 32K (FP16)~2 GB
Buffers + cudagraph~1.7 GB
Total~22.3 GB → fits in 24

Bingo. But the drafter alone does nothing. You need an inference engine that knows DFlash.

Change #2: actually testing buun-llama-cpp

The buun-llama-cpp fork (also by spiritbuun) adds --spec-type dflash support to standard llama.cpp. So I build the Docker image for consumer Blackwell (-DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_NO_VMM=ON -Wl,-rpath-link,/usr/local/cuda/lib64/stubs). 2h30 of cross-compile on OrbStack. Image aamsellem/buun-llama-cpp-dflash:0.1.0. Deploy on Olares One.

First bench, April 30:

Run 1: 800 tok in 235s = 3.40 t/s
Run 2: 800 tok in 525s = 1.52 t/s
Run 3: 800 tok in 824s = 0.97 t/s

Ouch. 3.4 t/s degrading to 0.97. That’s 25× slower than vLLM Turbo (88 t/s) on the same card. Disaster. Logs show a DFlash cycle at 1860ms per batch with 1521ms of verify, when it should be 15-25ms. The custom kernels are clearly not tuned for sm_120.

I file an issue #35 on buun-llama-cpp with all the numbers and a full repro.

Change #3: spiritbuun fixes the bug in 24 hours

On May 1, spiritbuun replies:

“I think this may be fixed now - can you repull and give it another try?”

In between, 8 commits on master:

I rebuild. Another 2h cross-compile. Tag: aamsellem/buun-llama-cpp-dflash:0.2.0. Re-deploy on Olares One. Re-bench Space Invaders × 3.

The result

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

AVG ~80 t/s [74-85 range].

+2300% vs v0.1.0. 80× faster in 4 days. The jump from 0.97 t/s to 85 t/s rides on a single upstream commit. That’s the open source ecosystem in 2026 — it moves on a weekly clock.

Olares One comparison

BackendStackt/s avg
llama.cpp standardUD-Q4_K_XL, no spec decoding33-36
vLLM Turbov0.20.0 + Genesis + TurboQuant K8V4 + MTP n=388
buun-llama-cpp DFlashHEAD aecbbd5d + Q8_0 GGUF drafter80
vLLM vanilla (other app)0.19.1 + AutoRound INT4 + MTP n=399 peak

DFlash lands in the champions’ yard. Not first in absolute terms (vLLM Turbo stays ahead at 88), and not first on 24GB consumer eitherLucebox published DFlash numbers on RTX 3090 24GB (sm_86 Ampere) a few weeks ago: 78 t/s HumanEval, 70 t/s Math500, 60 t/s GSM8K. Our 80 t/s on Olares One sits in the same range, on the same model class.

What’s new here vs Lucebox 3090:

To my knowledge, Lucebox PR #86 (May 4, 2026) reports 218 t/s on RTX 5090 desktop 32GB via their path — absolute Blackwell record — but their HTTP server isn’t yet wired up and their engine doesn’t load the spiritbuun GGUF drafter.

The config that works

Image: aamsellem/buun-llama-cpp-dflash:0.2.0 (CUDA 13.1, sm_120 native, NO_VMM, libcuda stub link)

llama-server args:

--model Qwen3.6-27B-Q4_K_M.gguf
--model-draft dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99
--n-gpu-layers-draft 99
--ctx-size 32000
--ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Note: per spiritbuun, disable thinking for an extra ~1.8× speedup. The drafter wasn’t trained on the distribution with <think> tags — so acceptance collapses if the model emits them.

What we haven’t (yet) tested to push past 100+ t/s

Bonus: PFlash, the other half of the problem

While I was finalizing this post, sandropuppo (Lucebox author) posted on r/LocalLLaMA about another release: PFlash. It’s the inverse of DFlash:

The thing is, on Qwen3.6-27B Q4_K_M, decode is fast (74 t/s with DFlash on a 3090, 80 t/s on our 5090M) but prefill scales O(S²). On a 131K-token prompt, vanilla llama.cpp takes 4 minutes before the first token. Painful. Fast decode is moot if you wait 4 minutes per message.

PFlash combines Speculative Prefill (arXiv 2502.02789) + FlashPrefill + Block-Sparse-Attention to score important tokens via a tiny Qwen3-0.6B drafter and only prefill the spans that matter. Pure C++/CUDA, no Python, in-process with DFlash.

PFlash was merged into the Lucebox repo today (May 4) — not yet in our v1.2.0 image. Next rebuild we add it. The DFlash + PFlash combo on sm_120 is potentially the top of the top on consumer 24GB: fast decode AND fast first token even on 128K prompts.

To be tested very soon.

TL;DR

The stock DFlash path on 24GB consumer mobile Blackwell is no longer impossible. 80 t/s avg on Qwen3.6-27B with:

  1. spiritbuun/Qwen3.6-27B-DFlash-GGUF Q8_0 (1.75 GB)
  2. spiritbuun/buun-llama-cpp HEAD compiled for sm_120

My April 28 post calling the stock vLLM/llama.cpp path “impossible” became obsolete on May 1. The lesson: in 2026 in the consumer LLM ecosystem, “impossible” has a 72-hour half-life.

Credits

That’s it! If you run on a 5090M, 4080M, 3090 or 4090 24GB and you reproduce these numbers (or beat 100+ t/s with a DDTree tweak), send me your results. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments