Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post

Four days ago I wrote that DFlash on 24GB consumer Blackwell didn't fit. On April 28, a dev publishes a quantized drafter. On April 30, I build, I test, I get 0.97 t/s. On May 1, after my issue, the dev fixes it in 24h. Tonight: 80 t/s. The story of a thesis that lasted 72 hours.

Hi there.

A week ago, I would have sworn it was impossible.

That was even the title of my post: Why DFlash on Qwen3.6-27B doesn’t fit on a 24GB single GPU. On April 28, I’d pulled out the VRAM calculator, added the target (17 GB), the drafter (6 GB), the buffers (3 GB), the KV cache (2 GB). Total 28 GB. I had 24. Dry conclusion: “impossible with the current public paths (Q1 2026)”. I’d even listed the conditions for that to ever change.

Spoiler: those conditions all dropped in four days. Today my Olares One produces 80 t/s on average on Qwen3.6-27B with DFlash active. Here’s how.

Quick recap for newcomers

DFlash is a speculative decoding technique: a small “drafter” model proposes tokens ahead, the big model validates in parallel, you multiply throughput by 2-3× when acceptance is good. The technique was published by z-lab at ICLR 2026, and it’s notoriously fast — Lucebox hits 134 t/s on RTX 3090 with it.

Except to run it, you need the drafter in VRAM on top of the target. And that’s where my 24 GB tap out: the z-lab drafter weighs 6 GB in BF16, that’s too much. Conclusion of the April 28 post: you’d need either a quantized drafter (nobody had done it), or a llama.cpp fork with DFlash support (existed but untested on consumer mobile Blackwell).

I’d ended my post with “if anyone ships a quant drafter, give me a shout.”

April 28, evening

No comment, no tweet. But while digging through HuggingFace late evening, I run into a recent repo: spiritbuun/Qwen3.6-27B-DFlash-GGUF. Publication date: that very morning. Two files:

The Q8_0 drafter is three times smaller than the z-lab BF16. I redo the VRAM math in a corner of a page:

ComponentSize
Target Qwen3.6-27B Q4_K_M16.8 GB
Drafter Q8_01.75 GB
KV cache @ 32K~2 GB
Buffers + cudagraph~1.7 GB
Total~22.3 GB → fits in 24

First ingredient handled. Now I need an inference engine that knows DFlash.

The same spiritbuun maintains a llama.cpp fork (buun-llama-cpp) that adds exactly that: --spec-type dflash with the Q8_0 GGUF drafter. Second ingredient. If I build their fork for my hardware (consumer mobile Blackwell, sm_120 — totally untested by them), I potentially have both halves of the puzzle.

I clone, I write a Dockerfile, I fire up buildx in amd64 emulation on the Mac. And go to bed.

April 29, 2h30 of compile later

Image aamsellem/buun-llama-cpp-dflash:0.1.0. 2.4 GB. Push to Docker Hub. Deploy on Olares.

Pod boots. Target loads. Drafter loads. llama-server announces it’s ready on port 8000. No CUDA crash, no arch mismatch. On paper, all good.

I run my standard bench: three Space Invaders prompts in HTML, max_tokens=800.

Run 1: 800 tok in 235s = 3.40 t/s
Run 2: 800 tok in 525s = 1.52 t/s
Run 3: 800 tok in 824s = 0.97 t/s

Ouch. 3.4 dropping to 0.97. That’s twenty-five times slower than my vLLM Turbo (88 t/s) on exactly the same hardware. The drafter is supposed to accelerate, instead it’s choking. And the per-run degradation smells like a memory leak or a cache progressively corrupting.

I dig into the detailed logs. The DFlash cycle takes 1860 ms per batch, with 1521 ms of verify phase. On Lucebox the same op takes 15-25 ms. Three orders of magnitude. The DFlash custom CUDA kernels are visibly written for Ampere and Ada Lovelace, not for Blackwell consumer SMs.

OK. No config-side fix. It’s a runtime bug. I file a clean issue on spiritbuun’s repo: issue #35, with all three runs, detailed logs, exact command, hardware. And I drop the hot potato.

May 1, late afternoon

GitHub notification. Spiritbuun replies:

“I think this may be fixed now - can you repull and give it another try?”

In between, 8 commits on master. I read the messages:

The first commit catches my eye: “p_min confidence threshold + adaptive draft length”. On consumer Blackwell, the drafter probably has to maintain a minimum confidence threshold before emitting, otherwise the custom kernels enter exotic compute paths that degrade. Plausible hypothesis.

I fire up buildx for a rebuild. Another two hours of compile on OrbStack. Tag: aamsellem/buun-llama-cpp-dflash:0.2.0. Push, redeploy. Pod ready. I rerun the three Space Invaders runs.

Run 1: 800 tok in 10.82s = 73.94 t/s
Run 2: 800 tok in  9.96s = 80.31 t/s
Run 3: 800 tok in  9.41s = 85.06 t/s

Average 80 t/s. No degradation between runs, slightly the opposite — it climbs because the KV cache settles in.

+2300% vs v0.1.0. Eighty times faster in four days. The jump from 0.97 to 85 t/s rides on a single upstream commit (the p_min confidence threshold). That’s the open source ecosystem in 2026 — it moves on a weekly clock.

Olares One comparison

To place that 80 t/s:

StackModelt/s avg
llama.cpp standardUD-Q4_K_XL, no spec decoding33-36
buun-llama-cpp DFlashQ8_0 GGUF drafter80
vLLM Turbo (Genesis)Qwen3.6-27B int4 + MTP n=388
Lucebox DFlash v1.4.4Qwen3.6-27B Q4_K_M88.5

DFlash lands in the champions’ yard. Not first in absolute — vLLM Turbo and Lucebox stay slightly ahead. But it’s the first public demonstration on consumer mobile Blackwell (sm_120 RTX 5090M), and it’s with a quantized drafter that fits in VRAM — exactly the problem I described as unsolvable four days earlier.

To my knowledge, Lucebox PR #86 reports 218 t/s on RTX 5090 desktop 32 GB via their custom path, but their HTTP server isn’t wired up yet and their engine doesn’t load the spiritbuun GGUF drafter. The two paths live in parallel for now.

The config that works

Image: aamsellem/buun-llama-cpp-dflash:0.2.0 (CUDA 13.1, sm_120 native, NO_VMM, libcuda stub link).

llama-server args:

--model Qwen3.6-27B-Q4_K_M.gguf
--model-draft dflash-draft-3.6-q8_0.gguf
--spec-type dflash
--n-gpu-layers 99
--n-gpu-layers-draft 99
--ctx-size 32000
--ctx-size-draft 256
--batch-size 256 --ubatch-size 64
--parallel 1 --flash-attn on --jinja
--chat-template-kwargs '{"enable_thinking": false}'

Important detail: disable thinking. The spiritbuun drafter wasn’t trained on the distribution with <think> tags, so acceptance collapses when the model emits them. Per spiritbuun, that’s worth ~1.8× extra.

The actual lesson

Four days ago I’d titled “impossible”. I was right at moment T: on the public paths of April 28, it didn’t fit. But the open source ecosystem in 2026 has a 72-hour half-life on its “impossibles”. Someone somewhere has the bandwidth to publish a quantized drafter overnight, fix a kernel bug on hardware they don’t even own, drop the commits on master without even announcing.

The lesson I want to keep, and that I should have applied earlier: before writing “impossible”, wait a week. Not out of editorial cowardice — on the contrary, my honest negatives stay useful for mapping the state at the time. But mark the expiration date explicitly. “Impossible on the April 28 public paths”. Not “impossible”. Period.

My April 28 post, updated tonight, now points at this one. The negative and the positive live together in the archive. That’s more honest than rewriting history.

What we haven’t tested yet

A few open paths to push past 80:

To reproduce

Credits

That’s it! If you run on a 5090M, 4080M, 3090 or 4090 24GB and reproduce these 80 t/s — or beat them with a DDTree sweep — send me your numbers. See you next time!


Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.

Share this post on:

Comments