Why DFlash on Qwen3.6-27B doesn't fit on a 24GB single GPU

Hi there !

Today’s post is a bit different — I’m going to walk you through something that doesn’t work. Specifically: why DFlash (Block Diffusion for Flash Speculative Decoding, ICLR 2026) — which claims a juicy 207 t/s on Qwen3.5-27B on RTX 3090 24GB — doesn’t fit on the Olares One. That’s nearly 3× our best Turbo (60 t/s on the same hardware in consumer Blackwell). Naturally I tried to reproduce. Spoiler: it doesn’t fit, and it’s not a tuning problem — it’s just VRAM math. Let’s walk through it together.

The math, first

On the Olares One (RTX 5090M 24GB GDDR7), HAMI vGPU exposes 23.42 GiB usable to the pod after system reservations.

To run DFlash on Qwen3.6-27B you need to host the target + the drafter + buffers/activations + the KV cache in VRAM. Here’s the budget:

Component	Size
Target Qwen3.6-27B BF16	51 GiB
Target Qwen3.6-27B FP8	27 GiB
Target Lorbus AutoRound INT4	17 GiB
Drafter z-lab/Qwen3.6-27B-DFlash (5-layer Qwen3 BF16)	~6 GiB
Activations + cudagraph buffers	2-3 GiB
KV cache (minimum 8K context)	1-2 GiB
Minimum total	~26-28 GiB

The piece everyone underestimates is the drafter. “5-layer Qwen3” sounds tiny, but each Qwen3-27B layer is ~1.2 GiB in BF16 due to hidden dim and QKVO + FFN projections. Five layers × 1.2 GiB = 6 GiB. And the drafter has to stay BF16 because that’s the official z-lab training — no quantized release yet. So we’re cooked before we even start.

Tests run

I still tried several paths, just to be sure. Here’s the rundown.

Path 1: stock vLLM + z-lab drafter (the “official” path)

vllm serve Qwen/Qwen3.6-27B \
  --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

With Qwen/Qwen3.6-27B BF16 (51 GiB) as target, immediate OOM at model loading. Of course.

I swapped target → Lorbus/Qwen3.6-27B-int4-AutoRound (17 GiB). Model load passes. Then:

torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.37 GiB. GPU 0 has a total capacity of 23.42 GiB
of which 266.38 MiB is free. Including non-PyTorch memory,
this process has 23.28 GiB memory in use.

The drafter takes 6 GiB we don’t have. Reducing max-model-len to 16K changes nothing — the OOM is on weights (target + drafter), not KV cache. Wrong lever.

I tried gpu_memory_utilization 0.85 and --enforce-eager (skip cudagraph alloc). Still 22.86 GiB allocated at model load. vLLM’s memory profiler reserves space upfront, so trimming target context doesn’t free much. Next !

Path 2: AEON-7 NVFP4 (Blackwell-targeted)

AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash ships an NVFP4 variant (26 GB) with docker-compose for DGX Spark / RTX PRO 6000 Blackwell. Targeted at Blackwell hardware, so I figured “hey, maybe this passes”. First naive instinct: 26 GB is smaller than 51 GB BF16, maybe it fits?

Their README is honest about it:

“Anything older than A100” / “Not supported” / “51 GB BF16 or 26 GB NVFP4 will not fit.”

The 26 GB NVFP4 already doesn’t fit on consumer 24GB. Their explicit path is DGX Spark (sm_121a) or RTX PRO 6000 Blackwell 96GB. Not for us, sadly. KO.

Path 3: Lucebox custom engine

Luce-Org/lucebox-hub (MIT) ships a GGUF DFlash port that hits 78 t/s on Qwen3.6-27B on RTX 3090 24GB. Wait — how do they fit 24GB when vLLM can’t?

Answer: they use Q4_K_M target (17 GiB) + TQ3_0 KV cache (TurboQuant 3-bit, GGUF) and their fork Luce-Org/llama.cpp@luce-dflash with custom tree-mode operations. The drafter is integrated differently (not a 6 GB BF16 model loaded in full). So they did some serious magic — but magic that works.

But — and it’s a big but — no public Docker, no binary release. Research-grade: you have to compile the llama.cpp fork + dflash + megakernel by hand, per target GPU. That deserves a separate post (and a dedicated session) to package as a Docker image usable on Olares K8s. sm_120 hasn’t been tested by the Luce team; they target Ampere / Ada. To follow !

What it would take to unlock consumer 24GB

I see three paths to a future where DFlash actually fits on 24GB single-GPU:

A quantized drafter — a z-lab/Qwen3.6-27B-DFlash-FP8 (3 GiB) or -INT4 (1.5 GiB) replacing the current BF16. Frees 3-5 GiB and the stock vLLM path becomes viable. As far as I know nobody has published this. If you spot it somewhere, ping me !
Or Lucebox packaged — reproducible Docker image, sm_120 support confirmed. In progress on my side.
Or a tighter fp8 target — Qwen/Qwen3.6-27B-FP8 weighs in at 27 GB, just over budget. A “fp8-mixed” variant down to 22-23 GB would leave 1-2 GB for a mini-quantized drafter. Haven’t seen one in the wild either.

TL;DR

DFlash on consumer 24GB GPU = impossible with the public paths available today (Q1 2026). The z-lab BF16 drafter occupies 6 GiB that won’t fit after target Lorbus INT4 17 GiB + buffers. Future fixes: a quantized drafter (not available) or Lucebox packaged as Docker (work-in-progress on my end).

In the meantime, don’t lose sleep over it: Turbo (TurboQuant 4bit_nc + MTP n=3 + Genesis Sandermage) hits 60 t/s avg / 73 peak on the same card — our best DFlash-equivalent on an open stack at 24GB. Everything is in the previous post — go read it if you haven’t already.

Credits

z-lab/dflash for the method (block diffusion drafter, parallel drafting, 207 t/s claim)
AEON-7 for the Blackwell vLLM image and the honest README on the hardware floor
Luce-Org for the GGUF port that fits 24GB single-GPU. To follow on Docker packaging.

That’s it ! If you run on a 5090M, 4080M or 3090 24GB and you manage to make DFlash fit in the budget, I really want to know how you did it. See you next time !

Disclosure — All benchmarks in this post run on my own Olares One. If this content helped you and you’re considering buying one, ordering through this referral link gets you $400 off ($3,599 vs $3,999) and nets me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Valid until ~end of June 2026.