Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU

Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.

Following u/Kindly-Cantaloupe978’s 80 t/s @ 218K context post and Wasif Basharat’s 85 t/s Medium write-up, I tried to reproduce on my Olares One — a small home-AI box with an RTX 5090 Laptop GPU (24GB, ~896 GB/s, sm_120 Blackwell), not the 32GB desktop card.

After several iterations: ~85-100 t/s sustained, peaks at 99.7 t/s, 75K max context, MTP n=3 with 92-95% acceptance once warm. That’s roughly 3x faster than llama.cpp on the same hardware (33-36 t/s with the best NVFP4 GGUF) and matches/beats the 32GB desktop references.

TL;DR numbers

SetupHardwaret/s
llama.cpp UD-Q4_K_XL or NVFP4 GGUFRTX 5090M 24GB33-36
vLLM v0.17 NVFP4 (no MTP)RTX 5090M 24GB39
vLLM v0.19.1 NVFP4 + MTP n=1RTX 5090M 24GBOOM (model OK, MTP head 2.37 GiB doesn’t fit)
vLLM 0.19.1 + Lorbus AutoRound + MTP n=1RTX 5090M 24GB65
vLLM 0.19.1 + Lorbus AutoRound + MTP n=3RTX 5090M 24GB85-100
Reference: same recipe on 5090 desktop 32GBRTX 5090 32GB78-80
Reference: Wasif’s stack on 3090 24GBRTX 3090 24GB85

Five gotchas specific to 24GB Blackwell mobile

1. NVFP4 + MTP = OOM on 24GB

I tried sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP first. NVFP4 is 2x FP8 throughput on Blackwell tensor cores, and the model name says it includes MTP. Loaded fine, but:

torch.OutOfMemoryError: Tried to allocate 2.37 GiB. GPU has 2.25 GiB free.

Same issue Wasif documents. vLLM’s Qwen3_5MTP loader allocates a fresh 2.37 GiB BF16 buffer for mtp.fc because NVFP4 quantizes everything in the file. On 32GB it fits, on 24GB it doesn’t.

Fix: switch to Lorbus/Qwen3.6-27B-int4-AutoRound, which dequantizes only mtp.fc to BF16 in the file (~280 MiB). vLLM finds it on disk, no fresh buffer.

Trade-off: AutoRound INT4 uses Marlin kernels (Ampere-tuned) instead of native NVFP4 tensor cores. But MTP n=3 brings way more speed than NVFP4 acceleration would have on a bandwidth-bound consumer card.

2. --kv-cache-dtype fp8_e5m2 rejected with NVFP4 checkpoints

ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

AutoRound INT4 isn’t FP8 family, so fp8_e5m2 works there. Bonus: it gives more KV pool than fp8_e4m3 (Olares One ends up with 23,760 cached tokens with fp8_e5m2 + gpu-mem-util 0.97).

3. PR vllm#36325 (Blackwell TMA fix) is mandatory on sm_12x

Without it, Triton autotuner OOMs at warmup. is_tma_supported returns True for any compute capability ≥9 but Blackwell consumer doesn’t really do TMA — descriptor buffer allocations blow up VRAM. PR caps at < 12. 4-line patch I cherry-picked into a custom image.

4. patch_tolist_cudagraph.py is now public

The previously-private patch from Wasif’s article is now in noonghunna/qwen36-27b-single-3090/patches/. 165 lines, fixes a .tolist() CPU sync that breaks CUDA graph capture during warmup’s continuation-chunk simulation when spec-decode + chunked-prefill combine. Required even with fp8 KV (not just TurboQuant).

5. MTP n=3 actually fits on 24GB with Lorbus

I expected n=3 to OOM (Wasif’s article warns about it on 24GB with sakamakismile). With Lorbus’s dequantized mtp.fc and --gpu-memory-utilization 0.97, n=3 fits fine. Acceptance length peaks at 3.86/3.0 (98%/96%/92% per-position), generation throughput peaks at 99.7 t/s.

The recipe

Custom Docker image (FROM vllm/vllm-openai:v0.19.1-cu130):

vLLM args:

--model Lorbus/Qwen3.6-27B-int4-AutoRound
--quantization auto_round
--dtype float16
--attention-backend flashinfer
--kv-cache-dtype fp8_e5m2
--max-model-len 75000
--gpu-memory-utilization 0.97
--max-num-seqs 1
--max-num-batched-tokens 2048
--language-model-only
--enable-prefix-caching
--enable-chunked-prefill
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Env:

VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
VLLM_FLOAT32_MATMUL_PRECISION=high
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
NCCL_CUMEM_ENABLE=0
NCCL_P2P_DISABLE=1
OMP_NUM_THREADS=1
CUDA_DEVICE_MAX_CONNECTIONS=8

Live metrics (steady state)

Avg generation throughput: 85-100 t/s (variance with content)
Peak: 99.7 t/s
Mean acceptance length: 3.20 → 3.86 (out of 3 max)
Per-position acceptance: 98%/93%/88%
Avg draft acceptance rate: 92-95%
Model loading: 16.87 GiB
KV pool: 23,760 tokens (3.24 GiB)
KV cache usage during generation: 21-31%

Notes

Credits

Happy to share the custom Dockerfile or the Helm chart if it helps anyone running on consumer Blackwell mobile. Curious if other 5090M / 4080M / 3090 24GB owners can reproduce these numbers.


Disclosure — All benchmarks in this post run on my own Olares One. If this content helped you and you’re considering buying one, ordering through this referral link gets you $400 off ($3,599 vs $3,999) and nets me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Valid until ~end of June 2026.

Share this post on:

Comments