Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB

Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.

Hi there !

Today we’re talking TurboQuant — specifically how to actually get it running on a hybrid model (Qwen3.6-27B with its Gated DeltaNet + attention layers) on a consumer Blackwell 24GB card. You might tell me “that’s a really narrow niche”. And you’d be right ! Most documented setups are on 80GB Ampere data-center cards or DGX Spark. But it’s exactly what’s sitting in my Olares One, and it might be sitting in yours too. So let’s see where this lands.

TL;DR — the numbers

Bench on the Olares One (RTX 5090M, 24GB GDDR7, 896 GB/s, sm_120 Blackwell). Model: Qwen3.6-27B Lorbus/Qwen3.6-27B-int4-AutoRound. 3 prompts × 800 tokens (Space Invaders HTML, Go REST API guide, PostgreSQL B-tree explainer). temperature=0.6, top_p=0.95.

StackCold (Space Invaders)AVG (3 runs)KV poolContext
Dense MTP + fp8_e5m2 (reference v2.2.2)~90 t/s~90 t/s24K tokens75K
Turbo TQ K8V4 no MTP40 t/s40 t/s149K tokens128K
Turbo TQ K8V4 + MTP n=337 t/s38 t/s120K tokens80K
Turbo TQ 4bit_nc no MTP28 t/s224K tokens128K
Turbo TQ 4bit_nc + MTP n=346 t/s60 t/s [46-73]177K tokens100K

The winning config (last row) doubles the KV pool vs the dense reference (177K vs 24K), at the cost of ~33% cold throughput. Worth it for long-prompt or context-heavy agent workloads. We sign !

The initial blocker: NotImplementedError on hybrid

First attempt — vanilla vLLM 0.20-nightly with PR #38479 (TurboQuant) merged. And boom, slap in the face:

NotImplementedError: TurboQuant KV cache is not supported for hybrid
(attention + Mamba) models. Boundary layer protection requires uniform
attention layers.

Why? Because Qwen3.5/3.6 mix Gated DeltaNet (24 of 32 layers for 3.5, similar in 3.6) with full attention. TurboQuant’s boundary layer protection algorithm assumes uniform layers — so hard refusal upstream. There you go.

This is where Sandermage Genesis comes in. It’s a set of 28 runtime monkey-patches that fix exactly this gap. The repo (Sandermage/genesis-vllm-patches, MIT, tag v7.51-stable-2026-04-27) was tested on Ampere (RTX A5000 80GB). Nobody had validated it on consumer Blackwell yet. First unknown: do the patches apply cleanly on sm_120?

Genesis on sm_120: results

Spoiler: it works without breaking a sweat.

[INFO:genesis.apply_all] Genesis platform:
  compute_capability: [12, 0]
  is_blackwell: false   # Sandermage classifies sm_120 as "non-Blackwell" but it works
  has_native_fp8: true
[INFO:genesis.apply_all] Genesis Results: 26 applied, 32 skipped, 0 failed

Zero failures across 26 applied patches. The 32 skips are either opt-ins we don’t enable, or Ampere-specific patches (FP8 Marlin fallback) that auto-skip because we have native FP8 on Blackwell. Easy.

The critical patch for us is P4 — TurboQuant hybrid model support: it bypasses the NotImplementedError, routes GDN layers through the right path, and fixes page-size mismatches between attention and recurrent layers. Exactly what we needed.

Once Genesis is applied, the vLLM engine accepts --kv-cache-dtype turboquant_k8v4 or turboquant_4bit_nc and boots on Qwen3.6-27B. First objective hit !

Four stack-specific gotchas

1. P65 is not optional — it’s a functional dependency

GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 is documented as an opt-in to address vLLM issue #40880 (MTP × TurboQuant × cudagraph degenerate output). In practice on Blackwell + MTP + 4bit_nc, the pod doesn’t even boot without P65. Look:

torch._dynamo.exc.TorchRuntimeError: RuntimeError when making fake tensor call
  Explanation: Dynamo failed to run FX node with fake tensors:
    call_function <built-in function mul>(*(
      FakeTensor(..., device='cuda:0', size=(196608, 128)),
      FakeTensor(..., device='cuda:0', size=(48*s72, 128))
    ), **{}): got RuntimeError(
      'The size of tensor a (196608) must match the size of tensor b (48*s72)
       at non-singleton dimension 0'
    )

The bug lives in the cudagraph capture path when MTP draft tensors meet TurboQuant kernels. P65 routes spec-verify batches through eager mode (no cudagraph), bypassing the broken zone. Without P65 → TorchDynamo trips, engine init fails. Don’t panic: enable it, it boots.

Cost of P65: you lose the speedup cudagraph would give to spec decode. That’s why MTP+TQ doesn’t deliver the boost you’d expect over a no-MTP baseline. Look at the table — the gap between Turbo K8V4 no MTP (40 t/s) and with MTP n=3 (37-38 t/s) is negative. MTP costs more than it returns under P65. Ouch.

It’s only with TQ 4bit_nc that MTP turns net-positive (46 t/s cold vs 28 t/s without MTP). Probably because the 4bit_nc internal dispatch is more uniform (MSE quant for both K and V), playing better with eager spec-verify than K8V4 (FP8 keys + 4-bit values, heterogeneous dispatch). More on that in the notes.

2. turboquant_3bit_nc breaks at compile

I tried pushing compression further (4.9× vs 3.8× for 4bit_nc, vs 2.6× for K8V4). Immediate failure with everything else identical:

torch._dynamo.exc.TorchRuntimeError: RuntimeError when making fake tensor call
  ...same shape mismatch (196608, 128) vs (48*s72, 128)...

Disabling Genesis P5B (the pad-smaller-to-max KV strategy) doesn’t help — the issue is intrinsic to MTP draft tensors × 3-bit kernel reshape. Probably an upstream vLLM or Genesis bug specific to 3-bit blocks. Watch for a P67+ from Sandermage that addresses it.

So for now: 3bit_nc + MTP = no-go on this stack. If you want 3bit_nc you have to disable MTP — and then you drop to 28 t/s (the Turbo TQ 4bit_nc no MTP row but worse due to 3-bit kernel overhead). Not worth it. Next !

3. --max-num-batched-tokens must exceed Mamba block_size

The Mamba cache block_size changes by KV dtype. Genesis P5 (page-size unification) computes block_size by aligning on the LCM of all attention patterns:

KV dtypeMamba block_size
fp8_e5m2 (dense ref)2080
turboquant_k8v42080
turboquant_4bit_nc~4096
turboquant_3bit_nc~4128

vLLM enforces block_size <= max_num_batched_tokens. So on 4bit_nc you need at least 4096 (I went 8192 for headroom), on K8V4 4096 is fine. If you start with the Sandermage prod value (4096) on 3bit_nc, you hit an AssertionError at boot. Adjust per dtype. Simple.

4. Prefix caching changes everything

In the multi-prompt bench, run 2 (Go REST API) hit 73 t/s while run 1 (Space Invaders, cold) was at 46 t/s. Plot twist: both prompts have different user content but share the system prompt + Qwen3 tokenizer init, and --enable-prefix-caching --prefix-caching-hash-algo xxhash lets vLLM reuse KV from common tokens. Hence the jump.

Useful to know when reporting numbers: a “warm” t/s in an agent iterating over the same context is ~50-60% faster than a “cold” t/s on a fresh prompt. For the cold reference (worst case), use run 1.

Full recipe

Alright, into the deep end. Here’s everything you need.

Docker image

docker.io/aamsellem/vllm-qwen36-blackwell:0.20.0-genesis — built on vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (post-#38479 merge on April 15, 2026).

Dockerfile:

FROM vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52

RUN apt-get update && apt-get install -y --no-install-recommends git && \
    git clone --depth 1 --branch v7.51-stable-2026-04-27 \
        https://github.com/Sandermage/genesis-vllm-patches.git /tmp/genesis && \
    cd /tmp/genesis && \
    pip install --no-deps --no-cache-dir ./genesis_vllm_plugin && \
    VLLM_DIR="$(python3 -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')" && \
    cp -r vllm/_genesis "$VLLM_DIR/_genesis" && \
    rm -rf /tmp/genesis && apt-get purge -y git && \
    apt-get autoremove -y && rm -rf /var/lib/apt/lists/*

COPY patch_tolist_cudagraph.py /patches/patch_tolist_cudagraph.py

RUN echo '#!/bin/sh\nset -e\npython3 -m vllm._genesis.patches.apply_all || true\npython3 /patches/patch_tolist_cudagraph.py || true\nexec vllm "$@"' > /entrypoint.sh && \
    chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]
CMD ["serve"]

vLLM args

--model Lorbus/Qwen3.6-27B-int4-AutoRound
--quantization auto_round
--dtype float16
--kv-cache-dtype turboquant_4bit_nc
--max-model-len 100000
--gpu-memory-utilization 0.97
--max-num-seqs 1
--max-num-batched-tokens 8192
--language-model-only
--enable-prefix-caching
--prefix-caching-hash-algo xxhash
--enable-chunked-prefill
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--performance-mode interactivity
--async-scheduling
--no-scheduler-reserve-full-isl
--attention-config.flash_attn_version 2
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Env vars

VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
VLLM_FLOAT32_MATMUL_PRECISION=high
GENESIS_ENABLE_P5B_KV=1
GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1
GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1
GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING=1
GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1
GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
NCCL_CUMEM_ENABLE=0
NCCL_P2P_DISABLE=1
OMP_NUM_THREADS=1
CUDA_DEVICE_MAX_CONNECTIONS=8

Bench harness

import urllib.request, json, time
prompts = [
  "Build a complete Space Invaders game in a single HTML file...",
  "Write a comprehensive guide to building a REST API in Go...",
  "Explain how a B-tree index works in PostgreSQL...",
]
results = []
for i, p in enumerate(prompts):
    data = json.dumps({
        "model": "qwen3.6-27b",
        "messages": [{"role": "user", "content": p}],
        "max_tokens": 800,
        "temperature": 0.6, "top_p": 0.95
    }).encode()
    req = urllib.request.Request("http://localhost:8000/v1/chat/completions",
                                  data=data,
                                  headers={"Content-Type": "application/json"})
    t0 = time.time()
    r = json.loads(urllib.request.urlopen(req).read())
    el = time.time() - t0
    toks = r["usage"]["completion_tokens"]
    print(f"RUN{i+1} TOKENS={toks} ELAPSED={el:.2f}s TPS={toks/el:.2f}")
    results.append(toks/el)
print(f"AVG={sum(results)/len(results):.2f} MIN={min(results):.2f} MAX={max(results):.2f}")

On Olares K8s, run from inside the pod to bypass the auth sidecar:

kubectl exec -n vllmqwen36turbo27bone-aurelien deploy/vllmqwen36turbo27bone -c vllm-server -- python3 -c "..."

Live metrics (steady state)

Avg generation throughput: 60 t/s [46-73 range across 3 prompts]
KV cache pool: 177,840 tokens
KV cache usage during generation: 5-15%
Mean acceptance length: variable (P65 forces eager → metrics less meaningful)
Engine init: 100s (cudagraph compilation 42s + load weights 7s + KV alloc + warmup)
Model loading: 16.65 GiB

Random notes

Credits

That’s it ! On reproducibility: everything is in the aamsellem/olares-one-market repo (Helm chart vllmqwen36turbo27bone v2.2.0). Public Docker image aamsellem/vllm-qwen36-blackwell:0.20.0-genesis. If something doesn’t reproduce, open an issue or drop a comment here, I’ll fix it. See you next time !


Disclosure — All benchmarks in this post run on my own Olares One. If this content helped you and you’re considering buying one, ordering through this referral link gets you $400 off ($3,599 vs $3,999) and nets me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Valid until ~end of June 2026.

Share this post on:

Comments