Hi there !
Today we’re talking TurboQuant — specifically how to actually get it running on a hybrid model (Qwen3.6-27B with its Gated DeltaNet + attention layers) on a consumer Blackwell 24GB card. You might tell me “that’s a really narrow niche”. And you’d be right ! Most documented setups are on 80GB Ampere data-center cards or DGX Spark. But it’s exactly what’s sitting in my Olares One, and it might be sitting in yours too. So let’s see where this lands.
TL;DR — the numbers
Bench on the Olares One (RTX 5090M, 24GB GDDR7, 896 GB/s, sm_120 Blackwell). Model: Qwen3.6-27B Lorbus/Qwen3.6-27B-int4-AutoRound. 3 prompts × 800 tokens (Space Invaders HTML, Go REST API guide, PostgreSQL B-tree explainer). temperature=0.6, top_p=0.95.
| Stack | Cold (Space Invaders) | AVG (3 runs) | KV pool | Context |
|---|---|---|---|---|
| Dense MTP + fp8_e5m2 (reference v2.2.2) | ~90 t/s | ~90 t/s | 24K tokens | 75K |
| Turbo TQ K8V4 no MTP | 40 t/s | 40 t/s | 149K tokens | 128K |
| Turbo TQ K8V4 + MTP n=3 | 37 t/s | 38 t/s | 120K tokens | 80K |
| Turbo TQ 4bit_nc no MTP | 28 t/s | — | 224K tokens | 128K |
| Turbo TQ 4bit_nc + MTP n=3 | 46 t/s | 60 t/s [46-73] | 177K tokens | 100K |
The winning config (last row) doubles the KV pool vs the dense reference (177K vs 24K), at the cost of ~33% cold throughput. Worth it for long-prompt or context-heavy agent workloads. We sign !
The initial blocker: NotImplementedError on hybrid
First attempt — vanilla vLLM 0.20-nightly with PR #38479 (TurboQuant) merged. And boom, slap in the face:
NotImplementedError: TurboQuant KV cache is not supported for hybrid
(attention + Mamba) models. Boundary layer protection requires uniform
attention layers.
Why? Because Qwen3.5/3.6 mix Gated DeltaNet (24 of 32 layers for 3.5, similar in 3.6) with full attention. TurboQuant’s boundary layer protection algorithm assumes uniform layers — so hard refusal upstream. There you go.
This is where Sandermage Genesis comes in. It’s a set of 28 runtime monkey-patches that fix exactly this gap. The repo (Sandermage/genesis-vllm-patches, MIT, tag v7.51-stable-2026-04-27) was tested on Ampere (RTX A5000 80GB). Nobody had validated it on consumer Blackwell yet. First unknown: do the patches apply cleanly on sm_120?
Genesis on sm_120: results
Spoiler: it works without breaking a sweat.
[INFO:genesis.apply_all] Genesis platform:
compute_capability: [12, 0]
is_blackwell: false # Sandermage classifies sm_120 as "non-Blackwell" but it works
has_native_fp8: true
[INFO:genesis.apply_all] Genesis Results: 26 applied, 32 skipped, 0 failed
Zero failures across 26 applied patches. The 32 skips are either opt-ins we don’t enable, or Ampere-specific patches (FP8 Marlin fallback) that auto-skip because we have native FP8 on Blackwell. Easy.
The critical patch for us is P4 — TurboQuant hybrid model support: it bypasses the NotImplementedError, routes GDN layers through the right path, and fixes page-size mismatches between attention and recurrent layers. Exactly what we needed.
Once Genesis is applied, the vLLM engine accepts --kv-cache-dtype turboquant_k8v4 or turboquant_4bit_nc and boots on Qwen3.6-27B. First objective hit !
Four stack-specific gotchas
1. P65 is not optional — it’s a functional dependency
GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 is documented as an opt-in to address vLLM issue #40880 (MTP × TurboQuant × cudagraph degenerate output). In practice on Blackwell + MTP + 4bit_nc, the pod doesn’t even boot without P65. Look:
torch._dynamo.exc.TorchRuntimeError: RuntimeError when making fake tensor call
Explanation: Dynamo failed to run FX node with fake tensors:
call_function <built-in function mul>(*(
FakeTensor(..., device='cuda:0', size=(196608, 128)),
FakeTensor(..., device='cuda:0', size=(48*s72, 128))
), **{}): got RuntimeError(
'The size of tensor a (196608) must match the size of tensor b (48*s72)
at non-singleton dimension 0'
)
The bug lives in the cudagraph capture path when MTP draft tensors meet TurboQuant kernels. P65 routes spec-verify batches through eager mode (no cudagraph), bypassing the broken zone. Without P65 → TorchDynamo trips, engine init fails. Don’t panic: enable it, it boots.
Cost of P65: you lose the speedup cudagraph would give to spec decode. That’s why MTP+TQ doesn’t deliver the boost you’d expect over a no-MTP baseline. Look at the table — the gap between Turbo K8V4 no MTP (40 t/s) and with MTP n=3 (37-38 t/s) is negative. MTP costs more than it returns under P65. Ouch.
It’s only with TQ 4bit_nc that MTP turns net-positive (46 t/s cold vs 28 t/s without MTP). Probably because the 4bit_nc internal dispatch is more uniform (MSE quant for both K and V), playing better with eager spec-verify than K8V4 (FP8 keys + 4-bit values, heterogeneous dispatch). More on that in the notes.
2. turboquant_3bit_nc breaks at compile
I tried pushing compression further (4.9× vs 3.8× for 4bit_nc, vs 2.6× for K8V4). Immediate failure with everything else identical:
torch._dynamo.exc.TorchRuntimeError: RuntimeError when making fake tensor call
...same shape mismatch (196608, 128) vs (48*s72, 128)...
Disabling Genesis P5B (the pad-smaller-to-max KV strategy) doesn’t help — the issue is intrinsic to MTP draft tensors × 3-bit kernel reshape. Probably an upstream vLLM or Genesis bug specific to 3-bit blocks. Watch for a P67+ from Sandermage that addresses it.
So for now: 3bit_nc + MTP = no-go on this stack. If you want 3bit_nc you have to disable MTP — and then you drop to 28 t/s (the Turbo TQ 4bit_nc no MTP row but worse due to 3-bit kernel overhead). Not worth it. Next !
3. --max-num-batched-tokens must exceed Mamba block_size
The Mamba cache block_size changes by KV dtype. Genesis P5 (page-size unification) computes block_size by aligning on the LCM of all attention patterns:
| KV dtype | Mamba block_size |
|---|---|
| fp8_e5m2 (dense ref) | 2080 |
| turboquant_k8v4 | 2080 |
| turboquant_4bit_nc | ~4096 |
| turboquant_3bit_nc | ~4128 |
vLLM enforces block_size <= max_num_batched_tokens. So on 4bit_nc you need at least 4096 (I went 8192 for headroom), on K8V4 4096 is fine. If you start with the Sandermage prod value (4096) on 3bit_nc, you hit an AssertionError at boot. Adjust per dtype. Simple.
4. Prefix caching changes everything
In the multi-prompt bench, run 2 (Go REST API) hit 73 t/s while run 1 (Space Invaders, cold) was at 46 t/s. Plot twist: both prompts have different user content but share the system prompt + Qwen3 tokenizer init, and --enable-prefix-caching --prefix-caching-hash-algo xxhash lets vLLM reuse KV from common tokens. Hence the jump.
Useful to know when reporting numbers: a “warm” t/s in an agent iterating over the same context is ~50-60% faster than a “cold” t/s on a fresh prompt. For the cold reference (worst case), use run 1.
Full recipe
Alright, into the deep end. Here’s everything you need.
Docker image
docker.io/aamsellem/vllm-qwen36-blackwell:0.20.0-genesis — built on vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (post-#38479 merge on April 15, 2026).
Dockerfile:
FROM vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52
RUN apt-get update && apt-get install -y --no-install-recommends git && \
git clone --depth 1 --branch v7.51-stable-2026-04-27 \
https://github.com/Sandermage/genesis-vllm-patches.git /tmp/genesis && \
cd /tmp/genesis && \
pip install --no-deps --no-cache-dir ./genesis_vllm_plugin && \
VLLM_DIR="$(python3 -c 'import vllm, os; print(os.path.dirname(vllm.__file__))')" && \
cp -r vllm/_genesis "$VLLM_DIR/_genesis" && \
rm -rf /tmp/genesis && apt-get purge -y git && \
apt-get autoremove -y && rm -rf /var/lib/apt/lists/*
COPY patch_tolist_cudagraph.py /patches/patch_tolist_cudagraph.py
RUN echo '#!/bin/sh\nset -e\npython3 -m vllm._genesis.patches.apply_all || true\npython3 /patches/patch_tolist_cudagraph.py || true\nexec vllm "$@"' > /entrypoint.sh && \
chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
CMD ["serve"]
vLLM args
--model Lorbus/Qwen3.6-27B-int4-AutoRound
--quantization auto_round
--dtype float16
--kv-cache-dtype turboquant_4bit_nc
--max-model-len 100000
--gpu-memory-utilization 0.97
--max-num-seqs 1
--max-num-batched-tokens 8192
--language-model-only
--enable-prefix-caching
--prefix-caching-hash-algo xxhash
--enable-chunked-prefill
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--performance-mode interactivity
--async-scheduling
--no-scheduler-reserve-full-isl
--attention-config.flash_attn_version 2
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Env vars
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
VLLM_FLOAT32_MATMUL_PRECISION=high
GENESIS_ENABLE_P5B_KV=1
GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1
GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1
GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING=1
GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1
GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
NCCL_CUMEM_ENABLE=0
NCCL_P2P_DISABLE=1
OMP_NUM_THREADS=1
CUDA_DEVICE_MAX_CONNECTIONS=8
Bench harness
import urllib.request, json, time
prompts = [
"Build a complete Space Invaders game in a single HTML file...",
"Write a comprehensive guide to building a REST API in Go...",
"Explain how a B-tree index works in PostgreSQL...",
]
results = []
for i, p in enumerate(prompts):
data = json.dumps({
"model": "qwen3.6-27b",
"messages": [{"role": "user", "content": p}],
"max_tokens": 800,
"temperature": 0.6, "top_p": 0.95
}).encode()
req = urllib.request.Request("http://localhost:8000/v1/chat/completions",
data=data,
headers={"Content-Type": "application/json"})
t0 = time.time()
r = json.loads(urllib.request.urlopen(req).read())
el = time.time() - t0
toks = r["usage"]["completion_tokens"]
print(f"RUN{i+1} TOKENS={toks} ELAPSED={el:.2f}s TPS={toks/el:.2f}")
results.append(toks/el)
print(f"AVG={sum(results)/len(results):.2f} MIN={min(results):.2f} MAX={max(results):.2f}")
On Olares K8s, run from inside the pod to bypass the auth sidecar:
kubectl exec -n vllmqwen36turbo27bone-aurelien deploy/vllmqwen36turbo27bone -c vllm-server -- python3 -c "..."
Live metrics (steady state)
Avg generation throughput: 60 t/s [46-73 range across 3 prompts]
KV cache pool: 177,840 tokens
KV cache usage during generation: 5-15%
Mean acceptance length: variable (P65 forces eager → metrics less meaningful)
Engine init: 100s (cudagraph compilation 42s + load weights 7s + KV alloc + warmup)
Model loading: 16.65 GiB
Random notes
- Why 4bit_nc beats K8V4 on the MTP combo — theory: 4bit_nc does MSE quant on both K and V (uniform), K8V4 mixes FP8 keys + 4-bit values (heterogeneous). The eager spec-verify dispatch (forced by P65) navigates the uniform path better. Observational, not confirmed by reading Genesis code. If anyone has a kernel-level explanation, ping me !
- Why not K3V4_NC — didn’t test. On paper it sits between K8V4 and 4bit_nc, but Sandermage’s docs barely mention it. If you try, let me know what you get !
- Why not Lucebox — Lucebox (Luce-Org/lucebox-hub, MIT) claims 78 t/s on Qwen3.6-27B with a matched DFlash drafter + their custom ggml fork. No public Docker, must compile. That’s my next post !
- Measurement caveats — t/s measured from inside the pod (no envoy auth gate, no network round-trip). Prompt-dependent variance (run 2 GoREST faster than run 1 SpaceInv probably from a partial prefix cache hit on the system prompt + common tokens). For a no-cache-hit number, take run 1.
Credits
- Sandro / Sandermage for the 28 Genesis monkey-patches that made all this possible. Several weeks of reverse-engineering work. Hat tip !
- vibhavagarwal5 for vLLM PR #38479 merged on April 15, 2026 — TurboQuant 2-bit KV cache has been upstream since.
- Lorbus for the AutoRound INT4 quant that dequantizes the
mtp.fchead to BF16 in the file — moving MTP memory from 2.37 GiB (fresh buffer) to ~280 MiB (read from disk). - Wasif Basharat and u/Kindly-Cantaloupe978 for the RTX 3090 24GB and RTX 5090 32GB recipes that gave me the starting point.
That’s it ! On reproducibility: everything is in the aamsellem/olares-one-market repo (Helm chart vllmqwen36turbo27bone v2.2.0). Public Docker image aamsellem/vllm-qwen36-blackwell:0.20.0-genesis. If something doesn’t reproduce, open an issue or drop a comment here, I’ll fix it. See you next time !
Disclosure — All benchmarks in this post run on my own Olares One. If this content helped you and you’re considering buying one, ordering through this referral link gets you $400 off ($3,599 vs $3,999) and nets me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Valid until ~end of June 2026.