Hi there.
Quick recap for newcomers: Genesis is the custom vLLM stack maintained by Sandermage — a developer who shipped a 28-patch Python suite that makes Qwen3.6-27B run with speculative decoding on consumer Blackwell. On my Olares One that’s what gets me to 88 tokens per second, my current best. But it’s heavy: a 5GB custom Docker image, and every new vLLM release means re-patching while chasing upstream changes.
Yesterday morning (May 5, 2026, 00:14 UTC), JartX merged PR #39931 into vLLM main: “[Feature] TurboQuant: support hybrid models and uniform quantization”. This PR natively fixes exactly what half of the Genesis patches do by hand. Bingo, can we maybe drop Genesis?
Spoiler: yes BUT not all the way. 72.55 t/s without Genesis vs 88 with. Here’s the full story.
PR #39931 — what it actually fixes
Reading the body of #39931:
- Hybrid models: TurboQuant was throwing
NotImplementedErrorthe moment it hit a Mamba layer on Qwen3.5/3.6/Qwen3-Next. Now it applies TurboQuant only to thefull_attentionlayers and lets Mamba/SWA pass through transparently. - Page-size planner: the hybrid planner used the standard formula, which doesn’t match TurboQuant’s packed K|V layout → assertions on page merges. Fixed.
- Backend selector: excluded layers (sliding-window, Mamba, skipped) were still wrongly forcing the
TURBOQUANTbackend. Now they fall back to the default backend. - ROCm
flash_attn_varlen_func: wrapper for theout=incompat (not for us, we’re CUDA).
On the Genesis side (reminder: that 28-patch suite from the intro), three patches cover exactly that:
- P60 — fixes the ngram path for GDN models (the Mamba layers)
- P65 — drops the CUDA graph mode down a level when speculative decoding is on
- P66 — filters invalid sizes during CUDA graph capture
If #39931 does the job natively, those three become obsolete.
Finding the right image
The one annoying detail: vLLM v0.20.1 shipped on May 4 at 10:36 UTC — that is, ~14h before #39931 merged. So the official release doesn’t have it. And the standard nightly tag was stuck at 2026-05-04 06:08 UTC (commit 01d4d1ad from 04:33 UTC, predates the PR by 20h). The nightly job either crashed or didn’t run that day.
But on Docker Hub there was this odd tag: vllm/vllm-openai:gemma4-0505-cu130 pushed at 18:27 UTC on May 5 — i.e. 18h after #39931. The tag is probably a special build for Gemma 4’s day-zero (it shipped that day). But it’s a full main HEAD build, so it has the PR.
8.67 GB amd64 + cu130 (CUDA 13.0). Exactly what I need for native sm_120.
First crash: HAMi 0m
Pod deployed. Logs:
[HAMI-core Warn] invalid device memory limit CUDA_DEVICE_MEMORY_LIMIT_0=0m
...
File ".../vllm/v1/spec_decode/llm_base_proposer.py", line 151, in __init__
self.input_ids = torch.zeros(
RuntimeError: CUDA driver error: invalid argument
Ouch. The HAMi-core bug I already know well: Olares sets CUDA_DEVICE_MEMORY_LIMIT_0=0m (intent: “no limit”) but HAMi parses “0m” as “0 bytes” and any CUDA alloc crashes. I thought vLLM Runtime API was immune (unlike Lucebox which goes through Driver API). Wrong: EagleProposer.__init__ (the MTP drafter) also triggers the intercepted HAMi path.
Known workaround: CUDA_DEVICE_MEMORY_LIMIT_0=24000m env var override (same fix as lucedflashqwen36one v1.4.2+).
Second crash: CUDA graph + TurboQuant + spec decoding
Pod restarts. This time we get past boot, the AutoRound model downloads (17.69 GiB), TurboQuant detects the full-attention layers:
INFO config.py:195 TQ hybrid: full-attention layers [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63]
INFO Resolved architecture: Qwen3_5MTP
INFO Using TURBOQUANT attention backend out of potential backends: ['TURBOQUANT'].
INFO Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration when prefix caching is enabled
Great, PR #39931 + PR #40454 (Mamba ‘align’ mode) are active natively. Without any Genesis patch.
Then during CUDA graph capture:
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 583, in _prefill_attention
qsl = query_start_loc.tolist()
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
unless the CPU tensor is pinned. Please use tensor.pin_memory() or allocate
the tensor with pin_memory=True.
Ouch redux. That’s exactly what Genesis P65 TURBOQUANT_SPEC_CG_DOWNGRADE monkey-patches. PR #39931 has NOT fixed this point — it targeted the hybrid models aspect, not CUDA graphs + spec decoding.
Minimal workaround: --enforce-eager (skip ALL CUDA graphs). We’ll lose some speedup but the pod will boot.
The bench
Add the flag, restart, the pod hits READY 1/1 after 3min37s. Three standard Space Invaders prompts (HTML+CSS+JS, 800 tokens max, temp=0.6, top_p=0.95, MTP n=3 active):
Run 1: 800 tok in 11.12s = 71.97 t/s
Run 2: 800 tok in 10.84s = 73.81 t/s
Run 3: 800 tok in 11.13s = 71.87 t/s
AVG = 72.55 t/s [71.87 – 73.81]
Comparison on Olares One (RTX 5090M 24GB sm_120)
| Stack | t/s | Custom patches |
|---|---|---|
Vanilla vLLM main HEAD + #39931 + --enforce-eager | 72.55 | 0 |
| llama.cpp standard (no spec) | 33-36 | pure upstream |
| llama.cpp + MTP PR #22673 | 78.1 | 1 PR open |
| buun-llama-cpp DFlash + Q8_0 drafter | 80 | llama.cpp fork |
| Genesis 28 patches + custom image (baseline) | 88.0 | 28 patches + disable_p8 |
| Lucebox v1.4.4 (DFlash test_dflash + libvgpu hot-swap) | 88.5 | custom engine |
vLLM no-Genesis = -17.5% vs Genesis. Better than llama.cpp standard, but below anything that goes beyond vanilla.
Why -17%: we lose the CUDA graphs
--enforce-eager disables all CUDA graph captures. On 1-token decode batches (the common case in single-user), CUDA graphs typically buy +20-30% on Blackwell.
Genesis P65 is smarter: it downgrades _cudagraph_support from UNIFORM_BATCH → UNIFORM_SINGLE_TOKEN_DECODE only when speculative_config is active. Result: K+1 spec-verify batches (the ones that crash) fall to eager, but 1-token decode batches keep their CUDA graph.
P65 code (very clean, genuinely upstream-able):
@classmethod
def get_cudagraph_support(cls, vllm_config, kv_cache_spec) -> AttentionCGSupport:
"""Context-aware downgrade for spec-decode only."""
if vllm_config.speculative_config is not None:
return AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE
return cls._cudagraph_support
That’s it. Three lines added to the TurboQuantMetadataBuilder class in vllm/v1/attention/backends/turboquant_attn.py.
The upstream issue already exists: #40807
Digging around I find vLLM issue #40807 opened by noonghunna on April 24, 2026: “TurboQuant KV + spec-decode + chunked-prefill crashes CUDA graph capture at query_start_loc.tolist()”. That’s exactly our crash, exactly the same line.
The issue contains a fascinating thread:
- noonghunna confirms on RTX 3090 (Ampere) with Qwen3.6-27B
- Sandermage (Genesis author) commented on April 24, shared his Patches 23/44/22/38 that fix it + explicitly offered “Happy to extract Patch 23 + Patch 44 into an actual upstream PR if the core team finds the approach acceptable”
- xyehya confirms on RTX 5080 (consumer Blackwell) with Qwen3.5-9b NVFP4
- But 12 days after Sandermage’s offer: no upstream PR filed
And with my test, we add a fourth confirmed hardware: RTX 5090M sm_120 mobile (Olares One). The bug reproduces on Ampere, consumer Blackwell desktop, and consumer Blackwell mobile.
What already works natively (the good side)
PR #39931 did still deliver concrete things:
✅ TQ hybrid: full-attention layers [...] — TurboQuant skips Mamba/SWA on its own
✅ Using TURBOQUANT attention backend — backend selector OK
✅ Resolved architecture: Qwen3_5MTP — the MTP arch is native, no need for Genesis P64
✅ Mamba cache align mode (PR #40454) — the hybrid spec decoding cache is sorted
✅ TurboQuant overriding flash_attn_version to 2 — auto-fallback FA3→FA2
Genesis-side, that means P60 (GDN ngram fix) and P64 (Qwen3 Coder MTP streaming) are now obsolete. P65/P66 (CUDA graph + spec) remain useful.
Next step: apply Patch 65 alone (and find out it isn’t enough)
Curious, I wrote an init script that text-patches Sandermage’s P65 onto turboquant_attn.py at pod startup, vanilla gemma4-0505-cu130 image. Boot passes, P65 is properly detected by vLLM:
P65 applied: TurboQuantMetadataBuilder.get_cudagraph_support added
WARNING [compilation.py:1390] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend TurboQuantAttentionBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4
init engine ... took 100.32 s (compilation: 39.30 s)
Application startup complete.
CUDA graphs captured no crash, server READY 1/1. Bingo… no. On the first bench request, the engine crashes:
File ".../vllm/v1/attention/backends/turboquant_attn.py", line 862, in _decode_attention
current_workspace_manager().get_simultaneous(...)
AssertionError: Workspace is locked but allocation from 'turboquant_attn.py:862:
_decode_attention' requires 0.76 MB, current size is 0.00 MB. Workspace growth
is not allowed after locking.
Ouch. P65 fixes the CUDA graph capture — but not the workspace lock that blows up on the first inference. That’s exactly what Sandermage described in his #40807 comment: “profiler-invisible torch.empty inside _continuation_prefill — the allocation happens after profile_run finishes, so vLLM’s KV sizing never accounts for it”. His Patches 22 + 38 pre-allocate that buffer so it exists before the lock.
So P65 alone is not enough. To leave Genesis behind for good, you need at minimum five patches from the same set:
- P65 — the 3-line CUDA graph downgrade in spec-decode (the one we just tested)
- P22 — shared dequant buffer + 4-D K/V workspace prealloc (heavy lift)
- P38 — the memory state machine for the “continuation prefill” phase
- Maybe P44 — the attention output buffer to stay idempotent under capture
- Maybe P23 — the
cu_seqlensprealloc
That’s more than a simple “extract P23+P44 to upstream PR” as Sandermage initially proposed. It’s a serious effort that probably deserves several distinct PRs.
Provisional verdict
We can’t drop Genesis today without losing 17% throughput (--enforce-eager workaround) or without porting ~5 non-trivial patches upstream. The path is known, but vLLM main is still missing the suite of CUDA graph + TurboQuant + spec + workspace fixes.
My immediate plan:
- Comment on #40807 with two new consumer-Blackwell-mobile datapoints absent from the thread:
- vanilla #39931 +
--enforce-eager= 72.55 t/s (vs 88 baseline Genesis) - vanilla #39931 + Patch 65 alone = boot OK + CG OK but workspace lock crash on the 1st inference
- vanilla #39931 +
- Push Sandermage toward a series of upstream PRs — not just P23+P44, more like P22+P38+P65 minimum
- Test the full pack (P22+P38+P44+P65) on my 5090M when Sandermage extracts them — to validate it climbs back to 88+ t/s
- Once the PRs land: bump the
vllmqwen36turbo27boneimage to vanillavllm/vllm-openai:nightlyand drop Genesis for good
To follow this
- Issue #40807 — TurboQuant + spec + CG capture
- PR #39931 — the half-fix that merged
- Repo Sandermage/genesis-vllm-patches — where P23/P44 live until upstream
- vLLM
v0.20.2upcoming (probably the first stable release with #39931)
Credits
- JartX for PR #39931 and the hybrid TurboQuant fix
- Sandermage (Sander Barzov, Odessa) for Genesis and Patches 22/23/38/44 that identified the root cause
- noonghunna for filing #40807 with a detailed repro
- xyehya for the consumer Blackwell confirmation (his 5080)
- JartX for the upstream FP16 rotation work and the PR
That’s it! If you run Qwen3.5/3.6 + TurboQuant + spec decoding on any consumer Blackwell hardware (5070/5080/5090/5090M) and reproduce these 72 t/s with --enforce-eager, or ideally 88+ t/s with Sandermage’s Patches applied, send your numbers in the comments. See you next time!
Disclosure — All the benchmarks in this post run on my own Olares One. If the content was useful and you’re considering one, ordering through this referral link gets you $400 off ($3,599 instead of $3,999) and pays me $200. I’m mentioning this out of transparency — and yes, incidentally, it helps keep the blog alive (hosting, domain, and the time I spend writing here). Link valid until late June 2026.