This morning’s review of vLLM PR #40082 “Integrate flashinfer b12x MoE + FP4 GEMM for SM120/121” surfaced a blocker that confirms what some of us have been quietly noticing on consumer Blackwell parts: NVIDIA shipped FlashInfer 0.6.11 without SM120/121 cubins.
User AethoceSora posted around 05:00 UTC after bringing up the PR on an 8-node DGX Spark cluster (the GB10 reference platform):
Two upstream blockers identified:
- CUTLASS DSL MLIR→PTX bug → malformed
_mmaPTX- FlashInfer cubin gap: 0.6.11 ships only Sm100a/f/Sm103a cubins, zero SM120/121 binaries → falls back to wrong-arch kernels, “degenerate repetition” output. Recommendation: land with warning guard.
This matters for everyone running:
- RTX 5090 Laptop / mobile (5090M) — sm_120, what’s in my Olares One
- RTX 5070 Ti / 5080 desktop — also sm_120
- DGX Spark / GB10 — sm_121
Anyone who’s tried to deploy a NVFP4-quantized model (post-Q4) on these parts through stock vLLM’s FlashInfer MoE path has likely seen either kernel selection errors or, more insidiously, output that “works” but is nonsense — the repetition / gibberish loop pattern.
What this concretely blocks
For my Olares One stack specifically:
-
sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ— I tested this last night on stockvllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404(vLLM 0.20.2). It boots and produces 31.79 t/s steady-state with no spec decoding. Why no FP4 errors? Because Qwen3.6-27B isn’t an MoE — the dense linear layers use FlashInfer’sFlashInferCutlassNvFp4LinearKernelwhich apparently does work on SM120 (just slow). The cubin gap bites only on the MoE GEMM path that #40082 is trying to introduce. -
nvidia/Gemma-4-26B-A4B-NVFP4and similar MoE NVFP4 quants — these would hit the cubin gap directly. Until NVIDIA ships SM120 cubins (or until vLLM/FlashInfer add a Triton fallback path that doesn’t break), running these on consumer Blackwell with stock FlashInfer is rolled-dice on output quality. -
vLLM TurboQuant + NVFP4 KV cache (PR #42345 “Support excluding SWA layers from NVFP4 KV cache”) — same arch dependency.
What’s actually working on SM120 today
Tested live on Olares One in the last 24h:
| Path | Status |
|---|---|
| AWQ-4bit dense via vLLM tokenspeed-preview | ✅ 214-235 t/s on Gemma 4 26B-A4B (cyankiwi AWQ) |
| AWQ-4bit MoE + DFlash spec decode | ✅ same image, same numbers |
| Compressed-tensors WNA16 Marlin (Lorbus AutoRound INT4) | ✅ 88 t/s on Qwen3.6-27B (Genesis-patched) |
| NVFP4 dense via FlashInfer | ✅ 31.79 t/s (works but slow, no MoE path) |
| NVFP4 MoE via FlashInfer | ❌ wrong-arch fallback per AethoceSora’s report |
| llama.cpp Q_K_M / IQ | ✅ 64-77 t/s on Qwen3.6-27B + MTP at 262K |
Recommendation while we wait
If you’re running consumer Blackwell (5070 Ti / 5080 / 5090 / 5090M):
- Don’t trust NVFP4 MoE quants on stock vLLM with FlashInfer until NVIDIA ships SM120 cubins OR vLLM lands a guard like the one AethoceSora recommends in #40082.
- Compressed-tensors WNA16-Marlin (AWQ-4bit) works great — that’s the path I’m shipping in production for Gemma 4 26B-A4B (cyankiwi) and Qwen3.6-27B (Lorbus AutoRound INT4 via Genesis).
- llama.cpp Q4_K_M / Q3_K_XL / Q2_K_XL with MTP speculative decoding are mature and fast — see my Qwen3.6-27B MTP OOM fix post for the current optimal stack.
What I’d like NVIDIA to ship
NVIDIA: please add SM120a / SM121a cubins to the next FlashInfer release. The consumer Blackwell parts are the largest installed base of Blackwell GPUs today — locking out FP4 MoE on these while shipping it for B200/B100 is a missed opportunity.
In the meantime, the PR #40082 review thread is the best place to watch the upstream resolution. The PR author’s recommendation to “land with warning guard” so users at least get a clear error instead of silent garbage output is the right short-term move.
Olares One reproduction (FlashInfer NVFP4 dense WORKS on SM120)
If you want to confirm the dense NVFP4 path works on your own SM120 hardware:
docker run --gpus all --rm \
-v $(pwd)/models:/models \
vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 \
vllm serve sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ \
--served-model-name qwen3.6-27b-nvfp4 \
--host 0.0.0.0 --port 8000 \
--max-model-len 28000 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 1 --max-num-batched-tokens 4096 \
--kv-cache-dtype fp8 --trust-remote-code \
--download-dir /models
Expect ~31 t/s on RTX 5090M with the above. If you see “degenerate repetition” output instead, you’re hitting a wrong-arch kernel — likely a different MoE path or KV cache config. Try without --kv-cache-dtype fp8 to isolate.
Full stack details + Helm chart in orales-one-market.