NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this

This morning’s review of vLLM PR #40082 “Integrate flashinfer b12x MoE + FP4 GEMM for SM120/121” surfaced a blocker that confirms what some of us have been quietly noticing on consumer Blackwell parts: NVIDIA shipped FlashInfer 0.6.11 without SM120/121 cubins.

User AethoceSora posted around 05:00 UTC after bringing up the PR on an 8-node DGX Spark cluster (the GB10 reference platform):

Two upstream blockers identified:

CUTLASS DSL MLIR→PTX bug → malformed _mma PTX

FlashInfer cubin gap: 0.6.11 ships only Sm100a/f/Sm103a cubins, zero SM120/121 binaries → falls back to wrong-arch kernels, “degenerate repetition” output. Recommendation: land with warning guard.

This matters for everyone running:

RTX 5090 Laptop / mobile (5090M) — sm_120, what’s in my Olares One
RTX 5070 Ti / 5080 desktop — also sm_120
DGX Spark / GB10 — sm_121

Anyone who’s tried to deploy a NVFP4-quantized model (post-Q4) on these parts through stock vLLM’s FlashInfer MoE path has likely seen either kernel selection errors or, more insidiously, output that “works” but is nonsense — the repetition / gibberish loop pattern.

What this concretely blocks

For my Olares One stack specifically:

sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ — I tested this last night on stock vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 (vLLM 0.20.2). It boots and produces 31.79 t/s steady-state with no spec decoding. Why no FP4 errors? Because Qwen3.6-27B isn’t an MoE — the dense linear layers use FlashInfer’s FlashInferCutlassNvFp4LinearKernel which apparently does work on SM120 (just slow). The cubin gap bites only on the MoE GEMM path that #40082 is trying to introduce.
nvidia/Gemma-4-26B-A4B-NVFP4 and similar MoE NVFP4 quants — these would hit the cubin gap directly. Until NVIDIA ships SM120 cubins (or until vLLM/FlashInfer add a Triton fallback path that doesn’t break), running these on consumer Blackwell with stock FlashInfer is rolled-dice on output quality.
vLLM TurboQuant + NVFP4 KV cache (PR #42345 “Support excluding SWA layers from NVFP4 KV cache”) — same arch dependency.

What’s actually working on SM120 today

Tested live on Olares One in the last 24h:

Path	Status
AWQ-4bit dense via vLLM tokenspeed-preview	✅ 214-235 t/s on Gemma 4 26B-A4B (cyankiwi AWQ)
AWQ-4bit MoE + DFlash spec decode	✅ same image, same numbers
Compressed-tensors WNA16 Marlin (Lorbus AutoRound INT4)	✅ 88 t/s on Qwen3.6-27B (Genesis-patched)
NVFP4 dense via FlashInfer	✅ 31.79 t/s (works but slow, no MoE path)
NVFP4 MoE via FlashInfer	❌ wrong-arch fallback per AethoceSora’s report
llama.cpp Q_K_M / IQ	✅ 64-77 t/s on Qwen3.6-27B + MTP at 262K

Recommendation while we wait

If you’re running consumer Blackwell (5070 Ti / 5080 / 5090 / 5090M):

Don’t trust NVFP4 MoE quants on stock vLLM with FlashInfer until NVIDIA ships SM120 cubins OR vLLM lands a guard like the one AethoceSora recommends in #40082.
Compressed-tensors WNA16-Marlin (AWQ-4bit) works great — that’s the path I’m shipping in production for Gemma 4 26B-A4B (cyankiwi) and Qwen3.6-27B (Lorbus AutoRound INT4 via Genesis).
llama.cpp Q4_K_M / Q3_K_XL / Q2_K_XL with MTP speculative decoding are mature and fast — see my Qwen3.6-27B MTP OOM fix post for the current optimal stack.

What I’d like NVIDIA to ship

NVIDIA: please add SM120a / SM121a cubins to the next FlashInfer release. The consumer Blackwell parts are the largest installed base of Blackwell GPUs today — locking out FP4 MoE on these while shipping it for B200/B100 is a missed opportunity.

In the meantime, the PR #40082 review thread is the best place to watch the upstream resolution. The PR author’s recommendation to “land with warning guard” so users at least get a clear error instead of silent garbage output is the right short-term move.

Olares One reproduction (FlashInfer NVFP4 dense WORKS on SM120)

If you want to confirm the dense NVFP4 path works on your own SM120 hardware:

docker run --gpus all --rm \
  -v $(pwd)/models:/models \
  vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 \
  vllm serve sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ \
  --served-model-name qwen3.6-27b-nvfp4 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 28000 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 --max-num-batched-tokens 4096 \
  --kv-cache-dtype fp8 --trust-remote-code \
  --download-dir /models

Expect ~31 t/s on RTX 5090M with the above. If you see “degenerate repetition” output instead, you’re hitting a wrong-arch kernel — likely a different MoE path or KV cache config. Try without --kv-cache-dtype fp8 to isolate.

Full stack details + Helm chart in orales-one-market.

What this concretely blocks

What’s actually working on SM120 today

Recommendation while we wait

What I’d like NVIDIA to ship

Olares One reproduction (FlashInfer NVFP4 dense WORKS on SM120)

Comments