Skip to content
/ airelien.dev
Go back
Aurélien AMSELLEM

NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this

An 8-node DGX Spark cluster bringup of vLLM PR

This morning’s review of vLLM PR #40082 “Integrate flashinfer b12x MoE + FP4 GEMM for SM120/121” surfaced a blocker that confirms what some of us have been quietly noticing on consumer Blackwell parts: NVIDIA shipped FlashInfer 0.6.11 without SM120/121 cubins.

User AethoceSora posted around 05:00 UTC after bringing up the PR on an 8-node DGX Spark cluster (the GB10 reference platform):

Two upstream blockers identified:

  1. CUTLASS DSL MLIR→PTX bug → malformed _mma PTX
  2. FlashInfer cubin gap: 0.6.11 ships only Sm100a/f/Sm103a cubins, zero SM120/121 binaries → falls back to wrong-arch kernels, “degenerate repetition” output. Recommendation: land with warning guard.

This matters for everyone running:

Anyone who’s tried to deploy a NVFP4-quantized model (post-Q4) on these parts through stock vLLM’s FlashInfer MoE path has likely seen either kernel selection errors or, more insidiously, output that “works” but is nonsense — the repetition / gibberish loop pattern.

What this concretely blocks

For my Olares One stack specifically:

What’s actually working on SM120 today

Tested live on Olares One in the last 24h:

PathStatus
AWQ-4bit dense via vLLM tokenspeed-preview✅ 214-235 t/s on Gemma 4 26B-A4B (cyankiwi AWQ)
AWQ-4bit MoE + DFlash spec decode✅ same image, same numbers
Compressed-tensors WNA16 Marlin (Lorbus AutoRound INT4)✅ 88 t/s on Qwen3.6-27B (Genesis-patched)
NVFP4 dense via FlashInfer✅ 31.79 t/s (works but slow, no MoE path)
NVFP4 MoE via FlashInfer❌ wrong-arch fallback per AethoceSora’s report
llama.cpp Q_K_M / IQ✅ 64-77 t/s on Qwen3.6-27B + MTP at 262K

Recommendation while we wait

If you’re running consumer Blackwell (5070 Ti / 5080 / 5090 / 5090M):

  1. Don’t trust NVFP4 MoE quants on stock vLLM with FlashInfer until NVIDIA ships SM120 cubins OR vLLM lands a guard like the one AethoceSora recommends in #40082.
  2. Compressed-tensors WNA16-Marlin (AWQ-4bit) works great — that’s the path I’m shipping in production for Gemma 4 26B-A4B (cyankiwi) and Qwen3.6-27B (Lorbus AutoRound INT4 via Genesis).
  3. llama.cpp Q4_K_M / Q3_K_XL / Q2_K_XL with MTP speculative decoding are mature and fast — see my Qwen3.6-27B MTP OOM fix post for the current optimal stack.

What I’d like NVIDIA to ship

NVIDIA: please add SM120a / SM121a cubins to the next FlashInfer release. The consumer Blackwell parts are the largest installed base of Blackwell GPUs today — locking out FP4 MoE on these while shipping it for B200/B100 is a missed opportunity.

In the meantime, the PR #40082 review thread is the best place to watch the upstream resolution. The PR author’s recommendation to “land with warning guard” so users at least get a clear error instead of silent garbage output is the right short-term move.

Olares One reproduction (FlashInfer NVFP4 dense WORKS on SM120)

If you want to confirm the dense NVFP4 path works on your own SM120 hardware:

docker run --gpus all --rm \
  -v $(pwd)/models:/models \
  vllm/vllm-openai:tokenspeed-preview-x86_64-ubuntu2404 \
  vllm serve sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ \
  --served-model-name qwen3.6-27b-nvfp4 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 28000 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 --max-num-batched-tokens 4096 \
  --kv-cache-dtype fp8 --trust-remote-code \
  --download-dir /models

Expect ~31 t/s on RTX 5090M with the above. If you see “degenerate repetition” output instead, you’re hitting a wrong-arch kernel — likely a different MoE path or KV cache config. Try without --kv-cache-dtype fp8 to isolate.

Full stack details + Helm chart in orales-one-market.

Share this post on:

Comments