166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag
NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.
Lire →