Tag · moe
# moe
All posts tagged "moe".
-
249 t/s on Qwen3.6 35B-A3B MTP — the bigger model that runs faster than everything smaller
I posted yesterday about Nemotron-Labs Elastic 30B-A3B NVFP4 hitting 166 t/s on Olares One — then 182 once vLLM #40082 landed. New record. Headline of the post: 'fastest LLM on Olares One'. Less than 12 hours later, that record is now sitting in second place. Qwen3.6 35B-A3B MTP runs at 249 t/s on the same hardware. Bigger model, +37% faster. Here's what's going on.
Lire → -
166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag
NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.
Lire →