Archive
All the posts.
Everything I've tested, tuned, benched or discovered. Latest to oldest.
-
166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag
NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.
Lire → -
MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)
MTP support was merged into llama.cpp master on May 16th. Five days later, three follow-up PRs quietly changed how MTP behaves — including the spec-draft-n-max default flipping from 16 to 3. On Olares One (RTX 5090M sm_120), that change plus NVIDIA's backend-sampling rewrite (#23287) pushed Qwen3.6 27B MTP from 64% to 86.7% draft acceptance. +22 points. Nobody is talking about this.
Lire → -
Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB
An Olares One peer user shared a Discord patch to restore vision on the gemma426ba4bone chart. 24 hours later, I shipped a vLLM variant hitting 135 t/s at 128K context — and the same user validated it in production. The story of a community-driven engineering loop, four llama.cpp configs benched in parallel, and the moment turbo3 KV stopped being the answer.
Lire → -
BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile
Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire → -
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path
Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire → -
NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this
An 8-node DGX Spark cluster bringup of vLLM PR
Lire →