Tag · vllm

# vllm

All posts tagged "vllm".

Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

23.05.2026

Two days ago I shipped Qwen 3.6 35B-A3B MTP at 249 t/s on Olares One. Text-only, but the new champion. Today the same hardware runs Gemma 4 26B at 250 t/s with vision and tool calling. The unlock: vLLM v0.21 quietly merged the official Google Gemma 4 MTP drafter. No more 5-fast/4-slow cycle bug from DFlash. No more 135 t/s no-spec fallback. Just full speed, plus images.
Lire →
166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag

21.05.2026

NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.
Lire →
Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB

15.05.2026

An Olares One peer user shared a Discord patch to restore vision on the gemma426ba4bone chart. 24 hours later, I shipped a vLLM variant hitting 135 t/s at 128K context — and the same user validated it in production. The story of a community-driven engineering loop, four llama.cpp configs benched in parallel, and the moment turbo3 KV stopped being the answer.
Lire →
NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this

12.05.2026

An 8-node DGX Spark cluster bringup of vLLM PR
Lire →
Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

11.05.2026

Full num_speculative_tokens sweep for Gemma 4 26B-A4B + z-lab DFlash drafter on RTX 5090M Laptop (24GB sm_120). Optimal is n_spec=8 (not n=15 like desktop). I also found a 100% reproducible vLLM degradation cycle that I couldn't fix from config alone.
Lire →
A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

08.05.2026

From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.
Lire →

Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion

166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag

Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB

NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this

Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle

A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock