Tag · vllm

# vllm

All posts tagged "vllm".

Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

08.05.2026

On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: native support for Gemma 4 Multi-Token Prediction drafters. On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One hits 178.6 t/s with 77.3% acceptance — first public Gemma 4 MTP bench on consumer mobile Blackwell.
Lire →
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

06.05.2026

PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire →
My personal Olares Market — 28 apps hand-tuned for the Olares One, one click away

04.05.2026

A custom Olares Market hand-tuned for the RTX 5090M of the Olares One. 28 ready-to-install apps: llama.cpp, vLLM, DFlash, Voxtral ASR/TTS, vision, music. How to add it to your device in 30 seconds.
Lire →
Why DFlash on Qwen3.6-27B doesn't fit on a 24GB single GPU

28.04.2026

Three paths tested (z-lab BF16, AEON-7 NVFP4, Lucebox custom). All need ≥26 GB. VRAM math, honest negatives, what to wait for on 24GB.
Lire →
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB

28.04.2026

Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire →
Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU

26.04.2026

Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.
Lire →

Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

My personal Olares Market — 28 apps hand-tuned for the Olares One, one click away

Why DFlash on Qwen3.6-27B doesn't fit on a 24GB single GPU

Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB

Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU