Tag · vllm
# vllm
All posts tagged "vllm".
-
Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge
On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: native support for Gemma 4 Multi-Token Prediction drafters. On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One hits 178.6 t/s with 77.3% acceptance — first public Gemma 4 MTP bench on consumer mobile Blackwell.
Lire → -
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why
PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire → -
My personal Olares Market — 28 apps hand-tuned for the Olares One, one click away
A custom Olares Market hand-tuned for the RTX 5090M of the Olares One. 28 ready-to-install apps: llama.cpp, vLLM, DFlash, Voxtral ASR/TTS, vision, music. How to add it to your device in 30 seconds.
Lire → -
Why DFlash on Qwen3.6-27B doesn't fit on a 24GB single GPU
Three paths tested (z-lab BF16, AEON-7 NVFP4, Lucebox custom). All need ≥26 GB. VRAM math, honest negatives, what to wait for on 24GB.
Lire → -
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB
Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire → -
Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU
Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.
Lire →