Tag · paged-attention

# paged-attention

All posts tagged "paged-attention".

Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB

15.05.2026

An Olares One peer user shared a Discord patch to restore vision on the gemma426ba4bone chart. 24 hours later, I shipped a vLLM variant hitting 135 t/s at 128K context — and the same user validated it in production. The story of a community-driven engineering loop, four llama.cpp configs benched in parallel, and the moment turbo3 KV stopped being the answer.
Lire →