Tag · mtp

# mtp

All posts tagged "mtp".

MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

21.05.2026

MTP support was merged into llama.cpp master on May 16th. Five days later, three follow-up PRs quietly changed how MTP behaves — including the spec-draft-n-max default flipping from 16 to 3. On Olares One (RTX 5090M sm_120), that change plus NVIDIA's backend-sampling rewrite (#23287) pushed Qwen3.6 27B MTP from 64% to 86.7% draft acceptance. +22 points. Nobody is talking about this.
Lire →
Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

12.05.2026

A user hit a reproducible runtime CUDA OOM in MTP draft on my Qwen3.6-27B v1.0.5 chart at 262K context. Boot fine, draft scales beyond static estimate, exit 139 in common_speculative_state_mtp draft. Fixed by dropping havenoammo UD-Q3_K_XL (14.9 GB) to UD-Q2_K_XL (12.3 GB). Direct bench validates v1.0.7 at 72.14 t/s stable, full 262K, no OOM. Plus a side experiment: can we drop Genesis patches by switching to NVFP4? Answer: no.
Lire →
The story of the day I broke my Qwen3.6 ceiling — not with code, but with a stranger's name

09.05.2026

I spent a whole day trying to push my Qwen3.6 27B on Olares past 65 t/s. Custom builds, experimental forks, merges that crash. And then late evening, in a desperate HuggingFace search, I run into a name: havenoammo. Five minutes later, 77 t/s on a 262K context. The story of a day chasing an answer that was waiting one click away.
Lire →
A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

08.05.2026

From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.
Lire →
Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

08.05.2026

On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: native support for Gemma 4 Multi-Token Prediction drafters. On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One hits 178.6 t/s with 77.3% acceptance — first public Gemma 4 MTP bench on consumer mobile Blackwell.
Lire →
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why

06.05.2026

PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire →

MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)

Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier

The story of the day I broke my Qwen3.6 ceiling — not with code, but with a stranger's name

A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock

Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge

Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why