Hi there.
I wasn’t planning to write tonight.
It was past 11pm, my Cuisinart was brewing tea I wouldn’t have the courage to drink, and I was tidying up notes from a day I wanted to forget quickly. You know those days where you feel like you’ve made a lot of moves and stayed exactly in the same place? That kind of day.
In the morning I’d told myself: today, I break the ceiling. My Qwen3.6 27B Long Context app on Olares was running at 65 tokens per second on a 128K context. Honest, but not great. And on Reddit, for a few weeks now, there had been this reference haunting me — 80 t/s on 262K, doable on a 24GB 4090. Same VRAM as my 5090M Laptop. So in theory, doable on my side too. Except in real life, I was at 65 t/s on half the context. Frustrating.
Morning
Coffee, terminal, I attack the most promising thing on paper: NVFP4 models, which exploit Blackwell’s new native tensor cores. That’s what NVIDIA pushes for the current gen. I find a Qwen3.6 variant that weighs 20 GB in NVFP4. Should fit in my 24 GB, plus the MTP drafter, plus the KV cache.
I download, I deploy. The target loads fine. The external drafter (3.5 GB in BF16) loads next. And there it is, OOM. I have 1.9 GiB free, the drafter wants 2.4. Five hundred megabytes short.
Fine. I drop memory utilization to 0.92, I reload. OOM. I switch to --enforce-eager to save the CUDA graphs. OOM. I test max-model-len at 8K to shrink the buffers. OOM, every time, at the same height.
The problem is mathematical: the NVFP4 target eats 20.96 GiB of VRAM (PyTorch overhead included), 1.9 GiB left for everything else, and the drafter alone wants 2.4. No point continuing.
Noon
Plan B: build my own llama.cpp image. On GitHub I’d spotted a fork that officially combines the two things I’m missing — the new MTP technique being merged into upstream llama.cpp, and TurboQuant quantization that compresses the KV cache. On paper, that’s exactly the Reddit recipe I dream of.
I clone, I write a Dockerfile, I fire up buildx in amd64 emulation on my Mac. And there’s the first lesson of the day: a CUDA build under emulation is slow. Really slow. 45 minutes to compile the kernels. While that runs, I re-read my notes, I code something else, I think about my weekend to-do.
The binary compiles without errors. I push the image to Docker Hub. I deploy the pod. I run my standard bench — generate a Space Invaders in HTML, 2,000 tokens, three consecutive passes.
First run: 53 t/s. Slower than my current version. OK, a bit disappointing but not absurd — maybe a cold start. Second run: 55 t/s. OK. Third run.
The third run takes 171 seconds for 2,000 tokens. 11 t/s. I look at the logs: MTP acceptance dropped to 0%. The drafter completely collapsed. The fork worked on the RTX 4090s where it was developed, but on consumer mobile Blackwell something in the Mamba cache degrades over usage. Silent regression, invisible if you only run a single bench.
I clean up. 45 minutes of build, two hours of troubleshooting, and I’m back to square one. With as a bonus the precious knowledge that this won’t work, which closes an avenue but doesn’t open another.
Afternoon
I try other things. Genesis (the custom vLLM stack from Sandermage I keep mentioning here) with another KV cache format, 3-bit lossless instead of 4-bit: layernorm bug in the FLA code, the bug is upstream and there’s no fix shipped today. I bump memory utilization to 0.98: I’m short by 20 megabytes, twenty. Twenty megs out of 24 gigs. Sigh.
I test num_speculative_tokens=6 instead of 3 because a Windows build claims it gives 158 t/s on a desktop RTX 5090. Failure: Qwen3.6 has a single MTP layer, reusing it six times tanks acceptance — it’s literally written in a vLLM warning I hadn’t taken seriously.
By 5pm I’ve exhausted my technical ideas for the day. I decide to post properly two upstream comments — one on the llama.cpp MTP PR, one on the buun-llama-cpp fork issue (another experimental fork we’ve already crossed paths with on this blog) — to flag the consumer Blackwell regression and push for a fix. It’s a consolation. At least I’ve contributed something, even if I haven’t gained anything performance-wise.
Evening
I was tidying up my notes to update my agent’s memory (yes, I code my dev tools with their own persistent memory, it’s become a habit since they keep memory files), and I told myself: while I’m at it, I’ll re-read the HuggingFace findings of the week.
And there, in a list of models I’d skimmed three days earlier, a name: havenoammo.
A repo soberly named Qwen3.6-27B-MTP-UD-GGUF. Publication date, two days earlier. I click because I want to hurt one last time before closing the laptop.
Three things made my ears prick up reading the readme.
First, the size. The Q3_K_XL version weighs 15 GB. My current target weighs 17 GB. That 2 GB difference is exactly the VRAM I was missing to extend my context. That alone deserves attention.
Then the “Unsloth Dynamic” tag. I knew the name, I’d never dug into what it actually means. I click on the Unsloth page, I read. And then I understand what I’d never really understood: Unsloth Dynamic isn’t “a smaller Q3”. It’s a per-layer precision mix. The layers that matter for quality — attention, embedding, the head that produces tokens, the MTP head that drives the speedup in speculative decoding — stay at 6 or 8 bits. Only the dense FFN layers, which are nearly redundant at parameter count, drop to 3 bits. The average is “Q3”, but internally it’s anything but uniform.
For my use case that means the MTP head — the one that drives my speedup — stays at high precision, while it was 4-bit on my previous target.
Third thing: no public bench on consumer Blackwell for this repo. Nobody seems to have tested it in combination with MTP on sm_120. Statistical void. If it works or breaks, I’ll be the first to know.
I copy the URL, I tweak one line in my Helm chart. I replace my current GGUF path with the path to havenoammo’s Q3_K_XL. I push the context to 262144. I save. I deploy.
The moment
The pod boots in two minutes. The model loads, llama.cpp reads the MTP head, registers it, configures the draft head. I run my standard bench.
I wait, the terminal spins. The first prompt generates its Space Invaders. 25 seconds for 2,000 tokens.
80 tokens per second.
I read the number three times. It’s higher than anything I’ve seen today. It’s the Reddit bar. I expect a regression on the second run, like earlier with the custom fork — the moment when the drafter collapses and the magic disappears. I rerun.
Run 2: 75 t/s. Run 3: 76 t/s. No collapse. No degradation. The acceptance rate in the logs holds at 75-80% across the three runs.
I check it really runs on the full 262K context. Yes, fully native. It’s in the logs, it’s in the config. Not a hidden 128K fallback.
I look at the clock. 11:47pm. The day just pivoted in five minutes.
Why it works
I’ll try to explain simply what’s happening, because it matters for understanding this isn’t a fluke.
When you do speculative decoding with MTP, the drafter proposes tokens ahead. The more accurate its proposals, the faster you generate — because the target only has to validate in parallel instead of generating one by one. If acceptance is 50%, the speedup factor is limited. If it climbs to 80%, you nearly double the speed.
But acceptance depends directly on the quality of the drafter’s predictions. And prediction quality depends on the precision the MTP head was quantized at. In standard 4-bit quant, the MTP head is at the same precision as the rest of the model: 4 bits. That’s enough for 64% acceptance, what I had.
In Unsloth Dynamic Q3_K_XL, the average target is smaller (3-bit for the FFNs), but the MTP head is at 6 or 8 bits — therefore more precise than in my old Q4 target. The drafter “sees more clearly”. Acceptance climbs to 75-80%.
The counterintuitive effect is that a smaller average quant gives a more accurate drafter, because what matters isn’t the average but the precision of the sub-component that drives the speedup.
The other gift is the 2 GB of VRAM saved on the average target. That 2 GB is the extra KV cache I was missing to go from 128K to 262K. So extended context AND improved speedup at the same time. The two things I’d been trying to get separately for weeks.
The honesty on quality
I won’t sell you a dream: Q3 is more aggressive than Q4. On standard quality benchmarks, perplexity loses between 1 and 3% versus Q4_K_M. That’s measurable.
Unsloth Dynamic mitigates strongly because the layers driving quality stay at high precision. For chat or code, the difference is invisible at usage time. For competition math or very precise reasoning, there’s a Q4_K_XL variant (18 GB) that’s safer — at the cost of falling back to 128-160K context.
While I’m writing these lines, I’ve kicked off a parallel bench on this Q4_K_XL version. If it holds its promises without sacrificing too much speedup, I’ll publish a follow-up on the exact trade-off.
The actual lesson
For eight hours, I tried to attack the problem through code. Build a custom image, debug kernels, merge forks, read upstream PRs. All the technical complexity I know well.
And the solution was to look at the data. A HuggingFace page I’d skimmed three days earlier without stopping. A name I didn’t know — havenoammo, I haven’t even managed to figure out exactly who that is, just a username — who’d done very clean quantization work with an approach that solves my problem exactly.
The reflex I’m going to try to install: before touching code, scrub HuggingFace with the right terms. UD, Dynamic, MTP-preserved, Heretic. See if someone hasn’t already solved my problem on the data side instead of the runtime side. Because in this ecosystem, there are a lot of quiet people doing extremely precise things, and they don’t make noise on Twitter or Reddit.
This time it was havenoammo. A name I didn’t know twenty-four hours ago. Who unlocks a ceiling that no runtime optimization had passed on this hardware. And who reminds me that in 2026, in this open-source ecosystem, the real levers are sometimes elsewhere than where you look by default.
How to install it
If you have my market source on Olares:
Olares Studio → Market → Settings → Add source
https://orales-one-market.aamsellem.workers.dev
The Qwen36 27B Long Context app v1.0.5 is in the AI category. First launch downloads the 15 GB GGUF from HuggingFace (HF token required in your Olares settings to pull). OpenAI-compatible endpoint on port 8000 once booted.
For folks on something other than Olares but on 24 GB VRAM, the essential combo: havenoammo/Qwen3.6-27B-MTP-UD-GGUF + the Q3_K_XL file + any llama.cpp build with MTP support + --ctx-size 262144 --cache-type-k q4_0 --spec-type mtp --spec-draft-n-max 5. Reproduces the essence.
See you very soon with the follow-up on the Q4_K_XL version.