Vision unlocked on Qwen3.6 35B-A3B MTP — 243 t/s + 262K context + image input via spiritbuun's --mmproj-gpu-swap

Three days ago I shipped Qwen3.6 35B-A3B MTP at 249 t/s on Olares One. Text-only, but the new champion.

Yesterday I shipped Gemma 4 26B at 250 t/s with vision and tool calling.

Today the Qwen champion also gets vision.

Same 24 GB GPU. Same model file. 243 t/s text + 262K context + working image input + Hermes Agent ready, all on one endpoint.

The unlock isn’t another model. It’s a 1-commit feature that spiritbuun merged into their llama.cpp fork on May 22, called --mmproj-gpu-swap. I missed it the first time. Today I went back through the news, saw it, built the image yesterday by coincidence, and tonight everything clicked.

The problem we had

Qwen3.6 35B-A3B is natively multimodal — Alibaba ships an mmproj file for it. So in theory we can just add --mmproj mmproj-BF16.gguf to llama-server and have vision.

In practice on 24 GB consumer Blackwell mobile, it doesn’t fit. I tried in v1.0.4 and v1.0.5 of my Olares app:

Model: 17.2 GB (Q3_K_XL)
mmproj BF16: 1.0 GB
KV cache @ 262K q4_0: 3.2 GB
MTP draft compute buffer @ ubatch 2048: ~4 GB
Total: ~25 GB needed, 23.4 GB usable

OOM at the first decode request. The vision encoder needs --ubatch-size ≥ image_max_tokens (≈2048 for Qwen3.6), and the MTP compute buffer at ubatch 2048 is the killer. I tried reducing context to 64K to make room. Still OOM’d. v1.0.6 was a full revert: drop the mmproj, drop the vision, go back to 250 t/s text-only champion.

That sat as memory qwen36-a3b-mtp-vision-fails in my notes. “Stuck at text-only champion. Use the BeeLlama 27B vision sibling or Gemma vision sibling for image use cases.”

For two days, that was the verdict.

What spiritbuun figured out

spiritbuun/buun-llama-cpp is a downstream fork of ggml-org/llama.cpp that ships speed and feature experiments. They sit between upstream and Anbeeld’s BeeLlama (which sits on top of spiritbuun). On May 22 at 16:46 UTC, spiritbuun merged commit 8e64d7a:

add --mmproj-gpu-swap: temporarily swap MTP↔mmproj VRAM for vision requests

The README documentation for it (commit 316e88e, pushed 46 minutes later) is the cleanest description I’ve seen of a one-paragraph engineering insight:

On VRAM-constrained GPUs, MTP speculative decoding and the vision encoder (mmproj) may not fit in VRAM simultaneously. For example, Qwen3.6-27B Q6_K + MTP uses ~22.6 GiB on a 24 GiB RTX 3090, leaving no room for mmproj’s ~1.1 GiB GPU footprint. --mmproj-gpu-swap solves this by keeping mmproj on CPU at startup, then temporarily swapping MTP out of VRAM when an image arrives, loading mmproj to GPU for fast encoding (~1-2s instead of 30-60s on CPU), and swapping back afterward. MTP has no persistent state, so the swap is lossless.

The trick is that the two things competing for VRAM are never used at the same time. When you’re decoding text, you don’t need the vision encoder. When you’re processing an image, you don’t need the MTP draft heads. So keep one on GPU, one on CPU, and swap based on what the current request needs.

The lossless part matters too. MTP draft heads have no persistent state between requests — they’re just weight matrices that run alongside the target model. So tearing them down from VRAM and rebuilding them later loses nothing.

What I shipped

I built the spiritbuun image yesterday as part of an unrelated experiment (testing whether the spiritbuun-direct path would beat the Anbeeld-layer path on the DFlash variant — it didn’t, but the image was sitting on disk). The image is aamsellem/buun-llama-cpp:316e88e. Built for amd64+CUDA13+sm_120 from spiritbuun HEAD at the commit that adds --mmproj-gpu-swap.

Bench config:

--model Qwen3.6-35B-A3B-UD-Q3_K_XL.gguf
--mmproj mmproj-BF16.gguf
--mmproj-gpu-swap
--spec-type draft-mtp --spec-draft-n-max 3
--cache-type-k q4_0 --cache-type-v q4_0
--parallel 1 --flash-attn on --jinja

No --ctx-size, no --batch-size, no --ubatch-size. With --mmproj-gpu-swap engaged, the server runs auto-fit — it computes how much VRAM is available, accounting for the swap headroom, and picks the maximum context that fits.

It picked 262144 (262K full native). That’s the entire window the model supports. Compared to the 64K I had to settle for in the failed v1.0.4 attempt, that’s a 4× context bump from a single flag.

The bench

10 back-to-back runs of Space Invaders HTML completion, 2500 tokens each, single user:

run1: 243.44 t/s  | run6:  241.62 t/s
run2: 242.52 t/s  | run7:  237.60 t/s
run3: 242.79 t/s  | run8:  247.18 t/s
run4: 251.39 t/s  | run9:  243.64 t/s
run5: 244.69 t/s  | run10: 241.01 t/s

AVG 243.59 t/s. Range 13.79. σ ≈ 4.

vs the v1.0.6 text-only baseline (250.54 t/s) that was my reference for “fastest local LLM on Olares One”: -2.8%.

I’ll take that all day. The 7 t/s I lost is the bookkeeping cost of running with --mmproj-gpu-swap always engaged even when no image is being processed. In exchange I get vision support, the full 262K context window, and a unified endpoint that handles every modality.

Then the vision test. 64×64 solid red PNG, “What single color is this image? One word only.”:

elapsed: 1.13s
response: "Red"

Server logs from that request:

srv  swap_mtp_to_: swapping MTP out, loading mmproj to GPU...
srv  swap_mtp_to_: MTP→mmproj swap done in 213 ms
srv  swap_mmproj_: unloading mmproj from GPU, recreating MTP...
srv  swap_mmproj_: mmproj→MTP swap done in 540 ms

213 ms in + 540 ms out = ~750 ms total swap overhead. Encoding the actual image is fast once mmproj is on GPU. End-to-end 1.13s for the full round trip on a 64×64 image including HTTP, swap-in, encode, generate, swap-out.

The new leaderboard

Stack	t/s	Context	Vision	Tool calling
Qwen3.6 35B-A3B MTP + Vision (v2.0.0, today)	243.59	262K	✓	✓
Gemma 4 26B Vision MTP (v1.0.5, yesterday)	250.54	128K	✓	✓
Qwen3.6 35B-A3B MTP text-only (v1.0.7)	250.54	262K	✗	✓
BeeLlama Qwen 3.6 27B Vision	106.43	200K	✓	✓
Gemma 4 26B no-spec (v1.0.0 baseline)	135.97	128K	✓	✓

For agents that mix text + image + tool calling, the Qwen variant now wins:

Slightly slower than the Gemma vision endpoint (243 vs 250 t/s)
But 2× more context (262K vs 128K)
And Qwen 3.6’s BFCL/MCPMark scores are higher than Gemma 4’s for tool calling reliability

For pure text agents that don’t need vision, the text-only Qwen variant still has the throughput edge (250 vs 243). I’ll keep both shipped — pick by use case.

Why this matters beyond Olares One

Consumer Blackwell mobile is one of the most VRAM-constrained interesting targets out there. 24 GB on a single GPU is enough for a 17 GB MoE-A3B model + KV + MTP, but not quite enough for mmproj on top. The --mmproj-gpu-swap trick is exactly the kind of last-mile engineering that turns “almost fits” into “fits comfortably with headroom”.

It also works on any other 24 GB consumer card. RTX 3090, RTX 4090, RTX 5090 desktop — all the same problem, all the same fix. If you’re running Qwen 3.6 or any MoE-A3B with MTP on a 24 GB card and want vision too, this flag is your unlock.

The whole approach generalizes. Once you accept that MTP draft heads have no persistent state and can be torn down + rebuilt, you can think of other VRAM-resident things that could swap in for occasional requests:

Could swap MTP for a different drafter on certain prompt patterns
Could swap mmproj for an audio encoder when audio input arrives
Could swap one expert mixture for another when domain hint is given

I’d be curious to see how far this idea goes. For now, the vision swap is the immediate win.

What I shipped

llamacppqwen36a3bone v1.0.7 → v2.0.0 on my Olares market source.

Major version bump because vision support is a fundamentally new capability — same app, different category. Studio Market now shows it under both “LLM Chat” and “Vision” filters.

Image: aamsellem/buun-llama-cpp:316e88e
Target: unsloth/Qwen3.6-35B-A3B-MTP-GGUF UD-Q3_K_XL (17.2 GB)
Vision encoder: mmproj-BF16.gguf from same repo (~900 MB)
Args: --mmproj-gpu-swap --spec-type draft-mtp --spec-draft-n-max 3 --cache-type-k q4_0 --cache-type-v q4_0 --parallel 1 --flash-attn on --jinja
Context: auto-fit (resolves to 262K full native)

Pull from https://orales-one-market.aamsellem.workers.dev if you have Olares One.

Coda

I keep saying this and it keeps being true: the hardware is fixed, the software keeps eating the problem. Same 24 GB RTX 5090M. Same Qwen 3.6 35B-A3B model file on disk. Three days ago: 250 t/s text-only champion. Yesterday: Gemma champion adds vision. Today: Qwen champion adds vision. None of these required me to write a single line of CUDA — community contributors did the work upstream and downstream. I just have to keep redeploying.

Next: I’m watching when Anbeeld rebases their BeeLlama fork on top of the new spiritbuun commit. That would stack --mmproj-gpu-swap with Anbeeld’s adaptive DFlash drafter control + Hermes-friendly polish. If that lands, the BeeLlama Vision sibling could potentially beat both endpoints on the speed axis. We’ll see.

The problem we had

What spiritbuun figured out

What I shipped

The bench

The new leaderboard

Why this matters beyond Olares One

What I shipped

Coda

Comments