Field notes · Local · Claude · Agents
AI for developers.
Hi there. This is where I write about what AI actually gives a working developer. My findings, my tests, the way I use it day-to-day. No marketing review, no Twitter hot take — just things I actually try, integrate in the code, keep or discard with eyes open.
Read the postsMy rig · Local LLMs
All local. Zero cloud.
The box I run on every day: an Olares One (RTX 5090M 24 GB, 96 GB DDR5). Spoiler — I genuinely recommend it. It's what I picked specifically to get a serious GPU at home, and it does the job. So when I publish local-inference numbers, this is the rig behind them: llama.cpp tuned to the bone, vLLM with speculative decoding, Qwen3.6 at 100 t/s. No third-party API, no quota that drops you mid-session, no prompt landing in a training set. You keep the keys, the bill stops at electricity.
The harness · Agents · MCP · Tools
Cloud. And the toolchain.
Claude Code, Cursor, persistent agents, MCP servers, validation hooks, prompts that hold for a month. The real dev loop with AI in the editor — what to keep, what to throw, how to plug it into a real codebase without everything falling apart at the first serious refactor.
My numbers, my rig
My own numbers.
Everything I write here, I measured myself: tokens per second, latency, VRAM use, prompt time, MTP acceptance rate, cost per API call. No bench thrown on Twitter without the command behind it, no "they say it's fast". If I publish a number, you'll find the exact stack to reproduce it. Promise.
Posts, right below
On to the posts.
Scroll on — the latest posts are waiting. If a finding saves you time, even better. If something feels off, tell me and I'll fix it. That's what an open blog is for.
Featured
Pinned.
-
Gemma 4 audio E4B hits 288 t/s — the second upstream merge closes the family
Yesterday I shipped Gemma 4 12B at 170 t/s via the upstream PR #23398 merge. Today PR #24282 (the E2B/E4B counterpart) merged. Custom rebuild, chart swap, bench: Gemma 4 audio E4B jumps from 47 t/s to 288 t/s. 6.1x speedup on the same hardware in 5 minutes of config. With a flash-attention trap on the way — the combo Gemma 4 E4B + audio mmproj + MTP draft crashes the CUDA flash attention kernel, no-FA fallback unlocks everything.
Lire → -
Gemma 4 12B hits 170 t/s — upstream merge buys +67% speed for free
Two days ago I shipped Gemma 4 12B QAT at 102 t/s on Olares One. Today I ship 170 t/s. Same hardware. Same model file. Same drafter. Same context. The delta: am17an's PR #23398 (Gemma 4 MTP support) merged into llama.cpp upstream at 12:50 UTC. My custom image — a snapshot of the WIP branch at commit dd97604 — was missing 10+ polish commits that ggerganov forced in review. +67% speed on the exact same setup, just by rebasing. Bonus: critical insight on Olares One's nvidia driver capping CUDA at 13.1, blocking the whole upstream Docker ecosystem.
Lire → -
Gemma 4 12B QAT lands — +17% speed, −39% VRAM, 65K context on 24 GB consumer Blackwell
Google released the QAT (Quantization-Aware Training) variants of Gemma 4 today at 1pm UTC. Three hours later, Olares One is running on them. On the 12B: 102.78 t/s vs 87.5 baseline = +17.4% speed. 8.6 GB VRAM vs ~14 GB = −39%. Context 32K → 65K with margin to spare. Tool calling intact, vision intact (modulo an mmproj gotcha I explain below).
Lire → -
Vision unlocked on Qwen3.6 35B-A3B MTP — 243 t/s + 262K context + image input via spiritbuun's --mmproj-gpu-swap
Three days ago I shipped Qwen3.6 35B-A3B MTP at 249 t/s text-only on Olares One — the new champion. Yesterday I shipped Gemma 4 26B at 250 t/s with vision. Today the Qwen champion gets vision too. Same 24 GB GPU. Same model file. The unlock: spiritbuun merged a feature called --mmproj-gpu-swap on May 22 that hot-swaps MTP and the vision encoder in VRAM on-demand. Trade-off: -2.8% text throughput, +full vision support, +4× more context vs my v1.0.5 vision attempt.
Lire → -
Gemma 4 26B Vision at 250 t/s — vLLM v0.21 closed the gap with my text-only champion
Two days ago I shipped Qwen 3.6 35B-A3B MTP at 249 t/s on Olares One. Text-only, but the new champion. Today the same hardware runs Gemma 4 26B at 250 t/s with vision and tool calling. The unlock: vLLM v0.21 quietly merged the official Google Gemma 4 MTP drafter. No more 5-fast/4-slow cycle bug from DFlash. No more 135 t/s no-spec fallback. Just full speed, plus images.
Lire → -
249 t/s on Qwen3.6 35B-A3B MTP — the bigger model that runs faster than everything smaller
I posted yesterday about Nemotron-Labs Elastic 30B-A3B NVFP4 hitting 166 t/s on Olares One — then 182 once vLLM #40082 landed. New record. Headline of the post: 'fastest LLM on Olares One'. Less than 12 hours later, that record is now sitting in second place. Qwen3.6 35B-A3B MTP runs at 249 t/s on the same hardware. Bigger model, +37% faster. Here's what's going on.
Lire → -
166 t/s on Nemotron-Labs 30B-A3B NVFP4 — the new fastest LLM on Olares One, hidden behind one CUDA-graph flag
NVIDIA released Nemotron-Labs Elastic 30B-A3B with native NVFP4 quantization two weeks ago. On Olares One (RTX 5090M consumer mobile sm_120, 24 GB), vLLM's default config OOMs at load. With one CUDA-graph flag set right — PIECEWISE mode and explicit capture_sizes [1,2,4] — the model boots and runs at 165.91 t/s. That's +22% over Gemma 4, +55% over BeeLlama on Qwen3.6 27B, +124% over my MTP-master build. New champion.
Lire → -
MTP merged in llama.cpp master — and the n_max default change everyone missed (86.7% accept on Qwen3.6 27B Blackwell mobile)
MTP support was merged into llama.cpp master on May 16th. Five days later, three follow-up PRs quietly changed how MTP behaves — including the spec-draft-n-max default flipping from 16 to 3. On Olares One (RTX 5090M sm_120), that change plus NVIDIA's backend-sampling rewrite (#23287) pushed Qwen3.6 27B MTP from 64% to 86.7% draft acceptance. +22 points. Nobody is talking about this.
Lire → -
Gemma 4 26B-A4B vision via vLLM — 135 t/s at 128K for an office workhorse on 24 GB
An Olares One peer user shared a Discord patch to restore vision on the gemma426ba4bone chart. 24 hours later, I shipped a vLLM variant hitting 135 t/s at 128K context — and the same user validated it in production. The story of a community-driven engineering loop, four llama.cpp configs benched in parallel, and the moment turbo3 KV stopped being the answer.
Lire → -
BeeLlama Qwen3.6 27B with vision — 106 t/s at 200K on consumer Blackwell mobile
Followup to last night's BeeLlama text-only 262K post — added the mmproj vision projector to Qwen3.6 27B, expected a perf hit, got a counter-intuitive surprise. BeeLlama supports vision + DFlash spec decoding together (which crashes on Gemma 4). And 200K context outperforms 128K by 4.4%. First public sm_120 BeeLlama vision bench.
Lire → -
BeeLlama tested on Olares One — 107 t/s at 262K full, +48% over my best path
Last week on r/LocalLLaMA, a post claims 135 t/s on Qwen3.6 27B Q5 + 200K context on a single RTX 3090, via a fork called BeeLlama.cpp. Ridiculous if true — my best path on Olares One topped out at 88. I tested it. Spoiler: 107 t/s at 262K full, zero OOM, zero degradation. +48% over my fastest path. The story of a qemu build and three apps in my catalog made obsolete in one night.
Lire → -
NVIDIA shipped FlashInfer 0.6.11 with zero SM120/121 cubins — consumer Blackwell FP4 MoE is dead-on-arrival in vLLM until they patch this
An 8-node DGX Spark cluster bringup of vLLM PR
Lire → -
Qwen3.6-27B + MTP CUDA OOM at 262K context on 24GB — fixed by dropping one UD quant tier
A user hit a reproducible runtime CUDA OOM in MTP draft on my Qwen3.6-27B v1.0.5 chart at 262K context. Boot fine, draft scales beyond static estimate, exit 139 in common_speculative_state_mtp draft. Fixed by dropping havenoammo UD-Q3_K_XL (14.9 GB) to UD-Q2_K_XL (12.3 GB). Direct bench validates v1.0.7 at 72.14 t/s stable, full 262K, no OOM. Plus a side experiment: can we drop Genesis patches by switching to NVFP4? Answer: no.
Lire → -
Gemma 4 26B-A4B + DFlash on 24GB Blackwell mobile — n_spec=8 optimal, +5% over default, plus a weird degradation cycle
Full num_speculative_tokens sweep for Gemma 4 26B-A4B + z-lab DFlash drafter on RTX 5090M Laptop (24GB sm_120). Optimal is n_spec=8 (not n=15 like desktop). I also found a 100% reproducible vLLM degradation cycle that I couldn't fix from config alone.
Lire → -
The story of the day I broke my Qwen3.6 ceiling — not with code, but with a stranger's name
I spent a whole day trying to push my Qwen3.6 27B on Olares past 65 t/s. Custom builds, experimental forks, merges that crash. And then late evening, in a desperate HuggingFace search, I run into a name: havenoammo. Five minutes later, 77 t/s on a 262K context. The story of a day chasing an answer that was waiting one click away.
Lire → -
A week of benches on the Olares One: Gemma 4 MTP, Lucebox regression, vLLM no-Genesis hitting the workspace lock
From May 5 to May 8, 2026, I benched everything that fit on a 24GB RTX 5090M. Three findings: Gemma 4 MTP via vLLM lands at 178 t/s 24h after merge, Lucebox v1.9.0 mysteriously regresses from 88 to 69 t/s, vLLM no-Genesis validates PR #39931 but stalls on P65/P22/P38. Plus housekeeping: 8 Qwen3.6 27B apps → 2.
Lire → -
Gemma 4 E4B MTP on the RTX 5090M: 178 t/s, 24h after the vLLM upstream merge
On May 6 at 14:39 UTC, lucianommartins merges PR #41745 into vLLM main: native support for Gemma 4 Multi-Token Prediction drafters. On May 7 at 06:13 UTC the nightly Docker drops. At 06:35 UTC, my Olares One hits 178.6 t/s with 77.3% acceptance — first public Gemma 4 MTP bench on consumer mobile Blackwell.
Lire → -
Drop the 28 Genesis patches on vLLM? Vanilla bench: 88 → 72.5 t/s, here's why
PR #39931 (TurboQuant hybrid) merged into vLLM main yesterday morning. I tested on Olares One with ZERO Genesis patches, vanilla image vllm/vllm-openai:gemma4-0505-cu130. Verdict: 72.55 t/s with --enforce-eager (vs 88 baseline Genesis = -17.5%). Bonus: we ran into two HAMi/CUDA-graph bugs again + issue #40807 already in the upstream pipe.
Lire → -
Qwen3.6-27B on upstream llama.cpp: +123% free with MTP, zero fork to maintain
MTP finally lands in llama.cpp upstream (PR #22673 by am17an, May 4). Bench on Olares One RTX 5090M sm_120: 78 t/s with an MTP-enabled GGUF, +123% vs baseline. No Lucebox, no Genesis, no permanent custom fork.
Lire → -
Lucebox on Olares One — Episode 8: seven days of waiting, one lib swapped by hand, 88.5 t/s
Seven days after my PR #188 to HAMi-core, still no review. The saga had its cliffhanger — I was waiting on someone else. Then a stupid idea: compile my patched lib and swap it myself. Three new bugs, one night, and at the end Lucebox hits 88.5 t/s. First llama.cpp-based path to pass vLLM Turbo on this hardware.
Lire → -
My personal Olares Market — 28 apps hand-tuned for the Olares One, one click away
A custom Olares Market hand-tuned for the RTX 5090M of the Olares One. 28 ready-to-install apps: llama.cpp, vLLM, DFlash, Voxtral ASR/TTS, vision, music. How to add it to your device in 30 seconds.
Lire → -
DFlash unblocked on 24GB consumer Blackwell — 80 t/s, 4 days after the "impossible" post
Four days ago I wrote that DFlash on 24GB consumer Blackwell didn't fit. On April 28, a dev publishes a quantized drafter. On April 30, I build, I test, I get 0.97 t/s. On May 1, after my issue, the dev fixes it in 24h. Tonight: 80 t/s. The story of a thesis that lasted 72 hours.
Lire → -
Lucebox on Olares One — Episode 1: 134 t/s on RTX 3090, what about my rig?
You're scrolling r/LocalLLaMA, you see a post claiming 134 t/s on Qwen3.6-27B with an RTX 3090 thanks to Lucebox. Of course you want to try it on your Olares One. Spoiler: it'll take 12 hours of compile time and 6 Docker builds. Episode 1.
Lire → -
Why I picked an Olares One to run my LLMs
The actual decision process. Why not a Mac Studio, not a custom GPU PC, not cloud — and why an Olares One won out for a working dad who also wants to run local LLMs.
Lire → -
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB
Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire → -
Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU
Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.
Lire →
This week
Latest posts.
-
Lucebox on Olares One — Episode 9: the PR that promised +57% and delivered +0.2%
Last night Lucebox crossed 88.5 t/s on Olares One and became the new champion. This morning PR #94 reports +57% on RTX 4090. If it scales, we hit 120 t/s. Spoiler: 88.7 t/s. Full DDTree sweep, three hypotheses, the honest lesson on upstream benches that don't reproduce.
Lire → -
Lucebox on Olares One — Episode 7: six HAMi hooks fixed upstream in one go
The bug is identified: 6 hooks in HAMi-core ignore the return value of cuCtxGetDevice. The fix is 50 lines. But for the entire HAMi community to benefit, it has to go upstream. Here's how that played out.
Lire → -
Lucebox on Olares One — Episode 6: We read the HAMi-core source and we find 6 bugs
NO_VMM doesn't fix anything. The `Illegal device id` bug comes back every run. Time to read the HAMi-core source. And what we find is not a single bug — it's a systemic pattern across 6 different hooks.
Lire → -
Lucebox on Olares One — Episode 5: The runtime slams the door with a negative device id
Image pushed, pod deployed, models downloaded. Everything is ready. Then HAMi vGPU dumps `Illegal device id: -644371744` on every boot, with a random number that changes each run. Smells like uninitialized stack from a mile away.
Lire →