Field notes · Local · Claude · Agents
AI for developers.
Hi there. This is where I write about what AI actually gives a working developer. My findings, my tests, the way I use it day-to-day. No marketing review, no Twitter hot take — just things I actually try, integrate in the code, keep or discard with eyes open.
Read the postsMy rig · Local LLMs
All local. Zero cloud.
The box I run on every day: an Olares One (RTX 5090M 24 GB, 96 GB DDR5). Spoiler — I genuinely recommend it. It's what I picked specifically to get a serious GPU at home, and it does the job. So when I publish local-inference numbers, this is the rig behind them: llama.cpp tuned to the bone, vLLM with speculative decoding, Qwen3.6 at 100 t/s. No third-party API, no quota that drops you mid-session, no prompt landing in a training set. You keep the keys, the bill stops at electricity.
The harness · Agents · MCP · Tools
Cloud. And the toolchain.
Claude Code, Cursor, persistent agents, MCP servers, validation hooks, prompts that hold for a month. The real dev loop with AI in the editor — what to keep, what to throw, how to plug it into a real codebase without everything falling apart at the first serious refactor.
My numbers, my rig
My own numbers.
Everything I write here, I measured myself: tokens per second, latency, VRAM use, prompt time, MTP acceptance rate, cost per API call. No bench thrown on Twitter without the command behind it, no "they say it's fast". If I publish a number, you'll find the exact stack to reproduce it. Promise.
Posts, right below
On to the posts.
Scroll on — the latest posts are waiting. If a finding saves you time, even better. If something feels off, tell me and I'll fix it. That's what an open blog is for.
Featured
Pinned.
-
Lucebox on Olares One — Episode 1: 134 t/s on RTX 3090, what about my rig?
You're scrolling r/LocalLLaMA, you see a post claiming 134 t/s on Qwen3.6-27B with an RTX 3090 thanks to Lucebox. Of course you want to try it on your Olares One. Spoiler: it'll take 12 hours of compile time and 6 Docker builds. Episode 1.
Lire → -
Why I picked an Olares One to run my LLMs
The actual decision process. Why not a Mac Studio, not a custom GPU PC, not cloud — and why an Olares One won out for a working dad who also wants to run local LLMs.
Lire → -
Genesis on consumer Blackwell — TurboQuant unlocked for Qwen3.6-27B on 24GB
Sandermage Genesis patches validated on RTX 5090M (sm_120). TurboQuant 4-bit + MTP n=3 on Qwen3.6-27B → 60 t/s, 100K context, 177K KV tokens.
Lire → -
Qwen3.6-27B at 85-100 t/s on a 24GB RTX 5090 Laptop GPU
Adapting the 32GB desktop and 24GB Ampere recipes to a 24GB Blackwell consumer mobile (sm_120) GPU. Custom vLLM image, AutoRound INT4, MTP n=3 — sustained 85-100 t/s with 75K context.
Lire →
This week
Latest posts.
-
Lucebox on Olares One — Episode 7: Issue #187, PR #188, and 6 hooks fixed in one go
The bug is identified: 6 hooks in HAMi-core ignore the return value of cuCtxGetDevice. The fix is 50 lines. But for the entire HAMi community to benefit, it has to go upstream. Here's how that played out.
Lire → -
Lucebox on Olares One — Episode 6: We read the HAMi-core source and we find 6 bugs
NO_VMM doesn't fix anything. The `Illegal device id` bug comes back every run. Time to read the HAMi-core source. And what we find is not a single bug — it's a systemic pattern across 6 different hooks.
Lire → -
Lucebox on Olares One — Episode 5: The runtime slams the door with a negative device id
Image pushed, pod deployed, models downloaded. Everything is ready. Then HAMi vGPU dumps `Illegal device id: -644371744` on every boot, with a random number that changes each run. Smells like uninitialized stack from a mile away.
Lire → -
Lucebox on Olares One — Episode 4: The llama-server submodule serves it up to you 1h later
test_dflash compiles, great. But to serve over HTTP I need llama-server, which compiles from the submodule. And the submodule has its own cmake invocation — where I forgot to add -rpath-link. And boom, 1h later, here we go again.
Lire →