AkitaOnRails.com

LLM Benchmarks: DeepSeek Unlocked! Use DeepClaude

Mon, 04 May 2026 16:00:00 GMT

Update (May 11, 2026): at least two more paths beyond DeepClaude are worth flagging.

The first is DeepSeek-TUI (repo at Hmbown/DeepSeek-TUI). A Rust-based coding agent with a Codex-style 13-crate workspace, Plan / Agent / YOLO modes, MCP integration, full 1M-token context, streaming reasoning blocks. It talks straight to the DeepSeek API with proper reasoning_content handling by design, so opencode’s bug never shows up. On provenance, the author is Hunter Bown and the project is listed in awesome-deepseek-agent curated by DeepSeek themselves, which counts as endorsement. Install via npm, cargo, homebrew, binary, or docker.

The second is that opencode now supports DeepSeek thinking mode via PR #24146, merged April 24, 2026. The fix preserves reasoning_content in turn history, but it needs manual configuration in opencode.json: add interleaved: { field: "reasoning_content" } to the DeepSeek V4 model config. Without that the bug stays, and the new names (deepseek-v4-pro, deepseek-v4-flash) aren’t pre-configured with interleaved by default. When I ran Round 3 of the benchmark, that fix either hadn’t landed yet or my config didn’t have the flag set. For a future benchmark, worth re-testing via opencode with interleaved properly configured.

The rest of the post below still holds. DeepClaude remains the most ergonomic option for anyone using Claude Code as their primary harness. If you’re already on opencode or prefer a dedicated TUI, alternatives are on the menu now.

DeepSeek V4 Pro stopped being a lost cause in my coding benchmark. It used to be solo Tier B (69/100) and literally unmeasurable in any multi-agent scenario, because it kept hitting a protocol bug I documented across the last two posts. The good news: I found a path that unblocks the model, and it jumped out of limbo straight into Tier A at 89/100, sitting only behind Opus 4.7, GPT 5.4/5.5 and Kimi K2.6. I’ll walk through how I got there, what DeepClaude is, how you can use it, and where this puts DeepSeek in the updated ranking.

Recapping the previous posts

If you just got here, the context matters. This benchmark experiment kicked off in April, and the DeepSeek case has been unfolding across four articles:

Testing Open Source and Commercial LLMs — Who Can Beat Claude Opus? (April 5). First cut, comparing ~20 models running a Rails 8 + RubyLLM + Hotwire + Tailwind + Minitest task. It defined the scenario and the base task I’m still using today.
LLM Benchmarks Part 2: Worth Combining Multiple Models in the Same Project? Claude + GLM?? (April 18). First orchestration attempt: have a strong planner (Opus) call cheaper subagents (Kimi, Qwen, GLM, DeepSeek). Result: zero delegations across seven free-choice variants. Strong models would rather do everything themselves.
LLM Coding Benchmark (May 2026): DeepSeek v4, Kimi v2.6, Grok 4.3, GPT 5.5 (April 24, updated in May). The canonical version with 24 models, standardized 0-100 rubric across 8 dimensions, A/B/C/D tiers. It’s the current “who’s what” reference.
LLM Benchmarks: Is it worth ($) mixing 2 Models? (Planner + Executor) (April 25). Three rounds of multi-model orchestration: free-choice, forced delegation, and manual cross-process orchestration. Short answer: for a cohesive Rails greenfield task, solo Opus 4.7 in opencode (97/100, $4, 18m) beats every combination. This was the post where I documented DeepSeek V4 Pro’s protocol incompatibility with any ai-sdk-based harness.

The DeepSeek story up to that point looked like this: solo in opencode delivers 69/100 Tier B (correct RubyLLM code, but with a stock README, no docker-compose.yml, no bundle-audit). In any multi-turn opencode scenario, the API rejects turn 2 with "reasoning_content must be passed back to the API". The ai-sdk that opencode uses underneath strips reasoning_content while building the next request, and DeepSeek returns 400. Worse: opencode buries the error in the event stream and falls back to the general agent, which is Opus 4.7. The runs “complete” with files written, but most of the output came from Opus masquerading as DeepSeek. Score 69 reflects mixed authorship, not V4 Pro for real.

The takeaway that landed: DeepSeek V4 Pro was fundamentally incompatible with any ai-sdk harness (opencode included). To use it for real, you’d need direct API with thinking-mode handling or a custom harness.

The discovery this week is that a custom harness already exists. It’s called DeepClaude.

What DeepClaude is

DeepClaude is a shell shim for Claude Code (Anthropic’s CLI) that swaps the endpoint it queries. Claude Code, under the hood, talks to api.anthropic.com and expects the Anthropic format (Messages API with system, messages, tools, etc.). DeepClaude sets a few environment variables before invoking Claude Code:

ANTHROPIC_BASE_URL          # alternative endpoint
ANTHROPIC_AUTH_TOKEN        # token for the alternative endpoint
ANTHROPIC_DEFAULT_OPUS_MODEL    # model to invoke instead of Opus
ANTHROPIC_DEFAULT_SONNET_MODEL  # same for Sonnet
ANTHROPIC_DEFAULT_HAIKU_MODEL   # same for Haiku
CLAUDE_CODE_SUBAGENT_MODEL  # subagent model

Supported backends:

DeepSeek directly via api.deepseek.com/anthropic (needs DEEPSEEK_API_KEY)
OpenRouter via openrouter.ai/api (needs OPENROUTER_API_KEY)
Fireworks AI via api.fireworks.ai/inference (needs FIREWORKS_API_KEY)
Anthropic (no override, regular Claude Code)

The clever bit is that Claude Code never realizes the swap. Its full agent loop (file editing, bash execution, subagent spawning, multi-step tool use, Anthropic prompt caching) runs the same way. It’s just that every model call goes out via OpenRouter (or DeepSeek direct, or Fireworks) and lands on a model you picked. In our benchmark’s case, DeepSeek V4 Pro now receives the traffic that would normally go to Opus 4.7.

Why does this fix the reasoning_content bug? Because OpenRouter’s Anthropic-compatible endpoint (/anthropic) handles thinking content correctly in the format Claude Code emits. The ai-sdk opencode uses didn’t. Different harness, no bug.

Install:

git clone https://github.com/aattaran/deepclaude ~/Projects/deepclaude
ln -sf ~/Projects/deepclaude/deepclaude.sh ~/.local/bin/deepclaude
chmod +x ~/Projects/deepclaude/deepclaude.sh
deepclaude --status   # confirms which backend keys it detected

Then run:

export OPENROUTER_API_KEY=sk-...
deepclaude --provider openrouter --model deepseek/deepseek-v4-pro
# starts a Claude Code session with DeepSeek V4 Pro answering instead of Opus

To exit and go back to regular Claude Code, just exit the session. The environment variables are restored at the end of the script.

How I extended the benchmark suite to support DeepClaude

The claude_code_runner.py I already use to run Claude Code variants (claude_opus_alone, claude_opus_sonnet, claude_opus_haiku) needed a way to inject env vars per variant, without breaking the existing pipeline. I added an optional env_overrides field to the JSON config of each variant. The shape:

{
  "env_overrides": {
    "ANTHROPIC_BASE_URL": "https://openrouter.ai/api",
    "ANTHROPIC_AUTH_TOKEN": "$OPENROUTER_API_KEY",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek/deepseek-v4-pro",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek/deepseek-v4-pro",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek/deepseek-v4-pro",
    "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek/deepseek-v4-pro",
    "UNSET:ANTHROPIC_API_KEY": ""
  }
}

Conventions:

A value starting with $ is an indirect lookup against the parent shell env ($OPENROUTER_API_KEY resolves to whatever’s in the user’s shell). Keeps secrets out of the version-controlled config/*.json.
A key starting with UNSET: removes the variable from the subprocess env. I use this to drop ANTHROPIC_API_KEY when swapping to a non-Anthropic backend, otherwise the SDK might prefer it over ANTHROPIC_AUTH_TOKEN.
Everything else is literal.

The runner logs which overrides got applied (with masked secrets) at the start of each variant, so the benchmark output captures the full setup. That gave me two new variants in the benchmark:

Slug	Setup	Subagent registered?
`claude_code_deepseek_v4_pro_or`	Claude Code via DeepClaude → DeepSeek V4 Pro for the entire loop	none
`claude_code_deepseek_v4_pro_or_sonnet`	Same, but with `sonnet-coder` registered (Sonnet 4.6 via Anthropic API)	yes, but zero delegations observed

Both ran with --no-progress-minutes 15 (the standardized watchdog from Round 2.5).

Results: DeepSeek V4 Pro in Tier A

Both variants ran end-to-end with no reasoning_content errors. Multi-turn works. Registered subagent works (even if not invoked). Tool calls, file editing, bash, all of it works. The numbers:

Variant	Status	Time	Files	Turns	Delegations	Cost	Score	Tier
`..._or`	completed	21m	1544	106	0 (no subagent)	$3.38	84	A
`..._or_sonnet`	completed	18m	1348	109	0 (Sonnet ignored)	$3.14	89	A

Both 100% billed against deepseek/deepseek-v4-pro. Zero Sonnet tokens despite being registered. The subagent was there, the DeepSeek planner just never invoked it.

And where does this put DeepSeek in the canonical ranking? Here:

Rank	Model	Score	Tier	Time	Cost
1	Claude Opus 4.7 (opencode)	97	A	18m	~$1.10
1	GPT 5.4 xHigh (Codex)	97	A	22m	~$16
3	GPT 5.5 xHigh (Codex)	96	A	18m	~$10
4	DeepSeek V4 Pro (DeepClaude + sonnet registered)	89	A	18m	$3.14
5	Kimi K2.6	87	A	20m	~$0.30
6	DeepSeek V4 Pro (DeepClaude solo)	84	A	21m	$3.38
7	Claude Opus 4.6	83	A	16m	~$1.10
8	Gemini 3.1 Pro	82	A	14m	~$0.40
9	Claude Sonnet 4.6	78	B	16m	~$0.63
9	DeepSeek V4 Flash	78	B	3m	~$0.01
11	Qwen 3.6 Plus	71	B	17m	~$0.15
…	DeepSeek V4 Pro (opencode solo)	69	B	~25m	~$1

In one round V4 Pro delivers Tier A, leaves the “unmeasurable” limbo behind, lands next to Kimi K2.6, and beats Opus 4.6 / Gemini 3.1 Pro. A 15-to-20-point lift from a harness change alone. Same model, same prompt, no orchestration, just a different agent loop.

What this round proved (and what it didn’t)

The Claude Code regression with Opus is Opus-specific, not a harness property

This is the biggest finding. In Round 1 I’d documented that Opus 4.7 inside Claude Code hallucinated chat.complete (Tier 3), while the same Opus 4.7 in opencode delivered Tier A. The hypothesis was that the context Claude Code injects (system prompt, tool schemas, agent registry, with 6-11M cache-read tokens) was nudging Opus toward an OpenAI-SDK mental model instead of the specific RubyLLM one.

DeepSeek V4 Pro inside the same Claude Code harness used the correct chat.ask path (chat_service.rb:17 in both variants). No chat.complete, no invented fluent DSL, no batch-form invention. Cross-grep on both project trees:

grep -nE "RubyLLM::Client|chat\.complete|chat\.send_message|chat\.user|chat\.assistant"
→ 0 hits (both variants)

The regression is Opus-specific. Opus’s training has a particular vulnerability to how Claude Code’s system prompt and schema-chatter prime the model. DeepSeek is robust to it. Other reasoning models with correct RubyLLM training would probably make it through Claude Code without regressing the same way.

A registered subagent, even when never invoked, lifts quality

The more subtle finding. The ..._or_sonnet variant has sonnet-coder registered, but DeepSeek never invoked it (zero delegations, 100% of tokens on deepseek/deepseek-v4-pro). And yet it scored +5 over the sister variant with no subagent (89 vs 84). Auditor attribution: with a subagent available “in case I need it”, the DeepSeek planner produced measurably more delegable decomposition. Smaller seams, cleaner DI, system prompt used via with_instructions, controller test that mocks the real service API shape instead of fudging it.

Knowing a subagent might execute the work pushes toward smaller, more contractable units — even when nothing actually delegates. Weak signal, single sample, but visible.

This lines up with prior findings that the structured CONVERGE phase (planner forced to articulate the interface before executing) was the actual driver of quality lifts in some forced runs, more than the delegation itself. Just having a subagent available makes the planner think more like an architect.

The multi-turn controller bug persists

A specific defect that survives the harness change: both DeepClaude variants have multi-turn problems at the controller layer.

Variant 1 (..._or): outright single-shot. chats_controller.rb:10 builds a 1-element messages array on every POST, throwing history away. The ChatService supports history, but the controller never sends it.
Variant 2 (..._or_sonnet): correct multi-turn via session[:messages], but the cookie-store overflows around 10 turns with CookieOverflow.

The solo DeepSeek run in opencode (69/100) had the same single-shot issue. Different harness, different agent loop, same model-level mistake. Multi-turn architecture for stateless Rails is genuinely hard for DeepSeek and a harness change doesn’t help with that specific gap.

DeepSeek as planner also ignores subagents on cohesive tasks

I’d already documented in earlier posts that every frontier LLM (Opus, GPT 5.4, Codex) ignored their registered subagents on the cohesive Rails task. That got documented as planner rationality: smart planners correctly intuit that coordination cost exceeds execution savings on a tightly coupled task.

Round 4 adds DeepSeek V4 Pro to the list: Sonnet 4.6 registered, zero delegations. It generalizes the finding to “strong reasoning model facing greenfield Rails with an optional delegate”, instead of being a quirk of Opus. DeepSeek’s no-delegate behavior matches what every other strong planner has done.

Updated answer to “is orchestrating worth it?”

The Round 3 verdict was: for one-off greenfield Rails, solo Opus in opencode wins. Round 4 adds three cases:

For users who don’t have access to Anthropic Claude Opus (or want to avoid Anthropic-direct cost), claude_code_deepseek_v4_pro_or_sonnet is the closest substitute: 89/100 Tier A, 18m, $3.14, running entirely through OpenRouter using a key you already have. It’s the first meaningful “Tier A coding without an Anthropic subscription” answer in the benchmark.

For users who do have Opus, DeepClaude is still slightly worse on quality (-8) for marginally better cost (-$0.66). Not worth the trade.

For comparing models inside the Claude Code harness directly, DeepClaude makes the comparison possible: DeepSeek V4 Pro at 84-89 vs Opus 4.7 at the regressed Tier-3 level in the same harness. DeepSeek V4 Pro through DeepClaude beats Opus 4.7 inside Claude Code on quality, cost, and multi-turn. Worth remembering this only happens because Opus regresses in this harness; against Opus in opencode (97), DeepSeek is clearly behind.

The cross-harness picture for DeepSeek V4 Pro

It looks like this:

Harness	Outcome
opencode solo (single agent)	69/100 Tier B — works, lacks deliverable polish
opencode multi-agent forced (executor)	UNMEASURABLE — `reasoning_content` interop bug fails at turn 2
opencode + manual cross-process orchestration (executor)	UNMEASURABLE — same bug
Claude Code via DeepClaude OR (solo)	84/100 Tier A — works, harness fills the polish gap
Claude Code via DeepClaude OR (with registered subagent)	89/100 Tier A — modest planner-availability bonus
Claude Code direct (Anthropic API)	not tested — would need `DEEPSEEK_API_KEY` for the native endpoint

Bottom line, DeepSeek V4 Pro is a Tier-A coder when delivered through a strong autonomous loop (Claude Code) and Tier B when delivered through a thinner harness (opencode). The RubyLLM API correctness is the same in both cases. The score difference is entirely in deliverable completeness (README, compose, CI tooling) which Claude Code’s loop fills in.

What to use this for

Three scenarios where DeepClaude becomes the right tool.

The first is Tier-A coding on a budget, with no Anthropic subscription: the ..._or_sonnet variant is the new default. $3.14, 18m, 89/100, Tier A, running entirely through OpenRouter.

The second is validating whether the Claude Code regression hits other models. It hits Opus, doesn’t hit DeepSeek. The chat.complete regression is Opus-specific. DeepSeek (and presumably other reasoning models with correct RubyLLM training) cross the harness without regressing.

The third is future experiments. DeepClaude opens the door to running any OpenRouter model through Claude Code’s full agent loop. Worth re-testing Kimi K2.6 and Qwen 3.6 Plus through DeepClaude to compare against the manual-orchestration Round 3 results. Same model, different harness, see where the lift lands. There’s a real chance Kimi K2.6 (already Tier A at 87) climbs to 90+ and ties V4 Pro deepclaude. That’s the next round of the benchmark.

Caveats from DeepClaude

Things the project README warns about, worth repeating here.

No image input: the DeepSeek Anthropic endpoint doesn’t support vision. For this benchmark it’s text-only, so no impact. For other use cases it can be a real limitation.

No MCP server tools: the compatibility layer doesn’t support MCP. If you use MCPs in your normal Claude Code workflow, they won’t work inside DeepClaude.

Anthropic prompt caching is ignored. The cache_control markers Claude Code emits get dropped on the floor. DeepSeek has its own automatic caching (which shows up in the cache_read of the token usage), but the explicit cache markers get dropped. Per-turn cost ends up slightly higher than a comparable Anthropic-direct run because of this. Relevant for apples-to-apples cost comparisons.

Wrapping up

DeepSeek V4 Pro is out of limbo. It was Tier B in opencode solo and unmeasurable in any multi-agent scenario. Through DeepClaude (Claude Code with env vars pointed at OpenRouter) it delivers Tier A at 89/100, beats Opus 4.6 and Gemini 3.1 Pro, lands next to Kimi K2.6, costs $3.14 in 18 minutes. The lift didn’t come from orchestration or a better model: it came purely from swapping the harness for an agent loop DeepSeek can drive well. For anyone without Opus access but with an OPENROUTER_API_KEY, this is the first real Tier-A coding option without an Anthropic tether.

The next round is already cooked: run Kimi K2.6 and Qwen 3.6 Plus through DeepClaude to see whether the “strong harness lifts a Tier-B/A model to the top” effect reproduces. If it does, that opens up a whole family of budget variants that can compete with solo Opus in the benchmark. If it doesn’t, it becomes clear that DeepSeek has a specific affinity with the Claude Code loop the others don’t share. Both outcomes are interesting.

References

DeepClaude on GitHub — the shell shim itself, README with the full list of supported backends
Round 4 success report — the full technical report this post comes from
DeepClaude integration in the benchmark — runner patch, env_overrides shape, smoke test
Benchmark repo — code, configs, results from every round

NW-Omarchy: Bringing Omarchy to X11 with XLibre

Fri, 01 May 2026 18:00:00 GMT

This post inaugurates an experimental project I’ve been hacking on for the past few weeks: NW-Omarchy. The pitch is easy to explain and stubborn enough to be worth defending: take the opinionated, pretty look-and-feel of Omarchy (which is Hyprland, meaning Wayland) and bring all of that to the X11 world, as a parallel session you pick at SDDM. Your Hyprland session stays untouched. When you want to try X11, pick nw-bspwm at login and land in something that looks like the same Omarchy with different parts underneath.

The obvious question is “why bother”. Red Hat already announced in 2025 that upstream xorg-server is gone from RHEL 10, and the corporate Linux narrative shifted to “use Wayland and move on”. As usual, in open source we don’t have to agree. There’s XLibre, a fork of xorg-server picking up active maintenance, and there are plenty of people with old hardware, legacy drivers, or legacy software for which Wayland still isn’t a viable path. NW-Omarchy is for that scenario: give that gear a graceful second wind and, while you’re at it, make it pretty.

What XLibre is and why it matters

XLibre is a fork of xorg-server that started in 2025, after it became clear upstream had no energy left to evolve. X.Org itself was only taking critical security patches, and the historical primary funder of new development (Red Hat) announced it was stepping out of new X-server work as part of the RHEL 10 plans. XLibre took the codebase and went back to cleanups, security fixes, modernization and driver-ABI bumps. The first public release was XLibre 25.0, and the project lives at github.com/X11Libre/xserver.

XLibre’s technical pitch is being a drop-in replacement for xorg-server. Same X11 protocol, same client API (libxcb/libx11), same dependency graph in pacman. Your xterm, your Firefox, your Steam, your favorite emulator for arcade machines from the 90s, none of them know the X server changed. Distros already shipping XLibre as default or tier-one:

Artix Linux (2026.04+) ships it as the default X server.
Fedora has an open Change proposal to migrate.
Arch Linux has an official binary repo at x11libre.net/repo/arch_based/. That’s exactly what NW-Omarchy uses.

Comparing what’s still alive:

	xorg-server	XLibre
Last non-security release	21.1 (2022)	25.0 (2025)
Active codebase work	none	yes
Driver ABI	frozen	bumped (`X-ABI-VIDEODRV_VERSION=28.0` in 25.0)
Stated future	“use Wayland”	continued X server evolution
Coexists with `xorg-xwayland`	yes	yes

If your reason for staying on X11 is any of the classics (legacy proprietary Nvidia driver, vintage Intel GPU, Optimus laptop with quirky muxing, dependency on xdotool/wmctrl/xprop/xinput for accessibility or remap tools, ssh -X to run a remote app, clipboard with no permission prompt, screen reader that needs to read input from any window), XLibre is the path that keeps that stack under active maintenance.

Why this project exists

I run Omarchy on my Thinkpad T14 Gen 6 (wrote about it here) and I genuinely like it. DHH did serious curation work: a single coherent theme across everything (terminal, launcher, top bar, lockscreen), consistent shortcuts, visual choices that don’t tire your eyes. It’s opinionated the right way, and it spares you the fatigue of configuring everything from scratch.

The catch is that Omarchy is Hyprland, Hyprland is Wayland-only, and Wayland still has scenarios where it’s the wrong tool for the job. Older machines with legacy Nvidia drivers, environments leaning on software with deep X11 integration, situations where you want ssh -X and don’t want to learn waypipe, or just machines you want to keep running without swapping half the stack. For those cases, I wanted Omarchy’s aesthetic and workflow on top of X11. It didn’t exist. So I built it.

The name NW-Omarchy is short for “Not Wayland Omarchy”. It’s deliberately a sidecar: runs alongside, doesn’t replace. You install it on top of an existing Omarchy install, it creates a separate SDDM session called nw-bspwm, and nothing in your Hyprland setup gets touched. Want to go back to Hyprland? Pick Hyprland at SDDM next login. Want to uninstall? There’s an uninstall script that reverses exactly what got installed, based on a manifest that tracks every action.

What changes in the stack (Wayland → X11)

Hyprland is compositor + WM in the same binary. In the X11 world, those two functions come as separate pieces, and each has a well-known substitute. The translation work was mapping every Omarchy component to its X11 equivalent, then making sure the keyboard shortcuts, theming, and behavior stayed as close as possible.

Function	Omarchy/Wayland	NW-Omarchy/X11
Window manager	Hyprland	bspwm
Hotkey daemon	Hyprland built-in	sxhkd
Compositor	Hyprland built-in	picom (upstream v13)
Top bar	Waybar	polybar
App launcher	Walker	rofi
Notifications	Mako	dunst
Idle/lock	hypridle / hyprlock	xidlehook + xss-lock + i3lock-color
Nightlight	hyprsunset	redshift
Clipboard history	walker `-m clipboard`	clipmenu (rofi-backed)
Screenshots	hyprshot	maim + slop + xclip
Color picker	hyprpicker	xcolor

bspwm is a minimalist tiling window manager. The WM itself only responds to commands over a socket, and bspwmrc is literally a shell script. Keybindings live in a separate daemon, sxhkd, which reads a plain text file. That separation is classic X11: each piece does one thing.

picom is the compositor that gives you blur, shadows, rounded corners, fade-in/out, and per-window opacity rules. NW-Omarchy uses upstream v13 (released February 2026), not the old picom-ftlabs-git fork, abandoned since 2024. The visual effects Omarchy has under Hyprland (rounded corners, fade on open/close, blur on rofi and dunst) all come from picom in NW-Omarchy.

polybar replaces waybar. Same visual layout: Omarchy logo on the left, date in the middle, audio/wifi/bluetooth/battery indicators on the right. Same nerd-font icons. Left click opens the same TUIs (wiremix, impala, bluetui), right click opens the GUI variants. Scroll on audio adjusts volume, middle click mutes. Practical parity.

rofi replaces walker. super + space opens the full launcher with every .desktop on the system plus icons. super + shift + space opens a cheat-sheet of the pinned apps, with the chord shown next to each name. Useful for remembering “what was the chord for Signal again”.

For a user who isn’t going to crack open ~/.config to customize anything, the difference between the two sessions is cosmetic at homeopathic levels. The same Omarchy shortcuts (over 70 bindings) are wired up, the same menus pop up, the same logo sits in the bar, the same nerd-font icons.

The theme system is the same

This was the most fun part to build, and the part that surprises people most. Omarchy has a centralized theme system: each theme lives at ~/.local/share/omarchy/themes//, ships a colors.toml with a universal palette (accent, background, foreground, color0..15), and omarchy-theme-set re-renders each app’s config to match. The active theme lives at ~/.config/omarchy/current/theme/, and every app imports its colors from there.

NW-Omarchy plugs into that same pipeline. I added .tpl templates for bspwm, polybar, rofi and dunst that follow the Omarchy convention, and omarchy-theme-set-templates regenerates everything on every theme change. The result is that all 19 themes Omarchy ships work as-is in NW-Omarchy, with zero porting work. Catppuccin, gruvbox, kanagawa, nord, tokyo-night, all of them. The wallpaper changes along with the theme via an inotify daemon (nw-omarchy-bg-watch).

A few examples in action:

When Omarchy adds a new theme via omarchy-theme-update, NW-Omarchy gets it for free. No extra work, no rebuild, nothing for me to republish.

System menu and TUIs

Omarchy’s system menu (the super + alt + space one that opens the hierarchical menu with Hardware, Setup, Update, etc.) has a 1-to-1 port in NW-Omarchy. Same tree, same navigation. Where Omarchy calls walker --dmenu, I call rofi -dmenu. The rest of the code is identical, and most of the helpers the menu invokes (omarchy-pkg-add, omarchy-theme-set, omarchy-update, etc.) have nothing Wayland-specific in them and run straight on X11.

The TUIs too: wiremix for audio, impala for wifi, bluetui for bluetooth, btop for system resources. They all open as a centered floating window, same as Omarchy under Hyprland. bspwm has a rule (bspc rule -a) that recognizes the org.nw-omarchy. class and applies state=floating center=on rectangle=900x600 automatically.

What’s there and what isn’t

Honesty first: there are Hyprland features that are Wayland-only and don’t come back. Per-monitor fractional scaling is gone, X11 only does integer scaling per output, so mixed-DPI setups look off. HDR doesn’t exist on X11. Tear-free rendering by design needs Wayland; on X11 you need picom + vsync and a cooperative GPU, and even then you get close to Wayland without quite matching it. Workspace swipe with live preview during the gesture is also gone, because bspwm has no in-progress switch state, so the gesture fires a discrete command only on completion (you go to the next workspace, but you don’t see the preview slide). And the Wayland-specific daemons (voxtype, walker preview pane, hyprsunset, hypridle) become redshift and xss-lock, with no preview pane in rofi.

If any of those points is a deal-breaker for you, stay on Omarchy/Hyprland. SDDM picks between the two sessions every login, so they coexist without resentment.

On the other hand, what X11 has going for it still applies. xdotool, wmctrl, xprop and xinput keep working for automation, key remapping, screen readers, AutoKey, kanata, xkeysnail. xclip, xsel and clipmenu do straightforward clipboard plumbing, no permission dialog. ssh -X keeps working: you run Firefox on a remote box and the window pops up locally. Legacy Nvidia, vintage Intel and quirky Optimus drivers are all first-class on X11. The tiling WM ecosystem is mature (bspwm, i3, awesome, openbox, xmonad, dwm) and configs port between them with little pain. And the predictable input model, where any client can read any event, is bad for security but excellent for accessibility and productivity tooling.

For older hardware, that’s the difference between “live machine” and “machine sitting in a corner”.

Install

Prerequisite: Arch Linux with Omarchy already installed. NW-Omarchy runs on top of it, doesn’t replace it.

One-liner install:

curl -fsSL https://raw.githubusercontent.com/akitaonrails/NW-Omarchy/master/boot.sh | bash

boot.sh runs preflight (checks Arch, Omarchy, git), clones the repo to ~/.local/share/nw-omarchy, and runs the full install pipeline: packages, the nw-bspwm session, configs in ~/.config, and the XLibre swap as the last step. It asks for confirmation once before touching anything. Skip the prompt with:

curl -fsSL https://raw.githubusercontent.com/akitaonrails/NW-Omarchy/master/boot.sh | bash -s -- --yes

When it finishes, reboot. The X server swap (xorg-server → XLibre) only takes effect on the next session. At SDDM you’ll see a session picker. We also swapped Omarchy’s SDDM theme for one with a selector, because the vanilla Omarchy theme has the session name hard-coded in the QML with no UI to change it. Pick nw-bspwm and you’re in.

If you’d rather clone and run by hand to inspect first:

git clone https://github.com/akitaonrails/NW-Omarchy.git ~/.local/share/nw-omarchy
cd ~/.local/share/nw-omarchy
./install.sh           # dry-run, prints what it would do
./install.sh --apply   # actually run it

Upgrade

When a new release ships, the path is:

nw-omarchy-upgrade --check    # checks current vs latest, exits
nw-omarchy-upgrade            # yay -Syu, fetch latest tag, migrations, install

The script runs yay -Syu (or pacman -Syu if you don’t have yay), compares the local tag to the remote, and if there’s a new release does git fetch && git checkout v, runs the migrations between the current and new versions, and re-applies install.sh --apply. All idempotent. Re-running on an already-current machine is a no-op.

Uninstall

If you tried it and didn’t like it, uninstall this way:

~/.local/share/nw-omarchy/uninstall.sh --apply

The uninstaller reads ~/.local/state/nw-omarchy/manifest.tsv and replays it in reverse: removes only the packages it installed (pkg-skip rows are left alone; packages that were already on the system stay on the system), restores original configs from ~/.local/state/nw-omarchy/backups/, deletes the nw-bspwm session entry, wipes the state dir. Your Hyprland session keeps working as before. Nothing under ~/.config/hypr/ or ~/.local/share/omarchy/ gets touched at any point during install or uninstall.

To revert specifically XLibre back to xorg-server (without uninstalling anything else), pacman handles it on its own:

sudo pacman -S xorg-server xorg-server-common xf86-input-libinput

The package provides/conflicts graph handles the rest of the swap atomically.

Status: experimental, feedback welcome

The current version is at 1.0 but the project is still young, in polish phase. The big items are in place: binding parity, menu parity, theme parity, TUI parity, doctor for health-checking the install, manifest for clean uninstall. What’s left is polish, edge cases, working through varied hardware, listening to feedback from anyone running it on a different machine than mine (Thinkpad T14 Gen 6).

If you try it and hit something (something that doesn’t work, configuration that’s missing, behavior that differs from Omarchy in a way that breaks flow), open an issue here:

github.com/akitaonrails/NW-Omarchy/issues

Repros, stack traces, output of nw-omarchy-doctor, anything helps. PRs are welcome too.

Defending X11 in 2026 is a pragmatic argument. The transition to Wayland takes time, plenty of hardware and plenty of software still live comfortably on this side of the fence, and the open source community can support more than one path at a time. XLibre is the bet on keeping the X server alive a few more years. NW-Omarchy is my bet that you can make this older stack look as pretty as the newer one, without compromising any of what makes X11 still worth the trip.

If you’re on a new machine, modern GPU, 4K HiDPI display, wanting HDR, stay on Omarchy/Hyprland. If you’re on an older box, or you simply prefer the X11 ecosystem for any of the reasons above, give NW-Omarchy a shot.

LLM Benchmarks: Is It Worth ($$) Mixing 2 Models? (Planner + Executor)

Sat, 25 Apr 2026 13:00:00 GMT

TL;DR: No. Across all three rounds of experiments I ran, mixing “strong frontier planner + cheap executor” loses to just using Opus 4.7 alone in a mature harness. Solo Opus 4.7 in opencode delivers Tier A (97/100) in 18 minutes for ~$4 pay-as-you-go. No multi-agent combination beats that on quality, and no combination is cheaper at the same time. The exception is Codex GPT 5.4 xHigh + medium executor, which drops from ~$16/run to ~$1-3/run while losing 3 quality points. Useful if you only have GPT in your provider stack. For everything else, let the frontier model decide when to delegate on its own, especially if you’re on a Plus/Pro/Max subscription.

Before we start: the five wrong premises

There’s a narrative that shows up in my feed every week: “I’m saving money mixing Opus for planning with Kimi/Qwen/GLM/DeepSeek for execution.” The premise is that the frontier model is too expensive to use for continuous coding, so you split the task: the expensive one thinks, the cheap one writes. Sounds reasonable. In practice, it’s wrong on five fronts.

First: most of the people saying this don’t show their results. Plenty of folks brag about orchestrating dozens of agents in parallel, fancy dashboards, elaborate flows. Ask them to show the app that came out of it, in production, generating value. Almost nobody delivers. Ask for the result. If they can’t show real production, they’re snake-oil sellers. My benchmark, with everything open, is the opposite of that: the repo is public, the logs are auditable, the generated code is right there for inspection. That’s the minimum bar for honest discussion about LLMs in coding.

Second: the use case changes everything. If you’re doing a legitimate one-shot (Opus plans the architecture once, a cheap model executes many parallel pieces with well-defined scope, like generating 50 UI components following the same pattern), then it can make sense. But most people complaining about token cost are doing continuous coding-agent work, not massive one-shots. For continuous coding the math is completely different.

Third: pay-as-you-go vs subscription completely changes the math. If you pay per token directly on the API, every extra orchestrator prompt counts. If you pay $20 or $200 a month for Plus/Pro/Max and use Claude Code, Codex CLI or similar inside the subscription quota, the “cost per run” question disappears. It becomes “I’m consuming part of my monthly quota.” Under subscription, the marginal cost of an extra call is zero until you saturate the cap. Coordinating two models to save tokens you’re not being charged for is optimization against a non-existent cost.

Fourth, and most important: you only realize the maturity of solo frontier models when you do full vibe coding. I’ve written about this in several posts (six-day immersion from zero to production, how to talk to Claude Code effectively, Clean Code for AI agents). Serious vibe coding is: turn off the IDE, don’t edit code by hand, role-play as product manager + QA + tech lead mentor, let the model write. When you do that, it becomes immediately obvious that mixing two models creates absurd coordination overhead. It’s like outsourcing development to an offshore team but hand-editing every delivery because you nitpick everything. A plan that needs to spell out every line of code becomes micromanagement: if the plan already contains the code, why are you delegating? That’s the extreme, but it’s where most multi-agent setups land. Balance is what matters, and removing overhead is the main technique. For that, a good frontier model inside a mature harness (Claude Code, Codex, opencode) is the path.

Fifth, and the most technical: premature optimization. Donald Knuth warned us decades ago, “premature optimization is the root of all evil.” The original rule: 97% of the time you don’t have data to justify optimizing; when you do, it’s only in the critical 3%. The 2026 version of that is spending a weekend wiring up an orchestration of five parallel agents to save $30/month in tokens, before you’ve validated whether the product you’re building is even worth shipping. You’re optimizing the cost of a pipeline that hasn’t delivered anything yet. Focus on shipping. $20-200/month subscriptions are trivial compared to any other professional learning cost. People spend more on bad online courses or weekend partying. Once the product ships and token cost becomes a measurable problem, then yes, optimize. Today, focus on shipping.

OK, sermon over. Let’s go to the numbers.

Methodology in one line

Same benchmark as the canonical version: build a Rails 8 app with RubyLLM, Tailwind, Hotwire, Minitest tests, Brakeman, Docker, docker-compose. Same 8-dimension rubric with 0-100 scoring, same A/B/C/D tiers. The difference is that here every variant has two models: a “planner” (strong) and an “executor” (cheaper), in different harnesses (Claude Code, opencode, Codex). Everything in the repo with docs, dispatches, traces, audits.

Solo Opus 4.7 in opencode is the comparison baseline: 97/100 Tier A, 18 minutes, ~$4 pay-as-you-go. Every multi-agent variant has to beat this (in quality, time OR cost) to justify the added complexity.

What this benchmark does NOT prove (and what good delegation actually means)

Before diving into the three rounds, it’s worth circling what this benchmark covers and what it doesn’t. This matters because I’m going to draw strong conclusions about delegation, and those conclusions don’t generalize to every kind of work.

Limit 1: the project is a simple greenfield Rails build

The benchmark builds a Rails 8 chat app with RubyLLM. It’s real code, with Tailwind, Hotwire, Docker and tests, but it’s a greenfield project with well-defined scope on a popular stack. Practically every Tier A model nails it. In complexity terms, it’s early-junior territory: there’s no legacy code to understand, no technical debt to work around, no distributed system with 50 microservices to orchestrate.

For anyone who wants conclusive data about their own use case, the honest answer is: adapt the benchmark harness to mimic your project’s conditions and compare models there. The repo is open source for exactly this reason. Take the prompt, swap in something that reflects your reality (50K-line legacy codebase, integration with 3 external systems, custom company DSL, whatever it is), and run the models the same way. The numbers I report here give direction, but any serious engineer should validate models against the work they actually do, not against an example chat app.

Limit 2: parallelizable tasks vs dependency-laden tasks

The question “is mixing two models worth it?” depends critically on what kind of work you’re delegating. There are two extremes.

Work where frontier models delegate happily, and that will work beautifully: numerous, simple tasks with little or no coordination needed. “Translate these 100 documents to English.” “Summarize these 50 spreadsheets in bullet points.” “Convert these 200 images to WebP at max 800px.” Each item is independent. No revision is needed between them. The result of one doesn’t change the input to the next. It’s exactly the kind of work Amazon’s Mechanical Turk was designed to distribute. Frontier models already do this on their own without explicit orchestration: ask Opus 4.7 to translate 100 documents and it’ll fan out to parallel batches via the Task tool, scaling up or down as needed.

Work where frontier models won’t delegate, because it can’t be delegated well: tasks with many dependencies, requiring constant revision and mutual adjustment. Building a Rails app is exactly that. The Chat model depends on the LlmClient decision, which depends on the RubyLLM config, which depends on the OPENROUTER_API_KEY in env, which depends on ENV being loaded before the initializer, which depends on the Gemfile having the right gem. Touching one piece forces revising several others. This isn’t Mechanical Turk work. It’s cohesive engineering work, and Tier A models recognize that and keep everything inside one reasoning session. That’s why zero of the 7 runs in Round 1 delegated: the task wasn’t delegable.

The async programming analogy

Anyone who learned async/await in a modern language went through the same disillusionment. You discover Promise.all, asyncio.gather, or similar, and start eyeing every slow piece of work looking for ways to parallelize. Then you write something like this and get angry:

// Sequential chain: each step depends on the previous result
const userData = await fetchUser(userId);
const orders = await fetchOrdersFor(userData);          // ← needs userData
const recommendations = await analyzeOrders(orders);    // ← needs orders

// Total latency = latency(fetchUser) + latency(fetchOrders) + latency(analyze)

Three awaits. Three I/O operations. You expected 3× faster from going async. But it’s the same as synchronous because each call depends on the previous one’s result. Trying to wrap this in Promise.all doesn’t even compile: you can’t pass userData to fetchOrdersFor before you have userData.

Promise.all only helps when the calls are independent of one another:

// Independent calls: none uses another's result
const [user, allProducts, categoryTree] = await Promise.all([
  fetchUser(userId),       // ← doesn't need products or categories
  fetchAllProducts(),      // ← doesn't need user or categories
  fetchCategoryTree()      // ← doesn't need user or products
]);

// Total latency = max(latency of the three), not the sum

The difference is not syntax; it’s dependency structure. In the first case, A → B → C: three operations in series, each waiting on the previous. In the second case, A | B | C: three independent operations that fire at the same time and the total latency is that of the slowest one. Async/Promise.all doesn’t create parallelism, it only exposes the parallelism that was already there in the structure of the problem.

Multi-agent in coding is the same. If you give two models a cohesive task with internal dependencies (build a Rails chat app), the “planner” has to read every output from the “executor” before deciding the next dispatch, and the executor needs the context the planner built. You sequentialized the two. Worse: you added coordination overhead (prompt format, response parsing, watchdog, retry logic) on top of a pipeline that was supposed to have low overhead. It’s like taking the three-sequential-await code and dropping a Redis queue in the middle: now you have 3× the latency plus the queue cost.

Parallelizable task = “apply this same refactor to 50 different files.” There orchestration makes sense: one model plans the refactor once, describes it in reusable form, and dispatches 50 sub-agents in parallel to execute on each file. Real time savings. Sequential task with dependencies = building a cohesive app. There’s no parallelization possible, and adding agents only adds overhead.

This benchmark tests the second scenario. If your daily work is the first (large batches of independent tasks), the conclusions here may invert. Adapt the harness, measure, decide.

Segment 1: the initial round, models didn’t delegate

The first round, in mid-April, configured 7 variants where the planner had a registered subagent available and aggressive prompt language pushing delegation (“Use PROACTIVELY”, “ALWAYS delegate to this agent for code implementation”). The question: given that the subagent exists and is painted as the preferred path, will the main model delegate?

Variant	Harness	Time	Cost	Delegations	Tier
`claude_opus_alone`	Claude Code	11m	$6.74	0	3
`claude_opus_sonnet`	Claude Code	10m	$5.13	0	2
`claude_opus_haiku`	Claude Code	15m	$7.83	0	3
`opencode_opus_glm`	opencode	19m	~$1.10	0	1
`opencode_opus_qwen`	opencode	30m	~$1.10	0	1
`gpt_5_4_multi_balanced`	Codex	21m	~$11	0	1
`gpt_5_4_multi_faster`	Codex	20m	~$10	0	2
baseline `claude_opus_4_7`	opencode	18m	~$1.10	n/a	1

Zero delegations in zero of the 7 variants. Claude Code’s Task tool, opencode’s task dispatcher, and Codex’s spawn_agent were all completely ignored. In every case, the main model did 100% of the work.

Two things worth logging from this round.

First: aggressive prompt language doesn’t persuade a model to delegate. “Use PROACTIVELY” is a weak word against the model’s internal “this task is cohesive, I’ll do it myself” instinct. In greenfield Rails, plan and implementation don’t have a clean line. There’s no atomically-delegable subtask.

Second, and most surprising: the same Opus 4.7 wrote worse code in Claude Code than in opencode, and cost 4 to 7× more ($6.74 vs $1.10). Two of three Claude Code variants hallucinated a RubyLLM method that doesn’t exist (Tier 3 hallucination); opencode with the same model got the API right and landed Tier 1. The harness itself influences the output. Claude Code resends much more context per turn, and that seems to nudge the model toward generic patterns while also inflating the bill.

Round 1 conclusion: if the model is already delivering Tier A solo in the right harness (opencode), adding a subagent isn’t even used. The extra configuration is pure overhead. And if you force it inside Claude Code, you pay 4-7× more. The “savings” of multi-agency here is negative.

Segment 2: the forced round, when the planner is forbidden from coding

The community pushed back on Round 1: “the subagent description was weak”, “models don’t trust unfamiliar subagents.” Fair pushback. Round 2: explicit planner-executor prompt. The planner is literally forbidden from using Write/Edit/Bash. Every code change has to flow through the subagent via Task or spawn_agent. No fallback.

The full prompt is at prompts/benchmark_prompt_forced_delegation.txt. Mandatory workflow: plan → delegate → converge → validate. Seven variants:

Slug	Planner	Subagent	Harness
`claude_opus_sonnet_forced`	Opus 4.7	Sonnet 4.6	Claude Code
`claude_opus_haiku_forced`	Opus 4.7	Haiku 4.5	Claude Code
`opencode_opus_kimi_forced`	Opus 4.7	Kimi K2.6	opencode
`opencode_opus_glm_forced`	Opus 4.7	GLM 5.1 (Z.ai)	opencode
`opencode_opus_qwen_forced`	Opus 4.7	Qwen 3.6 Plus	opencode
`gpt_5_4_multi_balanced_forced`	GPT 5.4 xHigh	GPT 5.4 medium	Codex
`gpt_5_4_multi_faster_forced`	GPT 5.4 xHigh	GPT 5.4 low	Codex

Results:

Variant	Score	Time	Cost	Tier	vs solo Opus (97)
`claude_opus_sonnet_forced`	92	25m	$5.77	A	-5
`claude_opus_haiku_forced`	90	19m	$3.49	A	-7
`opencode_opus_kimi_forced`	95	25m	~$2-3	A	-2
`opencode_opus_glm_forced`	93	13m	~$0.50	A	-4
`opencode_opus_qwen_forced`	92	(variable)	~$0.50	A	-5
`gpt_5_4_multi_balanced_forced`	94	30m	~$1-3	A	-3
`gpt_5_4_multi_faster_forced`	94	53m	~$3-6	A	-3

When forced, delegation actually happens (5 to 15 dispatches per run). All Tier A on code quality (90-95). But the important point is: none beats solo Opus 4.7 on quality. All cost extra time (minimum +7 minutes, up to +35 minutes for multi_faster). And on raw cost (pay-as-you-go), only two come in below solo: GLM (Z.ai subscription, basically free) and the GPT 5.4 multi (much cheaper than GPT 5.4 solo, but still in the same ballpark as Opus solo).

This round had two embarrassing harness findings. One was a watchdog firing too early, killing the benchmark before the cross-provider subagent finished spinning up (Z.ai and llama-swap took longer than expected to come online). I bumped the timeout from 6 to 15 minutes and GLM/Qwen came back at Tier A. The other: several models appeared to be “executing silently” (empty result envelopes). Only in Round 3 did I figure out that most of those were hidden errors — the qwen3.6-plus:free endpoint had been deprecated and was returning 404, and DeepSeek was 400-erroring with a protocol bug (details in its dedicated section below). The “1900 files” of opencode_opus_deepseek_forced were entirely written by Opus’s general fallback, not DeepSeek.

OK, so accounting for the harness failures, the story becomes:

Sonnet/Haiku coders in Claude Code: 90-92 vs solo Opus 97. Costs +1 minute to +7 minutes. Costs ~equal or more.
Kimi K2.6 in opencode: 95 vs 97. Costs $2-3 vs $4, so 25-50% savings. +7 minutes.
GLM 5.1 in opencode: 93 vs 97. Costs ~$0.50 with Z.ai subscription (basically free) vs $4 for solo Opus. -5 minutes.
GPT 5.4 + medium: 94 vs 97. Costs ~$1-3 vs $16 for GPT 5.4 solo (80-85% cheaper). +12 minutes.

The last is the only configuration where forced delegation actually pays off in absolute cost: if you have to be on GPT 5.4, forced multi-agent is real savings. In every other case, you pay in quality and/or time to use a cheaper subagent. On Anthropic Pro monthly, solo Opus is $0 marginal. There’s zero reason to use Sonnet/Haiku as a sub. Forced multi-agent there is theater.

And one last thing from this round: the “exception lesson”, when forced multi-agent repairs Claude Code. Solo Opus in Claude Code came out Tier 3 (hallucinated chat.complete) at $6.74. Forcing Sonnet/Haiku as executor inside Claude Code repaired it to Tier A at $3.49-5.77. But the cleaner fix is to switch harness, not orchestrate. Solo Opus in opencode delivers Tier A at $4 with no fuss. The “orchestration repairs Claude Code” line is a workaround for a bug elsewhere.

Segment 3: manual cross-process orchestration, Opus driving opencode

The third round was the most elaborate. Premise: take the subagent out of the planner’s process (where the task envelope bug lived) and use opencode in single-agent mode invoked via subprocess for each subtask. The setup was a Claude Code session with Opus 4.7 in the orchestrator role. For each subtask, Opus wrote the prompt to a file, invoked opencode via Bash with the cheap model as sole primary, read the output, and decided the next dispatch.

No fallback. No general agent available for Opus to escape to. Either the named executor writes, or nothing gets written.

Three variants attempted:

Opus + Qwen 3.6 Plus (Variant 1): 94/100 Tier A

Setup: 8 dispatches + 1 fix-up = 9 dispatches total. ~$0.74 executor cost. ~12 minutes cumulative wall time for the executor only. Plus the planner overhead (Opus in Claude Code orchestrating) which isn’t directly measured but is estimated at $11 (more on this below).

Qwen 3.6 Plus behavior as executor: truncates the final summary in 3 of 9 dispatches, sometimes emits zero text but does many tool calls (dispatch 8: 14 tool calls, 0 text turns), made smart adaptations on its own (manually created app/javascript/ after Rails 8.1 didn’t generate it, swapped root_url for get "/", created tailwind.css placeholder). But needed 2 fix-up dispatches for issues caused by Rails 8.1 generator behavior changes.

Auditor’s read: “Qwen wrote the lines, Opus decided the boundaries, and the boundaries are most of what lifts this from B to A.”

Result: 94/100 Tier A. +23 points over Qwen 3.6 Plus solo (71/100, Tier B). But the entire lift comes from Opus’s detailed plan, not from autonomous Qwen capability.

Opus + Kimi K2.6 (Variant 2): 97/100 Tier A, TIES SOLO OPUS

Setup: 5 dispatches, zero fix-ups. ~$0.37 executor cost. ~10 minutes cumulative executor wall time. Full end-to-end validation (local boot, docker compose up + curl + clean teardown).

Kimi K2.6 behavior: coherent text response per dispatch with no truncation, strong autonomous adaptation (caught Stimulus + Tailwind install gaps without explicit prompting, caught the layout-wrapping side effect of tailwindcss:install), zero fix-up dispatches.

Result: 97/100 Tier A. TIES Opus 4.7 solo. Auditor’s read: “Kimi wrote every line, but Opus’s planning prompts shaped what to ask for (better test fixtures, error-path coverage, persistence design), pushing Kimi from 87 → ~97.”

Opus + DeepSeek V4 Pro (Variant 3): FAILED

Structural harness incompatibility bug. DeepSeek V4 Pro returns reasoning_content on every response, and opencode strips that field when constructing the next request. The DeepSeek API then rejects turn 2 with "reasoning_content must be passed back to the API". Tested three reasoning configurations (true, false, absent). All failed identically at turn 2 of dispatch 1. There’s no opencode flag that fixes it. Workarounds (custom OpenRouter provider params, single-bash-per-dispatch protocol, switching to V4 Flash) were not pursued.

Retroactive implication: the supposed “completions” of DeepSeek in rounds 2 and 2.5 were entirely written by Opus through the general fallback. DeepSeek V4 Pro contributed zero lines of code in any configuration of this benchmark to date.

The hidden planner cost

Here’s where things get ugly. The two manual variants that actually ran (Qwen and Kimi) had ~14 successful dispatches combined. Each dispatch consumed ~3-5 turns of the Opus orchestrator (read previous dispatch output, plan next, write prompt file, monitor execution, verify filesystem). Each turn costs $0.15-0.25 in Anthropic tokens. Total: **$11 of hidden planner cost**, not logged in the executor JSON.

Real total cost of the two manual variants: ~$1.11 executor + $11 planner = **$12 combined**. Compared to solo Opus opencode at $4. Manual orchestration costs 3× more than solo.

What this round actually proves

Executor	Solo score	Lift under Opus orchestration
GLM 5.1	46 (Tier C)	+47 → 93 (Tier A) [via in-process forced 2.5]
Qwen 3.6 Plus	71 (Tier B)	+23 → 94 (Tier A) [via manual orchestration]
Kimi K2.6	87 (Tier A)	+10 → 97 (Tier A, ties Opus) [via manual orchestration]

The lift scales inversely with the executor’s solo capability. The further from Tier A solo, the more orchestration adds. Kimi solo is already Tier A, so the +10 is polish (better test fixtures, error coverage, persistence). GLM solo is Tier C because of one specific hallucination (invented fluent DSL), and prescriptive prompts that name the real API remove the hallucination entirely.

But the cost of that lift is dominated by the planner. If you only have access to GLM (not Opus), this finding doesn’t help you. If you have Opus AND GLM, just use Opus solo for less money and time.

The realistic use case for this pattern is: multi-tenant deployment where the planner runs once and is amortized across many similar subtasks (like “apply this same refactor to 50 different files”). Greenfield Rails benchmark doesn’t capture that.

Final comparison: the 3 rounds vs solo

All costs below are total (Opus planner + executor) in pay-as-you-go. End-to-end wall times:

Variant	Score	Wall time	Total cost	Δ quality vs solo (97)	Δ cost vs solo ($4)	Δ time vs solo (18m)
Opus 4.7 solo (opencode)	97	18m	$4.04	baseline	baseline	baseline
Opus + Kimi (manual)	97	30-40m	~$5-7 (planner ~$5 + executor $0.37)	=0	+$1 to +$3	+12-22m
Opus + Sonnet 4.6 (CC, forced)	92	25m	$5.77 (Claude Code log)	-5	+$1.73	+7m
Opus + Haiku 4.5 (CC, forced)	90	19m	$3.49 (Claude Code log)	-7	-$0.55	+1m
Opus + Kimi (in-process forced)	95	25m	~$3-4 (planner ~$2-3 + executor ~$0.50)	-2	-$1 to 0	+7m
Opus + GLM 5.1 (forced, watchdog fix)	93	13m	~$0.50 + Z.ai sub	-4	-$3.50 + sub	-5m
Opus + Qwen 3.6 (manual)	94	~40m	~$6-7 (planner ~$5 + executor $0.74)	-3	+$2 to +$3	+22m
GPT 5.4 xHigh + medium (Codex forced)	94	30m	~$1-3 (Codex log, both GPT)	-3	-$1 to -$3	+12m
GPT 5.4 xHigh + low (Codex forced)	94	53m	~$3-6 (Codex log, both GPT)	-3	-$1 to +$2	+35m

(Tier D / 0-file failures omitted.)

The orchestrator-Opus planner cost on the “Opus + Kimi (manual)” and “Opus + Qwen (manual)” rows is a hidden cost: it shows up in the Claude Code session driving the experiment, not in the executor log. The ~14 successful dispatches from the two manual variants combined consumed ~$11 of Opus, split between them, landing at ~$5-6 each.

Solo Opus 4.7 in opencode is the best on every metric: it ties or beats every other variant on quality, cost OR time, and ties or beats most on all three simultaneously.

The real exception is Codex GPT 5.4 xHigh + medium executor: 94/100 vs 97/100 (loses 3 points), but costs $1-3 instead of $16 for GPT 5.4 solo (80-85% cheaper). Useful if you only have OpenAI credentials and need cheap Tier A.

Every other variant loses on at least one important dimension. Most lose on two (quality AND time, or quality AND cost).

The subscription question

The numbers above are pay-as-you-go. Let’s add subscription to the picture.

Anthropic Pro at $200/month includes Opus 4.7 with 20× the Plus quota. In practice, that’s many full benchmark runs per day without touching the cap. OpenAI Plus at $20/month or Pro at $200/month includes Codex CLI with similar economics. Under subscription, a benchmark run adds zero marginal cost until you saturate the quota.

What does that change?

Solo Opus 4.7: $4 pay-as-you-go becomes $0 marginal on Anthropic Pro. Still the best.
Multi-agent with Sonnet/Haiku via Claude Code: $3.49-5.77 becomes $0 marginal on Anthropic Pro. Still loses 5-7 quality points. Not worth it.
Multi-agent with Kimi/GLM/Qwen via opencode (which charges separately via OpenRouter): adds $0.50-3 in OpenRouter cost on top of the subscription. Loses 2-7 quality points. Equal or longer time. Not worth it.
Codex GPT 5.4 + medium: $1-3 becomes $0 marginal on ChatGPT Pro. But solo GPT 5.4 also becomes $0 marginal. Multi isn’t worth it.
Manual cross-process (Opus driving opencode with Kimi): $0 marginal (planner on Anthropic Pro) + $0.37 (executor on paid OpenRouter) = $0.37. Costs 30-40 minutes vs 18 for solo. Ties on quality. Not worth it.

Under subscription, no multi-agent configuration beats solo frontier. Marginal cost of an extra call is zero for someone already paying the monthly subscription. Extra coordination doesn’t compensate when the “savings” is zero.

The least-bad case: Opus + Kimi K2.6 under heavy use

Since I’m being skeptical on every conclusion, it’s worth noting separately the one non-obvious configuration where Opus + Kimi K2.6 might actually make sense in practice: continuous heavy use that already saturated the Anthropic Pro monthly quota.

In forced in-process mode (Round 2.5 with adjusted watchdog), Kimi K2.6 delivers 95/100 vs Opus solo’s 97. The 2-point difference is usually polish in test fixtures and error handling, things you can patch later. In exchange, you pay ~$2-3 of OpenRouter on Kimi instead of burning Pro quota on Opus. For power users hitting the $200/month Pro cap regularly (multiple times a week, heavy parallel projects), redirecting some work to Kimi can free up Pro quota for tasks that really need Opus.

It’s a narrow exception. For most people who haven’t saturated Pro, just using solo Opus is still better. And for those without Pro yet, the savings from orchestrating Kimi + Opus will be equal to or less than just subscribing to Pro. But for the “I’m using a coding agent professionally and heavily and I’m blowing the monthly cap” scenario, the Opus planner + Kimi executor configuration is the best option I measured in this benchmark.

Worth mentioning also is the multi-tenant case where the planner is amortized: if you’re running a pipeline that applies the same change pattern to many similar projects (refactor 50 repos to use a new convention, generate 30 microservices with the same skeleton), Opus plans once, you save the plan, and Kimi executes against each project. The planner cost dilutes across volume and the pairing makes real economic sense.

Why DeepSeek was the hardest model to test

Worth logging separately, because it’s the most frustrating case in the whole benchmark. DeepSeek V4 Pro never contributed a single line of code in any of the three multi-agent rounds, despite showing up in several experiments as “completed”.

The root cause is a protocol incompatibility between the DeepSeek API and the ai-sdk that opencode uses underneath. DeepSeek V4 Pro runs in thinking mode by default. Every response it returns includes a reasoning_content field with the model’s internal chain of thought. The DeepSeek API then requires that reasoning_content to be echoed back in the message history of the next request. Without it, the server responds with 400 and this specific message:

The "reasoning_content" in the thinking mode must be passed back to the API.

Opencode’s ai-sdk, when constructing the next request, strips reasoning_content from the message history. Every multi-turn call to DeepSeek V4 Pro via opencode fails on turn 2.

What makes this invisible on the surface is that opencode doesn’t bubble that 400 to the user. It buries the error in the event stream and keeps going. When you look at the run result, you see files being written and tasks “completing.” But if you inspect the trace, you find that most (or all) came from the general fallback agent, which is Opus 4.7 itself. The DeepSeek that was supposed to be doing the work wrote nothing.

I tested three reasoning configurations in opencode (true, false, absent). All failed identically at turn 2 of dispatch 1. No config flag fixes it. Workarounds that were out of scope:

Custom OpenRouter provider params via header
Single-bash-per-dispatch protocol (which would dodge multi-turn)
Switch to DeepSeek V4 Flash, which doesn’t use thinking mode

This forces a retroactive correction of earlier conclusions. In an earlier post I described DeepSeek V4 Pro as having “Tier 1 code, but DNF in the harness.” That description was incomplete. DeepSeek V4 Pro is fundamentally incompatible with any ai-sdk-based harness (which includes opencode, and probably several other tools that use the same SDK underneath). The model can write good code solo if you invoke the API directly with proper thinking-mode handling. But in any real coding-agent pipeline that uses ai-sdk, it doesn’t work in multi-turn.

Practical result: DeepSeek V4 Pro is unmeasurable in this benchmark. The only configurations where it appeared as “successful” had Opus writing in its place. For future benchmarks, I’ll either swap to V4 Flash (which avoids thinking mode) or build a custom harness that echoes reasoning_content correctly.

The broader lesson: the maturity of tooling around a model matters as much as the model’s quality. DeepSeek V4 Pro might have excellent solo code, but if you can’t use it without writing your own harness, it loses to Kimi K2.6 which works out of the box.

Conclusion

Multi-agent on continuous coding agent work is premature optimization in disguise. You spend a weekend wiring up an orchestration of five models, five harnesses, five different token counters, only to land at a result that’s equal or worse than solo Opus in opencode. And worse: while you’re orchestrating, you’re not shipping product.

The data:

Solo frontier model (Opus 4.7 in opencode) delivers Tier A 97/100 in 18 minutes for $4 pay-as-you-go.
Trying multi-agent without forcing: the model doesn’t delegate, you pay harness overhead for nothing.
Forcing multi-agent: equivalent quality at more time and equal or higher cost. The one real exception is Codex GPT 5.4 + medium for cutting GPT cost.
Manual cross-process orchestration: ties on quality at best (Opus + Kimi → 97), but doubles wall time and triples total cost when you account for the planner.
Under monthly subscription: the marginal cost of an extra call is zero for solo frontier. Multi-agent has no economic advantage.

Practical rule: pick a good frontier model (Opus 4.7, GPT 5.5, GPT 5.4 xHigh), use it in a mature harness (opencode respects the model the most, Codex is the official one for GPT, Claude Code if you accept the context pollution), and optimize the prompt instead of the orchestration. Current models already decide on their own when to split a task into parallel subtasks (Claude Code has the Task tool, Opus uses it when it needs to). No need to force it.

And to close the subscription argument: $20-200 per month is trivial. It’s less than what most people spend on bad online courses, streaming subscriptions, or weekend partying. In serious professional use, that pays back 5-10× on the first real project. If “$200/month” feels like a lot, the problem isn’t the LLM, it’s the ROI of what you’re building. Which is exactly where you should be focusing before orchestrating agents to cut $30 in tokens.

Focus on shipping. Token optimization comes later.

Sources

Benchmark repo — code, prompts, configs, all the logs
Round 1 multi-model report — 7 free-choice variants, zero delegations
Round 2 forced-delegation audit — 7 forced variants, 0-100 rubric
Round 3 manual orchestration — Opus driving opencode in subprocess, Kimi/Qwen/DeepSeek
Orchestration traces — per-variant forensic walkthroughs
Forced-delegation prompt template
Canonical benchmark (April 2026) — solo runs of 23 models, 8-dimension rubric
Vibe code: from zero to production in 6 days — full immersion in practice
How to talk to Claude Code effectively — role-play as PM/QA
Clean Code for AI agents — instruction guide for the model

LLM Coding Benchmark (May 2026): DeepSeek v4, Kimi v2.6, Grok 4.3, GPT 5.5

Fri, 24 Apr 2026 13:00:00 GMT

Update (May 6, 2026): two post-publication adjustments that reshuffle the ranking. DeepSeek V4 Pro got unblocked — what was listed here as “unmeasurable in opencode” climbed to Tier A at 89/100 running through DeepClaude, a shim that swaps the endpoint Claude Code talks to. Technical details in the dedicated May 4 post. And Grok 4.3 entered the benchmark at 72/100 Tier B, a big jump over Grok 4.20’s 25/100 (which still sits at the bottom of the list). The ranking, comparisons, and DeepSeek/Grok dedicated sections have all been updated below.

TL;DR: This is the canonical version of my LLM coding benchmark. It supersedes the earlier April posts which are now deprecated. I re-audited 24 models against the ruby_llm gem source instead of memory, and several models I had flagged for “inventing API” actually write correct code. Kimi K2.6 moved to Tier A. Gemini 3.1 Pro too. DeepSeek V4 Pro reaches Tier A only via DeepClaude (89/100); in opencode it stays unmeasurable. Grok 4.3 debuts at Tier B (72/100). GLM 5.1 dropped to Tier C. MiMo V2.5 Pro fell from “first non-Anthropic Tier 1” to Tier B. Opus 4.7 and GPT 5.4 xHigh tie at the top (97/100), GPT 5.5 lands third at 96/100 (40% cheaper than 5.4 at the same quality). Opus 4.6 is still my daily pick on behavior, not code. Long details below.

Important disclaimer: all the rankings, tiers and conclusions here only hold within this specific benchmark methodology, which is building a Rails + RubyLLM + Hotwire + Docker app from a fixed prompt. Models that fall to Tier B or C here can shine on other kinds of task (isolated function completion, short snippet generation, mathematical reasoning). Nobody should read this as a universal capability judgment.

What this benchmark tests

Each model gets the same prompt to autonomously build a ChatGPT-style chat app in Rails. The prompt asks for 15 things:

Rails with latest Ruby + Rails via mise
No ActiveRecord, Action Mailer, or Active Job
SPA mimicking ChatGPT’s interface
Tailwind CSS
Hotwire + Stimulus + Turbo Streams
Rails partials (no single-file CSS/JS dumps)
OPENROUTER_API_KEY via env var
No secrets in files
RubyLLM (ruby_llm gem) configured for OpenRouter + Claude Sonnet
Minitest tests for each component
Brakeman, RuboCop, SimpleCov, bundle-audit for CI
Functional Dockerfile (not a placeholder)
docker-compose
README with setup
Everything in workspace root (no nested subdir)

Cloud models run two phases (build + boot/Docker validation). Local models run phase 1 only. Harness is opencode run --agent build --format json, except GPT 5.4 which runs via codex exec --json directly.

Methodology: the 8-dimension rubric

This part changed this week. The first version of this benchmark weighted RubyLLM API correctness too heavily, so a model that wrote the correct call but delivered an incomplete project (no docker-compose, stock README, bundle-audit missing) came across as more qualified than it should. The new rubric distributes across 8 dimensions:

Dimension	Weight	What it measures
Deliverable completeness	25%	Dockerfile + compose + README + Gemfile + all checklist artifacts
RubyLLM correctness	20%	Calls verified against gem 1.14.1 source
Test quality	15%	Tests exercise the LLM path with mocks that match the real signature
Error handling	10%	Rescue blocks around LLM calls, degraded UI for the user
Persistence / multi-turn	10%	Session cookie / cache good; singleton / class-var / none bad
Hotwire / Turbo / Stimulus	10%	Real Turbo Streams, decomposed partials, Stimulus controllers
Architecture	5%	Service/model separation, no logic dumps in controllers
Production readiness	5%	Multi-worker safe, no XSS, no committed `.env`, CSRF intact

Score 0-100 → Tier A (80+) / B (60-79) / C (40-59) / D (<40).

Tier A: ship as-is, or with a patch under 30 minutes
Tier B: 1 to 2 hours to ship; architecture is sound, minor gaps
Tier C: major rework. Core bugs or missing deliverables
Tier D: throw away, useful only for architectural inspiration

Final ranking (24 models)

Rank	Model	Score	Tier	RubyLLM OK	Time	Cost
1	Claude Opus 4.7	97	A	✅	18m	~$1.10
1	GPT 5.4 xHigh (Codex)	97	A	✅	22m	~$16
3	GPT 5.5 xHigh (Codex)	96	A	✅	18m	~$10
4*	DeepSeek V4 Pro (DeepClaude)	89	A	✅	18m	~$3.14
5	Kimi K2.6	87	A	✅	20m	~$0.30
6	Claude Opus 4.6	83	A	✅	16m	~$1.10
7	Gemini 3.1 Pro	82	A	✅	14m	~$0.40
8	Claude Sonnet 4.6	78	B	✅	16m	~$0.63
8	DeepSeek V4 Flash	78	B	✅	3m	~$0.01
10	Grok 4.3	72	B	✅	15m	~$1.74
11	Qwen 3.6 Plus	71	B	✅	17m	~$0.15
12	Kimi K2.5	69	B	✅	29m	~$0.10
13	Xiaomi MiMo V2.5 Pro	67	B	✅	11m	~$0.14
14	GLM 5	64	B	✅	17m	~$0.11
15	Step 3.5 Flash	56	C	⚠️ bypass	38m	~$0.02
16	Qwen 3.5 35B (local)	55	C	✅	28m	free
17	GLM 4.7 Flash bf16 (local)	52	C	✅	failed	free
18	GLM 5.1 (Z.ai)	46	C	❌	22m	subscription
19	DeepSeek V3.2	43	C	❌	60m	~$0.07
20	MiniMax M2.7	41	C	❌	14m	~$0.30
21	Qwen 3.5 122B (local)	37	D	❌	43m	free
22	Qwen 3 Coder Next (local)	32	D	❌	17m	free
23	Grok 4.20	25	D	❌	8m	~$0.60
24	GPT OSS 20B (local)	11	D	❌	failed	free

*DeepSeek V4 Pro only reaches that score via DeepClaude (Claude Code with env vars pointed at OpenRouter). In opencode (and any ai-sdk-based harness) the model stays unmeasurable due to the reasoning_content protocol bug. The 89 above is the claude_code_deepseek_v4_pro_or_sonnet variant, with sonnet-coder registered but never invoked; the solo variant (no subagent registered) lands at 84/100, still Tier A. Full coverage in the May 4 post on the unlock.

Course correction: what changed in the criteria

Worth logging what changed between the previous iteration and this one, because it’s quite a bit.

Mistake 1: I was cataloging real APIs as “hallucinations”

Embarrassing. I went to verify directly against the gem source at ~/.local/share/mise/installs/ruby/4.0.2/lib/ruby/gems/4.0.0/gems/ruby_llm-1.14.1/ and two things I had been calling invented are actually legitimate public API:

chat.add_message(role: :user, content: "x") is not a kwargs bug. I asserted this crashed with ArgumentError because the real signature would be a positional hash. Looking at the gem, Chat#add_message(message_or_attributes) accepts either a Message or a hash. Ruby’s parser treats add_message(role: :user, content: "x") as add_message({role: :user, content: "x"}), a single positional hash. It works.

chat.complete(&block) is a real public method in RubyLLM 1.14.1.

Consequence: several models I had tagged as Tier 3 were actually using valid API.

Mistake 2: RubyLLM correctness alone isn’t enough

Even after the API correction, weight was missing on deliverable completeness. If a model writes perfect RubyLLM but forgets docker-compose, leaves the README as a stock template, or omits bundle-audit, the project isn’t done.

The new rubric distributes weight differently and that shifted the ranking:

Kimi K2.6 moved from “half-fix Tier 2” up to Tier A (84). The only non-Western model with real tests mocking RubyLLM + rescue + multi-worker-safe session cookie.
Kimi K2.5 came back to Tier B (66) from Tier 3. The API I called invented is real. It drops for another reason: 37 tests that never mock RubyLLM.
Gemini 3.1 Pro jumped to Tier A (82). Was previously misclassified as Tier 3.
GPT 5.4 xHigh ties for first with Opus 4.7 (97/100). Impeccable architecture + complete deliverables.
DeepSeek V4 Pro has two readings. In opencode (and any ai-sdk harness) it stays unmeasurable: it hits the thinking-mode protocol incompatibility and the output falls back to Opus. Through DeepClaude (Claude Code with the endpoint swapped to OpenRouter), it delivers Tier A at 89/100 — the model is capable, the prior harness just didn’t support the protocol. May update, see the dedicated post.
MiMo V2.5 Pro dropped from “first non-Anthropic Tier 1” to Tier B (64). Tests don’t exercise LLM + process-local Singleton + no rescue + no system prompt.
GLM 5.1 dropped hard to Tier C (43). Its fluent DSL (c.user(), c.assistant()) really is invented — grep confirms. And every request reconstructs ChatSession.new, discarding history. Two compound bugs.

Lesson: file count and test count measure zero if the code is wrong underneath. But RubyLLM correct on its own also isn’t enough. You need both.

Tier A: what makes them work

To understand what separates Tier A from Tier B, look at what Opus 4.7, Kimi K2.6 and Gemini 3.1 Pro do that MiMo, Sonnet 4.6 and Grok 4.3 don’t.

All Tier A models have these 4 things:

Tests that mock RubyLLM with the correct signature. A FakeChat that implements with_instructions, add_message, ask. Tests that exercise happy path AND error path. WebMock blocking real calls to OpenRouter.
Rescue around chat.ask with typed error (LlmClient::Error or similar) and degraded UI for the user.
Persistence that survives restart and works multi-worker. Session cookie (Opus, K2.6) or Rails.cache with TTL (Gemini, GPT 5.4).
System prompt via with_instructions. The model knows its role.

Tier B usually fails 2-3 of these. MiMo fails all 4. Sonnet 4.6 has ambitious architecture (multi-conversation sidebar) with a subtle control-flow bug that the tests rubber-stamp. Grok 4.3 writes the cleanest controller in the benchmark but its Stimulus pipeline is dead at runtime, so half the UI doesn’t react to events. DeepSeek V4 Pro has the cleanest RubyLLM code (persistent @chat) in the snippet it produces; in opencode the run falls back to Opus before covering the checklist, but through DeepClaude the same model covers the checklist and lands in Tier A.

Tier C usually has at least one structural bug: invented fluent DSL, history discarded every request, ruby_llm gem in the :test group with require: false, or bypassing the gem entirely with raw Net::HTTP.

Claude family: Opus 4.6 vs 4.7, Sonnet 4.6

Opus 4.7 leads at 97/100. Textbook-correct code:

chat = @client.chat(model:, provider:)
chat.with_instructions(@system_prompt)
previous_messages.each { |msg| chat.add_message({role: msg.role.to_sym, content: msg.content}) }
response = chat.ask(user_message)
response.content

FakeChat with real signature. Tests verify history replay, error wrapping, model/provider override, system prompt. Session cookie with to_a/from_session is multi-worker safe. Rescue of RubyLLM::Error + StandardError → friendly error bubble.

No relevant deductions this round.

Opus 4.6 has correct code too (83/100 Tier A) but less disciplined. Controller without rescue around chat_service.ask: a transient 5xx becomes a stack trace page. Service reaches into Chat#messages via attr_reader directly, Demeter violation. Material difference between 4.6 (low Tier A) and 4.7 (high Tier A).

Sonnet 4.6 is Tier B (78). Richest UI of the whole benchmark (multi-conversation sidebar). But LlmChatService#call only calls ask if the last history message is from the user, otherwise it silently returns "". The tests rubber-stamp the bug. And the whole conversation fits in a 4KB cookie, which overflows after ~10 turns.

The 4.7 behavior downgrade

The community complaint on Reddit and DEV.to that “4.7 got worse for coding” isn’t about code quality. On objective benchmark it’s at the top. What 4.7 made worse is behavior:

New tokenizer consumes 1x to 1.35x more tokens for the same text
Tries to optimize resources (tokens, tool calls, steps) more aggressively, sometimes cutting corners
GitHub issues report Max Plan running out in minutes where it used to last hours

I have hundreds of hours with 4.6 and have been testing 4.7 since launch. The code 4.7 produces is Tier A, equal to or better than 4.6. But in daily use, 4.6’s more direct behavior is more productive. 4.6 is still my Claude Code default when Claude Code lets me pick (since 4.7 launched, Claude Code changed the default to xHigh reasoning and is more restrictive about downgrading to 4.6).

GPT via Codex: 5.4 and 5.5 at the top

GPT 5.4 xHigh via Codex CLI ties with Opus 4.7 at first (97/100). Uses RubyLLM.chat(model:, provider: :openrouter, assume_model_exists: true) + with_instructions + add_message(role:, content:) + chat.ask + response.content. Textbook, with provider pinning and registry-skip.

It’s the only model with:

Explicit API-key preflight (ensure_api_key! raises MissingConfigurationError)
Differentiated HTTP status codes: 503 for config error, 502 for runtime error
Rails cache persistence with TTL + cap (24 messages × 12h)
Dedicated form object (PromptSubmission) separate from the domain model (ChatMessage)

10 test files including partial render tests. FakeChat/FakeClient match real API.

Critical weakness: 7.6M total tokens → ~$16/run. 15× Opus’s cost for essentially tied quality. Hard to justify unless you can’t iterate on the first try.

GPT 5.5 xHigh (Codex): 96/100, cheaper and faster than 5.4

OpenAI released GPT 5.5 yesterday (April 23), already available in Codex. I ran the benchmark today. Result: 96/100, Tier A, rank #3.

Good news is 5.5 delivers the same code quality as 5.4 on this benchmark, with real time and token savings.

	GPT 5.4 xHigh	GPT 5.5 xHigh	Δ
Score	97/100	96/100	tie (1-point noise)
Elapsed	22m	18m	20% faster
Total tokens	7.6M	4.9M	35% fewer
Output tokens	63K	29K	54% fewer
Est. cost	~$16	~$10	40% cheaper

The RubyLLM integration is structurally identical to 5.4:

chat = RubyLLM.chat(model:, provider: :openrouter, assume_model_exists: true)
chat.with_instructions(SYSTEM_PROMPT)
history.each { |m| chat.add_message(role: m.role.to_sym, content: m.content) }
response = chat.ask(prompt)
response.content.to_s.strip

Ships with the same defensive patterns 5.4 has:

Dependency-injected client_factory:: lets tests exercise the full seed-history-then-ask path via FakeClient, no WebMock required
rescue_from RubyLLM::Error, RubyLLM::ConfigurationError: both real error classes, rescued separately
Session cookie with 20-message cap
Real Turbo Streams (turbo_stream.replace "chat-thread" + composer)
Stimulus composer controller with proper lifecycle (disable-on-submit, reset, auto-scroll)

What changes for the user: nothing in quality, everything in cost/speed

This is worth logging. On synthetic capability benchmark, 5.5 and 5.4 tie. Same architectural shape delivered, same defensive patterns, same error rescue, same persistence. Side-by-side code review shows no material difference.

Where 5.5 really wins is generation efficiency: 35% fewer total tokens, 54% fewer output tokens, 20% less time. For what used to cost $16/run on 5.4, you pay $10. On continuous use the savings compound.

It’s not a new capability generation. It’s cost optimization on top of architecture that was already at the top. For folks already on Codex, 5.5 replaces 5.4 with no regression. For folks who aren’t, at 15× (and now 10×) the cost of Opus for tied quality, it’s still hard to justify for continuous use.

Critical weakness: no significant defect on this run. Same shape as 5.4 at lower cost. The DI-injected pattern + real error class rescue + session cookie = best defensive patterns in the whole benchmark.

A caveat on price

The ~$16/run for GPT 5.4 and ~$10/run for GPT 5.5 in the table are direct API token costs (pay-as-you-go). In April 2026, OpenAI switched Codex to use the same API token accounting inside the Plus, Pro, Business and Enterprise subscriptions. In practice, most Codex CLI users access it via ChatGPT Plus ($20/mo) or Pro ($200/mo), where Codex usage falls under the subscription quota (Pro gets 20× Plus’s limit). For a Pro subscriber already paying $200/month, a benchmark run adds no marginal cost until they saturate their monthly quota.

So “GPT 5.4 is the most expensive in the ranking” is true under pay-as-you-go and changes under subscription. Anyone already on Pro for everything probably isn’t thinking of this in “cost per run” terms. That said, in terms of tokens spent per delivered quality, 5.5 is simply more efficient than 5.4. Even inside the subscription it burns 35% less of the quota.

DeepSeek: the overhype pattern

Every DeepSeek generation ships with heavy marketing (“competitive with Claude Opus”) and ends with the same pattern: tool support lags behind.

V4 Pro: clean code in snippet, harness incompatible in opencode

The RubyLLM snippet it produces, in isolation, is Tier A quality:

def initialize
  @chat = RubyLLM.chat
  @messages = []
end

def ask(content)
  result = @chat.ask(content)
  @messages << Message.new(role: "user", content: content)
  @messages << Message.new(role: "assistant", content: result.content)
end

Persistent @chat instance, delegates multi-turn to RubyLLM. Correct pattern.

In opencode, V4 Pro hits a protocol incompatibility. The model uses thinking mode by default and returns reasoning_content on every response. The DeepSeek API requires the client to echo that reasoning_content back in the next request’s message history, or it answers 400 with "reasoning_content must be passed back to the API". The ai-sdk that opencode uses underneath strips that field while building the next request. Every multi-turn call to V4 Pro via opencode fails on turn 2.

What makes this treacherous: opencode doesn’t surface that 400 to the user. It buries the error in the event stream and falls back to the general fallback agent — which is Opus 4.7. The run “completes,” files get written, tasks get marked ok. But if you inspect the trace, you discover that most of the files were written by Opus, not by DeepSeek. The half-baked deliverables that drag the score down (stock README, no docker-compose.yml, missing bundle-audit) are output from Opus in fallback mode acting without the main agent’s context. The 69/100 score on the old record reflects mixed authorship, not V4 Pro for real.

This isn’t opencode-specific. Any ai-sdk-based harness has the same bug, because the SDK is the one that strips reasoning_content.

V4 Pro unblocked: DeepClaude delivers Tier A at 89/100

In May I ran Round 4 of the orchestration benchmark and found a way to actually measure V4 Pro. There’s a shell shim called DeepClaude that swaps the endpoint Claude Code talks to. Instead of hitting api.anthropic.com, Claude Code now calls openrouter.ai/api/anthropic, which is OpenRouter’s Anthropic-compatible route. That route handles thinking content correctly in the format Claude Code emits (opencode’s ai-sdk doesn’t). The result: Claude Code’s full autonomous loop runs on top of DeepSeek V4 Pro with no protocol bug.

Numbers from the two variants I ran:

Variant	Score	Time	Cost	Note
Claude Code via DeepClaude → V4 Pro (solo)	84	21m	$3.38	No subagent registered
Claude Code via DeepClaude → V4 Pro + Sonnet registered	89	18m	$3.14	Sonnet registered but never invoked

The 15-20 point lift over the opencode result is purely a harness-change effect. Same model, same prompt, no new orchestration, just a different agent loop. And the variant with Sonnet registered but never invoked scored +5 over its sister with no subagent. Mere availability of a delegate made the DeepSeek planner decompose better (smaller seams, cleaner DI, system prompt via with_instructions). Full technical details in the dedicated post.

Where this leaves V4 Pro:

In opencode (and any ai-sdk): unmeasurable. What I wrote above still applies.
In Claude Code with DeepClaude: Tier A at 89/100, $3.14 in 18 minutes.
In Claude Code direct against the Anthropic API with DEEPSEEK_API_KEY: not tested, but the native route also handles thinking content properly.

For anyone without an Anthropic subscription who wants Tier A coding, the DeepClaude variant running entirely on OPENROUTER_API_KEY is the first real answer the benchmark has produced.

This is the pattern that shows up every time DeepSeek ships: tool support lags. The community takes weeks to figure out integration, the thinking-mode protocol is more complex at runtime, and support in open-source tooling accumulates gaps the provider doesn’t fix. If you need a production pipeline today that integrates DeepSeek, you’ll probably have to maintain your own patches — or use DeepClaude.

V4 Flash: cheapest viable option (78/100 Tier B)

$0.01/run, 2m 35s, fixes V3.2’s critical bug (invented RubyLLM::Client class). API all correct (now that I know add_message(role:, content:) is valid). Session-replay multi-turn via session[:messages]. WebMock tests the real OpenRouter endpoint and exercises the LLM path. All of this pushes it to 78/100 Tier B.

Gotcha: the model slug is "claude-sonnet-4" missing the anthropic/ prefix. 404 at OpenRouter at runtime. One-character fix, but fatal if you deploy blind.

At Tier B, $0.01/run, V4 Flash is the cheapest model that gets close to “code that works with one manual patch”.

V4 vs V3.2: real generational upgrade

	V4 Flash	V4 Pro (DeepClaude)	V4 Pro (opencode)	V3.2 (previous)
Score	78	89	69 (mixed authorship)	43
Tier	B	A	unmeasurable	C
Harness	completes	completes	reroute to Opus fallback	completes
Time	2m 35s	18m	22m	60m
Cost/run	~$0.01	~$3.14	~$0.50 (Opus underneath)	~$0.07
RubyLLM API	correct	correct	correct (snippet only)	invented `RubyLLM::Client`
Deliverables	mostly present	full	written by Opus fallback	decent

V4 Flash is the cheap option that works. V4 Pro needs DeepClaude (Claude Code with env vars pointed at OpenRouter) to deliver its full potential; in opencode it stays unmeasurable.

Kimi: K2.5 → K2.6

K2.6 (87/100, Tier A)

Positive surprise. Only non-Western model at Tier A. Textbook code:

chat = RubyLLM.chat
chat.with_instructions(SYSTEM_INSTRUCTION)
historical_messages.each do |msg|
  chat.add_message(role: msg[:role], content: msg[:content])
end
response = chat.ask(user_message)
response.content

Unique combination among Chinese models: FakeChat matching real API, rescue of RubyLLM::Error (with flash via turbo_stream), session cookie with MAX_MESSAGES = 50 cap. Complete Gemfile. Multi-worker safe.

Only deduction is the full history replay on each turn (uses more tokens than MiMo’s persistent pattern).

At $0.30/run, Kimi K2.6 is the cheapest Tier A in the benchmark. 3 to 50× cheaper than Opus 4.7 and GPT 5.4 xHigh.

K2.5 (69/100 Tier B, not Tier 3 as I had claimed)

In the first version I cataloged K2.5 as Tier 3 for supposedly inventing chat.complete and using kwargs in add_message. Both are real public API. K2.5 comes back to Tier B.

It drops from K2.6 to K2.5 because: 37 tests (most in the benchmark) never mock RubyLLM. They only test PORO CRUD and respond_to?. Test coverage without mock fidelity measures nothing. And it uses class-var storage (Chat.storage = @storage ||= {}), worse than Singleton because there’s no mutex.

K2.6 adds what K2.5 didn’t have: FakeChat with the right signature, rescue, session cookie. Real evolution in the dimensions that matter.

Xiaomi MiMo V2.5 Pro: looked like Tier 1, but the gaps are real

MiMo V2.5 Pro (April 2026) generated the most hype of this round. In the first analysis I promoted it as “first non-Anthropic Tier 1”. After the new rubric, it drops to Tier B (67/100).

RubyLLM code is still clean:

@llm_chat = RubyLLM::Chat.new(model: MODEL, provider: PROVIDER)
response = @llm_chat.ask(content, &)
@messages << { role: :assistant, content: response.content }

Chat.new(model:, provider:) is a valid public constructor. .ask(content, &) forwards a streaming block. Zero manual add_message calls. Multi-turn delegated to RubyLLM itself. It’s the cleanest pattern any model in this benchmark produced.

But the holistic score = 67 because the other components have gaps:

Tests don’t exercise the LLM path. The 21 tests only check constants and blank guards. ChatSessionTest has no happy-path test. If the LLM call worked or silently failed, no test would catch it.
No error handling. No rescue around @chat.ask. Rate limit becomes 500 with stack trace on screen.
ChatStore Singleton is process-local. Dies on Puma restart, doesn’t work with WEB_CONCURRENCY > 1, no TTL.
No with_instructions, the model doesn’t know its role.

Around those four gaps are smaller ones: Docker without healthcheck, no restart: unless-stopped, no Thruster (Rails 8.1 default), short README.

Genuine advantage: idiomatic use of the library. For a greenfield single-worker app, it’s less code with the same functionality. Opus’s manual replay pattern is actually more defensive than the library itself recommends.

At $0.14/run in 11 minutes, it’s 8× cheaper and faster than Opus. For throwaway prototypes, it’s genuinely useful.

Verdict: MiMo is ~70% of Opus quality at 12.5% of the price. For throwaway prototype, worth it. For production, ~2 engineer-hours adding rescue + Rails.cache + FakeChat + WebMock + system prompt. At that point Opus at $1.10 comes out cheaper overall.

Gemini 3.1 Pro: the quiet surprise (82/100 Tier A)

I had classified it as Tier 3. Re-audit showed Chat.new and add_message kwargs form are real API.

Strong points: real Turbo Streams (not fetch + innerHTML), Rails.cache-backed persistence with 2h expiry, FakeChat mocks matching the real API, error path tested.

Critical weakness: it uses the model string claude-3.7-sonnet instead of the current Sonnet 4.x. One-character fix.

At $0.40/run, Gemini 3.1 Pro is the other cost-effective Tier A alongside Kimi K2.6.

Qwen family: distillation doesn’t save you

Tested several Qwens:

Model	Type	Tier	Detail
Qwen 3.6 Plus	Cloud, OpenRouter	B (71)	Correct RubyLLM API; tests make real calls (no WebMock); client-side JS history
Qwen 3.5 35B	Local NVIDIA	C (55)	Correct entry point; no multi-turn; test wraps real call in `rescue => e; assert true`
Qwen 3.5 122B	Local NVIDIA	D (37)	Doesn’t use ruby_llm; uses `Openrouter::Client` (wrong casing)
Qwen 3 Coder Next	Local NVIDIA	D (32)	Invents `RubyLLM::Client.new`; commits placeholder `.env`
Qwen 3.5 27B Claude-distilled	Local NVIDIA	still Tier 3 on re-review	Code looks Claude-shaped, but invents the entire API

The discovery: Claude distillation doesn’t transfer library knowledge

I ran Jackrong’s Qwen 3.5 27B distilled from Claude 4.6 Opus. The promise was “Claude-at-home” running locally. The result: the code comes out Claude-styled (frozen_string_literal, Response value objects, layered architecture, doc comments), but functionally it’s full-blown hallucination:

RubyLLM::Chat.new.with_model(@model) do |chat|
  conversation_history.each do |msg|
    chat.add_message(role: :user, content: msg[:content])
  end
  response = chat.ask(message)
  Response.new(content: response.text, usage: build_usage(response))
end

The issue here is the invented block form .with_model(@model) do |chat| ... end (doesn’t exist in the gem), and response.text instead of the real response.content.

Distillation transferred the style and stopped there. Factual memory about a specific library is binary recall: it’s either in the weights or not. Claude’s reasoning traces don’t contain repetitions like “use ask, not complete” because that’s not reasoning, it’s raw memory.

If you need the model to actually use RubyLLM or any less popular library, Claude distillation won’t save you. Use actual Claude or a large cloud model.

Coder vs General: the inverse surprise

Intuitively, models with “Coder” in the name should be better for programming. In this benchmark, the opposite happened. Of the three Qwen Coders tested:

Qwen 3 Coder 30B returned a hardcoded mock string instead of calling RubyLLM
Qwen 2.5 Coder 32B timed out at 90 minutes with zero files
Qwen 3.5 27B Sushi Coder RL had infra failure

Meanwhile the general Qwen 3.5 35B-A3B completed in 5 minutes with a recognizable Rails project. My reading: the “Coders” were fine-tuned on isolated problems (Codeforces, Leetcode, short snippets), far from the long-running agentic flows with tool calling. “Coder = better for coding agent” is false for this kind of task.

Qwen 3.6 Plus: the cloud that almost arrived

Qwen 3.6 Plus on OpenRouter completed the benchmark in 17 minutes and is the cleanest Qwen I’ve measured (71/100 Tier B). Correct RubyLLM API. Well-built Stimulus controller. But tests make real calls to OpenRouter (no WebMock), and history is client-side JS only (lost on refresh). Uses fetch + innerHTML instead of Turbo Streams.

GLM family: 5, 5.1, local 4.7 Flash

GLM 5 (64/100 Tier B)

RubyLLM.chat(model: "anthropic/claude-sonnet-4") + chat.ask + response.content correct. Mocha stubs match the real API. One happy-path test, no error-path coverage.

Critical weakness: zero multi-turn state. Every POST creates a fresh RubyLLM.chat without history. The “chat app” is a stateless echo service. “What did I just say?” → “I don’t know.”

GLM 5.1 (46/100 Tier C, dropped)

This is the most painful fall from the re-audit. RubyLLM.chat(model:, provider:) is correct, but history is replayed via c.user(msg) / c.assistant(msg), fluent DSL that doesn’t exist in RubyLLM (confirmed via grep of the gem source).

Worse: every HTTP request constructs a new ChatSession.new that discards history. The two bugs mask each other: the invented DSL rarely fires because there’s never any history to replay.

Stimulus controller uses fetch + manual innerHTML. SSE-based but not Turbo Streams.

Per Z.ai itself, GLM 5.1 beats GPT 5.4 and Opus 4.6 on SWE-Bench Pro. In this specific benchmark (Rails + RubyLLM + complete deliverables), it drops to Tier C. More evidence that benchmarks are specific.

GLM 4.7 Flash bf16 local (52/100 Tier C)

The local model that most dominates the RubyLLM API. Uses the fluent chain .with_model().with_temperature().with_params().with_instructions().complete(&block), all real API per gem source.

Fatal bug: gem "ruby_llm" is in group :development, :test with require: false. Doesn’t load in production. NameError at boot. Uses class-var Message.all (process-local).

Grok: from Tier D to Tier B (4.20 → 4.3)

Grok 4.3 entered the benchmark in May 2026 and deserves a separate section, because it’s the biggest intra-family jump in the whole benchmark. Grok 4.20 scored 25/100 Tier D (still listed in the losers section right below). Grok 4.3 scores 72/100 Tier B, ranking between Qwen 3.6 Plus (71) and Kimi K2.5 (69). 47 points of jump. Cost $1.74 per run (15m wall time, 1.0M input + 18K output + 2.2M cache_read tokens, 102 step finishes).

What it does well, and does well:

Correct RubyLLM API on all five calls: RubyLLM::Chat.new(model:) + add_message(role:, content:) + chat.ask + response.content + RubyLLM::Error rescue. Verified against the gem 1.14.1 source, no hallucinations. Counter-grep zeroed: grep -nE "RubyLLM::Client|chat\.complete|chat\.send_message|chat\.user|chat\.assistant" returns 0 hits.
Cleanest controller in the benchmark: 48 lines, no service-object over-engineering, no invented fluent DSL.
Real Turbo Streams: the server-side reactive parts work.
Material deliverables: real README, real compose.yaml, working multi-stage Dockerfile, cookie-based session persistence.

The killer weakness is awkward to describe: Stimulus is dead at runtime. The app/javascript/application.js is a one-liner comment, no import "./controllers", no Application.start(). The compiled app/assets/builds/application.js weighs 48 bytes (just a sourcemap pointer). The controllers/index.js imports ./application, which doesn’t exist. Result: every data-controller="chat" is inert. Enter-to-send, autoresize, autoscroll, clear-input — all silently broken. Phase 2 of the benchmark (boot validation) reported “local boot OK” without exercising the JS layer, so the gap slipped through.

Other smaller issues:

Tests stub RubyLLM.stub :chat but the controller calls RubyLLM::Chat.new. The stub is bypassed, so the test either hits the network for real or fails for missing key.
Stale model pin: anthropic/claude-3.7-sonnet in code, README says “latest Claude Sonnet” (current is 4.7).
No with_instructions for system prompt.

The direct cost comparison is unfavorable: $1.74/run for Grok 4.3 vs $0.30/run for Kimi K2.6 (Tier A, 87/100). That’s 5× more expensive for a worse result. Grok 4.3 sits in an awkward price/quality slot: good enough for Tier B, but in Tier B there are options 5× cheaper (Kimi K2.5 at $0.10) and in Tier A there are options at almost the same price (Kimi K2.6 at $0.30). xAI still has to find the lane where Grok makes economic sense.

The general read: Grok 4.3 shows the family materially improved generation over generation (47 points is a huge jump). But it hasn’t reached the level of the models that dominate the top yet. Worth keeping an eye on the next generation.

The losers (Tier C/D): MiniMax, Grok 4.20, Step, DeepSeek V3.2, GPT OSS

Model	Score	Tier	Reason
Step 3.5 Flash	56	C	Bypasses `ruby_llm` entirely with `Net::HTTP`; not compliant with prompt
DeepSeek V3.2	43	C	Invents `RubyLLM::Client.new` and `client.chat(messages: [...])`; tests mock a class that doesn’t exist
MiniMax M2.7	41	C	Invents `RubyLLM.chat(model:, messages: [...])` batch signature; crashes on first call
Qwen 3.5 122B	37	D	Uses `Openrouter::Client` (wrong casing); calls `client.chat` on a method the gem doesn’t expose
Qwen 3 Coder Next	32	D	Invents `RubyLLM::Client.new` + `client.chat(messages:)` + OpenAI-shaped response
Grok 4.20	25	D	`ruby-openai` in `:development, :test` with `require: false` → NameError in production; uncompilable Stimulus JS
GPT OSS 20B	11	D	No tests folder; nested `app/app/`; invents `RubyLLM::Client.new`

Models that didn’t complete the benchmark

The ranking above lists the 24 models that completed the benchmark enough to be audited. But the benchmark covers more than that. In total, 34+ were configured, 28+ ran, and only 17-24 (depending on the criterion) completed enough to become auditable code. The ones that failed deserve to be logged:

Model	Harness	Problem	Root cause
Gemma 4 31B (local, llama.cpp)	local	Infinite tool-call loop after ~11 steps	llama.cpp bug #21375, partially fixed in b8665
Gemma 4 31B (Ollama Cloud)	cloud	504 timeout at ~20K tokens	Cloudflare 100s edge timeout; benchmark runs pass 50K tokens, no way
Llama 4 Scout (local)	local	Tool calls emitted as plain text	llama.cpp has no parser for Llama 4’s pythonic format (vLLM does)
Qwen 3 32B (local)	local	Too slow (7.3 tok/s)	Hardware bottleneck (fits in VRAM, bandwidth doesn’t deliver)
Qwen 2.5 Coder 32B (local)	local	90-minute timeout with zero files	Infinite reasoning loop without calling the write tools
GPT 5.4 Pro (OpenRouter)	cloud	Stalled after tool-calls	OpenRouter tool-calling integration broken for GPT; use Codex CLI instead

Also tested but didn’t make the final comparison for various reasons: Qwen 3.5 27B Claude-distilled (covered in the Qwen section as an example of distillation not transferring library knowledge), Qwen 3 Coder 30B (returned a hardcoded mock string instead of calling RubyLLM), Qwen 3.5 27B Sushi Coder RL (llama-swap infra failure), Qwen 3.6 35B local (best Qwen local result, but missing .content on the return, 1-line fix to work).

Practical lesson from these failures: half the challenge of running open source locally in 2026 lives in the toolchain. llama.cpp bugs, Ollama lifecycle, missing tool-call parsers, Ollama Cloud Cloudflare timeouts. Even if the model is good, if the stack that runs it hangs every 11 tool calls, in practice you can’t use it.

The Chinese situation: the gap in practice

The benchmark covers essentially every Chinese LLM family with recent releases: Moonshot (Kimi), DeepSeek, Xiaomi (MiMo), Alibaba (Qwen), Z.ai (GLM), MiniMax and StepFun (Step). Worth consolidating the whole picture, because the “China has caught up” narrative is more optimistic than this benchmark sustains.

Distribution by tier

Tier A (80+): Kimi K2.6 (87) and DeepSeek V4 Pro via DeepClaude (89, with the caveat that it requires a custom harness).
Tier B (60-79): Kimi K2.5 (69), DeepSeek V4 Flash (78), Xiaomi MiMo V2.5 Pro (67), Qwen 3.6 Plus (71), GLM 5 (64). DeepSeek V4 Pro in opencode scored 69 but with mixed authorship (reroute to Opus fallback), so unmeasurable in that harness.
Tier C (40-59): Step 3.5 Flash (56), GLM 4.7 Flash local (52), GLM 5.1 (46), DeepSeek V3.2 (43), MiniMax M2.7 (41).
Tier D (<40): Qwen 3.5 122B local (37), Qwen 3 Coder Next local (32).

Out of 13 Chinese models tested (counting both local and cloud), two reach Tier A: Kimi K2.6 with no caveats and DeepSeek V4 Pro only through DeepClaude. That’s the current penetration in this specific benchmark.

The gap in points

Kimi K2.6 vs Opus 4.7 (Tier A vs Tier A): 87 vs 97. 10-point gap. In practice, both deliver correct RubyLLM, real-signature FakeChat, error rescue, multi-worker-safe session cookie, complete Gemfile. What Opus 4.7 has extra are secondary dimensions that accumulate: tests that cover error wrapping, model/provider override and explicit system prompt application; redundant rescue in the controller beyond the service; slightly better concerns separation. Perceptible differences side by side, but not tier-separated.

Cost: K2.6 $0.30/run vs Opus 4.7 $1.10/run. 3.6× cheaper. In continuous production runs, that difference accumulates.

Chinese Tier B vs Claude Opus 4.6 (83 Tier A): 5 to 20-point gap. The Chinese Tier B models (MiMo, DeepSeek V4 Flash, Kimi K2.5, Qwen 3.6 Plus, GLM 5) have correct or near-correct RubyLLM code, but fail on specific components:

Test quality is the universal weakest spot. Chinese Tier B models often write many tests (K2.5 wrote 37, the most in the benchmark) but don’t mock RubyLLM. Coverage theater.
Persistence frequently uses process-local Singleton (MiMo) or class-var (K2.5) instead of session cookie or Rails.cache. Dies on restart, not multi-worker safe.
Error handling is usually absent. Rate limit becomes 500 with a stack trace visible to the user.

Patterns by family

Moonshot (Kimi): the most disciplined Chinese family. K2.5 at Tier B and K2.6 at Tier A. Real generational evolution in the dimensions that matter (tests, rescue, persistence).

DeepSeek: pattern of tool support lagging. Each generation has better RubyLLM code than the previous (V3.2 invented everything, V4 Flash writes correct, V4 Pro writes perfect). In ai-sdk harnesses (opencode included), V4 Pro stays incompatible because of the reasoning_content echo protocol. But via DeepClaude (Claude Code with env vars swapping the endpoint), the same model delivers Tier A at 89/100. The marketing matches the product, as long as you use the right harness.

Xiaomi (MiMo): writes the most idiomatic RubyLLM code in the benchmark (persistent @chat, zero manual add_message). But forgets real tests, rescue, robust persistence. Lands at Tier B with prototype-demo quality, not production.

Alibaba (Qwen): high variance. Qwen 3.6 Plus cloud reaches Tier B. Qwen 3.5 35B local at Tier C with patches. Qwen 3.5 122B and the “Coders” (Qwen 3 Coder Next, Qwen 2.5 Coder 32B) fall to Tier D for inventing API or hanging. “Coder” in the name isn’t a guarantee for coding agent.

Z.ai (GLM): GLM 5 at reasonable Tier B. GLM 5.1 regressed to Tier C, with invented fluent DSL plus history discard. Z.ai claims superiority in SWE-Bench Pro, but this specific benchmark caught two structural bugs. GLM 4.7 Flash local gets close but the gem in the :test group kills it in production.

MiniMax and StepFun (Step) don’t deliver. MiniMax invented a batch signature that crashes on first call. Step bypasses the ruby_llm gem entirely with Net::HTTP, violating the prompt.

Conclusion on the Chinese gap

For this specific benchmark (Rails + RubyLLM + complete deliverables):

Two Chinese models deliver Tier A: Kimi K2.6 (87/100, 10-point gap vs Opus, 3.6× cheaper) and DeepSeek V4 Pro (89/100 through DeepClaude — needs the custom harness, but fits within the budget of anyone who already has an OPENROUTER_API_KEY).
Five Chinese models deliver Tier B usable with 1-2h of patching: K2.5, V4 Flash, MiMo, Qwen 3.6 Plus, GLM 5.
The rest are not usable in production for this kind of task.

The narrative of “China has already caught up with the West in coding LLMs” needs to be read with caveats. On synthetic reasoning benchmarks, maybe. On the benchmark of delivering a complete Rails app with every part functional? Two models caught up, with the caveat that one of them (V4 Pro) only works in a non-default harness. The others are still a generation behind.

The reality of running open source locally

Every viable open source in the benchmark landed at Tier C or worse. Worth explaining why.

VRAM + KV Cache: the math nobody does

Take Qwen3 32B. FP16 takes ~64 GB. Quantized to Q4, it drops to ~19 GB. So it fits on a 32 GB RTX 5090, right? Wrong. That’s only the model weights. You’re missing the KV Cache.

KV Cache is the memory the model uses to “remember” what it already read. It scales linearly with context:

KV Memory = 2 × Layers × KV_Heads × Head_Dim × Bytes_per_Element × Context_Tokens

For a Llama 3.1 70B in BF16, that’s ~0.31 MB per token. A 128K-token context = 40 GB of KV Cache alone. In real benchmark runs, models consumed between 39K and 156K tokens. Less than 100K context isn’t practical for a coding agent.

Google published TurboQuant (ICLR 2026), which compresses KV Cache to 3 bits without accuracy loss, 6× reduction and up to 8× speedup. Not yet implemented in llama.cpp or Ollama.

Memory bandwidth rules, capacity doesn’t

“But I have 128 GB of RAM!” Nice, but what matters is memory bandwidth. Brutal differences:

Memory	Bandwidth
DDR4	~50 GB/s
LPDDR5x (AMD Strix)	~256 GB/s
GDDR6 (RTX 3090)	~936 GB/s
GDDR7 (RTX 5090)	~1,792 GB/s
HBM3 (Mac Studio M4 Ultra)	~800 GB/s

RTX 5090 has 7× more bandwidth than my Minisforum’s LPDDR5x. On the AMD, Qwen3 32B runs at ~7 tok/s; on the 5090, it would be much faster if it fit.

For Mac Studio M4 Ultra with 512 GB of unified memory (~$10k), it’s practical but pricey. For AMD Ryzen AI Max with LPDDR5x 256 GB/s, it’s accessible but slow. For DDR4 desktop, it’s unviable.

Ollama vs llama.cpp: each with its own problems

Ollama failed 6 of 8 local benchmark attempts: mid-session model unloading, context drift, flaky lifecycle, broken bf16. I migrated to llama-swap (Go wrapper around llama-server). It fixed the lifecycle but brought other problems: each model needs specific flags (GLM/Qwen 3.5 need --reasoning-format none for tags), tool-call parsers depend on the model (Llama 4’s pythonic format doesn’t parse), Gemma 4 requires build b8665+ and still enters repetition loops after ~11 tool calls.

Plug-and-play it isn’t.

The harness also matters

The same Opus 4.7 model produced measurably worse code on Claude Code than on opencode. The difference is harness context pollution. Claude Code uses 6-11M cache-read tokens per run (vs ~210K on opencode) and that nudges the model toward generic patterns like the OpenAI SDK instead of RubyLLM-specific.

Practical translation: even with the same model, the tool you invoke it through changes the output. For stable model testing, I ran everything on opencode.

Multi-model isn’t worth it for greenfield coding

Question that shows up every week in my feed: is it worth configuring two models in the same project? Opus for planning, GLM for executing, Haiku for boilerplate? I tested 7 combinations across 3 harnesses:

Claude Code: Opus 4.7 alone (baseline), Opus 4.7 + Sonnet 4.6 subagent, Opus 4.7 + Haiku 4.5 subagent
opencode: Opus 4.7 + GLM 5.1 subagent, Opus 4.7 + Qwen 3.6 local subagent
Codex: GPT 5.4 xHigh + GPT 5.4 medium, GPT 5.4 xHigh + GPT 5.4 low

Zero of the 7 runs voluntarily delegated to the subagent. The main model did 100% of the work alone in every case, even with the subagent declared with aggressive language like “Use PROACTIVELY” and “ALWAYS delegate to this agent for code implementation”. Three reasons:

Reason 1: Tier A models already plan-then-execute internally

Opus 4.7 and GPT 5.4 xHigh recognize when a task calls for planning + implementation and already do that within the same thought session, without external context switch. The “let’s split this into design + code” reasoning happens internally. Externalizing that with two separate models breaks the continuous context and gains nothing.

Reason 2: coordination costs are high

Subagents receive a reduced prompt, partial context, and have to return structured output the main model consumes. All that back-and-forth adds tokens, latency, and opportunities for misalignment. The main model prefers to absorb the cost of doing it alone rather than pay the cost of coordinating with another model. This is rational: economically, the subagent would only pay off if it were much cheaper AND produced equivalent output. Usually it’s cheaper but worse.

Reason 3: there’s no clean plan-vs-execute line in greenfield

On a Rails app from scratch, you plan while implementing, revise decisions when you hit concrete problems, rebalance trade-offs along the way. Splitting that artificially between two agents forces a separation that doesn’t reflect how the work actually happens on a cohesive task.

The “Opus plans, Qwen executes” case

The worst combination was Opus + Qwen 3.6 local. Opus is Tier A on RubyLLM, Qwen is Tier C. The theory: Opus (expensive) plans, Qwen (free) executes. In practice:

Opus didn’t delegate. It did everything alone.
If it had delegated, the Qwen code would come out Qwen-quality (invents API, no correct mocks).
Opus ↔ Qwen coordination isn’t free.

Three false assumptions in the theory. Practical conclusion: if you’re going to pay Opus to plan, let it do everything. That’s what it wants to do naturally. Forcing delegation to an inferior model adds coordination without reducing cost and makes the result worse.

When multi-model does make sense

Not on a cohesive greenfield task. It makes sense on separate pipelines where each model has a well-defined scope: fast classifier filtering input, large model processing only the subset that passed the filter, small model translating output. That’s not “Opus plans Qwen executes” in the same agent session, it’s multi-service architecture with different APIs.

For interactive coding agent on a real project, the rule of thumb is: pick one good Tier A model, use it alone, optimize the prompt instead of the orchestration.

Best commercial, best open source

Commercial

Tier A premium: Claude Opus 4.7, GPT 5.4 xHigh (Codex), GPT 5.5 xHigh (Codex), Claude Opus 4.6. Pick by behavior preference; code quality is there on all of them. Among the GPTs, 5.5 costs 40% less than 5.4 at the same output.

Tier A cost-effective: Kimi K2.6 ($0.30/run), Gemini 3.1 Pro ($0.40/run). 3-4× cheaper than Opus/GPT with comparable quality within this benchmark.

Tolerable Tier B: Claude Sonnet 4.6 ($0.63), DeepSeek V4 Flash ($0.01), Xiaomi MiMo V2.5 Pro ($0.14). Needs 1-2h of patching to ship to production.

Cheapest useful: DeepSeek V4 Flash ($0.01/run) with the anthropic/ prefix fix on the model slug.

Open source (local)

No Tier A. The best local that ran was Qwen 3.5 35B-A3B at Tier C (55/100), and it needs 1-2 correction prompts to deliver working code. For those with an RTX 5090 who want to escape vendor lock-in, that’s the model to run, but with hobby/lab expectations, not production.

Qwen 3.6 35B local is the closest to working I’ve seen (1-line fix to add .content), but still without multi-turn.

GLM 4.7 Flash local is the most RubyLLM-literate local, but the gem in the wrong group kills the app in production. Trivial fix, structural impediment.

In 2026, running open source locally for coding agent is viable with caution, on high-end hardware (RTX 5090 or Mac M4 Ultra), accepting 1-3 corrections per run. It’s not a substitute for Claude yet.

Conclusion

Claude Opus 4.6 is still my daily pick for coding agent on real projects. Predictable behavior, defensive code, real mocks, sensible persistence, error handling.

Opus 4.7 leads the objective benchmark (97/100) but has a behavior downgrade (heavier tokenizer, aggressive resource optimization). On benchmark it delivers. In daily use I prefer 4.6.

GPT 5.4 xHigh via Codex ties at the top (97/100), and GPT 5.5 xHigh landed third at 96/100: same quality as 5.4, but 40% cheaper and 20% faster. For anyone already on Codex, 5.5 replaces 5.4 with no regression. At 10× Opus’s price (used to be 15×) for essentially tied quality, it’s still pricey for continuous use.

The cost-effective sweet spot now is Kimi K2.6 at $0.30/run or Gemini 3.1 Pro at $0.40/run. Both Tier A. 3-4× cheaper than Opus.

Local open source hasn’t reached Tier A yet. The best is Qwen 3.5 35B-A3B at Tier C with corrections. For experimentation, worth it. For production, not yet.

The biggest lesson from the re-audit is about process: when you start classifying models based on “hallucinations” you think you found, go to the library source and check directly. I was discarding models for valid API calls. Once I checked against the gem, Kimi, Gemini, DeepSeek V4 Flash and GPT 5.4 all improved. And once the criterion included deliverables (docker-compose, substantive README, bundle-audit), models that looked clean but delivered incomplete projects (Step 3.5 Flash via bypass) fell to the tier that reflected that. And the DeepSeek V4 Pro case showed that harness choice can be as decisive as the model: under opencode the run is mixed authorship (Opus fallback underneath) and unmeasurable; through DeepClaude the same model delivers Tier A.

Benchmark is a tool, not truth. What I test here is a specific Rails app with a specific library. Models may perform differently on other tasks. The methodology is explicit in audit_prompt_template.md for anyone who wants to replicate, adapt, or challenge it.

The repo is public with all the success_reports and diffs.

Sources

How DriveClub and shadPS4 Almost Defeated AI and Me: How to Learn

Thu, 23 Apr 2026 12:00:00 GMT

Where we were

Last week I posted about my distrobox-gaming setup, with more than ten retro racing games running on Linux, from Gran Turismo 1 on DuckStation to Forza Motorsport 4 on Xenia. At the end of that piece, I described Driveclub on shadPS4 as the “final boss”: the game boots, the menu loads, the race starts, the controller responds.

Except I mentioned two problems in passing and treated them as “accepted limitations”:

During daytime, the image is a bit darker than on a reference PS4. Not alarming, but noticeable.
At night, the race start in Canada or Munnar went pitch black for 10 to 30 seconds before the game “recovered on its own” around the 1:30 mark. During those seconds, only the HUD was visible. Track, car, headlights — all absolute black.

In the previous article I pushed the explanation to “it’s just not fixed in the emulator yet, I’ll live with it”. But inside I couldn’t make myself play the game in a visually unplayable state. I wanted to play DriveClub at its best, and I didn’t want to sit around waiting for the project to fix it for me. I decided to take matters into my own hands. Thirty seconds of absolute black isn’t “a bit darker”. It’s a serious bug, the kind that makes a game unplayable in practice.

This article is what happened between Sunday night and Thursday morning.

The blackout, live

Your browser does not support HTML5 video.

This video is the behavior I wanted to fix. Race start in India (Munnar), night race. You see the track, headlights turn on, the first few seconds run. Then it darkens progressively. Within about 15 seconds everything is black. Stays that way for another 60-90 seconds. Then the game “recalibrates” on its own and the track reappears.

If you’re a player, you don’t care why. You just want to drive.

Why shadPS4 is hard to debug

shadPS4 still has no stable release. The project is under active development. The main branch changes several times a week, configs migrate from TOML to JSON without notice, settings hide behind “Advanced” so users with incompatible hardware don’t file issues, open PRs compete over different approaches to the same problem.

Anyone trying to configure DriveClub today will find:

September/2025 Reddit guides using readbacks = true (it was a boolean back then) saying “for DriveClub always enable readbacks”.
November/2025 guides saying “disable readbacks, it kills performance” (because in 2025-07 readbacks became an enum Disabled | Relaxed | Precise with different performance profiles, and Bloodborne started hanging with it on).
January/2026 YouTube videos using some custom fork nobody documented.
A thread on shadps4.net with three different, mutually exclusive explanations for the blackout.
My own notes from last week in distrobox-gaming/docs/driveclub-shadps4.md saying readbacks_mode: 0. The exact opposite of what I found this week.

So the information exists, but it’s scattered, dated, contradictory, and most of it depends on version details that changed between the original post and today. Reproducing a setup from a YouTube video is like trying to solve a Rubik’s cube blindfolded.

Spoiler: the solution is one integer

After 31 phases of investigation, 44 commits on my shadPS4 fork, 15,668 lines of instrumental code added, and three days of nearly uninterrupted debugging, the answer is one line in the per-game config:

{
  "GPU": {
    "readbacks_mode": 2
  }
}

readbacks_mode: 2 is the Precise mode. Explaining what it does requires understanding how DriveClub implements auto-exposure, and that’s where the journey starts. To know WHY this toggle fixes things, you have to understand the whole chain.

The short version: DriveClub implements auto-exposure as a GPU→CPU feedback loop.

The scene renders to an HDR target.
A compute shader calculates a luminance histogram into an SSBO.
The CPU reads that SSBO on the next frame and derives a target exposure.
The CPU writes that exposure back into a 1936-byte lighting UBO (at slots [38] [48] [50]).
Fragment shaders read that UBO and multiply scene luminance by it.

Without readbacks_mode: Precise, step 3 reads stale zeros. The memory page where the GPU wrote the histogram is never synchronized to the CPU side. The CPU exposure integrator concludes “scene is dark, open the aperture all the way” and ramps its output value monotonically: 2.59 → 7.84 → 24 → 90 → 179 → 255. Within 60-90 seconds the scale saturates so high that everything clips to zero. That’s the pitch black.

With readbacks_mode: Precise on, the emulator marks the pages the GPU just wrote as kernel-protected (via mprotect on Linux). When the CPU tries to read one of those pages for the first time, a page fault fires. The emulator intercepts it, issues a vkCmdCopyBuffer download from the GPU buffer to the host staging area, waits on the scheduler, and the CPU finally sees fresh data. The real histogram enters the integrator, exposure converges normally, and the race start opens bright as it should.

Why isn’t it the default? Three documented reasons. First, per-page mprotect + per-fault GPU stall + compute dispatch + scheduler wait has a real cost: on NVIDIA’s warm path it’s tolerable, on AMD it tanks below 12 FPS, with an open issue about it. Second, Bloodborne hangs on loading screens with Precise enabled. Third, the maintainer tucked the option behind “Advanced” in the UI precisely to keep users with sensitive hardware from enabling it by mistake. The mode has existed since v0.15.0 (September/2025), but nobody had publicly connected Precise + DriveClub + auto-exposure feedback loop. The compat tracker already admits that DriveClub “requires readbacks enabled to function properly”, but doesn’t specify the mode.

In my setup the performance cost is irrelevant. I have an RTX 5090 on my desktop, plenty of hardware to swallow the mprotect, the page fault, the copy and the stall per frame. The call is trivial: I flip Precise on, pay the price, the game runs. On more modest hardware or on AMD the conversation is different, which is exactly why this setting isn’t default and hides behind “Advanced”.

One thing I want to make clear: this whole explanation above (Precise vs Relaxed vs Disabled, GPU→CPU feedback loop, per-page mprotect syscall) is not knowledge I had on Sunday night. I didn’t know what a buffer readback was, didn’t know DriveClub implemented auto-exposure as a GPU→CPU feedback loop, didn’t know readbacks_mode had become an enum in 2025 with three values. To arrive at the right answer, I had to learn every piece from scratch. I couldn’t guess. Had I tried to “enable readbacks” last week without understanding the chain, I’d have picked the wrong value (Relaxed isn’t enough for this feedback loop, as I’ll explain below) and given up thinking “it doesn’t work”.

Two more things need to be in place for this to work, both covered in the previous article:

v1.28 patch applied on top of the v1.00 base install (without it, content stays locked with “not yet released”).
60fps XML patch disabled (it raises the render rate but the logic tickrate is fixed; with it on, the game runs in slow motion).

The result

Your browser does not support HTML5 video.

Same race in Canada, night, with readbacks_mode: 2 on. Race start opens bright, auto-exposure converges immediately, the night transition unfolds naturally as the game’s TOD (time of day) advances. No blackout, no “recalibration” at 1:30, no 30-second window of absolute black. This is DriveClub running like it does on PS4.

Now the interesting question is not “which setting”. It’s: how did we get here?

How does a junior learn today?

There’s a question I hear often: “with AI doing everything, how does a junior learn?”

The premise is wrong. AI doesn’t do everything. AI does what you ask it to. It doesn’t discover on its own. It has no domain initiative. It’s not capable of telling you “hey, this looks like a broken GPU→CPU readback feedback loop” unless you already know there’s a thing called a readback and that broken feedback loops are a class of bug.

Worse: if you go to an AI agent today and ask “fix this black screen in DriveClub on shadPS4”, it will give you a list of possible answers based on old forum posts, with a high probability of steering you wrong. It will suggest the 60fps XML patch (wrong — causes slow motion). It will suggest a tonemap override (wrong — the game already writes correct SDR). It will suggest a vblank_frequency tweak (useless for the real problem). It will suggest manual gamma (treats the symptom, not the cause).

The way to discover is to dive into the chain, phase by phase, ruling out wrong hypotheses until the terrain is familiar enough that you recognize the right answer when it appears.

That’s where a lot of people misread me. “Akita is already a senior, he has 25+ years of experience, of course he knows.” False. A senior is someone who has been a junior in every topic they master. And they keep becoming a junior every time they pick up a new domain. I have 25 years of web programming, distributed systems, Ruby, Erlang, Go. But I’ve never programmed for PS4. I’d never looked at the shadPS4 codebase. I didn’t know PS4’s graphics architecture. On Sunday I was an absolute junior in this domain.

But junior doesn’t mean totally green. A while back I had explored on my YouTube channel the low-level architecture of older consoles (from the NES’s 6502 onward) and how emulators work internally in general. The two videos below helped build the basic vocabulary for how a game console works under the hood, and made sure I wouldn’t trip on concepts like fetch-decode-execute, opcode, interrupt, or HLE vs LLE.

I had also made an older video on the evolution of CPUs, GPUs, DirectX, and Vulkan, so at least the vocabulary of “modern graphics pipeline” and “shading language” wasn’t alien.

Knowing that VkCommandBuffer exists is very different from knowing where in shadPS4 a readback vkCmdCopyBuffer fires. But it gives you a base. That kind of old curiosity paid interest now.

The rotation: Claude Code and Codex

AI (in my case Claude Code) didn’t “know” more than I did. It has the same dated forum text you’d find on Reddit. What it did well was aggregate: read shadPS4 source code in parallel with me, index comments, cross-reference between phase docs, compile, run probes, parse 61 MB logs, compare decompiled binary against UBO diffs, shine light on corners of the system faster than I could alone.

But the decision to “keep investigating” was mine. The perseverance of “don’t accept a mitigation, want the root cause” was mine. And when the AI suggested “honestly, we should just accept the current result and document it” (and it suggested this several times over the four days), I was the one who had to say “no, we keep going”.

The central point about the rotation: no single LLM managed to close this problem alone.

Over the four days, Claude Code repeatedly got stuck looping on the same hypotheses with no new direction. When that happened, I’d switch to Codex for a few hours to bring in outside ideas, different probes, a re-read of the codebase from a fresh angle. Then back to Claude Code to integrate what Codex had surfaced. And so on, alternating.

The final resolution happened on Claude Code, but the journey switched models several times. Each LLM had its biases and blind spots. Left alone, each one would have given up sooner.

What broke the impasse was me continuing to bring in new ideas to try, forcing the model to reassess, switching models whenever one started to repeat itself.

At two points things went past “just repeating hypotheses”. Codex literally had an agentic-loop glitch: it started repeating the exact same “solution” it had just tried, over and over, in sequence, without noticing what it was doing. It was the first time I’d ever seen that happen in any AI agent. I had to kill the session and start a fresh one from zero to break the cycle.

That says something about context size and session duration. The complexity and pace of this investigation were enough to bring both LLMs to their knees, each in their own way.

The whole journey below is documented in 33 phase docs on the gamma-debug branch of my shadPS4 fork. Each section here links to the original doc for anyone who wants the complete detail. Here I summarize the essentials: the hypothesis we had, what we tried, what new concept we had to learn, and why the phase didn’t close the case.

Session zero: baseline and reproducibility

Every serious investigation starts before Phase 1. The first thing, for any project, is to make sure the baseline works as expected and that the situation you want to research is reproducible every time. If you can’t reproduce the bug on demand, you’re chasing ghosts. Every probe after that will be inconclusive because the bug “appeared and vanished” without control. And if the baseline is broken in some way you haven’t noticed, every experiment measures the wrong problem. Without those two guarantees, you can’t make progress.

In my case, on Sunday night I confirmed:

The game boots all the way to the race screen without crashing, using the distrobox-gaming config.
DriveClubFS extracts v1.28 cleanly, 8018 files, ~47 GB.
MSAA depth resolve applied on the fork; night tracks render shapes instead of uniform absolute black (covered in the previous article).
8BitDo controller detected.
Night race in India 19:30 reproduces the blackout on 100% of attempts, with the same temporal pattern: progressive fade in the first ~15s, pitch black between ~20s and ~90s, “recalibration” at 1:30.

That baseline is the only solid ground that lets me compare each experiment reliably. Without it, a probe that “worked” might just be a different cache state, not the real fix. With it, every change has a before and an after that are measurable.

Beyond that, zero specific knowledge of:

PS4 PKG format (PFS, param.sfo, disc_info, keystone).
shadPS4 codebase (src/core, src/video_core, shader_recompiler, buffer_cache).
PS4 graphics pipeline (GCN ISA, forward+ lighting, MSAA depth resolve).
Vulkan beyond the surface (descriptor sets, UBO binding, push constants, pipeline cache).
SPIR-V disassembly and patching.
PS4 OELF loader, Itanium RTTI, SCE dynamic relocations.
mprotect-based page-fault tracking for CPU-GPU synchronization.

The journey, phase by phase

I’ll group the 30+ phases into buckets that make sense as a narrative. Each phase links to the full doc for anyone who wants technical detail.

Note for anyone who doesn’t want to read all 30+ phases: after the phase sequence there are reflection and methodology sections you can read on their own. The shotgun technique explains the debugging methodology I used throughout the investigation. What I knew on Sunday vs Thursday sums up what I actually learned. The AI suggested accepting. The perseverance was mine discusses AI’s real role in this process. The real cost of this journey catalogs what got produced. If you want to jump straight to the argument, those are the shortcuts.

Phase 01: pipeline cache, the warm-up

phase-01-shader-compile-stalls.md

Hypothesis: menus run at 2-3 fps on first launch. Must be Vulkan pipeline compilation happening in real time.
Action: enable pipeline_cache_enabled: true in the global config.
New concept: Vulkan pipeline caching. When the emulator translates a PS4 GCN shader into SPIR-V and then into driver-specific Vulkan binary, it can cache that binary. Without the cache, it recompiles every cold launch (~864 shaders + ~590 pipelines).
Outcome: solved. First session pays the cost; subsequent sessions read from disk and launch immediately. Easy. Also my first lesson about the shadPS4 codebase.

Phase 02: gamma dim, starting in the wrong direction

phase-02-gamma-dim-image.md

Hypothesis: DriveClub’s image is dim compared to reference. Must be wrong sRGB encoding or an HDR tonemap applied badly by the emulator.
Action: instrumented the swapchain to log format, added SHADPS4_PP_* env vars to experiment, built a post-processing pipeline with three knobs (exposure, ACES tonemap, gamma curve).
New concept: sRGB OETF, HDR tonemap (ACES vs Reinhard), Bayer dither as anti-banding.
Outcome: via bypass path (SHADPS4_PP_BYPASS=1, writes raw game output with no post-processing) I discovered the game writes correct SDR already into the framebuffer. The emulator’s post-processing pipeline was mangling the signal. Final shader became simply sRGB encode + Bayer dither. Dim persists, but it’s not a gamma problem.

Three hours on this phase. The most important lesson: when “the image looks wrong”, the first diagnostic should be a bypass path that shows raw game output. If bypass looks right, the emulator is the one mangling the signal. Stop touching the tonemap.

Phase 03: v1.28 content access

phase-03-v128-content-access.md

Hypothesis: v1.28 extracts, but the game says “content not released yet, download required”.
Action: merge v1.28 on top of v1.00, restore param.sfo, disc_info.dat, keystone.
New concept: PS4 cumulative patches, package metadata.
Outcome: solved. Detailed in the previous article.

Phase 04: MSAA depth + slow motion

phase-04-slowness.md

Hypothesis: night tracks are pitch black and the game “feels slow”.
Action: two separate things. Disable the 60fps XML patch (which was decoupling render rate from logic tickrate). Implement ReinterpretMsDepthAsColor on my fork (to let the emulator resolve 4x MSAA depth to 1-sample color, which DriveClub needs for forward+ lighting on night tracks).
New concept: forward+ lighting (a renderer that writes the scene to an MSAA depth target and then reads it as color for SSAO and volumetrics), MSAA depth resolve.
Outcome: both fixed. Detailed in the previous article.

Phase 05: second pass on brightness

phase-05-brightness-followup.md

Hypothesis: since the game writes correct SDR, maybe a static boost in post-processing helps without touching the UBO feedback loop.
Action: tried linear boost (1.5x, 2.0x), gamma pre-encode (pow curve), scene-aware auto-exposure with peak+mean sampling.
Outcome: all rolled back. Linear boost clips highlights. Gamma introduces banding. My auto-exposure layer cancels the user’s manual brightness slider. There’s no free lunch in post-processing. Dim stays dim without real HDR or shader patches.

Narrowing the blackout: the shared gate

race-start-blackout-040420.md

Hypothesis: the blackout has a curious signature. Main view goes dim/black, rearview mirror goes absolute black, HUD stays bright. Three layers with different behaviors suggest a shared “gate”, not a common fade.
Action: instrumented texture-cache aliasing, probed with live fragment shader substitution on suspected inputs.
New concept: texture-cache aliasing (multiple images sharing a memory address, which the emulator cache reuses), render-target lifetime.
Outcome: points to a shared “presentation permission” between the main scene and the mirror, not a scripted fade.

Phases 07-09: texture torture, UI surgery, binary patch plan

phase-07-aggressive-torture-probe.md · phase-08-ui-asset-surgery.md · phase-09-eboot-binary-patch-plan.md

Hypothesis: the race-start dim overlay is an asset. Small texture, UI panel, or a method of the FreeplayGetInCar controller in the eboot.
Action: null-substituted BC-compressed materials, small texture candidates, four-vector weight masks during critical draws. Disabled UI panels (loading_freeplay.txt, get_in_car_animation.txt, vehicle_select_background). Extracted the OELF, located FreeplayGetInCar ASCII strings, identified the constructor at VA 0xf89b00 and the vtable at 0x15dc3e0.
New concept: PS4 OELF loader (Sony’s variant of ELF), Itanium C++ RTTI (how C++ encodes type info in vtables), SCE dynamic relocations (placeholders that only the dynamic loader resolves at runtime).
Outcome: nulls work but the fade persists. The overlay renders directly into HDR scene targets, not through the panel system. Binary patching has relocation placeholders that aren’t statically resolvable without a Ghidra + PS4 plugin. Asset-side surface exhausted. Six hours of debugging, final conclusion: the bug lives in state, not in an asset.

Phase 10: rethink + animlib

phase-10a-rethink-040422.md · phase-10b-video-diagnostic-animlib.md

Tuesday morning. Sat down, rebooted mentally. New model: the blackout is a state-machine gate. It’s not “scene not ready”, it’s “permission transition”. The gate closes after the starting grid appears. Next probe: runtime dispatch trace, enough of the blind asset surgery.

In parallel: frame extraction from the recording at 10 Hz, cross-referenced against keyframes of the animation library (animlib) in india_posteffects.lvl. Found a smoking gun. The MasterBrightness track has keyframes 0.003 and 0.0007 (0.63% brightness) where the default would be 1.0. Animlib beats inline defaults every frame.

New concept: animation library, keyframe encoding, binary float-scanning.

Your browser does not support HTML5 video.

Video above: the process of probing the fade in real time.

Phase 11: multi-scalar patch

phase-11-multi-scalar-patch.md

Hypothesis: patching MasterBrightness + AutoTargetLuminance + ManualAutoMix simultaneously will lift the blackout.
Action: extended PatchAnimlibFloatValues to rewrite three scalars. Instrumented with SHADPS4_DC_DRAWLOG + SHADPS4_DC_UBOLOG.
Outcome: the patch produces a real numeric lift (1.2 → 6.5 in 255, a 5×), but still sub-perceptual black. And the crucial finding: the mirror works normally after the patch. It’s not gated, just reflecting a dark scene. At least 285× of additional attenuation lives downstream of the scalars we can patch.

Phase 12: bisect closes the asset side

phase-12-bisect-closes-asset-side.md

Hypothesis: one of the 60 accumulated patches broke the mirror independently from the blackout. Bisect.
Action: binary bisect across 60 overlay files over 5 rounds.
Outcome: x_live_lobby_pre_race.txt (rerouting loading-spinner transition from ZoomInAndFade to NoFade) was the mirror killer. Rolled back. Asset side completely closed. The driver lives in compiled code or post-process state, not in a game asset.

Asset sweeps 01-09 (consolidated)

asset-sweep-09-conclusions.md

In parallel with the phases above, I swept 12 RPK substitutions: FreeplayGetInCar page swaps, postfx edits in india_landscape_gui.rpk, globaldata RTT-only edits, prerace state-string rewrites. All clean misses. No asset family touches the blackout. Lesson: stop chasing by names; chase by state transitions.

Phase 13: UBO dump breakthrough

phase-13-ubo-dump-breakthrough.md

Hypothesis: eye-adaptation feedback loop with stale input is the culprit. A luminance scalar growing without converging.
Action: dump 128 bytes of UBO in hex during the race window across six hand-picked pipelines. Look for monotonically changing values.
New concept: UBO (Uniform Buffer Object), a shared memory region the CPU writes and shaders read. Auto-exposure integrator, a classic eye-adaptation component that takes time to converge toward a target.
Outcome: found it. UBO offset 3 climbs monotonically from 2.59 → 7.84 → 24 → 90 → 179 → 255 over 2 seconds. Textbook feedback loop with broken input. First real hit in 12 phases.

Phase 14: scene-pipeline UBOs are the read side

phase-14-scene-pipeline-ubos-wrong.md

Hypothesis: clamping the climbing scalar will stop the blackout.
Action: SHADPS4_DC_LUM_CLAMP=2.0 forces offset 3 to a fixed value during the race window.
Outcome: the clamp fires 1093 times, blackout unchanged. Scene-pipeline UBOs consume the dim, they don’t produce it. The dim is upstream, in a fullscreen tonemap or post-fx compute.

Phases 15-16: runtime texture dump + mutation

phase-15-runtime-texture-dump.md · phase-16-runtime-texture-mutation.md

Hypothesis: the blackout is a fullscreen texture composited over the scene.
Action: SHADPS4_DC_TEX_DUMP=1 dumps every bound texture during the race window (1079 textures captured). Convert tiled format to PNG for visual triage. Then, deterministically mutate each texture via hash-based tint + InvalidateMemory to force re-upload.
New concept: GCN tile format (Sony interleaves bytes a specific way for cache locality), texture de-swizzle, buffer cache invalidation.
Outcome: 14 rounds of mutation (BC textures, float data, render targets, UI atlases, post-fx buffers, LUTs). Blackout unchanged across all of them. Texture content 100% ruled out as the source.

Phases 17-19: UBO smashing, push-constant smashing, compute dispatch probe

phase-17-ubo-batch-smashing.md · phase-18-push-constant-smashing.md · phase-19-compute-dispatch-probe.md

Hypothesis: random-fill on UBOs, push constants, and compute input buffers can bracket and isolate the dim field without crashing.
Action: SHADPS4_DC_UBO_SMASH=1, SHADPS4_DC_PC_SMASH=1, SHADPS4_DC_DISPATCH_SMASH=1 with cb-index and size filters.
New concept: push constants (128 bytes of fast params the driver pushes straight into the pipeline, no descriptor), compute dispatches (fires compute shaders with a grid of threads), ud_regs (GCN user data registers — how shadPS4 maps user data to push constants).
Outcome: safe brackets produce overlay bokeh + blinks, but the dim stays put. Wide brackets that reach the dim also hit matrices/transforms and crash. Random-fill is the wrong instrument. Needs diff methodology.

Phase 19b-c: dim-vs-lift dispatch diff

phase-19b-dim-vs-lift-diff.md · phase-19c-dispatch-draw-skip-null.md

Hypothesis: compute dispatches that only fire during dim (and stop once it lifts) are part of the chain.
Action: split the clean log into pre / dim / bright windows (via user wall-clock), count dispatch frequency per pipeline.
Outcome: identified 8 compute pipelines (histogram 256→downsamples, tile-grid reductions, auto-exposure 1x1 scalars) that stop exactly when dim lifts. Shapes strongly suggest the eye-adaptation chain. First genuine diagnostic hit. Skip test on those 8 + 101 correlated draws: blackout unchanged. They’re downstream effects, not drivers.

Phase 20-21: tonemap inline, fragment → flip buffer

phase-20-shader-dump-tonemap-inline.md · phase-21-fragment-to-flip-buffer.md

Hypothesis: the auto-exposure compute shader is broken, or the tonemap compute has the wrong math.
Action: recompile all shaders, dump SPIR-V, disassemble the histogram auto-exposure. Patch its output to constant 1.0 and test.
New concept: SPIR-V disassembly (Khronos intermediate format with readable assembly). Compute shader verification. Later, pipeline-to-shader mapping.
Outcome: patch has no effect. Discovery: DriveClub has no separate tonemap pass. Scene geometry writes directly to the flip buffer, with exposure inline in material shaders. Each material shader bakes the dim scalar from a shared UBO. Pipeline cache wipe + SHADPS4_DC_PIPEMAP=1 dumps every pipeline→shader mapping. Filtering drawlog for pipelines writing to 0x5000900000 / 0x5000108000 (the actual swapchain buffers) yields 14 unique pipelines. None with Exp2/Log2 tonemap ops. Exposure is applied upstream.

Phase 22: tonemap compute identified

phase-22-tonemap-compute-identified.md

Hypothesis: among the game’s 510 compute shaders, one has a tonemap signature (many Exp2/Log2).
Action: automated scan, cross-reference with dispatchlog.
Outcome: cs_0x2c918c06 is the strong candidate, with 8 Exp2 + 8 Log2, 5 SSBOs (scene params, TAA, bloom, camera, exposure), 11 images. Skip test on that specific pipeline: stops the blackout cycle, removes color grading, introduces temporal ghosting. That’s the final scene→swapchain tonemap path.

Phase 22+: tonemap patches → playable

phase-22plus-tonemap-compute-patches.md

Hypothesis: two surgical SPIR-V edits (histogram no-op + tonemap exposure clamp/boost) make the game playable.
Action: histogram patch to never update; tonemap patch to clamp exposure floor at 0.5 and boost blend by ×10. Both via SPIR-V patching + shader/patch/ drop-in.
Outcome: playable! Dim shrinks from pitch black to mild. A real mitigation, not a root-cause fix. The fade curve still drags exposure, and each track needs per-threshold tuning.

Your browser does not support HTML5 video.

Here the AI suggested stopping for the first time. “The game is playable, this is a decent mitigation, we should document and accept.” I pushed back: “it’s playable but it’s wrong. Don’t accept. We keep going.”

Phase 23: CPU-side exposure intercept

phase-23-cpu-exposure-intercept.md

Hypothesis: clamping the exposure scalar at the tonemap compute’s BindBuffers point prevents over-dimming.
Action: SHADPS4_DC_EXPOSURE_PIN=1 intercepts BindBuffers, overwrites exposure offsets with a clamp-min threshold, invalidates buffer_cache.
New concept: CPU-GPU buffer binding (the moment when the CPU hands a descriptor to the GPU). Buffer cache invalidation (how the emulator keeps consistency between a buffer’s CPU-side and GPU-side state).
Outcome: the scalar oscillates at race start, then climbs monotonically. Per-track tuning is needed. India wants 3.0, Canada blows out at the same threshold. A single-scalar solution doesn’t converge cross-track.

Your browser does not support HTML5 video.

Phase 24: dim upstream of tonemap

phase-24-dim-upstream-of-tonemap.md

Conclusion: after the whole probe chain (identity tonemap, exposure pin, compute skip), the dim is in the HDR target before the tonemap reads it. Three upstream suspects: (1) atmospheric scattering / volumetric fog, (2) reflection-probe / envmap update, (3) pre-tonemap exposure apply.
New concept: atmospheric scattering (simulating how light travels through atmosphere, Rayleigh/Mie-style), volumetric fog (fog rendered as a froxel volume).
Outcome: plan set. Frame-order dump, walk backwards from tonemap through the whole chain.

Phase 25: frame-order trace

phase-25-frameorder-trace.md

Action: SHADPS4_DC_FRAMEORDER= logs strict per-submit pipeline/shader chain over N submits during race window.
New concept: frame-order reconstruction (listing every GPU submission in exact order), froxel volumetrics (frustum volume voxels — a 3D grid aligned to the camera’s frustum that stores light info).
Outcome: three suspects: (A) half-res compute (60,34,1), probably SSAO; (B) progressive dispatches (256,1,1)→(2560,1,1)→(10240,1,1), froxel volumetric; (C) fullscreen apply-fog-to-HDR draw. Suspect B is the strongest fit (sunset best-fit, per-pixel non-uniform dim).

Phase 26: The UBO fade pin, the “aha moment”

phase-26-ubo-fade-pin.md

This is where everything changed.

Hypothesis: a direct memory diff (bright vs dim vs recovered) identifies the exact bytes the game writes to control the blackout.
Action: capture periodic UBO snapshots across visual state transitions. Diff three snapshots byte by byte. Look for fade signatures.
New concept: state diff methodology. Instead of “which pass looks suspicious?”, ask “which bytes actually change with the visual state?”
Outcome: breakthrough. Two UBOs control the fade: a 1936-byte one with slots [38/48/50] (light intensity) being multiplied by 0.094 (fade factor), and a 224-byte sun UBO with slot [50] dropping 97%. Pinning both: race start opens bright. TOD animation intact. Content-aware three-state lifecycle (idle → engaged → expired).

Wednesday night. Claude suggests stopping again. “We have the pin working, three-state lifecycle is robust, Canada + Munnar are running. This is upstream quality, we should PR and close it.” I insist: “there’s still residual dim on other tracks, it’s not closed. We keep going.”

Phase 28: pivot to emulator code

phase-28-pivot-to-emulator-code.md

Realization: the residual dim in Canada dusk/night isn’t a wrong UBO value. It’s an emulator translation gap. The game runs correctly on a real PS4, so shadPS4 has some path that translates or synchronizes incorrectly.
Action: ranked 5 candidate emulator bugs: HLE shader misidentification, image storage classification, layout transitions, IMAGE_STORE_MIP fallback, OpImageFetch LOD restriction.
New concept: HLE (High-Level Emulation). Replacing complex PS4 OS functions with host-side implementations. Different from LLE (low-level) which emulates bit by bit. HLE is faster but can diverge from real behavior whenever the host implementation doesn’t cover every case.

Phase 29: calibrated state + recording harness

phase-29-calibrated-state.md

Action: SHADPS4_DC_RECORD=1. Recording harness that dumps lighting UBOs periodically + cross-correlates with wall-clock screenshots.
Outcome: important discovery. Slots [144..295] of the 1936-byte UBO (30+ fields) stay denormal/uninitialized for the first ~90 seconds of a race. At the 1:30 “auto-recalibration”, the game writes the real values. It’s not magic recovery, it’s the CPU finally writing values that should have been there since frame 1. On PS4 they’re ready before the first frame; on shadPS4 they’re delayed by 90s. Root cause lives in an emulator init gap.

Phase 30: UBO writer audit

phase-30-ubo-writer-audit.md

Action: exhaustive audit of every path that writes into slot [38] of the 1936-byte UBO. 7 suspects ranked: (1) ObtainBuffer stream-copy race, (2) lazy RegionManager in memory_tracker, (3) histogram compute skip, (4) page-fault delayed invalidation, (5) readback gating, (6) buffer-coherency race, (7) push-constant.
Status: audit incomplete. Before I could verify each suspect, Phase 31 short-circuited the whole thing.

Phase 31: readbacks_mode = Precise, the turning point

phase-31-readbacks-mode-fix.md

Thursday morning. Tired, thinking about suspect #5, readback gating. That one itched at me because it was the only one of the 7 writers that touched the timing of the read rather than the correctness of the write.

Hypothesis: flipping readbacks_mode: 2 (Precise) in per-game config makes the feedback loop work by synchronizing the CPU’s read to fresh GPU output.
Action: one edit in custom_configs/CUSA00003.json. Boot. Canada night.
Outcome: race start opens bright. Auto-exposure converges normally. Zero blackout. Zero 1:30 recalibration. Night TOD natural.

Tested on Munnar. Same result. One integer. 31 phases. 44 commits. 15,668 lines. Answer: one integer in the config.

Why did this slip past 30 phases? The Phase 31 doc has the honest analysis:

Symptom disguise. Progressive darkening looks exactly like a broken tonemap, bad shader, or miscompiled exposure scalar. Every dump-driven probe found wrong values in the UBO, which is true but describes the downstream read, not the upstream gap.
readback_linear_images red herring. I tested that flag in Phase 29, saw no effect, and concluded “the readback surface is audited”. I forgot that flag is for image readbacks (linear images via TextureCache::DownloadImageMemory). The histogram SSBO is a buffer, a separate path.
Pinning making it worse, not better. Overwriting slots [38/48/50] looked promising but never converged. The game rewrote on top of the pin every frame. I interpreted that as “the pin mechanism lands too late in the pipeline”. I spent phases 27-29 moving the pin earlier. The real fix was to stop clobbering the GPU-side input the integrator reads.
Never audited the buffer readback path. The Phase 30 audit listed 7 writer candidates, but all of them were about the correctness of the write. None asked “what is the CPU reading, expecting the GPU to have filled?”

The shotgun technique

One thing I want to put on record: torture probes aren’t waste, they’re learning infrastructure. And the methodology behind them has a name: shotgun.

If you have a suspicion (say, “the dim comes from some texture”), the worst thing you can do is start investigating one texture at a time, in depth. DriveClub has 1079 bound textures during the race window. If a detailed triage on one texture takes 20 minutes, you spend three days on a single branch of that tree and still don’t know if the dim even comes from a texture. It might not be in the whole tree.

Shotgun fixes that. Instead of investigating one at a time, you fire at all of them simultaneously. Mutate every bound texture with a hash-based tint, force re-upload, run the game. In 30 seconds you know whether any of them affects the dim. If it does, the dim is texture-side and only then is it worth starting to narrow down. If nothing changes, congratulations: you just ruled out the entire “texture” category in 30 seconds. Switch categories, shotgun UBOs. No hit? Shotgun push constants. No hit? Shotgun compute dispatches. Keep going until something reacts.

Don’t start by narrowing down a tree that might be huge. You’ll spend weeks in irrelevant leaves. Shotgun the trunk first. See which branch reacts. Then go deep inside that branch.

That’s what I did, without knowing I was doing it, for the first 30 phases. 32 probe env vars (SHADPS4_DC_UBO_SMASH, SHADPS4_DC_PC_SMASH, SHADPS4_DC_DISPATCH_SMASH, SHADPS4_DC_TEX_NUKE_*, SHADPS4_DC_UBO_NUKE, and more). Random batches, hash-based tints, null substitutions. Most of them never pointed at the real fix. But every shotgun round ruled out an entire category of mental model. “Not texture.” “Not push-constant.” “Not those 8 compute pipelines.” “Not the UBO write side anywhere in that region.” Every ruling-out narrows the problem.

And while I was doing it, I was learning the system. Every smash that crashed told me “there’s a critical descriptor pointer here”. Every color tint that appeared told me “this UBO feeds that composition pass”. Every dispatch that fired differently between dim/bright showed me where the interesting boundary lives.

When you don’t know the system, shotgun probes are a map. You don’t use them to find the answer. You use them to learn the terrain.

What I knew on Sunday vs Thursday

Sunday: zero everything. All the areas below were black boxes to me.

Thursday:

PS4 package format end-to-end (PKG → PFS → param.sfo → disc_info → keystone → npbind). How v1.28 cumulative patches layer on top of a v1.00 base install.
shadPS4 architecture in outline: src/common, src/core (OS emulation, HLE libraries), src/shader_recompiler (GCN→SPIR-V), src/video_core (Vulkan renderer, buffer cache, texture cache, page manager).
Full graphics pipeline: G-buffer → forward+ lighting (writes MSAA depth) → read depth as color for SSAO / volumetric froxel → post-fx compute → tonemap → composite → swapchain.
UBO vs push constant vs descriptor set. When to use each.
SPIR-V disassembly and re-assembly for surgical patching. How to identify a tonemap compute by its Exp2/Log2 signature.
Forward+ lighting + MSAA depth resolve + how to implement ReinterpretMsDepthAsColor.
shadPS4 buffer cache: MemoryTracker, RegionManager, FaultManager. How page faults turn into GPU→CPU sync.
readbacks_mode: Disabled / Relaxed / Precise. What each tradeoff looks like. Why Relaxed isn’t enough for feedback loops (it only write-protects, not read-protects).
PS4 OELF → Itanium RTTI → SCE relocation → dynamic linker reasoning, even without actually doing the binary patch (the plan is documented).

I didn’t become an expert in any of it. But I went from total outsider to reasonably oriented operator. And that jump, as I’ll argue below, was the single most valuable thing of the four days.

The AI suggested accepting. The perseverance was mine

I’ve said a few times in this article that “Claude suggested stopping”. I want to be fair and specific: it wasn’t an AI failure. It was exactly the correct behavior of an intelligent tool: when you have something that works reasonably well, stopping and documenting is frequently the right call. The Phase 22+ tonemap patch made the game playable. The Phase 26 UBO pin made the image pretty. The Phase 29 recording harness covered the critical window. All were legitimate stopping points.

If you read codex-conclusion.md, the document the AI produced after Phase 26 (still not knowing about Phase 31 yet), it describes the then-current solution as “a mitigation, not a root-cause fix”. It honestly admits what we knew: the UBO pin was a countermeasure, not an explanation. That’s good behavior.

The difference is that I had a stubborn hunch that something was still wrong. It worked in the playable sense. But the cost (DriveClub-specific pin + tonemap SPIR-V patch + per-track threshold tuning + doesn’t generalize) was too high for a supposedly engineering-quality fix. A senior professional recognizes when the asymmetry between cost and understanding points to something deeper. You don’t always get it right. But that hunch deserves to be heard.

Four days without interruption, 8 a.m. to midnight, Sunday, Monday, Tuesday, Wednesday, resolved Thursday morning. How many times did the AI suggest “honestly, we should just accept the current result and document”? Four or five. Each technically defensible. Human perseverance is the difference between “the game works well enough” and “the game works, and I know why”.

The real cost of this journey

investigation-effort-accounting.md catalogs everything before cleanup. Some highlights:

3 days of active work (Apr 21-23, 2026).
33 numbered phases + sweeps + side threads = 54 docs, 412 KB, 6,867 lines of prose.
44 commits on the gamma-debug fork. 15,668 lines added, 1,204 removed.
src/video_core/renderer_vulkan/vk_rasterizer.cpp grew from 1,364 to 4,680 lines (3.4×) with instrumentation, pin lifecycles, recording harness.
32 probe env vars defined. 1 effectively used to find the bug (SHADPS4_DC_LOG_STREAMCOPY, in Phase 30).
61 MB of shadPS4 log accumulated. 57 MB of shader dumps (5,803 files). ~470 MB of total reclaimable artifacts once closed.

Resolution: 1 integer in the config.

The ratio of 6,867 lines of prose : 1 integer amuses me. But it isn’t a joke about inefficiency. It’s the real cost of learning a system you didn’t know. Every line of prose ruled out a hypothesis. Every commit taught a layer of the stack. The integer was the end; the path was the product.

An honest note: despite the 15,668 lines added on the gamma-debug fork, almost none of it is useful to contribute upstream to shadPS4. What’s there is instrumentation: torture probes, recording harness, pin lifecycles, calibrate-at-arm, UBO snapshots, dozens of debug env vars. Diagnostic code, not production code. Diagnostics that only make sense for DriveClub under those specific conditions.

The fork is going to stay public, but not as a contribution branch to the shadPS4 project. Unfortunately. It stays as a study document: 54 md files covering the full process, 44 commits showing the sequence of hypotheses, rule-outs and failures. It’s material for anyone who wants to understand how the debugging process unfolded, or to reuse some of the techniques in similar investigations. No upstream-ready PR is coming out of that tree.

Closing

Yes, the final solution was a config toggle that already existed. It’s been in shadPS4’s code for seven months. But on Sunday night, with zero PS4 knowledge, zero emulator knowledge, zero notion that auto-exposure was a GPU→CPU feedback loop, I had no technical authority to say “flip readbacks Precise, done”.

I had to see the whole chain to understand why it works. Why Relaxed isn’t enough (it only write-protects, and DriveClub reads, it doesn’t write into the histogram SSBO). Why Precise is expensive (per-page mprotect + per-fault GPU stall). Why the maintainer hid it behind “Advanced” (Bloodborne hangs, AMD tanks to 12 FPS). Why nobody in the community had connected Precise to DriveClub before: the compat tracker admits “readbacks enabled” but doesn’t cite the mode, and nobody had seen the monotonic-drift signature through the GPU→CPU lens.

That’s the difference between knowing the answer and knowing the terrain well enough to recognize the answer. To the junior asking me “how do I learn today, in the AI era”: that’s my answer. Don’t ask. Dive in. Use AI to speed up the search, to read code in parallel, to index old forums, to compile, to parse logs. But insist on understanding the why. When the AI suggests “let’s accept and document” (and it will, because that’s often the right behavior), ask yourself whether you actually understood the chain. If the itch “this is wrong but I don’t know why” is still there, keep going.

You’re probably going to spend four days on something that has a one-integer solution. You’re probably going to discover five wrong paths before the right one. Probably half your probes will be torture shotguns that never pointed at the fix.

And you’ll probably, when you finish, know a domain you didn’t know before.

Fork gamma-debug: github.com/akitaonrails/shadPS4/tree/gamma-debug
54 investigation docs: docs/driveclub-investigation/
Operational runbook (recipe): distrobox-gaming/docs/driveclub-shadps4.md
Phase 31 resolution: readbacks_mode: 2

Now I’m going to go play DriveClub. Night race in Canada, natural brightness, no blackout. Good Thursday.

Clean Code for AI Agents

Mon, 20 Apr 2026 12:00:00 GMT

In 2008, Robert C. Martin (Uncle Bob) published Clean Code: A Handbook of Agile Software Craftsmanship. It’s one of the most influential software engineering books of the last couple of decades. For those who don’t know, Uncle Bob started programming professionally at 17, founded Object Mentor, signed the Agile Manifesto, served as the first chairman of the Agile Alliance, coined the SOLID acronym. He’s written a dozen books on design, architecture, and practice, and influenced entire generations of developers.

I’ve been following Uncle Bob for many years. I’ve exchanged messages with him a few times over the decades, I have formed opinions about his positions, and I did an hour-long live stream on my channel with him about Clean Code, Agile, Craftsmanship, and where the industry was heading. If you’ve never watched it, I recommend it:

Clean Code specifically set a standard: code is written once but read dozens of times. The programmer’s job isn’t just to make it work. It’s to make it work in a way that another programmer can understand, modify, and not break. Meaningful names, small functions, one responsibility per class, no duplication, automated tests, clear structure. The target audience was always another human being sitting at an editor trying to figure out what the first one did.

In 2026, that audience changed.

The audience isn’t human anymore

Last week I wrote VS Code Is the New Punch Card. The thesis is that typing code manually into a text editor is becoming a niche activity, the same way typing binary directly on an Altair front panel became a relic after compilers got good. The AI agent is the new compiler: I describe intent in natural language, the agent navigates code, edits, runs tests, adjusts, delivers.

If that thesis holds, the next question is obvious: who are we writing code for now?

Not for the human programmer who’ll sit down tomorrow to maintain it. It’s for the agent that’ll read, edit, and extend it. And an agent is not human. Agents have different technical constraints, different biases, different limitations. Part of Clean Code still applies (in some cases more critically). Another part shifts in weight. And new demands emerge that Uncle Bob couldn’t have anticipated in 2008.

This article is about that. What’s the version of Clean Code that makes sense when the primary reader is an LLM?

The real agent constraints

Before re-ranking, worth reviewing what agents actually face.

File truncation. Most agent CLIs limit file reads to small ranges. Claude Code reads 2000 lines per chunk by default. Cursor, Codex, Windsurf, all have similar caps. A giant file simply doesn’t fit into the context window in one shot. The agent has to ask for piece by piece, or worse, use grep and reconstruct mentally.

Attention degrades with context. Claude Opus has a 200k-token window, Sonnet 1M, Gemini 1M. Sounds like a lot. In practice, “needle in haystack” tests show retrieval quality drops well before the claimed limit. Flash Attention and variants speed up the computation, but they don’t replace full native attention. The more you stuff into the window, the worse the detail precision. And the agent’s context isn’t just your code: it holds CLAUDE.md, the system prompt, chat history, tool output, error logs, test output. All competing for the same window.

Grep is cheaper than read. The agent knows this. It prefers rg "funcName" over loading a whole file. It’s faster, uses fewer tokens, hits the target. Unique, distinctive names make that much more effective. This isn’t a shortcut, it’s an architectural choice: I wrote about this in detail in Is RAG Dead? Long Context, Grep, and the End of the Mandatory Vector DB, showing that Claude Code itself navigates a repo with Glob and Grep, no vector DB, no embedding, and that’s not a deficiency, it’s mature design. Lexical search + smart reader consuming raw text beats dense retriever + top-k on practically every real-domain benchmark. You benefit from that when you organize your code: greppable names aren’t just “nice for humans”, they’re the agent’s primary navigation API.

Tool calls cost tokens. Every Read or Edit or Bash burns input and output tokens. Short files, small test output, concise logs — all of that keeps the agent productive and the bill low.

Latency matters. Agent in a loop, each tool call adds seconds. A large file slow to process becomes perceptible session friction.

Grepping by visual pattern is hard. If you used inconsistent indentation, mixed tabs and spaces, varied brace style between files, the agent spends tokens internalizing the mess. Consistency helps.

From those constraints, Clean Code principles can be re-ranked by relevance for agent-facing work.

Re-ranking: Clean Code in the age of agents

Most important to least. A caveat: it’s not that the ones at the bottom stop mattering. It’s that the ones at the top started mattering MUCH more.

1. Small functions (and small files)

Uncle Bob: “functions should do ONE thing, they should do it WELL, and they should do it only”. Ideal size 4 to 20 lines, per the book.

For an agent, that recommendation became a technical obligation. A small function fits in a single tool call without truncation. A short file (keep it under 500 lines, ideally 200-300) fits in a single read. If the agent can grab the whole unit of meaning in one call, it reasons about it with full attention. If it has to paginate, it builds a fragmented mental model, and each fragment costs attention.

Before, “small function” was good for humans because it aided reading. Today, “small function” is good because it matches the model’s unit of processing. If there’s one recommendation to take to heart, it’s this one.

2. Single Responsibility Principle (SRP)

Each module does one thing and has one reason to change. It was already the heart of Clean Code. For an agent, it becomes even more critical because:

The agent can isolate the unit to understand without loading the rest of the system
You can run focused tests on it
You can edit without fearing side effects
Grepping by responsibility becomes predictable

Code with tangled responsibilities forces the agent to load way more context for any simple change. An 800-line class that does three things is worse for the agent than three 250-line classes, even if the total is the same.

3. Meaningful and unique names

Clean Code already preached: names reveal intention, no disinformation, distinctive, pronounceable, searchable. For the agent, “searchable” became the most important property of that list.

The agent searches code via grep/ripgrep all the time. A generic name (data, process, handler, Manager, Service) returns fifty matches and forces the agent to read each one. A distinctive name (UserRegistrationValidator, InvoiceLineItemTotal, ClaudeCodeSessionTracker) returns three matches and the agent goes straight to the right one.

Rule of thumb: if you grep the name and a lot of irrelevant stuff comes back, the name is bad for the agent. If only what matters comes back, the name is right.

4. Comments with context and provenance

This is where the inversion is most jarring. For Uncle Bob in 2008, the axiom was: “good code explains itself, excessive comments are a code smell, every comment is debt that gets stale”. Every experienced programmer who read the book absorbed that rule. Well-named code doesn’t need comments. Too many comments = flag of bad code trying to justify itself.

Now flip it. The agent reads comments. And likes them. Comments become first-class context. The agent has perfect syntax fluency, knows exactly what x++ does, doesn’t need obvious captions (that kind is still bad, see item 13). What it DOESN’T know is why you chose this approach over the obvious one, what production bug motivated this weird logic, what business constraint forces this specific order, what workaround exists because the upstream lib has known bug #1234, which commit introduced this decision, which Jira issue is the reference. That kind of information is provenance: the why of the decision. It only exists in the head of the human who wrote it, in the commit message, or in a well-placed comment. For the agent, the comment is the most accessible source during a tool call.

Docstrings with intent and usage examples also became strong signals. When the agent picks up a function without understanding the context, a header docstring (JSDoc-style with examples, Python """, Rust ///) drastically shortens the path to a correct change. Uncle Bob was skeptical of JavaDoc in 2008 because they got stale. Today, with the agent able to rewrite the docstring alongside the code, that counter-argument lost weight.

A practical consequence: don’t prune the comments the agent writes. If you have the reflex “verbose comment is noise” inherited from the original Clean Code era, that rule flipped. The agent wrote that comment because, in the act of generating the code, it decided that information was worth preserving for future edits. Removing the comment in code review strips context that the agent itself will want to read on the next interaction. Let the agent comment. It knows what it’s doing. The only kind of agent-authored comment worth removing is the obvious redundant one (item 13), and modern models rarely produce those if the system prompt is well written.

5. Explicit types

This isn’t in 2008’s Clean Code because the industry hadn’t converted yet. But in 2026 it’s a fundamental criterion.

Python without type hints, JavaScript instead of TypeScript, Ruby without RBS. Dynamic code without annotations forces the agent to infer types from usage, which costs reasoning and gets it wrong frequently. Typed code gives an immediate answer key: the signature says what goes in, what comes out, which states are valid. The agent saves discovery work and makes fewer mistakes.

If you’re still on Python 3 without type hints, the transition will boost agent productivity more than any logic refactor.

6. DRY (Don’t Repeat Yourself)

Clean Code already said duplication is the root of all evil. For an agent, duplication is worse than for a human for one specific reason: when the agent has to change something that’s replicated, it can update one copy and forget the others. The attention window doesn’t have natural gravity pulling “oh, there are two more copies of this in other files”. The agent has to find each one via grep, and if the pattern has subtle variation between copies, the result ends up inconsistent.

Factoring into a reusable function or module isn’t aesthetic. It’s automated-refactor safety.

7. Tests the agent can run

Uncle Bob dedicates a full chapter to Unit Tests and F.I.R.S.T (Fast, Independent, Repeatable, Self-Validating, Timely). All of it still applies, with an important addendum: the test has to be executable by the agent without human setup.

Meaning: the command to run the test is in the README or CLAUDE.md, in the Makefile, in the package.json. Output has a predictable format the agent parses. It doesn’t depend on manually seeding the database, on a config file that isn’t in the repo, on a secret credential. The agent writes code, runs tests, reads output, adjusts, runs again. That cycle is the foundation. If the tests don’t run headless, the agent goes blind.

Here I speak from field experience. I documented this in From Zero to Post-Production in 1 Week Using AI, where I went hard on a real project: 274 commits in 8 days, 4 integrated applications, 1,323 automated tests by the end. What made it work wasn’t “AI programs on its own”. It was Extreme Programming with the agent as pair instead of a human pair. Running tests on every commit, tight CI, coverage above 80% (95%+ on business logic), test-line-to-code-line ratio above 1:1 on some modules. Sounds like overkill. It isn’t. In 274 commits, CI caught real bugs more than 50 times, bugs that would have gone straight to production if I had blindly trusted the agent. Without tests, the agent hands you plausible code that silently breaks something that worked yesterday. With strong tests, the agent becomes a multiplier: it generates a test, the test validates the code it wrote, the test is the safety net for the next change it makes. Virtuous loop.

XP practices (pair programming, CI, tests first, continuous refactoring, short feedback) didn’t become obsolete. They became exactly the right way to work with an agent. Whoever programs in cowboy mode without tests today isn’t rebellious. They’re just slow, because the agent without tests keeps guessing, and guesses need manual review, which kills the speed the agent should bring. Good tests with good coverage became the difference between a productive agent and an agent that keeps flailing. Or, put another way: TDD became a technical obligation, not a philosophy.

I covered this theme from another angle in Software Is Never “Done”, showing that post-deploy life is where tests matter most: in ten days of operation after launch, I ran 56 commits of fixes, hardening, and adjustment in response to real behavior, and each commit came with a regression test. Without the net, each of those 56 commits would be an opportunity to break something that worked yesterday. TDD isn’t a phase, it’s a habit.

8. Predictable directory structure

Clean Code barely discusses this (it was more focused on code inside a file). For an agent, tree organization matters. If src/controllers/users.rb implies src/models/user.rb and src/views/users/, the agent can anticipate paths without listing directories. If the project uses idiosyncratic naming (random files, unpatterned names, everything flat in one folder), the agent loses time with find.

Strong framework conventions (Rails, Django, Next.js, Laravel) help the agent a lot. Project without convention, the agent builds one over time, but until then it burns tokens exploring.

9. Dependency Injection and Testability

Code with injected dependencies (not hardcoded) is easier to test in isolation. The agent benefits from this. It can swap the real EmailSender for a FakeEmailSender in a test without touching the logic. Code that instantiates its dependencies internally forces the agent into monkey-patch-fake-server-hacks that are slow, fragile, and pollute the session with infra grime.

DI isn’t ceremony. It’s isolation scope. And in a real project, DI quickly becomes a load-bearing refactor: on one of my projects (the M.Akita Chronicles), I discovered after launch that I needed to swap the default LLM model to another provider. The environment variable had existed from the start. But the model name was still hardcoded in references across 24 files. A whole commit (Centralize LLM model config) touched all 24 to isolate the config into a single constant. Swapping models after that became a one-line change. That’s exactly the kind of refactor that only shows up after software meets reality, and it’s where DI and config isolation pay dearly if you didn’t do it earlier.

10. Avoid deep nesting

Clean Code talks about single level of abstraction per function. A corollary: avoid if inside for inside if inside try. Every indentation level is more attention the model has to spend tracking state. Four levels of indentation is MUCH more cognitively expensive for the agent than two levels with early return.

Pattern matching, guard clauses, early returns, flattening logic, all of this improves readability for the model the same way it improves it for the human, except measurably so, because the cost is measured in response quality.

11. Errors with context

raise ValueError("invalid input") doesn’t help the agent when it reads the stack trace. raise ValueError(f"invalid input: received {repr(x)}, expected non-empty string of digits") does. The agent uses exception messages as debug signal. Vague message = agent runs an extra round to figure out what went wrong.

Uncle Bob talked about this in Error Handling: “Provide context with exceptions”. It became critical now.

12. Formatting and style

Don’t waste time on this. Use the default or most popular formatter for your language: cargo fmt for Rust, gofmt for Go, prettier for JS/TS, black or ruff for Python, rubocop -A for Ruby. Configure it in pre-commit, configure it in your editor to run on save, and move on. The agent handles any consistent style just fine, and the auto-formatter keeps the diff tidy between commits. Tab vs space, 80 vs 100 columns, brace style, all of that became noise. The formatter decides, you accept.

13. Comments that describe the obvious

Last on the list. Still bad, got even worse. Comments like // increment i by 1 above i++ waste the agent’s tokens the same way they wasted the human’s patience. The model knows how to read code, it doesn’t need obvious captions.

If you have the habit of writing obvious comments because some school taught you that way, this is the moment to stop. In 2008 it was bad because it polluted visual space. In 2026 it’s bad because it costs real money in tokens.

What Uncle Bob couldn’t foresee

Beyond re-ranking what was in the book, some new things emerged that are specific to the agent world:

Meta-documentation files for agents. CLAUDE.md, AGENTS.md, .cursor/rules, and so on. These are files the agent reads before any tool call, describing project conventions, important commands, caveats, things that don’t go in a docstring. Writing these files is a new skill: short, direct, imperative, action-oriented. No philosophical prose. Bullet points of what the agent needs to know to not mess up.

README with high-level architecture. Uncle Bob barely cared about README (the book is about code). For the agent, well-made READMEs drastically shorten the path to understanding the shape of the project. A simple ASCII or Mermaid diagram helps.

Structured logging. JSON logs with named fields are much more useful for the agent than prose logs. The agent parses JSON trivially, uses fields to filter relevant errors, correlates across services. A loose printf in free text forces heuristic parsing.

Accessible observability commands. pnpm test, make lint, cargo check, python -m mypy — the more the project exposes predictable commands the agent can invoke to validate changes, the better. If running tests requires 10 manual setup steps, the agent won’t run tests, and the feedback loop breaks.

Idempotent setup scripts. The agent has to be able to run bin/setup or scripts/bootstrap.sh on a clean machine and reach a state where it can work. If onboarding depends on instructions in someone’s head, the agent is locked out.

Instructing the agent to write clean code

There’s an important detail that only becomes clear after about 500 hours of using agents: no LLM does any of this by default. You tell it “implement feature X” and it’ll implement it the way the model considers average. No dependency injection. 80-line functions. No tests, or tests that mock the wrong thing. Duplicated logic because it’s faster. 2000-line files because “everything’s in one place”. You need to WRITE these rules. The agent reads, the agent follows.

Where to write: CLAUDE.md, AGENTS.md, .cursor/rules, .github/copilot-instructions.md, depending on the CLI. Format: short, imperative, action-oriented. The agent reads these files on every iteration (Claude Code re-reads CLAUDE.md on every query), so each line burns context tokens — density matters.

Below is a proposal for a template you can drop into a CLAUDE.md on a new project, consolidating what I discussed above in a format the agent consumes. This is not a definitive version, it’s a starting point to test and tune to your language, your team, your flow. If a rule doesn’t fit your context, remove it. If you need a new rule, add it. The point is to have a structured skeleton:

## Code style

- Functions: 4-20 lines. Split if longer.
- Files: under 500 lines. Split by responsibility.
- One thing per function, one responsibility per module (SRP).
- Names: specific and unique. Avoid `data`, `handler`, `Manager`.
  Prefer names that return <5 grep hits in the codebase.
- Types: explicit. No `any`, no `Dict`, no untyped functions.
- No code duplication. Extract shared logic into a function/module.
- Early returns over nested ifs. Max 2 levels of indentation.
- Exception messages must include the offending value and expected shape.

## Comments

- Keep your own comments. Don't strip them on refactor — they carry
  intent and provenance.
- Write WHY, not WHAT. Skip `// increment counter` above `i++`.
- Docstrings on public functions: intent + one usage example.
- Reference issue numbers / commit SHAs when a line exists because
  of a specific bug or upstream constraint.

## Tests

- Tests run with a single command: ``.
- Every new function gets a test. Bug fixes get a regression test.
- Mock external I/O (API, DB, filesystem) with named fake classes,
  not inline stubs.
- Tests must be F.I.R.S.T: fast, independent, repeatable,
  self-validating, timely.

## Dependencies

- Inject dependencies through constructor/parameter, not global/import.
- Wrap third-party libs behind a thin interface owned by this project.

## Structure

- Follow the framework's convention (Rails, Django, Next.js, etc.).
- Prefer small focused modules over god files.
- Predictable paths: controller/model/view, src/lib/test, etc.

## Formatting

- Use the language default formatter (`cargo fmt`, `gofmt`, `prettier`,
  `black`, `rubocop -A`). Don't discuss style beyond that.

## Logging

- Structured JSON when logging for debugging / observability.
- Plain text only for user-facing CLI output.

This block fits in under 100 lines and costs about 500 tokens per iteration. Sounds like a lot, but the savings in code quality and absence of rework easily make up for it, especially if you’re on a pay-per-token API account. Use it as a base and evolve it with your own experience.

Some items (SRP, small functions, tests) the agent will try to do on its own. Others (DI, strict DRY, explicit types EVERYWHERE, aggressively unique names) it only does when you say so explicitly. And some (like “don’t strip the comments you wrote yourself on a refactor”) are so counterintuitive to its default training that without the instruction it WILL prune them. Hence the importance of having a rules file that gets read every iteration.

An analogous pattern shows up around defensive programming: the agent implements circuit breaker, retry with backoff, aggressive timeout, graceful degradation, all of it nicely — when you ask. But on its own it won’t propose them. The agent doesn’t know what your system’s operational failure points are, so it implements the happy path and waits for instruction. If your CLAUDE.md lists the categories of defensive code the project needs (rate limit, retry, breaker, fallback), the agent covers them. If it doesn’t list them, the agent doesn’t invent them. It’s another case where explicit instruction to the agent is what separates robust code from naive code.

If you use the same agent across different projects, it’s worth having a base template and adding project-specific rules on top. But start with something like the above and iterate from there.

The short version

Uncle Bob wrote Clean Code to be read by other humans. In 2026, the primary reader became the agent. The good news is that most of what the book preached still holds. The bad news is that some things that were opinions (“a file should have N lines”) became technical constraints (“a file with X lines makes the agent perform worse”). The difference is that now there’s a metric: token cost, tool-call latency, output quality. Whoever writes clean code for the agent saves API-bill money, session time, and gets less hallucination in the output.

And there’s a cultural bonus worth registering: all these practices (XP, TDD, SOLID, SRP, DI, small code, abundant tests) were falling out of fashion in the 2010s, replaced by “move fast and break things” and two-month bootcamps. Programmers who invested in fundamentals became a minority, and it became fashion to trash Uncle Bob on the internet. Turns out those very fundamentals became the technical differentiator of working with agents. Whoever kept the discipline is well served. Whoever dismissed it is now struggling to teach the agent not to commit mistakes the XP crowd had mapped out 25 years ago.

Clean code was never fashion. It became infrastructure.

My Favorite Retro Racing Games Running on My Distrobox

Sun, 19 Apr 2026 20:00:00 GMT

I spent the whole Sunday on this, and it was one of the most productive Sundays I’ve had in a while. The mission was specific: close the loop on the simcade racing games that are hardest to emulate, the ones I’ve wanted to run properly for years, and only now got to a reliable state.

Two monsters sit at the top of that list. First, Driveclub on shadPS4. It’s a PS4 exclusive, never got a port, Evolution Studios was shut down, no remaster. To play it outside the original PS4, emulation is the only option, and PS4 emulation is still the most immature part of the ecosystem. This is by far the hardest on the list.

Second, Forza Motorsport 4 with Project Forza Plus on Xenia Canary. FM4 is the GOAT of the Xbox 360 era, Project Forza Plus is the community mod that consolidates patches and content, and getting the two to run together on Xenia with decent graphics, no audio crashes and no shadow bugs took serious hours of trial and error.

Around those two I organized the rest of what I already knew worked at some level, just needed consolidating into a reproducible setup: Gran Turismo 4 with Retexture 3.0 and HD HUD, Gran Turismo 3 with HD textures and widescreen, Gran Turismo 2 with the Spec II mod, the PS3 Gran Turismos (GT5 and GT6), Forza Horizon 1 and 2, the PGRs, Ridge Racer, Enthusia, Colin McRae. I had an idea how to make these work, I just needed a setup that doesn’t break the next time I reinstall the system.

Before the details, two things to clarify.

First, on infra: I’m not going to repeat here how the setup was built. I wrote that in detail in An Emulation Distrobox with Claude Code. Arch Linux in a distrobox with --nvidia, 17 Ansible roles, every emulator from the AUR, ES-DE as frontend, automated per-game configs, Python scripts for PS3 update checking and Xbox 360 title updates, Xenia via Wine. Go read that one. The repo is at akitaonrails/distrobox-gaming if you want to reproduce. Here I’m going to talk about the games.

Second, on taste: I like simcade racing. Gran Turismo has been my declared addiction since PS1, and I wrote about the rest in the Formula FX1 cockpit article. Direct-drive wheel, load-cell pedal, triple monitor when I’m feeling masochistic. That’s the kind of racing I like.

The warning for the “well, actually” crowd

Yes, I know iRacing and the rest of the current sims and simcades. I have 12 TB of ISOs and ROMs on the NAS, I probably have one you’d remember to mention. It’s not for lack of options that I’m focusing on emulation today.

On the new side, the games I’m genuinely anticipating: Forza Horizon 6 in Tokyo, Assetto Corsa EVO, and Assetto Corsa Rally. I’ve been playing the demos of both Assetto Corsa titles and enjoying their single-player campaigns a lot. When those ship, they’ll become my main sim rotation. This article covers a different side of the collection, pulling classics out of oblivion that still don’t have a remaster.

Today, here, I’m interested in these specific games I listed above. Period. You do what you want, I do what I want. If you want iRacing, fire up Steam and go in peace. This article is about messing with emulators, preserving old games, and pulling these gems out of oblivion on a Linux setup. If you’re here for current-sim reviews, you won’t find them.

Now to what matters.

A note on the videos: most of them have no audio because I forgot to turn on sound capture in OBS before recording and I was too lazy to re-record. Audio works perfectly on every emulator here; this is my OBS mistake, not a setup issue.

Global settings per emulator

Before the games, a quick consolidation of the global configs that apply across each emulator. This is automated in the repo, but if you’re setting up by hand, here’s the summary.

DuckStation (PS1)

Renderer: Vulkan
Internal Resolution Scale: 8x (8K internal, downscaled to the monitor)
Texture Filtering: JINC2 (preserves pixel art but smooths)
PGXP: on (fixes the typical PS1 vertex wobble)
Widescreen Hack: on as fallback, but I prefer per-game widescreen cheats
Aspect Ratio: 16:9 with widescreen cheat, 4:3 without

PS1 runs smoothly on DuckStation these days. It’s the most mature emulator on the list. The attention goes to per-game choices (some games need a specific widescreen cheat).

PCSX2 (PS2)

Renderer: Vulkan
Upscale: 4x SSAA (1440p native → 4K internal)
Anisotropic Filtering: 16x
Post Processing: FXAA on, PCRTC antiblur on
Deinterlacing: Automatic (some games need override)
Controller: Xbox-style bindings (I use an 8BitDo)

PCSX2 2.7.x is much better than the 2.6.3 still in Arch’s official repo. The AUR’s pcsx2-latest-bin tracks 2.7.x and that’s where texture replacement, modern FXAA and PCRTC antiblur live. If you’re on a different distro stuck on 2.6.3, get the official AppImage.

RPCS3 (PS3)

Renderer: Vulkan
Resolution Scale: 300% (generally) or native (for games with upscale issues)
Shader Precision: Ultra
Force High Precision Z: on
SPU XFloat: Accurate
Multithreaded RSX: on
Write Color Buffers (WCB) / Read Color Buffers (RCB): off by default (GT-series safety)

RPCS3 is where the most traps are. Each game has a recommended preset in the official compatibility DB. And a heads-up: per-game configs live in ~/.config/rpcs3/custom_configs/config_.yml. The config_ prefix is mandatory. It cost me time to figure out. RPCS3 silently accepts the YAML and then ignores it if the name is wrong.

Xenia Canary (Xbox 360)

Build: Canary (master has been stagnant for months, Canary is where development lives)
Renderer: Vulkan
Render Target Path: rtv by default, fsi for specific games (PGR4)
User Profile: created once, persists
Wine Prefix: dedicated, managed by Xenia Manager

Xenia on Linux runs via Wine, managed by Xenia Manager. There’s no decent native build. It works well, but each game needs specific tuning, and Title Updates have to be pulled manually from archive.org (I wrote a script for that, details in the previous article).

shadPS4 (PS4)

Gets its own section at the end. TL;DR: works for some simple games, Driveclub is still a moving target. Still a shot in the dark.

PS1 on DuckStation

Gran Turismo (the first, from 1997)

Your browser does not support HTML5 video.

This one is pure sentiment. GT1 was one of the first games I bought for the PS1 and I played it until the CD cracked. At the time, it was on another level. Real simulation on a machine with 128K of video RAM, decent physics, 140 cars when everybody else had 15, memorable soundtrack.

But let’s be honest: you can’t go back today. GT1’s physics has exaggerated drift, the car slides more than it should on any corner above 80 km/h. It’s part of the 1997 charm but gets annoying in 2026. The handling model was refined starting in GT2, and everything after only got better. Between GT1 and GT2, GT2 wins on content (650+ cars vs 140) and especially on drivability.

I keep GT1 in ES-DE to open in nostalgic moments, drive 10 minutes at Trial Mountain, and close it. For serious PS1-era gameplay, it’s GT2.

Gran Turismo 2 (Spec II Mod)

Your browser does not support HTML5 video.

The original PS1 Gran Turismo is a historical milestone. I played a lot of it back then. To replay today? No. GT2 exists, and GT2 is superior in everything. More cars (650+), more tracks, more modes, and most importantly, a much better handling model. GT1 has that exaggerated drift where the car slides more than it should, GT2 tightened that up to something closer to what the series would become starting on PS2. Between playing GT1 or GT2, always GT2.

And the GT2 I run today is the GT2 Spec II Mod by Project A-Spec, a community project that merges the two regional variants (Arcade + Simulation) into a single game, brings back events that were cut from the final release, fixes physics bugs, adds native widescreen support, and ships quality-of-life menu updates. It’s the definitive way to play GT2 in 2026. The project doesn’t maintain its own website anymore, updates come through the author’s Twitter.

To run it:

Setting	Value
Disc	GT2 Spec II Mod (patch applied over Simulation USA ISO)
Serial	SCUS-94455
Widescreen cheat	On
8 MB RAM cheat	On (fixes audio on crowded tracks)
Texture Filter	JINC2
Resolution Scale	8x

The 8 MB RAM cheat matters for crowded tracks (Seattle in populated races for example) where audio starts clipping. It simulates the memory expansion of the Japanese version that never made it West.

Colin McRae Rally 1 and 2

Iconic PS1 games. CMR1 has that rough first-gen rally feel that has its charm. CMR2 is technically superior, better physics, more rallies, better graphics. But honestly: to play CMR2 today, it’s worth looking for the original PC version. Higher resolution, native 60fps (the PS1 version is locked to 30), direct keyboard/wheel support. Codemasters repacks exist on the internet, a 40-second search will find them. I’m not going to link them, you know where to look.

On PS1 via DuckStation they run smoothly, nostalgia guaranteed, but it’s not where I spend my time.

PS2 on PCSX2

Gran Turismo 3 A-Spec

Your browser does not support HTML5 video.

GT3 was a generational leap. Out of the blocky PS1 world into shaders, decent environment mapping, serious polygon car models, physics that finally started to feel like actual physics. The soundtrack with Feeder’s “Just a Day” opening the game still gives me goosebumps.

The problem is that GT3 has LESS content than GT2. Polyphony was more cautious, cut modes to focus on polish. The A/B/S license tree is smaller, fewer cars, fewer tracks. But what’s there is polished.

In 2026 on PCSX2 2.7.x, GT3 looks great. Community HD textures, widescreen pnach, anti-aliasing via SSAA, 16x AF. Feels like a re-release.

Setting	Value
Disc	GT3 A-spec (SCUS-97102) or Bundle (PBPX-95503)
Widescreen pnach	On
Retexture pack	Optional (off by default, enable if you want full HD)
Upscale	4x SSAA
Deinterlacing	Automatic

Caveat on the retexture pack: when enabled, some car-showcase cutscenes in the garage have momentary flicker due to NFS streaming. Doesn’t bother in-race. If it bothers you in menus, turn it off.

Reminder: for the widescreen pnach to take effect, you still need to go into the in-game options and switch the aspect ratio to 16:9. The pnach unlocks the option in the menu, it doesn’t apply automatically.

Gran Turismo 4 (Spec II Mod)

Your browser does not support HTML5 video.

GT4 is my GOAT in the series. 700+ cars, 50+ tracks, physics still respectable by 2026 standards, and that soundtrack mixing alternative rock, Japanese lounge, and Moby’s infamous “Sail Away” on the main menu. It’s the game I poured the most hours into back in 2004, and it’s what I play the most today.

Like GT2, the GT4 I run today is not the original disc. It’s the Gran Turismo 4: Spec II maintained by TheAdmiester — a massive community mod that’s objectively the best way to play GT4 in 2026. It combines the final USA disc, the 2003 Beta prototype, and the original Japanese release into one game, restoring cut cars (including several that never left Japan), cancelled events, dozens of missing soundtrack tracks, adding 16:9 widescreen support, over 30 built-in cheats exposed as menu options (pro tuning, top speed, fuel economy), the option to switch between English and Japanese, and original bug fixes. The project is active, still gets patches, and ships its own installer that applies over a clean USA ISO.

In practice, Spec II is what GT4 should have been if Polyphony had six more months to polish. Playing it today feels like GT4 with all the content that got left out, plus modern quality-of-life. For anyone who enjoyed the original, it’s mandatory.

On top of that goes the community’s Retexture 3.0 (HD texture pack) and the HD HUD by Silentwarior112. With those three pieces together (Spec II Mod + Retexture + HD HUD), GT4 runs at 1440p SSAA on PCSX2 2.7.x with 16x AF, FXAA, and a look that’s honestly indistinguishable from a new low-poly-style release. It’s my current favorite to sit down and play long.

Setting	Value
Disc	GT4 USA (SCUS-97328) or Spec II (SCUS-97436, CRC `4CE521F2`)
HD HUD pack	Installed via symlink (Silentwarior112)
Retexture 3.0	Installed
Widescreen pnach	On (renamed to Spec II CRC if using Spec II)
Silent’s trigger/camera patches	On
Deinterlace (Spec II)	Mode 8 Adaptive TFF
ShadeBoost (Spec II)	Saturation +10, Brightness +3, Contrast +2
FXAA + PCRTC antiblur	Global
Upscale	4x SSAA

Spec II users need to pay attention to the CRC: the vanilla GT4 HD packs were built for 77E61C8A (USA), so on Spec II with CRC 4CE521F2 I symlink the assets into the CRC-suffixed folder that PCSX2 looks for. The widescreen pnach was also renamed to match the Spec II CRC. Without that, the game runs but the mods don’t load.

In-game reminders: same as GT3, the widescreen pnach only unlocks the option, you still need to enter Options in-game and switch to 16:9. And important: change the video mode to progressive 480p (also in Options). The default is interlaced 480i, which looks terrible on a modern screen. Spec II Mod has native progressive support, just flip it on.

If I could recommend a single GT in the series for someone starting today, it would be this one. If you like racing and never played GT4 in full, you’re missing out.

Enthusia Professional Racing

Your browser does not support HTML5 video.

Enthusia is the forgotten Konami game from 2005. Launched in GT4’s shadow and died on the shelves. It’s a shame, because Enthusia’s physics was more realistic than GT4’s. Car weight, mass transfer, limit behavior, all closer to real-world behavior. The sim-racing community of the 2000s knew this and the game has a cult following today.

Enthusia’s problem was the Enthusia Points system: you earned points for driving clean, lost them for crashing, and the game’s progression depended on those points. A lot of people found it punitive. I found it perfect. It forced you to drive with your head.

On PCSX2 with retexture and widescreen, the game looks great. Strong recommendation for anyone who wants something different from Gran Turismo.

Setting	Value
Disc	Enthusia Professional Racing (SLUS-20967)
Widescreen pnach	On
Retexture	On
Upscale	4x SSAA

Ridge Racer V

Your browser does not support HTML5 video.

Ridge Racer is its own thing. Not simcade, pure arcade. Drift everything, cartoony grip, electronic soundtrack, saturated colors. Ridge Racer V was the Japanese PS2 launch title and it carries that “show what the console can do” vibe, with massive scenery and a sense of speed GT never attempted.

On PCSX2 it runs well, with a catch: car textures sometimes flicker on the hardware renderer (PCSX2 issues #3639 and #13729). Software renderer fixes it but kills performance. I stay on hardware and accept the occasional flicker.

Setting	Value
Disc	Ridge Racer V (SLUS-20002)
Widescreen pnach	On
No-interlace pnach	On
Upscale	4x SSAA

Colin McRae Rally 3, 4, 5 (PS2)

PS2 Colin McRae versions run well on PCSX2 with default configs. CMR3 and CMR04 are good. CMR05 (the last before the rebrand to Dirt) is the best of the three technically. But again: if you’re on PC, look for the native version. The PS2 ports were relatively inferior to PC of the era, ran at lower resolution, and a modern PC with a repack will hit 144Hz easily.

I keep them on PCSX2 just for collection completeness.

PS3 on RPCS3

Gran Turismo 5

Your browser does not support HTML5 video.

GT5 is controversial. Polyphony promised the world and delivered a somewhat fragmented game: gorgeous Premium models next to Standards that were basically HD GT4 cars, a level system some loved and others hated, online that went to mud when PSN service was cut, a Top Gear mode that was weird. The contemporary critical reception was mixed, and I remember entire communities fighting over whether the game was a disappointment or a masterpiece.

For me, it’s a masterpiece. GT5 has the Course Maker (design your own track from real sections), real 24h Nurburgring with dynamic weather, the Moon Rover (yes, a drive-on-the-moon mode), Red Bull X-Challenge with Vettel, and the single-player endurance experience is ridiculously big. Anyone who went deep into the game knows there’s content for a whole year.

On RPCS3 in 2026, GT5 runs really well. Stable, without the serious bugs from years past. It’s the most stable GT I have emulated today.

Setting	Value
Serial	BCUS98114 (US) or BCES00569 (EU)
Resolution Scale	300%
Shader Precision	Ultra
Force High Precision Z	On
SPU XFloat	Accurate
Multithreaded RSX	On
WCB / RCB	Off

The Shader Precision Ultra + Force High Precision Z combo kills the typical RSX dithering that made GT5 look grainy on RPCS3 in years past. Without those two, asphalt looks permanently noisy. With them, the game looks clean.

Gran Turismo 6

Your browser does not support HTML5 video.

GT6 is one of my favorites in the series. It came out in 2013 for an end-of-life PS3 while the competition was already looking at PS4. Sold poorly. A lot of people didn’t play it. That’s a shame, because GT6 took the GT5 base, cut the fat (Standards were reduced), added more Premiums, better Moon circuit sessions, dynamic weather on more tracks, physics refinement. It’s basically GT5 polished.

On content, GT6 is comparable to GT4. A lot of old-guard players put GT4 above it because of the PS2-era charm. I stay between the two and go GT6 on a normal day, GT4 on a nostalgic day.

On RPCS3, GT6 has a serious catch: patches 1.06 and beyond regress visually. Black surfaces on cars in the garage menu, flicker in cockpit view, full-screen flashing on certain tracks. The version that works well in 2026 is pinned at v1.05. The repo’s extract_ps3_dlc.py script has a per-title version ceiling specifically to pin GT6 at 1.05 even when PSN serves 1.22 as “most recent”.

Also, Force CPU Blit is mandatory. Without it, full-screen flicker in the menu. The trade-off is the rear-view mirror stays permanently black. I prefer losing the mirror to having the flicker. Anyone who wants the mirror can enable Write Color Buffers, but then resolution drops to 720p native.

Setting	Value
Serial	BCES01893 + regional variants
Version pinned	v1.05 (patches 1.06+ cause black surfaces)
Resolution Scale	300% (menus) or 200% (heavy gameplay)
Force CPU Blit	On (mandatory, kills flicker)
WCB	Off (trade-off: black rear mirror)
Shader Precision	Ultra

In 2026, with these settings, GT6 is in my rotation. Not perfect, but playable enough that I can spend an afternoon doing endurance.

Ridge Racer 7

Ridge Racer 7 was a PS3 launch title in 2006 and the last mainline in the franchise. Arcade like Ridge Racer V, drift-first, electronic soundtrack, cinematic camera. For a PS3 of that era it was a pretty tech demo.

On RPCS3 it runs, but with caveats: requires Write Color Buffers on to avoid lighting issues, and the recommended preset in the compat DB is different from my GT preset. I use a dedicated per-game config. Not my favorite, but it’s there for completeness.

No video for this one, I’ll record later.

Xbox 360 on Xenia Canary

This is the part that caused me the most headaches this Sunday. Xbox 360 on Linux is hostile territory. Xenia runs via Wine, each game needs specific tweaks, title updates have to come from archive.org, and Forza Motorsport 4 specifically had serious problems until recently.

Forza Motorsport 4 (the GOAT)

Your browser does not support HTML5 video.

FM4 is, for a lot of people, the best Forza Motorsport ever released. Polyphony made Gran Turismo, Turn 10 made Forza, and FM4 is the peak of the 360 era. Beautiful car models for their time, well-designed tracks, solid physics without being a hard sim, Autovista mode with Jeremy Clarkson narration, epic soundtrack, and that feeling of the Xbox 360 being pushed to its limit.

FM4 on Xenia was a nightmare until recently. Severe shadow artifacts, constant texture pop-in on the car, grainy pixelated track textures, sky brightness completely wrong (it came out white-blown-out instead of the deep blue of the original), and worst: frequent audio crashes where XMA (the Xbox 360’s proprietary codec) would freeze and the audio would turn into a high-pitched noise until the game hung. I spent hours this Sunday messing with configs, testing builds, searching the xenia-canary issue tracker, testing different title updates.

What unlocked it:

Latest Xenia Canary build (master is stagnant, Canary tracks active development)
Render Target Path: rtv (the default works fine for FM4)
Title Update 1.0.17.0 installed via Xenia Manager (version matters, older ones regress shadows)
Vulkan GPU with NVIDIA ICD forced (on my hybrid setup with AMD iGPU, I had to hardforce the 5090)

After that, FM4 ran. Not perfectly. Still gets the occasional audio crash (xenia-canary issue #161 open). But it’s playable, and I spent two hours of the Sunday at Top Gear Test Track just admiring the Koenigsegg CCX. Worth every hour of debugging.

Setting	Value
Build	Latest Xenia Canary
Render Target Path	`rtv` (default)
Title Update	1.0.17.0
Wine Prefix	dedicated via Xenia Manager
GPU	NVIDIA via `VK_ICD_FILENAMES` hardforce

Forza Motorsport 3

Your browser does not support HTML5 video.

FM3 is one of the solid entries in the series, sitting between FM2’s rougher charm and FM4’s technical peak. Getting it to emulate was surprisingly annoying, because I was using the wrong ISO.

There are two releases of FM3: the retail (original disc sold in stores) and the Ultimate Edition (all DLCs packaged on a second disc). I was trying to run the Ultimate, which combines both discs into a hybrid ISO that Xenia can’t parse correctly. I swapped to the standard retail + DLCs installed separately via Title Update, and bang, the game ran first try.

Moral: if your FM3 isn’t running on Xenia, check that it’s retail. Ultimate breaks.

Setting	Value
Version	Retail (NOT Ultimate Edition)
Build	Latest Xenia Canary
Render Target Path	`rtv`
DLCs	Installed via Xenia Manager separately

Forza Motorsport 2

Your browser does not support HTML5 video.

FM2 is from 2007, first-gen 360. Less refined systems, more basic modeling, but that original Forza charm some prefer to FM4’s sophistication. On Xenia it ran practically out of the box. No special tweak.

Setting	Value
Build	Xenia Canary
Render Target Path	default

Forza Horizon 2

Your browser does not support HTML5 video.

FH2 was the most polished Forza Horizon of the 360 generation (there’s also the Xbox One version, which is technically better, but the 360 has its charm because of the smaller map and more focused progression). Open world in southern Europe, iconic soundtrack, Horizon festival at its peak.

On Xenia it ran without asking for anything extra. Same as FM2.

Setting	Value
Build	Xenia Canary
Render Target Path	default

Forza Horizon 1 (with XE Mod)

Your browser does not support HTML5 video.

Original FH1 runs OK on Xenia, but there’s a community maintaining the Forza Horizon 1 XE Mod, which improves traffic AI, rebalances progression, adds cars that were cut, fixes known bugs, and brings quality-of-life tweaks to the UI. I tested the XE Mod this Sunday and it’s worth it. The game feels more alive and progression feels fairer.

Install details for the XE Mod are on the project site. Basically it’s a patch over the original ISO + a few custom title updates.

For comparison, here’s FH1 without the mod, running vanilla:

Your browser does not support HTML5 video.

The most visible difference is traffic and the variety of situations that come up, but the XE Mod has a lot of stuff you only notice after a few hours.

Setting	Value
Build	Xenia Canary
XE Mod	Applied over original ISO
Title Updates	Version pinned by XE Mod

Project Gotham Racing 3 and 4

PGR3 was a 360 launch title in 2005 and still looks gorgeous today. Urban tracks (New York, Tokyo, Nurburgring, Las Vegas), kudos system, cinematic camera. PGR4 added dynamic weather (rain and snow) and motorcycles (some people love them, I prefer to pretend they don’t exist).

PGR3 on Xenia Canary runs reasonably well with default configs. PGR4 needs a specific tweak: render_target_path_vulkan = "fsi" in the config. Without it, some tracks break with severe visual artifacts (green bars across the screen in snow races).

And PGR4 on NVIDIA has a known audio bug: XMA decoding produces intermittent garbage (open xenia-canary issue, no resolution). Not game-breaking, but annoying.

Game	Relevant setting
PGR3	Default
PGR4	`render_target_path_vulkan = "fsi"`, accept the XMA audio bug

Ridge Racer 6

Ridge Racer 6 was the Ridge Racer of the 360, skipping the 7 that stayed on PS3. Pure arcade, like Ridge Racer V of the PS2. On Xenia Canary it runs with default config, no special tweak. Fun for short sessions. No video for this one in the article, but it’s in the catalog running.

PS4 on shadPS4: the state of the emulator

shadPS4 still doesn’t have a stable release. The project is under active development, the main branch changes several times a week, and “what works today” can regress on a rebase tomorrow. That said, many games already boot and run well enough. The community list is on the compatibility tracker and grows every week.

The difference between shadPS4 and the other emulators in this article is the kind of effort. PCSX2, RPCS3, Xenia are mature projects where each game has a reasonably up-to-date wiki page. shadPS4 is the opposite: you read a Reddit guide, try to reproduce, discover the guide uses a specific fork (not the official main), or a two-month-old commit, or a firmware with sys_modules symlinked in a particular way, or XML patches applied in a specific order that the author didn’t document. Reproducing a shadPS4 setup from a YouTube video is like trying to solve a Rubik’s Cube blindfolded.

But that doesn’t mean you shouldn’t try. It means you need to treat shadPS4 as a hobby, not a workflow. And if you have the patience, you can pull off impressive results.

Driveclub: the impossible, finally possible

Your browser does not support HTML5 video.

Driveclub is my Moby Dick of emulation. It’s the only PS4 game I really want to run. Released in 2014 by Evolution Studios, had a catastrophic launch (online servers collapsed on day one), got patched over two years until it became one of the best racing games of the generation, and was discontinued by Sony when Evolution was closed in 2016. No PC port. No remaster. No sequel. Only exists on PS4.

And today, after many hours of debugging, it’s running here. Main menu loads, intro plays, races start, controller responds, audio works, daytime tracks visible, night tracks with headlights lighting the road. The video above is real gameplay, stable 30 FPS, v1.28 with DLC content accessible.

Getting here required dismantling several traps. Worth documenting, because most online guides don’t mention any of them.

Trap 1: the FMOD loop error that wasn’t FMOD

When you try to boot Driveclub on shadPS4 straight from the retail PKG, the log hits an infinite loop of:

/app0/audio/fmodstudio/masterbank.bank failed, file does not exist

The first reaction is to investigate FMOD. Some people online spent hours messing with sceFont, libtbb, sceNgs2, convinced it was an audio or HLE lib problem. None of that is the cause. The masterbank.bank file literally doesn’t exist on disk, because it’s packed inside the gameNNN.dat files in a custom Evolution Studios archive format that shadPS4’s VFS can’t read. Same story for fonts, models, shaders. All the game’s content lives inside the .dat files.

The only public tool that parses that format is Nenkai’s DriveClubFS. It’s a .NET app that reads game.ndx (the index) and decompresses the .dat files into loose files. Without that unpack, shadPS4 can’t open anything.

Build detail: the DriveClubFS project targets net9.0, and Arch has net8 and net10. You need to retarget the csproj to net10.0 before building. The exact command is documented in the repo’s docs/driveclub-shadps4.md.

Trap 2: v1.28 is a patch, not a base installer

Like every PS4 update, v1.28 is a patch that layers on top of a base install. The v1.28 PKG ships the updated eboot, new-content .dat files, and patch content (sce_sys/about/right.sprx), but no param.sfo, no disc_info.dat, no keystone. Those metadata files live in the v1.00 base — on a real PS4 they’d already be on disk from the original install.

And there’s more: v1.28’s game.ndx references both the base .dat files and the new ones. Running DriveClubFS against v1.28 alone makes the index point at files that aren’t there.

The cruelest symptom hits later: the game boots, the Qt Launcher shows the tile, you click, the loading screen comes up. Then the game decides no content “has been released yet” and asks you to download. Download obviously isn’t happening because PSN for PS4 isn’t accessible. You get stuck thinking it’s a network bug, when actually it’s missing metadata.

The correct flow is to treat v1.28 as an overlay on top of the base. First, extract the v1.00 PKG into a dedicated dir (CUSA00003.v100-working-backup/ works well, and doubles as a rollback snapshot if some experiment breaks the live install). Then extract v1.28 into staging and symlink the base’s low-index .dat files into the staging dir so DriveClubFS sees the full set. Run the unpack. You get 8018 loose files, ~47 GB.

Move that into the live CUSA00003/ dir, copy v1.28’s updated eboot and new patch content on top, then pull param.sfo, disc_info.dat, and keystone from the v1.00 base. The base’s param.sfo already has APP_VER = 01.28 recorded (because it was the version that received the patch originally), so there’s no version conflict. The v1.28 eboot reads those files and unlocks the content.

Trap 3: the 60fps XML patch produces slow-motion

There’s a community Driveclub.xml patch that makes the game render at 60fps. Intuitively, it looks like pure upside. In practice, the patch rewrites the render rate but doesn’t touch the game’s fixed logic tickrate, which is locked at 30fps. On a real PS4 Pro, the hardware reconciles the mismatch. On shadPS4, it doesn’t.

Result: the game renders smoothly at 60fps, but the physics, AI, audio, race timing all run in slow motion. You press the throttle and the car takes a perceptible two seconds to launch. Looks like a random bug until you understand the mismatch.

Disabling the XML patch is the move (just rename the file to something like .disabled). Without the patch, the game runs at native 30 FPS, same speed as a stock PS4 base. Which is what we want. If someone in the future produces a patch that moves both rates (render + logic), we’ll flip it back on.

Trap 4: pipeline cache has to be on

shadPS4’s default ships with pipeline_cache_enabled = false. On a large game like Driveclub, which has hundreds of Vulkan shaders and pipelines, that means every cold launch recompiles everything. My first launch counted ~864 shaders and ~590 pipelines being compiled in real time. The menu animated at 2-3 fps. A minute later, with everything compiled, it snapped to normal fluidity. Next launch, same from-scratch compilation cycle.

Just enable "pipeline_cache_enabled": true in the global config.json. The first session pays the one-time compilation cost, subsequent sessions read from cache and start up instantly.

Trap 5: TOML silently ignored

shadPS4 migrated from TOML to JSON for per-game configs in late 2025. If you have an old CUSA00003.toml in the custom_configs/ folder, it’s silently ignored, no warning in the log. You think your config is active, in reality you’re running all defaults.

Delete any TOML and use CUSA00003.json. The per-game configs that work for Driveclub are specific and detailed, all documented in the docs/driveclub-shadps4.md of the repo.

Trap 6: controller not detected without hidapi hints

shadPS4’s Qt Launcher calls an internal Shadps4-sdl.AppImage. Without specific SDL hidapi env vars, the controller (8BitDo Ultimate 2 in my case) isn’t detected, even though it works in every other emulator. The fix is to wrap the AppImage in a shell that exports SDL_JOYSTICK_HIDAPI=1 and platform-specific variants (PS4, PS5, Xbox), before executing the original.

Plugging the controller in before opening the Qt Launcher also helps. Hot-plug works during the game, not before.

Trap 7: night tracks are pitch black on upstream

This one needed a code patch, not just config. On upstream shadPS4, Driveclub night tracks came out pitch black. The HUD rendered on top, but the track, the car, everything hidden in absolute black. Not even the car’s own headlights lit anything up.

Investigation pointed at the log repeating 1325 times the message ResolveDepthOverlap: Unimplemented depth overlap copy. The cause: Driveclub uses a forward+ renderer, which writes scene depth into a 4x MSAA D32Sfloat buffer, then reads that same buffer as R32G32B32A32Sfloat 1x for effects like SSAO, volumetrics, and dynamic lighting (headlights). shadPS4 upstream has a path for MSAA → MSAA and for color → color as MSAA depth, but it didn’t have a path for “MSAA depth 4x becomes color 1x” in that specific format. It falls into the final else, frees the buffer, and the shader reads uninitialized memory — i.e., black.

I implemented in my local shadPS4 fork a ReinterpretMsDepthAsColor symmetric to the existing ReinterpretColorAsMsDepth. It’s a tiny fragment shader (ms_depth_to_color.frag) that does texelFetch of sample 0’s depth and writes it as vec4(depth, 0, 0, 1) to the color attachment. Plus a BlitHelper helper and a new else if branch in TextureCache::ResolveDepthOverlap. With that, night tracks light up properly, opponent headlights appear, AO and volumetrics on day scenes get more precise too.

The fix lives on a gamma-debug branch of my fork, ready to become an upstream PR. Until it merges, anyone who wants to race at night on Driveclub on shadPS4 needs to build from that branch. Anyone sticking to daytime can use upstream nightly directly.

What finally worked

Component	Value
shadPS4	Upstream Pre-release (nightly) for daytime; `gamma-debug` fork for night (until MSAA depth fix lands upstream)
Driveclub	v1.28 base (cumulative patch, ~47 GB, 8018 files)
Extraction	ShadPKG → DriveClubFS (retargeted to net10.0)
Metadata	Restore `param.sfo`, `disc_info.dat`, `keystone` from the v1.00 PKG
Global config	`pipeline_cache_enabled: true`, `readbacks_mode: 0`
Per-game config	JSON at `custom_configs/CUSA00003.json`
`vblank_frequency`	60
`gpu_id`	0 (NVIDIA dGPU hardforced)
`Driveclub.xml` patch	Disabled (60fps produces slow-motion, root cause: fixed logic tickrate)
Controller	AppImage wrapper with `SDL_JOYSTICK_HIDAPI=1` + variants
Qt Launcher	`checkForUpdates=false` to avoid GitHub rate-limit dialog

All of this is encapsulated in docs/driveclub-shadps4.md in the repo, with the exact step-by-step for extraction, config, and wrapper. The full technical investigation (5 phases, detailed engineering log) lives in the fork’s driveclub-v128-investigation.md.

How playable it is today

Good. The video above is real gameplay, stable 30 FPS, physics responding, audio in sync, HUD correct, v1.28 with DLC accessible. Day races and night races both work (with the MSAA depth fork for night). What’s left as limitation:

Daytime brightness sits a bit below what the PS4 delivered. Not an emulator bug. I spent hours testing tonemap, auto-exposure, gamma curves, ACES, various post-processing shaders, and in the end discovered the game writes already-correct SDR to the framebuffer. What was missing was the display-calibration pipeline that PS4 firmware 4.0+ applied on the outside, and that Linux doesn’t have. The in-game brightness slider (which on shadPS4 maps to sceVideoOutAdjustColor) works, but only up to the point where the game expected the firmware to pick up the rest. The 5-second dim at the start of a race is also the game’s own cinematic fade-in VFX, not an emulator fault. Accepted.
Native 30 FPS. No viable 60fps patch (due to the fixed logic tickrate explained above). Stock PS4 speed.
MSAA depth fix still on the fork. PR candidate ready, awaiting upstream merge.

“Achievement Unlocked”. The game runs, with sound, controller, gameplay, day and night, and I complete a whole race without crashing. That’s a personal milestone. I spent years trying to get this game running on Linux and would always hit a point where I just gave up. Now it’s part of my ES-DE, shortcut on the desktop, sitting there waiting for me to play.

If you want to replicate, the repo is public. Good luck. You’re going to need patience.

“Achievement Unlocked”

I’ve been trying to make these games work, at some level, since each respective emulator came out of alpha. I’ve been running GT2 since ePSXe. I’ve been messing with GT4 since PCSX2 0.9.6. I tried to get FM4 running on Xenia twice in past years and gave up both times. And I looked at shadPS4’s face three times before this Sunday and swallowed hard.

Today it closed. The two monsters that opened this article (Driveclub on shadPS4 and FM4 with Project Forza Plus on Xenia) are running, along with the rest of the Gran Turismo, Forza, Ridge Racer, Colin McRae, and Enthusia collection. All in a single setup, reproducible, automated. I’m now part of the very small group of people who brute-forced their way into emulating Driveclub on Linux, and that alone made the Sunday unforgettable.

The frustration of years wasn’t lack of information. It was the OPPOSITE. Too much information, poorly indexed, scattered across dead forums, YouTube videos that went stale, Reddit threads with solutions for two-year-old emulator versions that don’t apply today, wikis that are out of date on half the pages, and mods with VERY specific requirements of CRC, version, patch, path, setting. Every time I sat down to get one of these games running right, I spent three hours reading conflicting material before writing the first line of config. Worse still when the good material was buried under wrong diagnoses, like the Driveclub FMOD loop everyone investigated as if it was an audio problem when in reality it was Evolution Studios’ proprietary archive format.

What changed in the last year is that Claude Code can be my aggregator. I tell Claude to hit the Xenia issue tracker, the PCSX2 wiki, the RPCS3 subreddit, shadPS4 PRs, cross-reference all that info with the emulator version I have installed, cross-reference with the logs I’m looking at, and give me back the combination that works today. It reads source code when needed, understands why Force CPU Blit changes behavior on GT6, finds which FM4 title update was the one that fixed the shadow bug, discovers that DriveClubFS needs a retarget to .NET 10 before compiling. What used to take three weekends of research now takes an afternoon of assisted execution.

This Sunday is the consolidated result of all that work. distrobox-gaming is that knowledge turned into code, reproducible, automated. Starting on a fresh machine, one ansible-playbook site.yml puts me back in this state in under two hours. Fewer than 10 commands separate “vanilla Arch Linux” from “Driveclub running”.

If anyone wants to contribute, test on different hardware, report a bug, or send a config that’s missing, the repo is open. The more people test on different setups, the more resilient the project gets. If you improve the Driveclub situation (get the MSAA depth fix merged upstream, a 60fps patch that also fixes the logic tickrate, font dumps that don’t need jailbroken hardware), send a PR and I’ll embrace it.

For now, I’m going to do a race in Iceland on Driveclub. Good Sunday night.

LLM Benchmarks Part 2: Is It Worth Combining Multiple Models in the Same Project? Claude + GLM??

Sat, 18 Apr 2026 14:00:00 GMT

⚠️ Obsolete article (updated 2026-04-24). The conclusions and rankings in this post were superseded after I re-audited the benchmark against the ruby_llm gem source. The core finding (multi-model is not worth it for greenfield coding) still holds and has been folded into the canonical post. The canonical version lives at LLM Coding Benchmark (April 2026). This post stays as a historical record.

TL;DR: Yes, the title is clickbait. The answer is no, it’s not worth it. Keep using Claude Code with Opus 4.6 or 4.7. Details below.

A few weeks ago I wrote a detailed LLM coding benchmark comparing 33 open source and commercial models on the same test: build a Rails app using RubyLLM. The conclusion was that only 4 models generated code that works on first try (both Claudes, GLM 5, and GLM 5.1), and that for anyone who doesn’t want to waste time fixing API hallucinations, Claude Opus via Claude Code remains the most rational choice despite the price.

This article is a follow-up. I’ll keep that benchmark updated as new models come out (Opus 4.7, Qwen 3.6, GPT 5.4 via Codex are already in). This one tackles a different question that shows up in my feed every week: what if I combine two models in the same project? Opus to plan, GLM to execute. Does that work?

Short answer: no, it doesn’t. Long answer is the rest of this article.

First: a word about Opus 4.7

There are people on Reddit claiming Opus 4.7 is an absurd downgrade from 4.6, that it regressed, that it “got worse at coding”. I get suspicious whenever I see “everything’s getting worse” narratives. I have hundreds of hours with Opus 4.6, and I’ve been testing 4.7 since it came out a few days ago. Quality is equal to or better than 4.6 on non-trivial tasks where I have a reference for how 4.6 used to behave.

When you see someone complaining “4.7 is terrible”, ask for the exact prompt, the repo, the context. Most of the time they can’t reproduce it, or the repo has a badly written CLAUDE.md, or the task is too subjective to measure. “Felt like it got worse” isn’t data. I got caught by that feeling myself in a 4.7 session where the context had been contaminated with a bunch of stale docs. The culprit was my config, not the model.

In the benchmarks I ran this week, Opus 4.7 on opencode delivered clean Tier 1, same level as the Opus 4.6 baseline. The tests it ran via Claude Code had a weirder story, and there the blame is likely the harness, not the model. More on that below.

What’s new in the benchmark

The benchmark now supports testing model combinations across the three main harnesses:

Claude Code (claude -p --output-format stream-json) — supports sub-agents declared in .claude/agents/*.md that Opus can delegate to via the Task tool
opencode — has its own sub-agent system that can run different models
Codex CLI — got sub-agent support via TOML with -c agents..config_file=...

On top of these three, I configured 7 combinations:

Runner	Primary model	Sub-agent	Idea
Claude Code	Opus 4.7	—	Baseline, Opus alone
Claude Code	Opus 4.7	Sonnet 4.6	Opus plans, Sonnet executes
Claude Code	Opus 4.7	Haiku 4.5	Opus plans, Haiku (smaller) executes
opencode	Opus 4.7	GLM 5.1	Opus + GLM (cheap + good)
opencode	Opus 4.7	Qwen 3.6 local	Opus + free local model
Codex	GPT 5.4 xHigh	GPT 5.4 medium	High reasoning plans, lower executes
Codex	GPT 5.4 xHigh	GPT 5.4 low	High reasoning plans, minimum executes

Each runs the same prompt: build a Rails app with RubyLLM, Tailwind, Stimulus, Turbo Streams, Minitest tests, Brakeman, RuboCop, Dockerfile, docker-compose. Same prompt as the original benchmark.

How to enable multi-model in each harness

Before showing what went wrong, it’s worth understanding how each harness exposes the sub-agent and who decides to call it. The mechanics are similar across all three, but the details matter.

Claude Code

Claude Code automatically reads files in .claude/agents/*.md from the project directory. Each file is an agent definition:

---
name: sonnet-coder
description: Claude Sonnet 4.6 for concrete coding execution. Use PROACTIVELY for any code change where the plan is already clear. Opus should plan and delegate; Sonnet should execute. Only skip delegation for cross-file architectural decisions.
model: claude-sonnet-4-6
---

You are a focused coding agent. The parent (Opus) has already decided
the approach — your job is to execute cleanly.

Rules:
- Follow the provided instructions precisely. Don't re-plan.
- Prefer editing existing files over creating new ones.
- Match the existing codebase style.
- Keep changes minimal.
- Default to no comments.

The YAML frontmatter has three required fields: name (the handle the primary model uses), model (which model runs the agent), and description (what the primary reads to decide whether to delegate). The file body is the system prompt the sub-agent receives when invoked.

To invoke, Opus uses the native Task(subagent_type="sonnet-coder", prompt="...") tool. Claude Code bills tokens to the sub-agent’s model, not the primary’s.

opencode

opencode uses a JSON config file (can be the default opencode.json or a custom one via --config). Agents live under an agents key:

{
  "model": "openrouter/anthropic/claude-opus-4.7",
  "agents": {
    "coder": {
      "model_id": "zai/glm-5.1",
      "provider": "zai",
      "description": "Use proactively for concrete coding execution...",
      "prompt": "You are a focused coding agent. The parent..."
    }
  }
}

Each entry has model_id, provider, description, and prompt. The primary model (set at the top) invokes the agent via a task tool, passing the name (coder in the example) and specific instructions.

Codex CLI

Codex uses TOML files per agent, passed via -c flags on the command line:

# .codex-coder.toml
name = "coder"
model = "gpt-5.4"
reasoning_effort = "medium"
description = "Use proactively for concrete coding execution..."
prompt = "You are a focused coding agent. The parent (xhigh)..."

And invoked:

codex exec \
  --dangerously-bypass-approvals-and-sandbox \
  -c model_reasoning_effort=xhigh \
  -c agents.coder.config_file=.codex-coder.toml \
  -p ""

The primary model gets access to the spawn_agent tool to invoke the coder. Codex lets you configure a different reasoning_effort for the primary and the sub-agent, which is exactly what the multi_balanced and multi_faster variants test.

Who decides which model runs each task

In all three harnesses, the decision to delegate is made by the primary model at runtime, not by a programmatic rule. There’s no deterministic heuristic like “if the file is larger than X lines, call the sub-agent”. What exists is the primary model reading the sub-agent’s description and judging, step by step, whether the current task fits.

Three consequences:

The sub-agent description is the only control knob. If you write “use PROACTIVELY for X” without caveats, the model tends to delegate more. If you add “skip for Y”, it tends not to delegate on Y.
The primary model is conservative by default. Across all three, current training favors not delegating when the task needs cross-file context or architectural decisions. A greenfield Rails app is exactly that kind of task.
You can’t force delegation via config. You can write an aggressive description, but if the model judges the task doesn’t fit, it ignores the sub-agent. No --force-subagent flag exists. The call is the model’s, not the operator’s.

Matters for what comes next.

The finding that kills the argument

I opened the logs of each run expecting to see delegation happen. Tools like Task (Claude Code) or spawn_agent (Codex) should show up in the ndjson every time the primary model calls the sub-agent.

Across 7 runs, the delegation tool was called zero times. No Opus called Sonnet. No Opus called Haiku. No Opus called GLM 5.1 or the local Qwen 3.6. No GPT xHigh called GPT medium or low.

All primary models did the entire job alone, ignoring the sub-agent that was registered and visible to them. The sub-agents were read, parsed, listed, and never invoked. It’s like hiring an assistant and letting them sit at their desk all day while you do everything yourself.

Why did this happen? Two layers of explanation.

The technical layer

The primary models read the sub-agent descriptions and decided the task didn’t fit. Descriptions typically said “use proactively for concrete code execution” with a caveat “skip for cross-file architectural decisions”. Except an entire Rails app is cross-file architectural decision. Controller depends on service, service depends on initializer, view depends on partial, all depend on how tests mock the LLM. There’s no isolated piece you can hand off to a dumber executor without losing context.

I could have written the sub-agent description more imperatively, forcing delegation. But that would be cheating to get a result. The point of the test is to see what the model does freely, not what it does under force. And freely, it didn’t delegate.

The management layer

Delegation has coordination cost. That’s basic project management knowledge, not news. When you outsource a task, you have to:

Write a clear spec for the other executor
Wait for the result
Review
Request adjustments if there’s a gap between what you wanted and what came back
Reintegrate into the rest of the work

For seniors outsourcing to juniors, that cost is real. Productivity doesn’t scale linearly with the number of executors. Doubling the team doesn’t double speed. In many cases, outsourcing costs more time than doing it yourself would have.

With LLMs the same thing happens, compounded by a specific trait: Opus’s planning is rarely perfect on the first try. Never is. Opus reads the prompt, builds a plan, starts implementing, hits a problem (library that doesn’t have that version, method that doesn’t exist as it imagined, test that fails for a reason it didn’t anticipate), adjusts the plan, tries again. That “plan → try → adjust” loop is inherent to the work. Not an Opus failure, it’s the nature of software development.

Now imagine you insert a smaller model in the middle of that loop. Opus plans, hands off to Qwen to execute, Qwen writes code that likely hallucinates the API (as we saw in the previous benchmark, Qwen invents RubyLLM::Client.new which doesn’t exist), Opus gets the code back, finds it’s wrong, has to make a sub-plan to fix it, hands back to Qwen, which invents something else, and on it goes. The communication and correction overhead explodes.

That’s why the models themselves, without being forced, decided not to delegate. They know the coordination cost outweighs the benefit, especially for a cohesive task like building a Rails app from scratch.

Notes per run

Even without delegation happening, the 7 runs produced different results. Let me comment on each.

Claude Code: Opus 4.7 alone

11 minutes, $6.74, 24 tests, 1742 files. Result Tier 3 (broken). Opus in this run hallucinated the chat.complete method on RubyLLM, which doesn’t exist, and used chat.add_message(role:, content:) with keyword args instead of a positional hash. Same typical hallucination other models have, now from Opus itself. Weird, because the same Opus 4.7 on opencode delivered correct Tier 1 code on the same prompt.

Claude Code: Opus 4.7 + Sonnet 4.6

10 minutes, $5.13, 18 tests, 1829 files. Result Tier 2 (first message works, multi-turn breaks). Better than the Opus-alone baseline, but still has the keyword-args bug on add_message. Zero delegations to Sonnet.

Claude Code: Opus 4.7 + Haiku 4.5

15 minutes, $7.83, 34 tests, 1984 files. Result Tier 3, same hallucination as Opus alone. The highest test count (34!) all pass because the test fakes mock the hallucinated API Opus itself invented. Tests that prove nothing.

The point worth underlining: Opus 4.7 on Claude Code wrote 34 passing tests, and none of them prove the code works. The hallucinated API is tested against a hallucinated implementation. In the real world, the app crashes on the first message. Test count is a vanity metric when the mock is wrong.

opencode: Opus 4.7 + GLM 5.1

19 minutes, $1.10, Tier 1 (works on first try). Correct RubyLLM API. Both phases (build + Docker validation) completed clean. Zero calls to GLM 5.1, Opus did everything.

opencode: Opus 4.7 + local Qwen 3.6

30 minutes, $1.10 (only Opus billed, local Qwen is free), Tier 1. Same quality as above. Zero calls to the Qwen 3.6 running on the 5090.

Codex: xHigh planning, medium executing

21 minutes, ~$11, Tier 1. The most expensive multi-agent run on the Codex side, but curiously the one that generated the best code, and fixed the add_message bug the previous benchmark caught on GPT 5.4 alone. But zero delegations, all the work came from xHigh.

Codex: xHigh planning, low executing

20 minutes, ~$10, Tier 2. Reproduced the keyword-args bug. Cheaper than the above but worse code. Zero delegations to low either.

Same model, different runs, different results

Here’s the real story of this benchmark, which becomes clearer when you group by model instead of by combination.

Since the sub-agents never ran, each “multi-model” combination effectively became another run of the primary model. That gave me something I didn’t have in the previous benchmark: multiple runs of the same model on the same prompt. Let me compare.

GPT 5.4 xHigh: three runs

In last week’s benchmark, GPT 5.4 via Codex with xHigh ran once, scoring Tier 2 (first message works, multi-turn breaks due to chat.add_message(role:, content:) with keyword args instead of positional hash).

This week I ran two more, with different sub-agent configurations (which weren’t used, but the presence changes the primary’s behavior, as I showed above):

Run	Tier	Tokens	Cost	Correct API?
xHigh alone (last week)	2	7.6M	~$16	Bug in `add_message`
xHigh + medium subagent	1	5.44M	~$11	Fixed the bug
xHigh + low subagent	2	4.28M	~$10	Bug came back

Same model, same prompt, three runs. One of them wrote chat.add_message(message) with positional hash (Tier 1, works in multi-turn). The other two wrote it with keyword args (Tier 2, breaks on the second message).

No sub-agent was called in any of the multi variants. The only thing that changed between the three runs was the text of the available sub-agent description (or its absence). Even so, that GPT 5.4 “got” the API right in one run and “got it wrong” in the other two.

Claude Opus 4.7: six runs

With Opus it was even more instructive. Six different runs, same model, same prompt, results spread from Tier 1 to Tier 3:

Run	Harness	Tier	Time	Cost
Opus 4.7 baseline	opencode	1	18.2m	$1.10
Opus 4.7 + GLM 5.1	opencode	1	10.3m	$1.10
Opus 4.7 + Qwen 3.6 local	opencode	1	19.4m	$1.10
Opus 4.7 alone	Claude Code	3	11.0m	$6.74
Opus 4.7 + Sonnet	Claude Code	2	10.1m	$5.13
Opus 4.7 + Haiku	Claude Code	3	14.7m	$7.83

The three Opus runs on opencode: consistent Tier 1. Correct RubyLLM API, works in multi-turn, clean code.

The three Opus runs on Claude Code: one Tier 2 and two Tier 3. Code that hallucinated chat.complete (method doesn’t exist) or got the add_message signature wrong.

Same model. Same prompt. Different harness = different result.

What that means

Two possible readings:

The lazy reading: “Opus 4.7 on Claude Code regressed, switch to 4.6” or “opencode is better than Claude Code”. Both would be wrong conclusions from such a small benchmark.

The honest reading: a single-run benchmark, or even three runs, isn’t enough to assert anything about the absolute quality of a model. Variance is high, the context loaded by the harness (CLAUDE.md, tool schemas, agent registries) shifts the “mental model” the model activates, and the result can swing between tiers.

In the previous benchmark, I got lucky (or unlucky) with runs consistent enough for hierarchies like “Claude/GLM work, Kimi/DeepSeek/Qwen hallucinate the API” to hold. But even there, run-to-run variance is real. If I run Kimi K2.5 ten times, maybe two or three of those runs would hit the API right. I didn’t test this, but it’s plausible.

This benchmark reinforces the point: the rankings in the previous article count as signal, not proof. “Works on first try 80% of the time” is different from “always works”. For production use, you want a model robust to variance, one that doesn’t have a 20% chance of returning hallucinated code. Today, the only models that meet that bar for me are Claude Opus and Claude Sonnet, on any harness. GLM 5.1 is close but I don’t have a large sample yet.

Does that mean Opus 4.7 “got worse”? No. Does it mean Claude Code “is worse than opencode”? No. It means a single-run benchmark on a greenfield Rails app doesn’t capture model variance. Worth knowing before jumping to strong conclusions.

Execution time: is multi-model slower?

Adjacent question worth measuring: if the sub-agent never ran, were the multi-model runs slower than the alone baselines? My initial intuition was yes, since without cross-session parallelism the primary model does the work serially. The data tells another story.

Run	Time
Claude Code Opus alone	11.0m
Claude Code Opus + Sonnet	10.1m
Claude Code Opus + Haiku	14.7m
opencode Opus baseline	18.2m
opencode Opus + GLM 5.1	10.3m
opencode Opus + Qwen 3.6	19.4m
Codex xHigh baseline	21.9m
Codex xHigh + medium	21.2m
Codex xHigh + low	20.2m

Some multi-model runs were faster than the alone baseline. opencode + GLM 5.1 was almost half the time of opencode alone (10m vs 18m). Claude Code + Sonnet was 1 minute faster than pure Opus. Codex multi-agent variants ended up slightly faster than xHigh alone.

Others were slower: Claude Code + Haiku took 15m (4m more than baseline). opencode + Qwen 3.6 ran 19m (same as baseline, likely penalized by llama-swap overhead even without invoking the model).

No consistent pattern of “multi-model is always slower” or “always faster”. What happened, looking at tool calls and test counts, is more interesting: the primary model changes its own behavior when it sees a sub-agent is available, even without calling it.

Claude Code Opus alone: 24 tests, 11m
Claude Code Opus + Sonnet: 18 tests, 10m (fewer tests, faster)
Claude Code Opus + Haiku: 34 tests, 15m (more tests, slower)

The pattern: when the sub-agent exists in the description as “executor”, the primary model sometimes produces leaner output, as if “leaving work for later”. When the sub-agent describes more expensive execution (Haiku as “high-volume execution”), the model seems to assume it can afford to write more tests because “the cheap executor will handle it”. The executor is never called in either case. But the sub-agent’s presence influences the primary’s planning.

Subtle effect, like a delegation placebo. The model doesn’t delegate, but behaves as if it would. It can be good (more focused output) or bad (lower test coverage than baseline). Not something you control, it’s emergent behavior from the model reading the sub-agent description.

So is it worth configuring Haiku just for the placebo?

You might be tempted: “if Opus wrote more code and more tests with Haiku configured, then it’s worth configuring Haiku as a sub-agent even if it never runs, just for the placebo”. The numbers say no.

Comparing Opus alone vs Opus with Haiku configured, both on Claude Code:

Metric	Opus alone	Opus + Haiku
Time	11.0m	14.7m
Cost	$6.74	$7.83
Tests	24	34
Quality tier	3 (broken)	3 (broken, same hallucination)

With Haiku configured, Opus spent 3.7 more minutes, $1.09 more, and wrote 10 more tests. The quality tier stayed the same. The same chat.complete hallucination appeared in both runs. The 10 extra tests mock the same hallucinated API, so they prove nothing the original 24 weren’t already proving. More code, not better code.

Delegation placebo can shift quantity but doesn’t fix factual errors. And with a sample of 1 run each, even the quantity increase isn’t reliable, because Opus-alone run-to-run variance is also high (probably another Opus-alone run would hit 30+ tests by chance).

Practical takeaway: don’t configure a “fake” sub-agent just to try to manipulate the primary. The cost in tokens/time is certain, the benefit is speculative. Opus alone, no sub-agent, stays the recommended default configuration. Sub-agents are only worth it when you have a real delegation use case that works (and we saw here that greenfield isn’t one).

“Multi-model is slower” hypothesis doesn’t hold

Back to the original question: no, you can’t say multi-model without delegation is consistently slower. Sometimes it is, sometimes it’s faster, depends on the model and the sub-agent description. What you can say is that the sub-agent’s presence shifts the primary’s behavior in unpredictable ways, and that alone is an argument against multi-model configurations with no clear need.

Two unexpected findings

Outside the main topic (multi-model didn’t work), two patterns showed up.

First: the harness affects code quality, not just cost

The same Opus 4.7 produced Tier 1 on opencode and Tier 2/Tier 3 on Claude Code, same prompt. That’s new. As far as I know, this is the first benchmark evidence that the harness (the CLI wrapping the model) can degrade factual correctness, not just cost.

The hypothesis is that Claude Code carries 6-11 million cache-read tokens per run (CLAUDE.md, tool schemas, agent registries, etc.), against opencode’s ~210 thousand. That volume of context seems to pull Opus toward a generic OpenAI SDK “mental model”, where chat.complete makes sense, instead of the specific mental model for the RubyLLM gem. Speculation on my part, I can’t prove it. But the Tier difference between the two harnesses running the same model is concrete.

This does not mean opencode is better than Claude Code for daily use. In my day-to-day, Claude Code with Opus beats opencode on almost everything: editor integration, long-session context management, native tool support, multi-step planning quality. The benchmark has a narrow scope (greenfield Rails app, specific prompt, no human iteration) that doesn’t reflect real use.

What the data says: variance between harnesses is real and measurable. Worth keeping in mind when you’re evaluating a model.

Second: cost of Claude Code vs opencode

Running the same Opus 4.7 on the same prompt:

Claude Code: $5 to $8 per run
opencode: $1.10 per run

Claude Code costs 5 to 7 times more per run on the same model. The difference is cache-read: Claude Code loads 6-11M context tokens per run, opencode loads ~210K. There’s a legit technical reason (tool schemas, TodoWrite, agent registries, CLAUDE.md, editor integration), but the overhead is real and shows up directly on the bill of anyone paying per token.

Worth a more refined take here, because this changes the calculus depending on how you consume.

If you’re on Pro or Max: use Claude Code. Period. Subscription covers the tokens, you get the full feature set (native tool support, skills, agents, Plan mode, better long-session context). No reason to switch.

If you pay per token directly through the API, the math changes with volume.

For light use (a few hundred dollars a month): opencode with Opus is cheaper, and in this specific benchmark hit Tier 1 while Claude Code landed in Tier 2/3. Works well for automated pipelines, CI, benchmarks, server-side agents.

For heavy use (thousands of dollars a month on API): staying on per-token doesn’t make sense. The Max 20x subscription at $200/month covers heavy volume and includes Claude Code. For a heavy vibe-coder, Max is cheaper than Opus on API by a wide margin. Then you’re back in the first bucket, on Claude Code.

Opencode is better, regardless of cost, for:

Headless or automated use (CI, benchmarks, server agents)
Multi-provider setups where you want the same harness hitting OpenRouter, Z.ai, and local llama-swap
When you need structured JSON output (--format json)
Neutral model comparisons

Claude Code is better, regardless of cost, for:

Interactive coding sessions with a human in the loop
Projects with CLAUDE.md, skills, custom MCP
Work where Opus’s Plan mode matters
Long iterative sessions where accumulated context helps

Honest reading: this benchmark measures a narrow scenario (greenfield, one-shot, no human iteration). For real daily work, Claude Code with Max remains the recommendation for 99% of people. Opencode’s cost win shows up in a specific niche (automated pipelines or per-token API use below the Max break-even). Most people aren’t in that niche.

The “Opus plans, Qwen executes” myth

Every so often someone on Twitter talks about setting up a pipeline where Opus makes the detailed technical plan and a smaller model (Qwen, GLM 5, Haiku, Sonnet) executes. “Save tokens, same quality, everybody wins”.

Doesn’t work. Or rather, works for a demo, doesn’t work for a real project.

The most serious problem is that the plan is never perfect on the first pass. Code is never one-shot. You implement, find a problem, adjust. With one big model, that adjustment happens in real time by the model itself. With two models, every adjustment has to go back to the planner, be reprocessed, a new plan written, a new executor invoked. The loop is slower.

Then there’s the question of factual API knowledge. If Opus’s plan says “use RubyLLM to call OpenRouter”, Opus knows it’s RubyLLM.chat(model:).ask(msg).content. The smaller Qwen reads the plan and implements with the API it thinks exists, which may be RubyLLM::Client.new.complete. The plan doesn’t correct this because the plan doesn’t carry the gem’s factual knowledge. Only the model that knows that API can implement it correctly.

And then there’s coordination cost, which explodes with iteration. Every round of “plan → execute → fail → re-plan → execute again” costs more tokens than just letting the big model do everything in one session. You pay in planning tokens AND in wrong-code tokens that need to be rewritten.

In theory, multi-model makes sense. In practice, it’s work for a Twitter thread with pretty animations, not the workflow of someone who ships code.

When multi-model can make sense

I don’t want to sound absolute. There are scenarios where multi-model is the right pick.

The main one is genuinely parallel and decoupled tasks. API migration across 30 identical files, for example: each file follows the same pattern, no dependencies between them. Opus could supervise 20 sub-agents doing the same transformation on 20 different files. In that case, the coordination cost is amortized by the parallelism.

Another is tasks with a heavy research phase followed by a direct implementation phase. Opus does the architectural spike exploring legacy code, then delegates the mechanical implementation to the smaller model.

A real example I went through last week: I translated 700+ posts and all the video subtitles of the blog to English using Claude Code. I burned through the Max 20x and blew another $1120 in extra usage on top. Translation is exactly the kind of task that would have benefited from multi-model: each post is independent of the others, no cross-file dependency, zero architectural planning, just batch translation. Opus orchestrating + Sonnet executing each file’s translation would have cut the cost in half, easily. Didn’t occur to me at the time, I ran everything on Opus. The lesson I take: for genuinely parallel tasks, multi-model with Sonnet as executor makes sense, and I missed a clear chance to save money.

But neither of the cases above is “greenfield Rails app”. A new app from scratch is the worst scenario for multi-model because every part depends on every other part. The models aren’t dumb, they recognize this and refuse to delegate.

The rule of thumb stays the same

For 90% of day-to-day programming work, my recommendation remains:

Claude Code + Opus (4.6 or 4.7)
If cost is critical and you’re OK plugging into OpenRouter, GLM 5.1 is a comfortable second place
If you have good GPU (5090 or equivalent), local Qwen 3.6 35B is acceptable for simple tasks, with caveats

Multi-model? Only for specific cases where the parallelism is genuine. For normal projects, it’s unnecessary overhead.

Benchmarks aren’t absolute truth

This benchmark measures a specific thing: greenfield Rails app, deterministic prompt, no human iteration, in automated runners. A narrow slice of real use.

If you want to know which combination works for YOUR workflow, YOUR types of projects, YOUR quality expectations, don’t trust my benchmark. Run your own. The code is all on GitHub, the harness is extensible, you swap the prompt and you have your own comparison.

What I hope this work contributes is methodology, not a definitive answer. “Claude is better than Qwen” depends on what you’re doing. “Multi-model doesn’t work” depends on the type of task. Benchmarks narrow the space for speculation with concrete data, they don’t close the discussion.

Meanwhile, if someone tells you they combined Claude + GLM and it was magical, ask for the code, the prompt, the repo. Most of the time they measured something quite different, or have a specific task where that combination fits. Don’t generalize from a tweet.

Omarchy on the Thinkpad T14 Gen 6: Mini-Review and Full Setup

Sat, 18 Apr 2026 08:30:00 GMT

I bought a Lenovo Thinkpad T14 Gen 6 and installed Omarchy on it. It’s not my main machine, and it’s not supposed to be. It’s a companion: a notebook I can open on the 3D printing office desk, SSH into the desktop, fire up Claude Code, access files on the NAS, debug the network through ethernet without hunting for a USB dongle and a long cable. This article covers the hardware choice, the Omarchy setup on top, the customizations specific to a laptop and to this Thinkpad in particular, and the architecture decisions that might not be obvious.

Mini-review of the Thinkpad T14 Gen 6

Let’s get this out of the way: this isn’t the notebook of my dreams. If I were picking by looks, I’d grab an Asus Zenbook S 14 with an OLED screen. If I were picking by portability, the Thinkpad T14s with the aluminum shell, which is lighter and much better looking. The regular T14 has a 14" 1920x1200 IPS panel, 400 nits, 60Hz, no HDR. It gets the job done, but it’s a far cry from the Zenbook’s screen. The shell is plastic with a rubberized finish, which makes it scratch-resistant but not premium. Reviews from NotebookCheck and XDA Developers land on the same verdict: good price, good connectivity, fine performance for office work, but the screen is dated.

What makes up for it:

Port selection. Full-size HDMI. Gigabit ethernet. USB-A, USB-C Thunderbolt 4, charges over USB-C (not a proprietary charger). For a debug companion, this is exactly what I wanted. The Zenbook S 14, being thinner, cuts ports.
Fingerprint sensor that actually works on Linux (Goodix MOC, works with libfprint on kernel 6.11+). I use this, details later.
Thinkpad keyboard. 1.5mm key travel, trackpoint, classic layout. It’s not the best keyboard in the world in 2026, but it’s reliable and durable.
Rugged shell. It will fall, it will scratch, it will travel in a backpack. If I put a Zenbook OLED or a Macbook Pro on the 3D printing office desk next to the printer with PLA dust in the air, I’d be nervous. The Thinkpad can take a beating.

If I wanted a gaming machine, I’d get an Asus Zephyrus G14, my favorite gaming notebook. If I wanted a creative work machine, I’d get the Asus Zenbook Duo (UX8406) with two vertical OLED screens, great for video editing and 3D modeling. I can afford any Macbook Pro or Mac Studio, and I chose not to go that way. I’ll explain in the next section.

My use case

My main PC is a desktop with Ryzen 9 7950X3D, 96 GB of RAM, RTX 5090 with 32 GB. That’s where I work, experiment with local models, run containers, edit the blog. For gaming, I have a separate mini-PC with an RTX 4090. Those two cover 100% of what I need to do at home.

The notebook exists to cover the remaining 1%: sitting on the couch, taking it to the 3D printing office, taking it to the kitchen, short trips. It’s not meant to replace the desktop. It’s meant to give me remote access to the desktop when I’m away from it.

In practice it looks like this: open the notebook, split the Hyprland layout, left pane SSHs into the main desktop, right pane is the notebook itself. Same Omarchy on both, same keybindings, same bash. The notebook becomes an extension of the desktop, not a parallel environment I have to re-learn in my head every time I switch.

SSH from outside the house: Tailscale

Inside the local network, SSH is trivial, the notebook talks to the desktop through the internal IP. Outside the house is another story. My home IP is dynamic, opening port 22 to the internet is a terrible idea, and even with DDNS and port forwarding you’re putting SSH on the public internet for any scanner to find.

The solution I use is Tailscale. For those who don’t know: Tailscale is a mesh VPN built on WireGuard that creates a private network between your devices (the “tailnet”). Each machine runs the agent, authenticates once, and gets a fixed IP on the private network (something like 100.x.y.z). Traffic between your own devices goes peer-to-peer, encrypted by WireGuard. It doesn’t route through Tailscale’s central server, they only coordinate NAT traversal. Result: from my notebook in a café anywhere in the world, I run ssh hal9000 and land on my home desktop as if I were on the same network.

More sophisticated options exist: Cloudflare Tunnel with Zero Trust to expose services publicly with SSO auth, self-hosted headscale, raw WireGuard with manual config, Nebula, OpenVPN. Each has its use case. If you need to expose services to third parties, control granular per-identity access, or run the whole infrastructure at home without depending on anyone, those options win. In my case, it’s just notebook talking to desktop, for short periods (it’s not a full week of work, it’s a weekend debug session), so free Tailscale covers it. The free tier accepts up to 100 devices and 3 users, way more than I need.

Setup is as simple as it gets:

# On Arch/Omarchy
sudo pacman -S tailscale
sudo systemctl enable --now tailscaled.service
sudo tailscale up --ssh

The --ssh flag enables Tailscale SSH, which authenticates via tailnet identity instead of a local SSH key. Once you log in through the browser to tailscale, each registered machine can SSH into the others based on ACL policy defined in the admin panel. Zero key management.

I repeat the same on the desktop, log in with the same account, done. Both machines show up in the panel (hal9000 and hal9666 in the screenshot above, both with SSH enabled). From the notebook: ssh hal9000. From the desktop to the notebook: ssh hal9666. No port forwarding, no public IP, no port 22 exposed to the internet. If the notebook is stolen, I remove it from the tailnet with one click.

A practical detail: since the tailnet gives stable names, I added entries to ~/.ssh/config to use these short names:

Host hal9000
  HostName hal9000
  User akitaonrails

Now ssh hal9000 works from anywhere that has Tailscale connected. It’s the closest thing to “it just works” I’ve seen for remote SSH.

Why not a Mac

I have no use for macOS. As a dev, I live better in native Linux. Every tool I need has a first-class Linux version, and on macOS I’d get a second-class version via Homebrew. I don’t do iOS, so I don’t need XCode. For mobile I use Flutter or Hotwire Native, which run on any OS. iTerm2 and Ghostty on Mac are fine, but Alacritty, Kitty and Ghostty itself on Linux work just as well. Every good piece of software lands on Linux first and gets ported afterwards. Arch with AUR covers everything in a single yay.

For creative work, I haven’t done it professionally in years. DaVinci Resolve Studio on Linux is better than Final Cut Pro. Krita or Affinity Photo replace Photoshop for most cases. Clip Studio Paint on Android is better than Procreate. I simply don’t have a workflow that depends on Apple, and the App Store annoys me.

For gaming, Mac is terrible. Apple Silicon has a decent GPU for some things, but the native macOS game library is pathetic compared to Windows or Linux. Game Porting Toolkit exists, CrossOver exists, but for anyone who games seriously, it doesn’t cut it. I won’t be gaming on the Thinkpad, I have the main desktop and the mini-PC with the RTX 4090 for that. But if I ever feel like running Hollow Knight or some indie on a train, I just install Steam and let Proton handle it. Linux became a real gaming platform in the last few years, with Proton/DXVK running almost the entire Steam catalog (check ProtonDB). I recently wrote about running my emulation library in distrobox without polluting the host system. With Mac, those options don’t exist.

Another argument that always shows up: “but a Mac Mini M4 or a Mac Studio with 128GB of unified memory runs large models locally, it’s a ChatGPT replacement.” I already tested that thesis and wrote a detailed benchmark. The conclusion: expensive local hardware to run LLMs is a weekend hobby, not a production tool. The open source models that fit don’t deliver the quality Claude Opus delivers. I have a home server with AMD Strix Halo and 96 GB of unified RAM, I ran models for dozens of hours, and in my real coding flow they’re fine for simple tasks. For complex tasks, Claude Opus. Before spending $4000 on a Mac Studio justifying it with local models, actually test it first. You’ll probably end up paying for Opus again.

Why Omarchy

I’ve been using Omarchy on the desktop for months and documented the path in a series of articles. Omarchy 2.0 has its own installer with LUKS, Btrfs, Limine, snapper and SDDM already configured. I wrote about using the official ISO, about ZSH customizations with atuin, starship, secrets properly organized, about Mise for multiple languages, about LazyVim and LazyExtras, about SSH and Yubikeys, about modern TUIs. There’s also Omarchy 3 with dual GPU AMD + NVIDIA and Crush as a coding agent.

For those who don’t know: Omarchy is plain Arch Linux with a cosmetic layer on top of Hyprland/Wayland. Pre-configured with sane defaults. I could build all of it from scratch, I’ve done it several times in my life, but why redo work someone already did well? I install Omarchy, I stack the tweaks that are mine on top, and I have customized Arch in a fraction of the time.

A point I raise every time I recommend Omarchy: the documentation is excellent. There’s an official manual covering everything from installation to theme customization, keybindings, Hyprland, Waybar, the works. If you’re coming from Ubuntu or Fedora and are wary of Arch, this manual handles most of the doubts. If you’ve never touched Hyprland, open it and read around. It’s not a README dumped on GitHub, it’s a real manual with chapters and an index.

The story of this article starts in a previous thread. I recently migrated my home server, swapping an old Ubuntu box for a Minisforum MS-S1 running openSUSE MicroOS configured with Claude Code. That was the first serious experiment of letting Claude Code drive an entire infra migration, with containers, NFS, services, networking. It worked. And it left an idea in the air: why not use the same approach to configure a new notebook?

That’s what I did. I grabbed the Thinkpad, downloaded the latest Omarchy ISO, burned it to a flash drive, installed it over Windows. From zero to working desktop in under an hour. From there on, it’s all tweaking, which I documented as I went.

I’ll detail the two layers of customization: what any notebook needs, and what’s specific to this Thinkpad.

Laptop-specific configs

A notebook has problems a desktop doesn’t: battery, suspend, lid, brightness, trackpad. Those are the parts Omarchy default doesn’t cover the way I want.

Power management with TLP

Omarchy defaults to power-profiles-daemon. I swapped it for TLP, which gives granular control over CPU scaling, battery thresholds and dynamic profiles based on AC vs battery.

sudo pacman -S tlp tlp-rdw
sudo systemctl mask power-profiles-daemon.service
sudo systemctl enable --now tlp.service

The mask is necessary because upower pulls power-profiles-daemon back in if you just disable it.

Charge thresholds: 60% to start, 85% to stop. The notebook spends most of its time plugged in on the office desk, and keeping the battery at 100% 24/7 wrecks capacity over time. With 60/85, the battery spends most of its time in the healthy lithium range and still has decent usable capacity.

Profiles: balanced with balance_power on battery, performance on AC. TLP’s low-power was too aggressive on the Core Ultra 5 235U, window and terminal response became noticeable. Balanced gives the best consumption/responsiveness ratio for normal use.

There’s a gotcha: tlp auto after tlp ac doesn’t always re-apply the battery profile. I wrote a small script bound to Super+Ctrl+P that reads the current state and calls tlp bat or tlp ac explicitly:

#!/bin/bash
set -euo pipefail

profile=$(cat /sys/firmware/acpi/platform_profile 2>/dev/null || echo unknown)
on_ac=$(cat /sys/class/power_supply/AC*/online 2>/dev/null | head -1 || echo 0)

if [[ "$profile" == "performance" ]]; then
  if [[ "$on_ac" == "1" ]]; then
    sudo /usr/bin/tlp auto >/dev/null
    label="Plugged in — back to AC auto"
  else
    sudo /usr/bin/tlp bat >/dev/null
    label="On battery — normal profile"
  fi
  notify-send -t 2000 "Performance mode: off" "$label" 2>/dev/null || true
else
  sudo /usr/bin/tlp ac >/dev/null
  notify-send -t 2000 "Performance mode: on" "Forcing AC profile" 2>/dev/null || true
fi

In waybar, a custom module custom/perf shows the current state (󰓅 PERF, 󰌪 ECO, or empty for balanced) and accepts a click to toggle. The output script is tiny:

#!/bin/bash
p=$(cat /sys/firmware/acpi/platform_profile 2>/dev/null)
case "$p" in
  performance) printf '󰓅 PERF' ;;
  low-power)   printf '󰌪 ECO'  ;;
  balanced|"") printf ''       ;;
  *)           printf '%s' "$p" ;;
esac

Suspend, hibernate, and the lid

This is what changed compared to the notebooks I used in 2015. The T14 Gen 6 closes the lid, suspends, and wakes up when you open it. No bug, no weird delay, no needing to log in again mid graphical session. Hyprlock kicks in after the suspend, accepts fingerprint or password, and brings me back to the desktop in seconds. This is the behavior Apple has had for years and that on Linux used to be an adventure. In 2026, on modern hardware with kernel 6.11+, it just works.

Mem sleep mode is s2idle (no deep sleep). Hibernation is enabled via a ~30 GB Btrfs swapfile and resume=/dev/mapper/root resume_offset=... on the Limine kernel cmdline. I rarely use it, but it’s there.

Hypridle has aggressive timeouts for a laptop:

2.5 min → screensaver
5 min → lock
5.5 min → DPMS off + keyboard backlight off

After lock, another 5 min and the screen turns off completely. On unlock, it restores screen and keyboard brightness to the previous level. These timings are much shorter than on the desktop (20-40 min there). The battery difference over a day is measurable.

Brightness and keyboard backlight

brightnessctl controls both. Fn+Space toggles the keyboard backlight across three levels. Hypridle saves the current level before turning off and restores it on resume. Command I use in the script:

brightnessctl -sd '*::kbd_backlight' set 0    # save and turn off
brightnessctl -rd '*::kbd_backlight'          # restore

Touchpad

In hypr/input.conf:

touchpad {
    natural_scroll = true
    clickfinger_behavior = true
    disable_while_typing = true
    scroll_factor = 0.4
}
gesture = 3, horizontal, workspace

clickfinger_behavior swaps the 2-finger click as right-click (more comfortable than hitting the lower-right zone). disable_while_typing is basic palm rejection. Three-finger horizontal swipes switch workspaces, the most useful gesture in Hyprland.

Thinkpad-specific configs

This is what’s specific to this model. Some things modern Linux handles on its own, others need explicit configuration.

Fingerprint sensor

Goodix MOC 27c6:6594, works with libfprint 1.94.9+ on kernel 6.11+. Package is fprintd.

sudo pacman -S fprintd
fprintd-enroll                        # right index finger by default
fprintd-enroll -f left-index-finger   # other finger

To make sudo accept fingerprint, I added auth sufficient pam_fprintd.so above the pam_unix.so line in /etc/pam.d/sudo. With sufficient, if the fingerprint passes, it authenticates directly. If it fails or I hit ESC, it falls back to the password prompt. This is genuinely worth it: dozens of times a day, sudo pacman -Syu or sudo systemctl restart something, and I just touch the sensor.

On hyprlock, I use hyprlock’s native configuration, not PAM:

auth {
    fingerprint {
        enabled = true
        ready_message = Scan fingerprint or type password
        present_message = Scanning...
        retry_delay = 250
    }
}

The PAM path gives a double prompt. Configured natively, hyprlock accepts fingerprint or password, whichever succeeds first unlocks.

On SDDM (initial login), I kept password only. The reason: the login password unlocks the GNOME keyring, and the fingerprint can’t provide plaintext for that. Once the keyring is unlocked, hyprlock can use fingerprint without issues.

In practice, sudo looks like that. Touch the sensor, authenticate, move on.

Brazilian Thinkpad keyboard

The Brazilian Thinkpad keyboard has an annoying quirk. The /? key sits where the right Ctrl would be (keycode 97), not in the traditional ABNT2 AB11 position (keycode 89). If you use the standard br(abnt2) layout, that key is inaccessible. It literally prints nothing.

The solution is the br(thinkpad) variant that exists in /usr/share/X11/xkb/symbols/br:

xkb_symbols "thinkpad" {
    include "br(abnt2)"
    name[Group1]="Portuguese (Brazil, IBM/Lenovo ThinkPad)";
    key  { [ slash, question, degree, questiondown ] };
};

In hypr/input.conf:

kb_layout = br
kb_variant = thinkpad
kb_model = thinkpad60

And system-wide for TTY/X11:

sudo localectl set-keymap br-abnt2
sudo localectl set-x11-keymap br thinkpad thinkpad60

I wrote a small Python that reads raw scancodes from /dev/input/event* to diagnose these quirks. Useful when a key decides not to work and you need to find out if it’s hardware, kernel, or xkb.

Input method: fcitx5

Alongside the layout I install fcitx5. Input method, for those who don’t know, is the layer that turns a key sequence into characters. It handles deadkeys (tilde for nasalization, acute accent, circumflex), composing characters that aren’t on the keyboard (ç, Ç, uppercase accented letters), emoji support. In Qt or GTK apps, the input method also drives context menus for cedilla and accents.

Packages:

sudo pacman -S --needed fcitx5 fcitx5-configtool fcitx5-gtk fcitx5-qt

And the environment variables for toolkits to find fcitx5. I created ~/.config/environment.d/fcitx.conf:

INPUT_METHOD=fcitx
QT_IM_MODULE=fcitx
XMODIFIERS=@im=fcitx
SDL_IM_MODULE=fcitx

systemd’s environment.d is loaded before graphical sessions, so Brave, Alacritty, VS Code and any GTK/Qt app pick it up automatically. I enable autostart in Hyprland:

exec-once = fcitx5 -d

Typing ~a produces ã, 'e produces é, ç works as it should in every app. On a Brazilian Thinkpad keyboard, this is the difference between typing Portuguese naturally or hunting for each character.

SOF audio

Realtek ALC3306/ALC287 codec via Sound Open Firmware. Without sof-firmware, the kernel module loads but the DSP never boots and PipeWire silently falls back to auto_null. Result: you think the speaker is on mute, but actually PipeWire has no device at all.

sudo pacman -S --needed sof-firmware alsa-ucm-conf pipewire pipewire-pulse wireplumber

If you need to force SOF instead of legacy HDA:

echo "options snd-intel-dspcfg dsp_driver=3" | sudo tee /etc/modprobe.d/alsa.conf

Reload without reboot:

sudo modprobe -r snd_sof_pci_intel_mtl
sudo modprobe snd_sof_pci_intel_mtl
systemctl --user restart wireplumber pipewire pipewire-pulse

Firmware updates via fwupd

fwupdmgr update works, but with Limine there’s a gotcha: fwupd tries to write to /boot/EFI/systemd/ or /boot/EFI/arch/, which don’t exist. The workaround:

sudo mkdir -p /boot/EFI/arch
sudo fwupdmgr update -y --no-reboot-check
fwupdmgr get-history  # should show "Success"

HiDPI / fractional scaling

14" panel at 1920x1200, Hyprland’s free resolution. Omarchy’s auto 1.5x felt too chunky. I pinned it at 1.333x:

env = GDK_SCALE,1
monitor=,preferred,auto,1.3333,vrr,2

Effective 1440x900. GTK with GDK_SCALE=1 renders 1:1 with Hyprland (no double magnification). VRR mode 2 only in fullscreen, because LCD panels tend to flicker on static content with VRR active.

Brave and Chromium flags to render well at this scale:

--ozone-platform=wayland
--enable-features=WaylandFractionalScaleV1,UseOzonePlatform,VaapiVideoDecoder,VaapiVideoEncoder
--enable-gpu-rasterization

VAAPI makes a real difference on YouTube battery life.

Tuning Omarchy to be mine

Omarchy comes with good defaults. What I adjust stacks on top of them, without touching ~/.local/share/omarchy/ (which is clobbered by omarchy-update). All customization lives in ~/.config/.

Infra: Btrfs, snapshots, Snapper

Omarchy already ships with Btrfs and separate subvolumes:

Subvolume	Mount	In snapshots?
`@`	`/`	yes
`@home`	`/home`	no
`@log`	`/var/log`	no
`@pkg`	`/var/cache/pacman/pkg`	no

@home separated means ~/.cache, ~/.config/BraveSoftware, etc. don’t bloat root snapshots. Snapshot is for system, not profile.

Swap: 4 GB zram at priority 100 (hit first), plus a 30 GB swapfile at priority 0 (enables hibernation, sized to match RAM).

The snapshot stack: Snapper takes a snapshot before and after every pacman -Syu via snap-pac. limine-snapper-sync writes those snapshots into the Limine menu, so you can boot into a previous snapshot to roll back. If something breaks after an update, you hold a key on boot, pick the pre-update snapshot, boot read-only to verify, and if it’s good, run snapper rollback.

Omarchy leaves snapper-cleanup.timer and snapper-boot.timer disabled by default. I enabled both and configured retention to fit on a 1 TB SSD without blowing up:

TIMELINE_LIMIT_HOURLY=10
TIMELINE_LIMIT_DAILY=0
TIMELINE_LIMIT_WEEKLY=1
TIMELINE_LIMIT_MONTHLY=1
NUMBER_LIMIT=50

A detail that costs debug time if you forget: Docker and Ollama write gigabytes into /var/lib/docker and /var/lib/ollama. If that falls inside @, snapshots go catastrophic. Each Docker image or Ollama model downloaded triples in size. I created nested subvolumes for both, with chattr +C to disable CoW:

sudo btrfs subvolume create /var/lib/docker
sudo chattr +C /var/lib/docker
sudo btrfs subvolume create /var/lib/ollama

This has to be done BEFORE Docker or Ollama writes any data. If they already have data there, you need to migrate.

NFS to the Synology NAS

My Synology exposes three volumes. On the desktop, I mount them the usual way. On the notebook, it has to be more defensive. The notebook moves around, forgets networks, connects to public WiFi. Notebook’s fstab:

nfs4  _netdev,noauto,nofail,x-systemd.automount,x-systemd.idle-timeout=10min,x-systemd.mount-timeout=15s,noatime,nodiratime,nconnect=4,actimeo=10,soft,timeo=30,retrans=2

Critical difference vs the desktop: soft with short timeouts (doesn’t hang forever), x-systemd.idle-timeout=10min (auto-unmount when idle), no network-online.target in the require (doesn’t slow boot). Practical result: cd /mnt/gigachad at home mounts lazily, away from home it fails fast without locking the shell.

Another important detail: my user on the notebook has UID 1026, which matches the share permissions on the Synology. Linux defaults to creating users at 1000, the Synology enforces identity via UID on the wire. If the UIDs don’t match, you can’t read the files, or worse, you write as nobody. I ran the usermod/groupmod from a TTY (with the user logged out) to remap the user to 1026/1026 and did chown -R on /home.

Public WiFi hardening

The notebook will leave the house. In an airport or café, I don’t want to announce hostname, don’t want a trackable MAC, don’t want a service listening on an open port.

/etc/NetworkManager/conf.d/00-macrandomize.conf:

[device]
wifi.scan-rand-mac-address=yes

[connection]
wifi.cloned-mac-address=stable
ethernet.cloned-mac-address=stable
connection.stable-id=${CONNECTION}/${BOOT}

ipv6.ip6-privacy=2
ipv6.addr-gen-mode=stable-privacy

Random MAC per scan (passive anti-fingerprinting). Stable cloned MAC per SSID (captive portals don’t re-prompt you every time) but different MACs between networks. IPv6 with temporary addresses and interface ID derived from a stable secret, not from the MAC (no EUI-64 leak).

/etc/systemd/resolved.conf.d/hardening.conf:

[Resolve]
LLMNR=no
MulticastDNS=no

Kills hostname broadcast on LLMNR and mDNS. I also disable avahi:

sudo systemctl disable --now avahi-daemon.service avahi-daemon.socket

UFW firewall:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw --force enable

SSH server is already off by default on Omarchy. Nothing listening. Even if the firewall leaks, there’s no surface.

All of it rolls into a single initial setup ritual.

SSH agent persistence (keyring + keychain)

If you use SSH seriously, you want to type the passphrase once per boot and have ssh, git push, scp stay silent for the rest of the day. On current Arch there’s a catch: gnome-keyring 50 dropped its SSH component, and the replacement (gcr-ssh-agent) is a plain in-memory agent with no passphrase persistence. The “remember this key” checkbox you saw in old guides simply doesn’t exist anymore.

The combination that works has three pieces:

gcr-ssh-agent.socket managed by systemd serves SSH_AUTH_SOCK at $XDG_RUNTIME_DIR/gcr/ssh
pam_gnome_keyring on SDDM login unlocks the keyring with the login password (used by GUI apps like Brave, not by SSH)
keychain (wrapper) keeps an ssh-agent alive across logouts, overrides SSH_AUTH_SOCK to point at that persistent agent, and caches the PID in ~/.keychain/-sh

First, install the keyring:

sudo pacman -S --needed gnome-keyring seahorse keychain

Then edit /etc/pam.d/sddm so the login password unlocks the keyring automatically. Add at the top:

-auth      optional    pam_gnome_keyring.so

And at the end:

-session   optional    pam_gnome_keyring.so    auto_start

The leading hyphen (-auth, -session) marks it optional, so if the module fails to load, login doesn’t break.

Pin SSH_AUTH_SOCK for graphical and TTY sessions via ~/.config/environment.d/ssh-agent.conf:

SSH_AUTH_SOCK=%t/gcr/ssh

%t resolves to $XDG_RUNTIME_DIR, something like /run/user/1026. Enable the gcr socket and user linger (so user daemons survive logout):

systemctl --user enable --now gcr-ssh-agent.socket
sudo loginctl enable-linger $USER

Finally, in ~/.config/bash/init.sh, keychain starts or reuses the agent:

if command -v keychain &>/dev/null && [[ -r ~/.ssh/id_ed25519 ]]; then
  eval "$(keychain --eval --quiet ~/.ssh/id_ed25519)"
fi

--eval emits shell assignments (SSH_AUTH_SOCK, SSH_AGENT_PID) pointing to keychain’s agent, overriding the gcr path set by environment.d. --quiet silences the banner once the key is loaded.

Flow in practice: boot → first terminal → keychain prompts for the passphrase once → key stays loaded. Logout → login again → new shells reattach to the same agent (via ~/.keychain/-sh) → zero prompts. ssh-add -l confirms the key is there, echo $SSH_AUTH_SOCK confirms it’s at keychain’s path.

When you’ll re-enter the passphrase: reboot, ssh-add -D, or keychain --clear. In normal use, once a day.

Bash instead of ZSH

On the desktop I use ZSH. On the notebook I went with Bash to align with Omarchy’s default, without having to maintain a parallel stack of modular layers. ~/.bashrc is a symlink to ~/.config/bash/bashrc, which sources Omarchy’s defaults first and then stacks my customizations:

source ~/.local/share/omarchy/default/bash/rc
source ~/.config/bash/envs.sh
source ~/.config/bash/aliases.sh
source ~/.config/bash/mounts.sh
source ~/.config/bash/init.sh
source ~/.config/bash/secrets    # gitignored, chmod 600

envs.sh has what’s mine: OpenRouter base URL, Ollama pointing to the LAN GPU box (192.168.0.14), AWS region, Hugo analytics, zoxide and SSH agent configs. aliases.sh has TLP shortcuts, an alias for shell-gpt via Docker, and functions to harden PATH when running makepkg or yay (prevents binary injection via a malicious user config).

init.sh does the integration work. Atuin with a manual bind for Ctrl-R (so the up arrow keeps bash’s default history-search, which I use more). Keychain loading ~/.ssh/id_ed25519 once per boot and reusing it across shells (no need to re-authenticate SSH every time). Blesh if installed (ZSH-style autosuggestions for Bash). And a function that has PROMPT_COMMAND set the window title to the current pwd and the running command:

__title_idle() { printf '\033]2;%s\007' "${PWD/#$HOME/~}"; }
__title_busy() {
  local cmd="${BASH_COMMAND}"
  [[ "$cmd" == "__title_"* || "$cmd" == *"PROMPT_COMMAND"* ]] && return
  printf '\033]2;%s — %s\007' "${PWD/#$HOME/~}" "$cmd"
}
if [[ -n "${PROMPT_COMMAND-}" ]]; then
  PROMPT_COMMAND="__title_idle; ${PROMPT_COMMAND}"
else
  PROMPT_COMMAND="__title_idle"
fi
trap '__title_busy' DEBUG

Idle, the title shows just the pwd. With a command running, trap DEBUG catches the BASH_COMMAND and updates the title. Waybar’s hyprland/window picks that up and displays it.

For the modern Rust toolbelt, the list:

sudo pacman -S --needed \
  eza bat fd ripgrep sd git-delta dust procs bottom duf tokei hyperfine \
  zoxide atuin tldr starship

eza replaces ls. bat replaces cat with syntax highlighting. fd replaces find. ripgrep replaces grep. sd replaces sed. delta plugs into git for colored side-by-side diff. dust for visual du. procs for ps. btm (bottom) for top. duf for df. tokei counts lines of code. hyperfine for command benchmarking. zoxide is cd with memory (very useful). atuin is shell history with encrypted sync (I point it to my home server via sync_address = "http://192.168.0.90:8888"). starship is the prompt. tldr is a man page in 10 lines.

The atuin key has to be backed up offline. If you lose it, you lose the encrypted history. I saved mine in my self-hosted Bitwarden (Vaultwarden), documented in the Bitwarden self-hosted article.

Git pipes diff through delta:

[core]        pager = delta
[interactive] diffFilter = delta --color-only
[delta]       navigate side-by-side line-numbers hyperlinks, light=false
[merge]       conflictstyle = zdiff3

Hyprland and Waybar

Visual and UX customizations on top of the default.

In hypr/looknfeel.conf, smaller gaps (2/5 vs the default 5/10), slide animation between workspaces, VFR on (reduces consumption when the screen is static), allow_session_lock_restore (if hyprlock crashes, it goes back to the lock screen instead of dumping you on the desktop).

In hypr/bindings.conf:

Binding	Action
`Super+B`	Brave
`Super+L`	Lock screen (default was layout toggle)
`Super+Ctrl+L`	Layout toggle (moved here)
`Super+Ctrl+P`	Toggle TLP perf mode

On Waybar, three custom modules I recommend:

Window title: the hyprland/window module shows what’s focused. In bash, PROMPT_COMMAND updates the title to something like “~/Projects/blog — hugo server”. So in waybar the directory and running command show up, without me needing to look at the terminal.

"hyprland/window": {
    "format": "{title}",
    "max-length": 50,
    "rewrite": { "(.{47}).+": "$1…" }
}

Perf mode: custom/perf runs a script that reads TLP state and shows an icon:

"custom/perf": {
    "exec": "~/.config/scripts/perf-waybar.sh",
    "interval": 3,
    "format": "{}",
    "on-click": "~/.config/scripts/perf-toggle.sh"
}

Empty for balanced, 󰓅 PERF for performance, 󰌪 ECO for low-power. Click toggles. Useful to quickly see what mode the machine is in.

Claudebar: I integrated claudebar as a custom module that shows the Claude Code session usage % and extra spend inline in waybar. No need to open another window.

The clock:

"clock": {
    "format": "{:L%A %d %B - %H:%M}",
    "format-alt": "{:L%A W%V %Y - %H:%M}"
}

Shows “Thursday 17 April - 14:32” by default, click toggles to “Thursday W16 2026 - 14:32” with the ISO week number, useful for planning articles and tasks.

Self-hosted Atuin

Atuin runs on my home server at 192.168.0.90:8888. I made a separate account for the notebook (akitaonrails-thinkpad), I don’t mix history with the desktop on purpose. atuin sync runs every 5 minutes and all history is encrypted before going to the server. The server doesn’t see the commands, only encrypted bytes.

Conclusion: picking a notebook for Linux

One thing I learned on this journey: picking a notebook to run Linux comfortably is not trivial. You have to check the ArchWiki for your specific model before buying. The T14s Gen 6 AMD page, for example, catalogs every hardware quirk and how to work around it. Without that reference, you discover the problems during install.

Rule I follow: never buy a freshly-released model. Let 6 to 12 months go by after launch. That gives the community time to iron out bugs, drivers to land mainline, the ArchWiki to have a decent page, libfprint to include the fingerprint sensor, and the kernel to cover WiFi and the IR camera. Buying on day one means signing up as a beta tester.

Brands with a decent Linux track record: Lenovo (especially Thinkpad, they ship Ubuntu and Fedora pre-loaded), Dell (the XPS and Latitude lines), Asus (Zenbook and ROG have reasonable support), Framework (built for Linux from the factory). Any recent Mac is cut off from Linux mainline or runs through Asahi with caveats.

The Thinkpad T14 Gen 6 isn’t the prettiest notebook I could have bought. But it’s rugged, it has the right ports, the fingerprint sensor works, and the plastic shell lets it fall, get scratched, travel in a backpack without making me nervous. To serve as a remote debug companion, it’s what I needed. If that’s your use case too, I recommend it. If you want OLED and metal, go Zenbook or T14s. Every premium comes with a compromise, no Linux notebook in 2026 is perfect.

Everything I described here is versioned in a private repo of mine. If I need to reinstall tomorrow, I run half a dozen steps in order and I’m back at the same place. That’s the whole point of keeping config in Git: notebook is commodity, config is mine.

Why LLMs Aren't Giving You the Result You Expect | Why I Prefer Claude Code Today

Wed, 15 Apr 2026 13:00:00 GMT

Every time I get pulled into an online thread about LLMs I hear the same chorus, in slightly different keys. “Claude didn’t perform as well as GPT for me.” “GPT did a way better job than Claude, I’m canceling my sub.” “Honestly, Kimi or MiniMax does the job for me just fine, I’m not paying for anything.” Anecdote piled on anecdote, each one a variation of “works on my machine” vs. “doesn’t work on my machine.” It sounds off to me.

I’ve already benchmarked most of the relevant open source and commercial models in my LLM testing post, so I’m not pulling this out of thin air. And beyond the benchmarks, I’ve got 500+ hours in Claude Code and Codex on real projects. 16 hours a day, two and a half months straight, something in the neighborhood of 400,000 effective lines of code generated.

And look: on neither of them, Claude or Codex, did I ever see the model wander off, do something I didn’t ask for, or flat-out fail to deliver what I wanted. Not once. When the model really couldn’t do something, it told me so upfront instead of inventing. So when someone tells me “Claude blew it,” my first question is always the same: what did you ask it for, exactly?

The fake problem

The ecosystem’s answer to this supposed “wandering off” problem is to pile on more layers. Spec Driven Development showed up, 15-section prompt templates showed up, whole frameworks to force the LLM to ask more questions before it starts. I respect the effort, but I think they’re treating the symptom, not the cause.

I practice what I call Agile Vibe Coding: applying XP techniques (pair programming, test-driven, short feedback loops, continuous refactor) on top of normal prompting. I don’t need a framework. I don’t need a three-page template. I need the same things that have always been needed to work on software in a team: know what I want, know what I don’t want, know how to validate it when it lands.

The real problem is one thing: nobody knows how to communicate

I’ve got an old post from 2013 called Programmers are terrible communicators (UDP vs TCP). Go read it if you haven’t, because the problem I described there in 2013 is exactly the thing blowing up now that everyone’s flying an LLM. Nothing changed. The tech got more powerful, but the people are the same people.

Here’s how it works. You’ve got a pile of information in your head. Project context, history, stack constraints, personal preference, things that went sideways in the past, decisions that got made in a meeting two months back. And then you walk into a conversation, whether it’s with a human coworker or an LLM, and fire off your request assuming everything in your head is also in the head of whoever is on the other side. “It’s obvious, everybody knows that.” So you write “do what I’m telling you,” except what you’re telling them is really “do what I’m thinking.” And you don’t even notice they’re not the same thing.

Developers are terrible communicators. Managers are terrible communicators too, and that’s exactly why most of the useful hours in a corporate week get burned in pointless meetings. Nobody gets to the point on time, nobody aligns expectations, the result comes in below the bar, and the default managerial answer to that is “more of the same.” More meetings, more spreadsheets, more reports. Except if your communication was bad at volume 1, it’s going to stay bad at volume 5. The problem is quality. Volume doesn’t fix bad quality.

How I actually talk to Claude or Codex

I treat any LLM exactly how I’d treat a human during a pair programming session. No ceremony, no form to fill out, no 10-page spec. But with communication discipline. Let me walk you through a real example, from last week.

I’ve got about 12 TB of ROMs piled up on my NAS, under /mnt/terachad/Emulators, split across two trees (ROMS/ and ROMS2/) that accumulated across different collections over the last 10-plus years. More than 400,000 files total. Uncompressed romsets, .7z, .rar, giant CDI/GDI bundles, inconsistent file naming, duplicates everywhere. I wanted to consolidate all of it, by platform, into a new ROMS_FINAL/ tree, using standardized naming (No-Intro / Redump / TOSEC) so that when I run Screenscraper later the match is automatic. That’s the goal, stated up front.

But I don’t stop there. I tell it what I DON’T want, too. “Never dedup by filename, only by sha1 + size, filenames lie way too much in this world.” “A Neo Geo romset depends on which emulator is going to consume it, so the MAME zip, the FBNeo bundle, and the Darksoft MVS cart are three mutually incompatible things, keep one canonical copy of each.” “Same idea for NAOMI: the MAME romset is not the same file as the GDI.” “Saturn has USA, Japan, and Europe versions, I want to keep a copy of each region, region is not duplicate.” If I leave that out, the model has no way to know, because that knowledge is in my head and not in the code. If I don’t give it, it’ll assume its own most-reasonable default, which might be the opposite of what I need.

Then I get into method detail. “Create a docs/ directory to serve as living knowledge base, and docs/scripts/ with the steps broken into numbered files (01_walk_and_hash.py, 02_classify.py, etc.). Each step has to be idempotent so I can re-run it if it crashes or if I need to resume halfway through. Progress state lives in a SQLite catalog, not in-memory variables.” This isn’t micromanaging the model, this is aligning the way I work. I know an operation this size is going to hit problems, and I want to be able to come back without losing the previous hours of hashing and classification.

Then, even after it hands me its plan, I keep thinking about what could go wrong. “Zero deletions under ROMS/ and ROMS2/. Whatever doesn’t get forwarded to ROMS_FINAL/ just stays where it was. The only thing the whole pipeline is allowed to delete is temporary files from an extraction that bailed halfway through. Also, the phase that actually executes the moves can only run after I manually approve the planning phase, create a flag file docs/.phase4-approved, and make phase 5 refuse to start without it.” This is the equivalent of making a commit before a big refactor, plus a human gate between planning and applying. I’m protecting against my own mistakes, its mistakes, or both at once.

When the model gets going, I don’t leave the room. I keep asking for status. I notice the hash ETA is way longer than the problem warrants, so I interrupt: “I think we can parallelize, I’ve got spare CPU and 10GbE talking to the NAS, and the Synology over NFS can handle it. Bump up the concurrency, test, check that it’s still stable and that the SQLite transaction ordering doesn’t break.” That’s real pair programming, not blind automation.

The structure behind it

Notice the pattern. I don’t say “solve this problem.” I say “solve this problem, this way and that way, and not this way or that way, and when you’re done, validate X and Y.” In other words, I’m communicating four things, not one:

First, what I want. The end goal in plain language. Second, how I want it done, in broad strokes, leaving room for it to suggest a better solution if it has one, because it usually does. Third, what I don’t want. This is the part most people skip, and it’s the most critical, because it’s where all the unspoken assumptions live, the ones that turn into bugs later. Fourth, how we validate it landed. What’s the expected result, what’s the test, what’s the “done” signal.

And that fourth part is a killer. Most of the clients I’ve worked with over the past 20 years couldn’t tell me what the expected result was. Because it’s easy to want something, and hard to say how you’d measure whether it showed up. Without a success metric, expectations break by definition, because there was no concrete expectation to begin with. That’s one of the main reasons consulting projects go sideways, and it’s identical in the world of AI agents.

When I hand the model all four blocks, it almost never fails. And when it actually can’t do the thing, because the task is impossible given my constraints or because there’s information I forgot to pass along, it doesn’t try to guess. It tells me: “given your constraints, I can’t proceed because of X or Y.” Then I adjust, or I loosen a constraint, or I realize I didn’t actually know what I wanted. It all works.

Once the context is solid, ask instead of prescribe

There’s a shift I make after the first few well-loaded prompts. In the opening turns I pour in a ton: goal, constraints, method, validation, all the context the model needs to understand the terrain. It’s a lot of detail, it’s my job to front-load it, and I already explained why there’s no way around that.

But once that ground is solid, I stop prescribing solutions and start asking for suggestions. Instead of saying “implement it this way, with that library, following this pattern,” I switch to “given everything we’ve already covered, what’s the best approach here? Research online if you need to, compare the options, and come back with the solution you’d pick for this goal.”

That’s the step most people skip, and it’s exactly where the LLM earns useful autonomy. It’s got way more vocabulary than I do across a pile of topics. It read the whole internet; I read whatever I had time for. If I prescribe line by line, I’m throwing that edge away. If I set the context and then ask, it comes back with options I hadn’t considered, with trade-offs laid out, sometimes better than my original idea, and I get to pick.

The trick is simple: asking well requires having set the context well first. A dry question with no ground gets back a generic answer, or the first idea that seemed to fit. A question resting on ground you already laid comes back with a real proposal, alternatives compared, references cited. The model is ready for that kind of contribution, it just won’t give it to you until you’ve spent the time setting the stage.

And look: this method wasn’t invented for LLMs. It’s the same way I’d work with any human developer, senior or not. Give them the context, give them the goal, explain what I care about, then ask “what’s the best way to do this?” instead of dictating the solution. The only difference is the LLM gives it back in seconds when a human would take days. The way you run the conversation is the same.

It’s not a 10-page spec

And here’s the important bit: what I’m describing is not a formalism. It’s not a long spec, it’s not a Confluence doc, it’s not a template. It’s just how I talk to anyone who has to deliver something to me. I picked it up running projects, managing contractors, integrating teams that weren’t mine, and getting knocked around every time my communication was bad. After a while it becomes second nature.

If I don’t really care about the final result, say a quick experiment, a weekend toy, something disposable, I cut way back. I know my expectations might break, and that’s fine, the cost of a bug here is cheap. But if the result actually matters, I invest the time needed to give the other side (human or AI) the best shot at delivering what I want. Output is proportional to the time you spend on input. If you’re not willing to explain what you want properly, don’t complain when what comes back is wrong.

To illustrate: this very post you’re reading was written by Claude, off a prompt I gave it. Below is the screenshot of what I typed. Notice: no hidden assumptions, clear goal, references to earlier posts baked in, constraints stated explicitly, enough detail that it didn’t have to guess.

But look, that’s just the first prompt. It wasn’t the only one. While Claude was writing, I kept tracking, sending corrections, adding points I forgot to put in the first prompt, flagging factual errors I caught in the generated text, and fine-tuning tone. “Hey, I also forgot to tell you we need to cover X.” “No, that reference is out of date, OpenAI’s Sora got shut down, fix that.” “Go read what we already documented over at /mnt/terachad/Emulators/docs/ and see if you can sharpen up the ROMs example.” This very observation you’re reading right now started as a mid-flight prompt:

And the conversation kept going all the way to the commit. I had already told it to humanize, translate to English, and push, and I still interrupted at the last second because I spotted an unnecessary English loanword I wanted to fix before it went up. Here’s that moment:

That’s real pair programming. Nobody sits down with a colleague, drops a 15-line task, stands up, and walks away expecting magic. You stay there, you watch the execution, you see the code come together, you suggest adjustments, you catch errors while they’re still cheap to fix, you add context you just remembered. The initial prompt is the starting point, not the final contract. Agile Vibe Coding in action: short cycles, fast feedback, continuous correction.

This isn’t a formal spec. It’s a conversation that keeps going during the work, not before it. And that’s how it’s always going to be, with me and with any good professional.

“Akita, you write too much detail, shouldn’t the AI figure this out on its own?”

That’s the question I know is coming, so let me answer it upfront. No, the AI is not going to figure it out on its own. There is no future version of Claude, GPT, Gemini, whatever, that’s going to guess what’s in your head. Context doesn’t generate itself by osmosis. If the information isn’t in the code, in the docs, or in my prompt, it simply doesn’t exist for the model. Full stop.

Neo Geo romsets being different per emulator? That’s domain knowledge. Saturn having separate regions? Domain knowledge. My NAS having 10GbE to handle aggressive parallelism? Environment context. Having docs/ with a SQLite catalog to survive crashes? An engineering decision I made. The model has no way to “discover” any of this. All of it is my job to bring to the conversation. And if I’m not willing to spend my own time understanding these details, why would anyone, or anything, do it for me for free?

The rule is simple: the quality of what you get back is directly proportional to the effort you put into asking. It has always been this way. Anybody who’s hired a contractor knows it: vague requests, ill-defined scope, “the client knows what they want, they just can’t explain it,” that’s a guaranteed recipe for a project to go off the rails. Always was. The LLM is the same thing, just faster and more patient. Think of it as modern outsourcing, not magic. A magician solves the problem without you saying anything. A contractor solves exactly what you asked for, exactly the way you asked, with the information you provided. If you didn’t ask properly, you don’t get it properly.

The frustration of people showing up here burned out on Claude or Codex is almost always the same: they asked for little, expected a lot. And when the result came back short, they blamed the tool. Never the question.

That’s why AI won’t replace the good ones

That’s exactly why I say, with full conviction, that AI agents aren’t going to replace good professionals. They’re going to replace people who can’t frame their own question properly, who don’t know what they want, who don’t know how to validate a result, who need somebody to think for them. And look, those people were always replaceable, it’s just that now the replacement is cheaper. The market is doing the math.

The good professional, meanwhile, got more productive. Uses the LLM as a pair programming partner at 2 a.m., no complaints, no union, no ego clash. Ships in a week what used to take a month. And is still the same good professional they were, because the skill that mattered, knowing what to ask for, what not to accept, and how to measure, is still 100% theirs. The tool just executes.

Stark + Jarvis = Iron Man

Let me drop a metaphor that I think wraps the whole thing up. Think about Tony Stark, in the MCU. He’s got Jarvis, probably the most advanced fictional AI pop culture has produced in the last two decades. Voice control, context awareness, planning, parallel execution, even a sharp personality. Technologically, Jarvis would easily be the best AI agent you could imagine today.

And even so, Jarvis on his own doesn’t build an Iron Man suit. Doesn’t build the Arc reactor. Doesn’t build any of the gear that makes Stark be Stark. All that advanced tech, with no Stark to guide it, just sits there. It’s Stark’s genius, his vision, his stubbornness, his engineer’s intuition about where to push next, that makes any of it real. Jarvis executes. Stark thinks.

There’s another detail in that story almost nobody remembers: not even Stark, with all his genius and with Jarvis riding shotgun, nails the perfect suit on the first try. He spends movie after movie iterating. The Mark I was literally bolted together in a cave out of scrap. The Mark II could fly, but iced over at altitude. The Mark III fixed the ice but weighed too much. And on it goes, each iteration fixing a specific flaw in the last, all the way to the Mark LXXXV in Endgame, his 85th suit. Eighty-five. The whole MCU Iron Man arc is basically a ten-year campaign of iterative development on one extremely complicated product.

That’s exactly the mental model I think we should be using when we work with AI today. You don’t fire off a prompt and wait for the definitive solution to pop out. You, sitting in the Stark chair, use the AI (your own Jarvis) as an accelerator on each lap of the cycle: propose, test, see what broke, fix it, propose again. The AI speeds up every lap, it doesn’t erase the laps. If someone swapped iteration for “one magic final answer,” they never understood that engineering was never like that, human or no human.

Put Stark and Jarvis together and now you’ve got Iron Man. Take either one away and the half that actually matters is missing.

Claude Code vs. Codex: my pick today (April 2026)

A quick aside so this doesn’t go unsaid: these days I’m alternating between Claude Code and Codex, but I’ve been leaning into Claude Code. Let me spell out why, because it’s not about the LLM itself.

Claude Opus and GPT-5.4 xHigh, to me, are basically tied as models. On the hard tasks, when one of them can’t pull it off, I swap over to the other and the other usually does. Head to head, both are strong. What separates them today is the harness, not the model.

And the Claude Code harness, right now, is flat-out better. Two concrete reasons: planning and parallel execution.

Planning. Claude Code breaks a long task into subtasks, keeps a visible to-do list I can actually watch on screen, tries to run things in parallel when it can, and doesn’t drop items. When it tells me “done,” I know the full list got executed, because it’s right there for me to check. That detail changes the whole game: I trust the “done” without having to chase every single item with “hey, did you actually do that other one?”.

Parallel execution. This one is clean-cut. If Claude Code is mid-task and I hit ESC to throw in something else, it typically keeps the first task running and kicks off the second in parallel, unless the new request actually requires cancelling the first. Codex, same situation, stops the first to deal with the second, and doesn’t always know how to resume the first from where it left off without me prodding it manually. With Claude Code I get to actually flow, opening new fronts while the old ones keep running. With Codex I have to go serial, be patient, and be more deliberate about every request, because interrupting is expensive.

None of which means Codex is bad. It’s good. Plenty of times when Claude Code gets jammed on a hairy task I pop over to Codex and it unjams it immediately. But then my way of working shifts: smaller requests, more focused, one at a time, wait for it, move on. Works fine, it’s just not the flow I prefer.

Codex is probably going to catch up on the harness side over the next few months, and the conversation will shift. But today, April 15th, 2026, if I have to pick one as my daily driver, it’s Claude Code, and the reason is the harness, not because OpenAI’s LLM is somehow inferior. Writing this down so that six months from now I can reread it and laugh.

Which is why my company is called Codeminer 42

To close with a reference I’ve been carrying around for years: my company is called Codeminer 42. The 42 isn’t a random number. It’s a direct nod to Douglas Adams, from The Hitchhiker’s Guide to the Galaxy.

If you don’t know the bit, here’s how it goes. An entire civilization builds a planet-sized supercomputer to calculate the Answer to Life, the Universe, and Everything. After millions of years of processing, the machine finally spits out the final answer: 42. And then there’s that awkward silence, because nobody there could remember what the original question had been. 42 is a technically correct answer to a question no one bothered to formulate properly. Which is to say, it means absolutely nothing. Plenty of people seem to think 42 has some deep meaning baked into it. It doesn’t. It’s the wrong answer to the wrong question.

That’s the sharpest lesson about engineering I’ve ever read in fiction. Every expectation that breaks, breaks because the question was wrong, not because the answer was badly executed. That’s what Codeminer 42 exists for, and it’s exactly what I practice day to day, whether the other side is a human client or an LLM. Before I deliver what you asked for, my job is to make you discover that what you think you want is probably wrong, to force you to rethink the assumptions you walked in with, and only once the question is sharpened do I get to hand you the best possible result. Skip that step and anything I deliver is going to be a 42.

So next time you read somebody saying “I canceled Claude because it doesn’t deliver,” or “GPT is way better,” or the other way around, take a closer look at the question the person was asking. Nine times out of ten, the problem isn’t the model. It’s the question. And nine times out of ten, the answer that person got back was 42.

Seedance 2.0 Is Finally Out: First Impressions

Wed, 15 Apr 2026 10:00:00 GMT

ByteDance opened up the public release of Seedance 2.0 today, April 15. They had promised the model late last year, but access stayed locked to commercial partners for months, with delay after delay. As of today, anyone can try it.

The pitch is strong. It’s a multimodal video generation model that takes text, up to 9 reference images, 3 reference video clips, and 3 audio tracks all in the same call. On top of that, it produces up to 15 seconds of output with native synced audio (dialogue, sound effects, ambient music), supports video extension, surgical editing of characters and objects, and prompt-driven camera planning. ByteDance is pitching it at commercial advertising, explainer videos, film production, e-commerce, and gaming. Translation: for content creators trying to climb out of the “random meme” tier and start producing things that look professional.

If you want a quick visual tour before going further, here’s a great walkthrough from Stefan, of the Stefan 3D AI Lab channel, which I already plugged in the post on AI 3D modeling:

The main interface is Dreamina’s “Photo Studio,” where you pick the generation mode and stack your references. So far I’ve only tested the standard multi-reference mode.

Test 1: lip sync with podcast audio

My first test was simple. I grabbed a few seconds of the opening bumper from my podcast The M.Akita Chronicles, which is generated by AI through an ElevenLabs v3 pipeline, passed in my new anime-style avatar as a reference image, and asked Seedance to make the character lip-sync and gesture along with that audio.

Anime avatar lip-syncing to the M.Akita Chronicles bumper

The result is decent. All of that came from a single still image. The lip sync follows the speech reasonably well, the expression has some life to it, and the hand gesture lands at the right beat. But this is a YouTube short or Instagram clip at best. Not the kind of animation you’d open a serious video with.

The feature that actually matters: video reference

And here’s what I think is the only real use of these models for actual work: feeding in video as a reference. Text alone is never going to be precise enough. You can describe a scene with whatever vocabulary you want, swap adjectives, specify lens, angle, framing, and every generation will still hand you back something subtly different. It’s a slot machine. Great for memes, terrible for production.

Most amateurs see one random clip that comes out nice on the first try and go straight to flooding their X and Instagram feeds with it. That’s not control. That’s luck. For film, for ads, for show openers, for anything where the next shot has to match the previous one, you have to be able to tell the model: “do exactly this motion, with exactly this camera.” That’s where reference video changes the game, because it replaces hours of 3D modeling, rigging, animation, and rendering that would cost a fortune in a studio.

For my test, I recycled the model I’d already generated in the post on AI 3D with Hunyuan and Nano Banana, where I tested how far you can take 3D modeling from a prompt and even hit “print” on the result. But for something more cinematic I wanted a model with a proper animation cycle. So I grabbed the Darth Talon model from Sketchfab, which has a short, well-made motion sequence, just because it looked cool.

I imported the FBX into Blender, set up a basic camera and lighting, and did a quick Cycles render. Nothing pretty about it, just a sketch to feed in as motion and framing reference.

The quick Cycles render that's about to become motion reference

I uploaded that render to Seedance as a video reference along with three t-pose images of the character (front, back, left side) for visual consistency:

First attempt, with a generic prompt:

First generation: the character stayed consistent, but the motion and camera missed the reference

Not bad. The character keeps its identity, the lighting got a serious upgrade over my raw render, and the clip is coherent. But the motion and camera position drifted noticeably from the reference. The model basically read it as “there’s a character here that looks like that doing something like that,” and improvised the rest.

Then I remembered Stefan’s tutorial: the trick is to literally drop something like Follow exact motion and camera from reference video into the prompt. Tried again:

Second generation: explicit instruction to follow the reference's motion and camera

A lot better. It’s clearly following the reference choreography, the camera angle matches in several beats, and the character looks more polished than anything I could hand-render in a quick Cycles sketch. But it’s still not a frame-for-frame match. In some shots the model is still taking liberties with timing, position, and framing. There are other prompting tricks and parameters to play with, but I stopped here to write this up while the launch is still fresh.

ComfyUI, Runway, and the rest of the ecosystem

Seedance 2.0 also already shipped as a ComfyUI cloud node, which is great for anyone who wants to drop the model into a bigger pipeline without bouncing between web UIs. If you’ve already got an image-gen workflow built in ComfyUI, this is genuinely handy.

And for anyone who wants volume without doing the per-generation credit math, Runway is offering Seedance 2.0 as part of their Unlimited plan, which runs around $76/month annual or $95/month monthly. “Unlimited” means no per-generation charge, but you do trade for lower queue priority at peak hours and an output storage cap. If you iterate a lot, the subscription model makes way more sense than buying credits one at a time.

The price question

Seedance’s own pricing is credit-based. The Standard plan is $24.90/month annual or $49.90/month monthly, and gets you 2,500 credits a month. Every 10 seconds of generated video burns 6 credits.

The math is brutal. With 2,500 credits on Standard, you get about 416 ten-second generations a month. Sounds like a lot, until you remember my own test: it took at least two tries to get close to what I wanted, and even then it wasn’t a one-to-one match. In practice, plan on 3 to 5 regenerations per usable scene. That drops your real number to something like 80 to 140 short clips per month. Maybe enough for a solo creator doing short-form content. Nowhere near enough for production that needs minutes of polished, continuous video.

The higher tiers scale credits roughly linearly, but the same problem holds: at today’s price, Seedance 2.0 is still a tool for short clips and well-made memes, or for pre-vis sketches before you send the real thing to an actual studio. It doesn’t replace continuous professional production. For that use case, Runway with Unlimited is honestly the cheaper option in practice. Full pricing table here.

What nobody wants to talk about: restrictions and deepfakes

Two things I don’t want to leave out. First: Seedance 2.0 took its sweet time getting to the public partly because ByteDance is being a lot more conservative than the competition when it comes to moderation. The model blocks generation that uses real people as reference (celebrities, politicians, public figures) and filters strong-IP characters from Disney, Marvel/DC, Nintendo, and registered trademarks generally. If you want the details of what passes, what doesn’t, and why ByteDance went that way, this MindStudio piece breaks it down. It makes sense: ByteDance operates across dozens of jurisdictions, lives under heavy regulatory scrutiny worldwide, and has the elephant in the room of Hollywood pushing the entire industry with copyright suits over training data and generation. The release got delayed and arrived with heavier filters precisely because of that.

Second point, and this is the one that matters more: for you, the average reader, the message is that deepfakes are no longer hypothetical. Photos on the internet stopped counting as proof a long time ago, because image models like Nano Banana, Hunyuan, and the rest deliver realism that fools anyone. Now video sits in the same bucket. What we just saw above, where a couple of references and two prompts get you a coherent animated scene in minutes, is just the start. The average person, with no training, isn’t going to be able to tell generated video from real video at a glance anymore. And Seedance’s filters only block above-board, legitimate use. Anyone with bad intent will always find a path through open source models, workarounds, or less strict platforms. That’s the world we’re in now, and the sooner we collectively accept it, the better off we are.

No, this doesn’t replace real artists

Let me cut off the usual speech before it shows up in the comments. No, Seedance doesn’t replace VFX artists, directors, editors, none of that. This is a hammer. The tool is the same for everybody. What comes out the other end depends entirely on who’s swinging it.

Anyone who’s actually studied framing, pacing, continuity, art direction, editing, color, composition is going to use Seedance to 10x the work they were already capable of delivering. Anyone who hasn’t studied any of that is going to crank out the same Instagram meme they were already cranking out, just 10x faster. The AI reflects who you are. If you’re a real artist, the tool multiplies you. If you only think you are, the tool multiplies the mediocrity. There’s no miracle here.

This is exactly why the whole “now everyone can be a film director with AI” wave is the same song we’ve heard with the digital camera, with Photoshop, with After Effects, with Premiere, and with every powerful tool that’s shown up in the last 30 years. The technical bar drops. The bar of taste, reference, and study sits exactly where it always has. Good people get more productive. Everyone else still needs to hire real ones to do real work.

Where this goes next

Seedance 2.0 is clearly a step in the right direction, but it’s not the tool that replaces professional production yet. Text alone stays imprecise, video reference helps but still drifts, and the price per minute of polished output is high relative to what you actually get. The good news is that this story always rhymes: Google has Veo, Runway has its own model and is also reselling Seedance, Kling and Hailuo from China keep pushing, and the open source models are gaining ground. OpenAI had Sora, but they shut it down. As competition tightens, quality goes up and price comes down. A year or two from now we’ll be looking back at this post and finding what I just showed pretty quaint. Same as it ever was.

An Emulation Distrobox with Claude Code

Sat, 11 Apr 2026 18:00:00 GMT

I’ve loved old videogames since before many of you were born. My first contact was back in the Atari era, in the early 80s. Then came the 8-bit micros, the 90s arcades, the 16, 32 and 64-bit consoles, and I kept following all of it. For me, nostalgia is not a cute Instagram folder. It’s an archive. ROMs, BIOS files, dumps, patches, DLC, firmware, saves. Over the years I kept everything on my NAS. At this point I have terabytes of it under /mnt/terachad/Emulators.

On my YouTube channel I even used old games to teach computer science fundamentals. In this Akitando episode about Super Mario and old computers, I explain the 6502, memory maps, the PPU, hardware constraints, and why those games were programmed the way they were. If you never watched it, start there:

The problem is always the same: every few years I decide to set up a new emulation machine, download the main emulators again, and there I go repeating the same masochistic ritual. PCSX2, RPCS3, Eden, Azahar, Dolphin, RetroArch, Flycast, shadPS4, and so on. Each one with its quirks. Each one with its own particular way of wasting hours of your life.

In theory, setting all this up should be fun. In practice, it takes days. And that’s not an exaggeration. Dolphin still manages to be one of the worst because GameCube and Wii controller setups have no standard. RPCS3 requires per-game tuning. Eden needs DLC, updates and cheats in the right place. shadPS4 is a festival of trial and error. When I’d finally finish configuring everything, I didn’t feel like playing anything anymore. I just wanted to close the menus and go do something else.

I’ve been known to say that maybe the fun was the tuning itself. I don’t think that anymore. After repeating this process too many times, I’m done. I want to play the games, not fill out forms hidden inside emulator GUIs.

The problem was never Linux

I’ve solved this kind of thing in different ways in the past. Years ago I ran Linux on the host and a virtualized Windows with GPU passthrough to play inside the VM. At the time that made sense. I made a whole video about it:

But that was before Valve, Proton, modern Wine and the whole open source community pushed the Linux desktop to a much better place. Today you can avoid Windows in most cases. In the other video below I explain the evolution of CPU, GPU, OpenGL, Vulkan, and why we ended up here:

My gaming mini-PC with the RTX 4090 is still my main Steam machine, especially now that I’ve built a proper sim racing cockpit. On it I use EmuDeck, which was born to automate installing and configuring emulators on the Steam Deck, then grew Windows support, and helps cut the complexity of setting up this kind of environment. But my main desktop is also way too powerful to leave underused. It has an RTX 5090, and today I use that GPU mostly to test LLMs and for benchmarking, as I wrote in Testing Open Source and Commercial LLMs. Leaving it idle for gaming would be a waste.

Only on Linux I wanted something else. I didn’t want a black box doing everything for me without me knowing exactly what was being changed, especially on my main PC, which is my work machine. I wanted my own setup, handmade in the right sense: not clicking GUI by GUI, but understanding what’s being configured, keeping the files under my control, and being able to rebuild everything the way I like. I know the NixOS folks will jump in here to say “just use Nix” — I’ll explain why I ruled that out below. I also didn’t want to turn my work machine into a carnival of emulator packages, GTK themes, wrappers, launchers, weird runtimes and configs buried in ~/.config. Nor did I want dual boot or a VM. In 2026, the best answer for this, in my opinion, is Distrobox.

From what the project itself explains, Distrobox is a wrapper on top of podman, docker or lilipod to create containers tightly integrated with the host, with access to HOME, Wayland/X11, audio, USB devices, external storage and GPU. In other words: exactly the kind of pragmatic isolation I wanted. It’s not a security boundary, and the project site makes that clear. Don’t use it thinking about high-security sandboxing. Use it thinking about separating environments without paying the price of a VM.

And before anyone repeats the usual confusion: a container is not a virtual machine. I explain this calmly in this other episode:

The setup

The idea is simple: a vanilla Arch Linux inside a Distrobox called gaming, with NAS paths mounted read-only, Steam library accessible where it makes sense, and all emulators installed inside. Since the container has access to the NVIDIA GPU, audio, USB and other peripherals, I can separate work and gaming on the same machine without practical performance penalty.

The initial bootstrap looks roughly like this:

distrobox create --name gaming --image archlinux:latest --nvidia \
  --home /mnt/data/distrobox/gaming \
  --volume /mnt/data/steam:/mnt/data/steam \
  --volume /mnt/terachad/Emulators:/mnt/terachad/Emulators:ro

Even that first step has a catch. archlinux:latest inside Distrobox comes without [multilib] in pacman.conf, with a misconfigured sudoers, and with the old --nvidia problem: the host driver libs get bind-mounted read-only, so any install that depends on nvidia-utils breaks with file conflicts. Just that alone is the kind of headache that, a few years ago, would make me open half a dozen tabs, a bunch of terminals, and spend a whole night in trial and error.

This time I did it differently.

Claude Code as an infrastructure assistant

I’ve been getting more and more comfortable using Claude Code as my infrastructure assistant on my personal machines. I wouldn’t do this blindly on a client server. But on my own machine, in an environment I can destroy and rebuild however many times I want, it makes all the sense in the world. I wrote about this in Migrating my Home Server with Claude Code, when I used Claude to install and configure openSUSE MicroOS, Docker, NFS, firewall, hardening, and all the rest without me having to remember annoying shell commands.

So I thought: why not do the same thing here?

That’s exactly what I did. I started with a sequence of objective prompts, always asking for two things at once: do the work, and document enough for me to rebuild everything later. The most important history ended up in docs/distrobox-gaming-prompts.md.

A clarification worth making: those prompts on GitHub aren’t my raw messages, the way they came out in real time. After everything worked, I asked Claude itself to rewrite the prompts in a much more organized and detailed way, purely for documentation. The original prompts were much simpler and much less specific. I described the goal and let Claude figure out paths, config files, formats and commands on its own.

The first prompt created the box with --nvidia, a separate --home, and correct mounts. The second solved the three classic problems of Arch inside Distrobox: sudo, multilib, and the nvidia-utils dummy package. The third installed the gaming base, including a detail I would have easily forgotten if I was doing it by hand: pipewire-pulse, needed so several emulators don’t end up mute or out of sync.

The point isn’t the text itself. The point is the method. I don’t need to remember the exact order, the precise options, or where the docs for each detail are. I describe the goal and let the agent carry the piano. I stay in the tech-lead role: watching, reviewing, course-correcting when needed, and telling it to keep going.

The GitHub project

The setup lives in akitaonrails/distrobox-gaming, structured as an Ansible project. 17 roles cover everything from box creation, package bootstrap, config seeding, DLC installation, controller setup, to post-setup verification. The main playbooks are:

site.yml: full setup from scratch
reset-configs.yml: reset configs only (useful when you want to wipe tuning and redeploy)
backup.yml / restore.yml: container snapshot and restore via podman commit
refresh-shadps4.yml: standalone shadPS4 update
install-xenia.yml: optional Xenia Manager install (Wine prefix)

Full walkthrough is in docs/rebuild-runbook.md. Package decisions (why AUR instead of Flatpak, for example) are in docs/distrobox-gaming-packages.md.

Why not Nix?

Someone will ask. “You’re recreating a reproducible environment. Why not use NixOS, or at least nix-shell/flakes?”

I considered and ruled it out. The emulator ecosystem I need lives on the AUR. AUR -bin packages are wrappers around AppImages that ship the official upstream binary without compiling anything. rpcs3-bin, pcsx2-latest-bin, eden-bin, duckstation-qt-bin, shadps4-bin — these binaries track upstream practically on release day. On Nixpkgs, many of these emulators are weeks or months behind release, and some don’t exist. Creating a Nix overlay for every emulator I need, maintaining it, and dealing with compatibility patches when upstream changes is work I don’t want to have.

There’s also the mental model. I don’t want to learn the Nix language, the derivation system and the evaluation model just to set up a personal gaming environment. I already know Ansible, roles are folders with YAML tasks, and debugging is ansible-playbook -vvv. When something goes wrong, I read the error, open the task that failed, and know exactly where to fix it. If I needed the same level of reproducibility at scale, across a cluster, with atomic rollback, then Nix would have an argument. For a distrobox with emulators on my personal machine, Ansible solves it with much less friction.

Flatpak was also tested and ruled out. Flatpak’s bwrap (bubblewrap) doesn’t nest properly inside Docker’s mount namespaces, so Flatpak apps simply don’t run inside distrobox. The alternative would be installing Flatpak on the host, but then you’re back to polluting the work machine with gaming packages. Plus, Flatpak versions of emulators lag well behind AUR — the AUR’s pcsx2-latest-bin tracks the 2.7.x AppImage, while Flathub still ships v2.6.3.

Platforms covered

The setup covers 12 platforms today:

Platform	Emulator	Renderer	Highlight
PS1	DuckStation	Vulkan, 8x upscale	JINC2 filter, PGXP, widescreen hack
PS2	PCSX2 2.7.x	Vulkan, 4x SSAA	FXAA + PCRTC antiblur, Xbox bindings
PS3	RPCS3	Vulkan	Curated per-game configs via API
PS4	shadPS4	Vulkan	Driveclub-focused (experimental)
PS Vita	Vita3K	—	Default seeding
GameCube	Dolphin	Vulkan	8BitDo Ultimate 2 profiles
Wii	Dolphin	Vulkan	Classic Controller + Nunchuk
Wii U	Cemu	Vulkan	Baseline seed only-if-missing
Switch	Eden	Vulkan	Atmosphere DLC + cheats
Dreamcast	Flycast	Vulkan, 2880p	Hi-res wrapper, widescreen
OG Xbox	xemu	—	BIOS/HDD via symlink
Xbox 360	Xenia (Wine)	Vulkan	Project Forza Plus, PGR3/4

RetroArch comes on top to cover the rest (NES, SNES, Genesis, N64, arcade, etc.), with 21 buildbot cores and 8 asset packs (databases, shaders, cheats, overlays, autoconfig).

Frontend is ES-DE. Custom system definitions live in ansible/group_vars/all/esde.yml, and each emulator comes with a wrapper that sets the right GPU environment variables (more on that in a sec).

GPU hardforce for NVIDIA

On a hybrid system with NVIDIA dGPU + AMD iGPU (my case on the desktop), Vulkan sometimes picks the iGPU on its own and you end up with an emulator running at 15 fps while the 5090 sits idle. The fix is to force NVIDIA’s Vulkan ICD in every launcher.

Each desktop entry rendered by Ansible exports VK_ICD_FILENAMES pointing to NVIDIA’s ICD, and the ES-DE wrapper does the same before launching any core. Combined with the --nvidia flag at box creation, which bind-mounts host drivers read-only, the result is predictable: Vulkan always on the NVIDIA dGPU, regardless of system state at runtime.

This mostly matters for hybrid setups. On a desktop with a single GPU, the ICD filtering is placebo, but it doesn’t hurt.

PS2: PCSX2 with per-game tuning

The Gran Turismo line is my declared addiction. Each version needs its own specific tweak. The PCSX2 setup today has sane global configs (Vulkan, 4x SSAA, widescreen 16:9, Xbox-style bindings, FXAA + PCRTC antiblur, 16x anisotropic, Automatic deinterlace) and per-game INIs for the titles I actually play:

Gran Turismo 3 A-spec (SCUS-97102 + bundle PBPX-95503): widescreen pnach, optional retexture pack (disabled by default because NFS streaming causes flicker in car-showcase cutscenes)
Gran Turismo 4 USA (SCUS-97328): Silentwarior112 HD HUD, full retexture pack, Silent’s trigger/cam patches, widescreen pnach
Gran Turismo 4 Spec II (SCUS-97436, CRC 4CE521F2): special case — the vanilla GT4 HD packs are linked via symlink because Spec II has the same structure but a different CRC. Widescreen pnach renamed to match the CRC. Deinterlace mode 8 (Adaptive TFF), ShadeBoost tuning (+10 saturation, +3 brightness, +2 contrast) to fix pause-interlace ghosting and the anemic upscale saturation
Enthusia Professional Racing (SLUS-20967): HD textures + widescreen
Ridge Racer V (SLUS-20002): widescreen + no-interlace pnach

Silent’s and community pnach files are downloaded automatically by the pcsx2_textures role, which also deploys the cheats pack from the NAS and installs the Spec II HD packs via symlink. Widescreen is enabled per-CRC via gamesettings (PCSX2 identifies games by CRC before applying pnach).

On PCSX2 versioning: the AUR’s pcsx2-latest-bin tracks the 2.7.x line, which is where all the modern features live (texture replacement, FXAA, PCRTC antiblur). The default pcsx2 in Arch repos is still on 2.6.3. The gap is over 250 commits of fixes and new features.

PS3: RPCS3 with curated per-game configs

RPCS3 has an annoying detail: per-game configs live in ~/.config/rpcs3/custom_configs/config_.yml. The file format has a version, and if you hand-write YAML with the wrong version, RPCS3 silently rejects it and falls back to defaults. That cost me time to figure out.

The rpcs3_per_game_configs role handles it like this:

Queries the RPCS3 community compatibility DB for recommended presets per title
Applies curated overrides on top for titles I know need specific tuning (typically GT5 and GT6)
Saves the YAML with the exact filename convention (config_.yml, config_ prefix mandatory)

For the Gran Turismo line:

GT5 (BCUS98114, BCES00569): Resolution Scale 300%, Shader Precision Ultra, Force High Precision Z, SPU XFloat Accurate, Multithreaded RSX. The combo kills the typical RSX dithering without breaking the GT-series safety config (WCB/RCB off).
GT6 (BCES01893 + regional variants): special case, version pinned to v1.05. Patches 1.06+ regress with black surfaces on cars in the garage menu. The extract_ps3_dlc.py script has a per-title PATCH ceiling specifically to pin GT6 at 1.05 even when PSN serves newer patches. Also, Force CPU Blit is mandatory (without it, full-screen flicker). The trade-off is that the rear-view mirror stays permanently black — turning WCB on would restore the mirror but downgrade to native 720p. I’ll take no mirror.

More details on those two in docs/gt5-rpcs3.md and docs/gt6-rpcs3.md.

PS3 update checker

To avoid opening RPCS3’s GUI and clicking Download Game Updates game by game, I wrote a Python script: check_ps3_updates.py. It walks the games installed in dev_hdd0/game, parses PARAM.SFO to get the current version per title, queries the PSN update server (a0.ww.np.dl.playstation.net), and compares against the local patch cache. --list shows the diff; --download fetches what’s missing. --max-version respects per-title ceilings (e.g., 1.05 for GT6).

The first scan of my library found 51 of 72 games with updates available, 69 patches missing locally, ~24 GB in total. GT6 alone had 16 missing patches from 1.07 to 1.22, which I skipped for the reasons above.

Patches arrive as PSN packages type 0x0001 without encryption, so extract_ps3_dlc.py handles the install. Standard call:

distrobox enter gaming -- python3 scripts/extract_ps3_dlc.py \
  "/mnt/terachad/Emulators/EmuDeck/roms_heavy/ps3-DLC" \
  --dest "/mnt/data/distrobox/gaming/.config/rpcs3/dev_hdd0/game"

Tracking PS3 DLC versions manually, game by game through RPCS3, is impractical. It becomes an infinite to-do list, and you only find out there was a 1.22 patch when the game crashes.

PS1: DuckStation per-game

The Gran Turismo trilogy starts on PS1, and quality there only goes up if you enable the right things:

Gran Turismo (SCUS-94949): DuckStation’s built-in WidescreenHack (no cheat exists for this game), PGXP enabled
Gran Turismo 2 Simulation (SCUS-94455): widescreen cheat + 8 MB RAM cheat (fixes audio issues on crowded tracks), JINC2 filter
Gran Turismo 2 Arcade (SCUS-94488): same treatment as Simulation

Global defaults: Vulkan, 8x upscale, JINC2 as the default filter.

GT2 cheats come from a widescreen-fixes repo and are linked automatically into DuckStation’s cheats folder.

Switch: Eden with DLC, update and cheats

On Switch, eden-bin is the emulator. The integration automates three things I used to do by hand:

DLC and update install: the install_dlcs role points to the dumps directory on the NAS, identifies .nsp/.nsz/.xci/.xcz files, and installs everything into ~/.local/share/eden/nand/user/Contents/registered/. If the dump is messy (patches and DLC mixed in a flat folder), the reorganize_switch_nsps.py script sorts everything by Title ID first.
Atmosphere cheats: Atmosphere-format cheats get symlinked into Eden’s load path (~/.local/share/eden/load//cheats/). The switch_cheats role handles this for every title covered by the NAS pack.
Update checker: check_switch_updates.py lists titles with updates available against what’s installed locally. Useful so I don’t miss important patches for games still getting updates.

A detail that burned me: QT_STYLE_OVERRIDE needs to be unset before launching eden-bin, otherwise it conflicts with Kvantum and breaks the UI. The wrapper does the unset before exec.

Xbox 360: Xenia Manager via Wine

Xbox 360 is still rough territory on Linux. There’s no decent native emulator. The best route today is Xenia running through Wine, managed by Xenia Manager.

This is opt-in in my setup. The install-xenia.yml playbook creates a dedicated Wine prefix, downloads Xenia Manager, and registers launchers. It’s not part of the main site.yml because not everyone wants Wine in the mix.

Project Forza Plus (Forza Motorsport 2/3/4)

Old Forza is still the king of previous-gen simcade. Running FM2/3/4 on Xenia requires the Project Forza Plus community mods: compatibility patches, performance mods, specific title-update installs. I documented the process in docs/project-forza.md.

Project Gotham Racing 3 and 4

PGR3 runs reasonably well on Xenia Canary. PGR4 needs a specific tweak: render_target_path_vulkan = "fsi" in the config, otherwise some races break with visual artifacts. Setup documented in docs/xbox360-pgr.md. PGR4 audio on NVIDIA still has an XMA decoding issue that produces intermittent garbage. There’s an open xenia-canary issue, no resolution so far.

Batch Title Updates

Title Updates for Xbox 360 games were distributed via Xbox Live, which has been shut down. The alternative is archive.org, which has the complete catalog preserved. I wrote scripts/download-xbox360-tus.py to automate:

distrobox enter gaming -- python3 scripts/download-xbox360-tus.py \
  --src /mnt/terachad/Emulators/EmuDeck/roms_heavy/xbox360 \
  --dest /mnt/terachad/Emulators/EmuDeck/roms_heavy/xbox360-updates \
  --dry-run

The script needs the internetarchive CLI authenticated (ia configure once, stores the token). It scans the games folder, matches against the archive.org manifest (cached locally), prioritizes by region (USA > World > Europe > Japan), and downloads via ia download --checksum with automatic retry. The resulting .zip files go to Xenia Manager via Manage → Install Content, which extracts into the correct directory (000B0000/).

Full setup in docs/xbox360-title-updates.md.

Steam in the distrobox

Since Proton became a first-class citizen on Linux, it makes sense to have Steam in the gaming box too. I added the steam package via Ansible (which pulls in multilib and 32-bit deps), mounted /mnt/data/steam read-write so the library persists outside the container, and created a host-side launcher.

In practice, this lets me run Steam games via Proton side-by-side with emulation, without switching environments. The mini-PC with the RTX 4090 is still the main Steam machine in the sim racing cockpit. The desktop is fallback and a test environment.

Shell and controllers

Minimal in-box shell

When I distrobox enter gaming, I want a decent prompt even for quick troubleshooting. I installed zsh + starship with a minimal config, zero exotic plugins. It’s not my full dev setup, it’s just “a prompt that shows path and git branch without embarrassing me”.

Dolphin and the modern-controller hell

Dolphin has always been the king of manual friction. If you use a modern controller, like my 8BitDo Ultimate 2, you have to remember how I like the GameCube mapping, how I adapt the Wii Remote, when to use a Nunchuk profile, when to switch to Classic Controller. I don’t have the patience to rebuild that by hand every time. Profiles are pre-built in config/emulator-overrides/dolphin/Profiles/ and copied by the seed_configs role. Details in docs/controller-hotkeys.md.

A few stumbles that stuck around

Not everything became magic. Emulation always has banana peels.

Flycast, for example, has an irritating trap. The emu.cfg file uses rend.* keys inside the [config] section. If you create a separate [rend] section, it looks right, but the emulator just ignores it and later rewrites everything with mediocre defaults. The fix became a dedicated wrapper at $DG_BOX_HOME/bin/flycast-hires:

$DG_BOX_HOME/bin/flycast-hires \
  -config config:pvr.rend=4 \
  -config config:rend.Resolution=2880 \
  -config config:rend.EmulateFramebuffer=no \
  -config config:rend.WideScreen=yes

Details in docs/flycast-resolution.md.

shadPS4 is still far from a closed case. The current setup is focused on Driveclub, with a mirrored config, XML patch and sys_modules links from firmware 11.00. And that wasn’t random: I really like Driveclub. It’s practically the only game still stuck on PS4 that I really wanted to get running properly. I’ve seen several YouTube videos showing the game working, but so far I haven’t been able to get it running satisfactorily on Linux myself. Documented in docs/driveclub-shadps4.md. The good part is that I no longer depend on muscle memory to remember how to launch it, with which patch, in which folder, with which modules. The bad part is that PS4 emulation is still PS4 emulation. No script makes upstream mature faster. If anyone knows how to close this setup on Linux, check the project on GitHub and send a PR.

Ridge Racer V has car-texture flickering documented in PCSX2 issues (#3639, #13729). The Hardware renderer is acceptable for a racing game where you’re not standing still admiring the model. Software renderer fixes it but kills performance. I chose to live with the flicker.

The real gain isn’t just automation

It’d be easy to sum this up as “look how cool, I used AI to automate shell scripts”. That’s not it.

The real gain is more mundane and more important: I stopped spending mental energy on manual work. I didn’t have to remember rare commands. I didn’t have to keep dozens of tabs of wiki, issue, forum, README. I didn’t have to leave half a dozen terminals scattered tailing logs while I try to remember which hidden GUI option the emulator insists on overwriting on the first boot. I started working in conversation.

I tell the agent the goal. It looks up the files, reads the logs, finds the right config format, compares defaults, proposes fixes, writes scripts, validates paths, checks UID/GID, verifies broken symlinks, generates wrappers, exports launchers. I’m still responsible for decisions, of course. But I stopped being a typist of rare commands.

That’s the point that interests me most about using coding agents on Linux. They dramatically reduce the entry friction. A lot of people bounce off the Linux desktop not because the system is incapable, but because the tuning curve has historically been too irritating. Having an assistant capable of reading docs, cross-referencing configs, proposing automation and executing under supervision changes that game.

If you want to reproduce

The repo was published for this:

git clone https://github.com/akitaonrails/distrobox-gaming.git
cd distrobox-gaming/ansible
ansible-galaxy collection install -r collections/requirements.yml
cp host_vars/localhost.yml.example host_vars/localhost.yml
$EDITOR host_vars/localhost.yml  # adjust paths for your machine
ansible-playbook site.yml

Xenia is opt-in:

ansible-playbook install-xenia.yml

Read the README.md and docs/rebuild-runbook.md first. I don’t distribute ROMs, BIOS, firmware or keys. The repo only detects, links and configures what you already have on your own machine.

If you prefer the Claude Code route

My recommendation is simple:

Start with the goal, not the command. “I want an Arch distrobox with NVIDIA GPU, separate home, ROMs read-only and Steam rw” beats dumping a half shell command without context.
Always ask it to document what it’s doing. If it works, promote the result to a script or role. If you don’t document, you just manufactured a throwaway hack.
Work in phases. Box creation. Bootstrap. Emulator config. Launcher export. Verification. That’s exactly how I broke down the problem.
Ask for objective checks. command -v, file existence, broken symlinks, UID/GID, ROM paths, audio, GPU. The best automation is the one that fails early.
Don’t use the agent as a command parrot. Use it as a technical assistant. You still review the decisions and tell it to adjust course.

For me, that was the gain. I went from zero to a much more complete emulation machine without having to customize everything by hand, GUI by GUI, one window at a time. And this time, I finished with energy left to do what I wanted from the start.

Play.

VS Code Is the New Punch Card

Sat, 11 Apr 2026 12:00:00 GMT

Every time someone asks whether juniors are going to stop learning how to program because LLMs can write code for them, I have the same reaction: you’re asking the wrong question.

You’re confusing code input with software engineering.

Those are not the same thing. They never were.

There was a time when programming meant knowing how to convert numbers into binary in your head and jam instructions straight into memory addresses, bit by bit, by hand. There was a time when programming meant knowing how to organize a punched-card deck, knowing whether the deck order was right, knowing where one bad hole wrecked execution, and knowing how to debug visually without the modern fantasy of infinite backspace. There was a time when real programming meant knowing 6502, Z80, and Assembly because computers had so few resources that every byte actually mattered, not as a figure of speech.

Front panel and switches: before editors, before IDEs, before comfortable terminals.

And there were phases of computing that were even worse than punch cards. The video below shows someone programming an LGP-21, one of the oldest personal computers ever made (the second oldest, after the 1950s Bendix G15). Starts at the 5-minute mark:

Imagine what that was: you typed the program in binary directly on a typewriter, looking at the accumulator, instruction by instruction, then physically flipped a switch to execute, and the result got typed back onto paper. The metric wasn’t framerate. It was characters per minute. An operation you can do today in the blink of an eye inside any app used to take hours of human work, typing bit by bit, verifying, testing, verifying again.

It’s the same thing happening with typing code in a text editor today versus an AI agent running in your place. The manual way still works, the same way the Altair front panel kept working for years after compilers became common. But the tool of choice shifted, and the gap in speed and effort is the same order of magnitude that separated the LGP-21 from a modern text editor. Our generation is watching this transition happen in slow motion, and a lot of people don’t want to see it.

Then compilers got better. C showed up. Machines got fatter, consoles moved into the 32/64-bit era, PCs got more decent, and Assembly stopped being the center of everything. It became what it should be: a low-level tool, local optimization, critical routines, bootstrapping, drivers, that sort of thing. Nobody serious looked at that transition and said, “well, that’s it, programming is over because the compiler writes machine instructions for you now.”

The 21st century brought the Web and shoved an entire generation into HTML, CSS, and a pile of markup bureaucracy that adds very little intellectual value while demanding a ton of manual labor just to get it mostly right. I still think the industry dragged that model out way too long. For way too many years, programmers became glorified form operators, CRUD assemblers, div aligners, priests of front-end frameworks that all do the same thing with slightly different syntax.

Then the 2010s bubble made it worse.

I’ve already written about that in RANT: Did AI Kill Programmers? and also in 37 Days of Vibe Coding Immersion. The startup bubble, cheap money, and the hiring frenzy produced a legion of very bad programmers, coming out of two-month bootcamps and sketchy online courses promising Google salaries with no foundation, no education, and no depth. The market spent a decade pretending that was normal. It wasn’t. Same story as always: lots of volume, little value, and a whole lot of people confusing inflated employability with actual competence.

And when the layoffs started in 2022, that did not come out of nowhere. I spent years warning that the bubble would pop. It’s all there in the playlist EU AVISEI. The message was always the same: once the cheap money dried up, the bar would go back up, and the only people with a real shot would be the ones who had made the effort to actually learn Computer Science. The next economic cycle was always going to be less about hiring volume and more about efficiency. That’s exactly what happened.

What really changed

LLMs went mainstream at the end of 2022. That’s true. But mainstream is not the same thing as useful for serious work.

Throughout 2023 and 2024, I was already using AI to write code. Did it work? Sure, it worked. But it was still full of nonsense: too many hallucinations, too many agent loops, context getting lost too easily, tooling breaking too often, too much cost for too little reliability. It was useful for experienced programmers who knew how to keep the thing on a leash. It still wasn’t a mature tool for day-to-day heavy lifting.

For me, the turn came at the end of 2025. That’s when the combination of better models, prompt caching, decent tool calling, inference optimizations, context windows that were actually useful in practice, and above all real agent interfaces made the whole thing stop feeling like conference-demo theater and start feeling like an actual work tool.

That was the point where Claude Code, Codex, OpenCode, and the rest stopped being “autocomplete on steroids” and became a different category of interface.

To me, Claude Code is already the new terminal. The editor has moved into the background.

I talked about that in Migrating my Home Server with Claude Code, too. I simply have no patience left to burn attention on Linux shell busywork when the problem is mundane: installing a server, bringing up and organizing Docker services, hardening a firewall, reviewing security rules, tuning kernel parameters, auditing dmesg, digging through systemd logs, that kind of thing. I let Claude do the heavy lifting, and I review direction and risk. Ironically, my Linux machines have never felt more stable, faster, or more robust.

Today, if you spend your day building software, going back to the raw combo of text editor plus terminal and doing everything manually starts to feel like regression. Not because typing became impossible. Of course not. I’ve been typing code for decades. The problem is something else: it became a waste of attention.

If I can describe an intent, ask an agent to inspect the codebase, edit twenty files, run tests, compile, fix the fallout, and bring me back a proposed change in minutes, why exactly am I supposed to feel nostalgic about hand-typing boilerplate inside VS Code?

I don’t.

And this is where a distinction a lot of people still don’t get matters. You should not use a coding agent like some dumb editor extension, the kind of thing where you say “generate this file here” and then spend the next half hour micromanaging every line in a tiny corner pane. That’s using a Ferrari to go buy bread around the block. The big gain does not come from treating Claude Code, Codex, or similar tools like glorified autocomplete inside VS Code. The gain comes when you drop the editor-operator mindset and start treating the agent like an actual pair programmer.

Instead of acting like a professional typist, you move up a level. You act more like a tech lead, product owner, QA, manager of the flow. You define intent, explain context, demand criteria, ask for a plan, ask it to run tests, ask for refactors, ask for alternative approaches, ask it to review its own change. You leave the manual code labor to the agent and use your own brain to judge direction, priority, risk, and quality.

But there is a balance there, and I’ve already covered it in other Agile Vibe Coding posts. You do not hand over the wheel and let the LLM git push everything blind. And you also don’t swing to the opposite extreme and turn into a comma cop, nitpicking every tiny detail until the agent becomes one more layer of bureaucracy and kills the speed advantage. Both extremes are bad. In one, you outsource responsibility. In the other, you strangle productivity.

The right point on that pendulum is somewhere else: use XP and actual engineering discipline to sustain the speed. Continuous refactoring. Tests. CI. Review. Fast feedback. Small code. Incremental change. That’s exactly what I documented in posts like From Zero to Post-Production in 1 Week - How to Use AI on Real Projects and Software Is Never ‘Done’. The 10x multiplier does not come from model magic. It comes from the model plus a sane process.

The interface changed. The need for judgment did not.

VS Code is the new punch card

That’s what the title means.

VS Code did not “become bad.” That’s not the point. Punch cards weren’t “bad” in their own historical context either. They were a massive step up from toggling bits by hand or rewiring a board. The point is that they were the mechanism of their era for feeding instructions into the machine.

VS Code is not the enemy. It’s just becoming the input mechanism of the previous era.

Today, text editors are becoming that again: an input mechanism that still works, will still be around for a long time, but is no longer the center of the activity.

If you’ve never looked at those older eras of computing, I already went through it in Akitando #86 - The Turing and Von Neumann Computer:

And if you want a refresher on why 6502, Z80, and old machines forced a different kind of discipline, go back to the Hardcore Guide to Computing Fundamentals and Learning About Computers with Super Mario (Hardcore++). That wasn’t old-man nostalgia. It was there to show that in every era the interface changes, but the machine still demands precision.

This is what a good part of Akitando existed for: teaching what still holds when the fashionable tool changes.

And it’s worth remembering something a lot of people forget: the 146 videos on Akitando, more than 96 hours of material, were built precisely to teach this kind of foundation to Computer Science students, juniors, and people trying to stop being framework button-pushers. I recorded all of that because I could already see the industry pushing too many people toward manual busywork and not enough understanding. Ironically, now that agents are here, that archive is more relevant than ever.

The interface changed again.

First you needed to know how to type the instruction. Then you needed to know how to organize the deck. Then you needed to know how to convince the compiler. Then you needed to know how to stitch together frameworks, HTML, CSS, YAML, CI, containers, cloud, ORMs, queues, observability, and fifty other layers of junk.

Now you need to know how to orchestrate an agent.

And again, that does not eliminate foundations. It only changes where the manual labor ends and the intellectual work begins.

“So we don’t need Computer Science anymore?”

Quite the opposite.

Now you need it more.

Someone with no foundation looks at an agent issuing a SELECT * FROM table, sees the thing working locally against 300 fake rows, and assumes everything is fine. In production the query pulls a million rows, blows memory, degrades latency, backs up a queue, congests connections, and the person has no clue why “it works on my machine” but melts in production.

That’s the real problem.

The agent does not know your system’s context the way an experienced engineer does. It doesn’t know which tables are going to grow tenfold next quarter. It doesn’t know which endpoint needs to answer in 80 ms and which one can take 2 seconds. It doesn’t know which flow needs a transaction, which one needs idempotency, which one needs a pessimistic lock, which one needs async compensation, which one needs auditing, which one can never leak sensitive data.

It may get the syntax right.

But syntax was never the hardest part.

I already said that in RANT: Did Akita Bend Over for AI??: what AI is really good at is removing the mundane work. And thank God for that. I did not get into computing to become an IDE operator. I have no romantic attachment to formatting HTML, fighting CSS, assembling the same old CRUD, gluing one more framework onto an old stack, or writing the same pile of infrastructure code for the hundredth time when any decent machine should be able to produce that by now.

So what is left after that mundane layer disappears?

Exactly the part that separates amateurs from real programmers:

domain modeling
architecture
trade-offs
performance
scalability
security
observability
maintainability
readability
operational cost
product decisions

All of that still exists. All of that is still hard. All of that still depends on judgment.

The mistake people make when they think programming was typing

Some people genuinely think that if the machine writes the code, then the need to understand software is gone.

That is the same stupidity as thinking the compiler killed the need to understand computers.

It didn’t.

It only killed the need to write everything in Assembly.

And thank God for that too.

Same thing here: coding agents do not kill the need to understand software. They kill the need for you to be a syntax typist.

And good riddance.

There is a nice irony here too: for years the industry sold the fantasy that programming meant “learning a framework.” Then it sold the fantasy that programming meant “learning React.” Then it sold the fantasy that programming meant “learning the stack of the month.” Now it’s going to sell the fantasy that programming means “learning prompt engineering.”

It doesn’t.

Prompts are an interface. Frameworks are an interface. IDEs are an interface. Punch cards were an interface.

Programming is still the act of instructing a machine to compute something useful under real constraints.

People who understand that survive every tool shift. People who don’t become operators of whatever the trendy tool is, and when the trend changes, they go down with it.

What I think is going to happen to juniors

So let’s answer the original question properly.

Juniors are not going to stop learning.

But they are going to have to learn something else.

If the junior of 2015 could spend years hiding ignorance behind low-value manual tasks, tweaking views, adjusting CSS, assembling dumb endpoints, copying Stack Overflow snippets, and making it all look like “productivity,” that hiding place is going away.

The junior of the agent era is either going to level up faster or get exposed faster. There isn’t much middle ground.

If they use agents and actually study fundamentals, they’ll be able to test hypotheses faster, read more code, compare more solutions, iterate more, make mistakes earlier and fix them earlier. They’ll learn more in less time.

But if they use agents without foundations, all they’ll do is outsource their own ignorance. They’ll become reviewers who cannot review. They’ll accept patches they do not understand. They’ll approve decisions they do not know how to measure. They’ll confuse “passed locally” with “ready for production.”

That kind of professional is dangerous.

Far more dangerous than the old junior who was at least limited by their own speed.

The post-bubble world

The good news is that this comes right after the collapse of the dumbest phase of the hiring bubble.

It’s past time for the market to stop rewarding labor-intensive stupidity as if it were competence. It’s past time to stop treating stack bureaucracy as technical depth. It’s past time to stop confusing commit volume with engineering value.

If the new era wipes out a good chunk of that theater, great.

In a post-bubble, post-miracle-bootcamp, post-CSS-as-a-career, post-CRUD-as-a-profession world, foundations go back to being what they should always have been: the main asset.

People who understand operating systems, databases, networking, data structures, compilers, computer architecture, profiling, debugging, concurrency, consistency, security, and cost will use agents as an exoskeleton.

People who understand none of that will use agents as a crutch.

An exoskeleton amplifies strength. A crutch only tries to hide weakness.

“But this isn’t sustainable”

There is always someone with the same excuse: “well, I don’t think this is sustainable, the data centers can’t keep up, the prices are too heavily subsidized, there’s no way this can keep going like this.”

And to be fair, that person is not entirely wrong.

It just doesn’t change a thing about what I have to do tomorrow morning.

This kind of concern might be great for bar talk or an X thread, but it doesn’t help me decide anything useful. Me, you, none of us are going to sit down with Anthropic or OpenAI leadership and redesign data-center capex, renegotiate power contracts, decide subsidy margins, or plan the next GPU generation. There is no concrete action for us that comes out of this besides repeating that “someday this will be a problem.”

It’s the same mindset as the person in the 1990s saying, “let’s not use the internet too much, it’s too slow, the limits are too low, the price per kilobyte is absurd, better wait until someone fixes it.” Or the person in the early 2000s looking at mobile data and saying, “2G is too slow, too limited, better not depend on it.” Why exactly would you want to be that person?

Thank God OpenAI, Anthropic, and the rest are fighting like hell and heavily subsidizing this race. I’m taking full advantage of it. I’ve already burned through my entire Claude Max 20x, already hit the extra-usage ceiling, already burned through my Codex plan, and upgraded to Pro so I can keep using it this weekend. People paying a monthly subscription and barely touching it are, in practice, subsidizing me to use as much as I can. I have no intention of slowing down. Why would you?

If prices change tomorrow, if infrastructure gets tighter, if the game changes, then I reassess tomorrow. That’s how technology has always worked. While the window is open, the rational move is not to slow yourself down in advance. The rational move is to learn as much as possible, extract as much as possible, and build an advantage while everybody else is busy explaining why they still haven’t started.

The decision is still human

At the end of the day, nothing that matters actually changed.

Someone still has to look at the result and decide:

can this go to production?
can this handle load?
is this readable?
is this secure?
is this maintainable?
does this fit the rest of the system?
does this solve the right problem?

If the answer is no, someone still has to know why it’s no.

More importantly, someone still has to know how to fix it.

That’s why, in the age of agents, foundational knowledge did not become less important. It became more expensive to be wrong without it.

VS Code is the new punch card.

Not because it became useless.

But because we’re finally entering an era where the act of typing code by hand stops being the center of the profession.

And honestly? About time.

AI reflects who you are.

If you’re good, it accelerates good code.

If you’re bad, it accelerates technical debt at industrial speed.

AI is not going to take a bad programmer and turn them into a good one. It never did, it doesn’t, and it won’t.

That’s why foundations matter more now than they did before.

The agent can write. You’re still the one who has to know whether what it wrote is any good.

How ElevenLabs Was Not Killed by Qwen3 TTS

Thu, 09 Apr 2026 08:30:00 GMT

TL;DR — Listen to this and keep reading:

Your browser doesn't support the audio element. Download the mp3 here.

That’s the April 6th episode of the The M.Akita Chronicles podcast, already generated with the new ElevenLabs v3 pipeline. Subscribe to the show on Spotify so you don’t miss new episodes like this one.

When Qwen3 TTS dropped, back around January this year, everyone on Twitter/X and on the AI newsletter circuit was shouting “ElevenLabs killer”. There’s a Medium piece claiming it’s the first real open source threat to ElevenLabs. There’s a post on byteiota saying 3-second voice cloning beats ElevenLabs. There’s an analysis on Analytics Vidhya calling it the most realistic open source TTS released so far. The enthusiast internet consensus was: we finally have open source that can go toe-to-toe with ElevenLabs, the tables have turned, it’s just a matter of time.

I decided to test it on my own production pipeline, as usual. I built out a whole pipeline on top of Qwen3 TTS 1.7B to generate the weekly podcast for The M.Akita Chronicles, and I documented the behind-the-scenes on the serving-AI-in-the-cloud post. Anyone who wants the detail on cold starts, voice cloning, and sampling parameters that change from one mode to another, that’s the place. I’m not repeating it all here.

This post asks a different question. After almost two months running that setup in production, shipping an episode every Monday, last night I finally pulled the plug on Qwen3 and switched the whole thing to ElevenLabs v3. Let me tell you why.

What didn’t work on Qwen3

Between February 15 and March 30, I made dozens of commits fine-tuning the podcast flow: prompts, sampling parameters, generation order, leading silence on the voice clone sample, volume normalization, pronunciation of tech acronyms. Fixing Marvin’s voice, which was clipping the first syllable because the reference audio started without any lead-in silence. Tuning Akita’s voice to sound more confident and assertive. Adding crossfades between section jingles. Sorting out the “podcast” pronunciation so it didn’t come out as “pódcast”. Making the script generator prefer Portuguese over gratuitous anglicisms so the TTS wouldn’t choke. Each of those fixes was a multi-hour session of listening to audio, regenerating, tweaking a parameter.

The result came out acceptable. “Acceptable” in the sense that I managed to ship every episode without re-recording anything by hand. But if you listen closely, the Qwen3 voice has that unmistakable AI-generating-audio quality. Flat intonation, uniform rhythm. Over long stretches you can feel it’s a machine talking. Good enough to ship, but miles away from what you’d hear on a professional podcast made by humans.

The worst problem was English pronunciation. My podcast covers tech news, so terms like “MCP”, “RAG”, “Claude Opus”, “GPT-5”, “open source” show up in every conversation. Qwen3 took those terms and pronounced them with a Brazilian accent, something like “oh-pen-ee-sohrss-eh”, that kind of thing. Unlistenable for the audience. The workaround I had to implement was manually mapping in the script-generation LLM prompt which English words to swap for Portuguese equivalents. The prompt today has a whole section split between “keep in English” (proper nouns, brands, terms already adopted into Brazilian tech lingo) and “translate to Portuguese” (gratuitous anglicisms), looking roughly like this:

**REPLACE with Portuguese** (common English words that have natural
Brazilian Portuguese equivalents):
- "update" → "atualização"
- "release" → "lançamento"
- "feature" → "recurso/funcionalidade"
- "deploy" → "implantação" or just "colocar em produção"
- "trade-off" → "dilema" or "escolha"
- "performance" → "desempenho"
- "default" → "padrão"
- "insight" → "percepção/sacada"
- "skills" → "habilidades"
- "approach" → "abordagem"
- "highlights" → "destaques"

I call this “scrubbing anglicisms so the TTS doesn’t choke”. The funny part is I didn’t want that level of restriction on my script. I wanted the voice to just pronounce “update” when “update” was the natural word. Since the model couldn’t, I had to mutilate the podcast’s vocabulary so the final result was actually listenable. It’s a band-aid, the kind you add hoping to rip it off later when the tech grows up.

The experience with ElevenLabs v3

Yesterday afternoon I opened an ElevenLabs account, bought the Pro plan ($99/month), and started experimenting with the eleven_v3 model, which shipped in February of this year. Thirty minutes later I had a proof of concept running, and about two hours later the whole podcast system was migrated. The effort gap is an abyss.

The technical details of the migration ended up in an internal project doc, so here’s the summary comparison table that actually matters:

Dimension	Qwen3 TTS 1.7B (old)	ElevenLabs `eleven_v3` (current)
Quality (Akita)	Good, cloned from real audio	Better, same clone but more natural prosody
Inline emotion in script	Not supported	`[sighs]`, `[sarcastically]`, `[excited]`, `[laughs]`, works in pt-BR
Cold start	5 to 15 min spinning up a RunPod GPU before each run	Zero, HTTPS call with immediate response
Throughput	~1× real time (serialized)	~6× real time with concurrency 4
Wall clock for a 28-min episode	~25 to 30 min	~4 min
Operational surface	RunPod, Docker, FastAPI, Qwen weights, GPU billing	One env var (`ELEVENLABS_API_KEY`)
Cost per episode	~$0.08 of GPU	~$2.70 in ElevenLabs credits

Look at the last line. Qwen3 costs less than ten cents of a dollar per episode. ElevenLabs costs almost thirty times more. And it’s still worth it. The other lines on the table solve problems that were eating hours of my week. I no longer need to write code to scale GPUs in the cloud, I don’t need to wait for the machine to spin up every time, I don’t need to babysit a model or rebuild a Docker image when the weights change. The operation became a single line of configuration.

And here’s the interesting part: the inline emotion tags. The v3 model accepts markers like [sighs], [sarcastically], [dryly], [excited], and changes the delivery accordingly. This works in more than 70 languages, including Brazilian Portuguese. It transformed script generation, because now I can ask the LLM that writes the script to drop emotional tags at the right moments, which gives a liveliness Qwen3 couldn’t deliver in a million years. A concrete example of what comes out of the script later:

AKITA: Isso é simples. [dismissive] Quem ainda acredita que Bitcoin
vai morrer não tá prestando atenção.
MARVIN: [sighs] Mais uma semana, mais uma leva de devs confiando
cegamente em pacotes npm. Previsível.

I even have two separate tag palettes, one for Akita (expressive but controlled, uses [excited], [dismissive], [emphatic]) and another for Marvin (stoic, just [sighs], [sarcastically], [tired], [dryly]). That’s all encoded in the script generation prompts so the LLM knows which character is allowed to use what.

About Marvin’s voice

For listeners who are already used to Marvin’s voice, don’t worry: I cloned him on ElevenLabs using the same audio sample I had used to train on Qwen3. It’s the same voice. It just sounds even better now, because the ElevenLabs model captures nuance and prosody that Qwen3 couldn’t deliver.

Listen to it and tell me

To prove the talk is real, here’s Monday’s episode, April 6th, already generated with the new pipeline:

Your browser doesn't support the audio element. Download the mp3 here.

If you’re already a podcast listener on Spotify and you’ve heard previous episodes made with Qwen3, compare them and tell me in the comments whether you notice the difference or whether it’s the same to you. I’m genuinely curious how much of this is my trained ear after hours of listening to TTS audio and how much is an obvious difference for a casual listener.

Starting next week, every episode of the podcast will be generated by ElevenLabs v3. The newsletter is already plugged into the new pipeline, the scheduled jobs to pre-warm the RunPod GPU have been disabled, and the legacy Qwen3 code stays in the repo as plan B in case v3 starts acting up chronically. Two file edits and I can switch back. I probably never will.

The part about dubbing the YouTube videos

Now the hook for the second half of this post. On the anniversary piece I published earlier today, I told the story of how Claude Code translated nearly half my blog to English over a weekend. In the same spirit, I went after dubbing the videos on the Akitando channel.

The channel has 146 episodes, around 96 hours of technical content in Portuguese, and more than 500k subscribers. I already had the subtitles translated (one curated .srt per episode), and YouTube even offers automatic English dubbing. But the result is like Google Translate circa 2015: gets the idea across, nobody wants to listen for very long.

I tested ElevenLabs’ three voice approaches and only one worked. Speech-to-Speech converts a voice but doesn’t translate. The Dubbing API does translate but creates the voice on its own, with no way to force a specific clone. Only Text-to-Speech could solve this: take the English .srt I already had, send each block to the TTS endpoint with my cloned voice, and assemble the audio aligned to the original video.

But having a translated .srt is not enough. Raw subtitle translation doesn’t survive TTS. Passages with source code, URLs, hex hashes, lists of shell commands — the model switches to spelling mode and the audio comes out at twice the expected duration. Translations that run too long blow past the time window and need to be condensed so the voice doesn’t sound like an auctioneer. Truncated SRTs need to be completed. And after each fix, you re-run the pipeline, listen, find the next problem, fix it, re-run. The whole thing was a cycle of interruptions, manual corrections, and re-runs — a far cry from the “press a button and get a dub” fantasy.

The big challenge was the accent. My cloned voice was trained on Brazilian Portuguese audio, so when it tries to speak English, the thick accent comes through. ElevenLabs has an [American accent] tag that works on v3, but on top of a voice trained in another language it’s weak — the Brazilian accent still bleeds through. The fix was to train a second voice of mine, English only. I recorded a few minutes in my best American accent, uploaded it as a separate Instant Voice Clone in my account, named it “Akita English”, and set the pipeline to use that voice by default. The result comes out more natural, no tag needed, and the voice identity is still mine.

The honest ceiling on my English voice quality

Let’s be frank. I’m a Brazilian who speaks English, not a native speaker, and I don’t have the time or patience to drill pronunciation for hours. A clone can only ever be as good as the source sample — if the source is a non-native reading a script, the clone inherits every imperfection. No clone is going to make me sound like “Fabio speaking perfect English” because that sound doesn’t exist for the model to learn from.

To get to the version that’s running today, I recorded five different training samples throughout the day, each with a different rhythm and script, and ran an A/B battery against the original Portuguese. None sounded native, and that was never the goal. The realistic bar was more modest: sound recognizably like me, read technical content without stumbling, and carry a full one-hour episode without wearing the listener out. The winner ended up being the first iteration, the original “Akita English”. Far from perfect, but good enough to ship. Swapping it out later is literally one line in the config file.

What surprised me most along the way was realizing that the rhythm of the training sample matters more than the timbre. The clone learns the cadence of whoever recorded it: iteration 5 came out 25% slower than iteration 2 reading the exact same text, purely because I recorded that particular sample more slowly. For dubbing tech videos, the right target is the rhythm of a “YouTube tech explainer” — brisk, energetic, around 150 words per minute. No audiobook pacing.

How I align audio to the original video

A non-obvious point: spoken English tends to be longer than equivalent Portuguese. If you just generate the audio for each subtitle and paste it at the original timestamp, the dub drifts later and later. With 30 minutes of video you’re already several seconds out of sync.

My approach was to attack the problem in layers. The first version split the SRT into smaller blocks, called “chunks”. Each chunk maxed out at 700 characters (so v3 wouldn’t hallucinate) and could only cut at a sentence boundary, never mid-sentence. A simple algorithm accumulated subtitles into a buffer until it was about to overflow, then walked back looking for the last cue whose text ended with ., !, or ?, and flushed only up to there. The cues still in the buffer got rolled into the next chunk. Each chunk also stored the start and end timestamps of the subtitles it covered, so the assembly step knew exactly where to drop that piece of audio later.

# For every incoming cue, check whether adding it
# would push the buffer past the char cap.
for cue in cues:
    projected = buffer_char_len(buf) + len(cue.text)
    if projected > max_chars and buf:
        # Walk back for the last cue in the buffer that
        # ends in '.', '!', '?', or '…'.
        sentence_end = last_sentence_end_index(buf)
        if sentence_end is not None:
            buf = flush(buf, sentence_end, chunks)
        else:
            # Run-on sentence longer than max_chars with no
            # terminator. Rare, but happens. Warn and cut.
            log.warning("run-on sentence > %d chars — "
                        "splitting mid-sentence as a last resort",
                        max_chars)
            buf = flush(buf, len(buf), chunks)
    buf.append(cue)

    # Soft-break: if the buffer is already at least 60% full
    # AND the current cue ended a sentence, flush now. Keeps
    # chunks balanced around 60-100% of max.
    if cue.ends_sentence and buffer_char_len(buf) >= max_chars * 0.6:
        buf = flush(buf, len(buf), chunks)

Technically, this was the turning point: instead of letting the dub follow the visual structure of the subtitle file, the chunker started following the real paragraph structure of the script. I won’t retell the whole story of how I got there and why it eventually forced a mass rerender, because I come back to that later in the section about the problems that only showed up after the first upload. The technical point here is simple: once the window represented a real spoken paragraph, the audio started sounding natural, and the stretch target could move from 92% to 95%.

Once the chunking is done, there’s still the problem that English runs longer than the Portuguese equivalent when spoken. The trick here is predicting whether a chunk is going to blow past its target window before generating the audio. If the ratio characters / 16 chars-per-second already exceeds the target duration by a margin, the pipeline passes the speed parameter directly to the ElevenLabs API, asking the model to natively generate faster speech. This preserves prosody way better than compressing afterwards. The ceiling is 1.15× (beyond that the voice starts sounding rushed).

# before calling the API, predict if it will overrun
expected_sec = char_count / EXPECTED_CHARS_PER_SEC  # 16 chars/s
predicted_ratio = expected_sec / target_sec

voice_speed = 1.0
if predicted_ratio > PREEMPTIVE_SPEED_THRESHOLD:   # 1.05
    voice_speed = min(predicted_ratio * 0.98,
                      PREEMPTIVE_SPEED_MAX)         # cap at 1.15
    voice_settings["speed"] = voice_speed

Even with preemptive speed turned on, sometimes the generated audio still comes out a tiny bit longer than the target window. That’s what the second safety net is for: measure the actual audio with ffprobe after it’s generated, compare to the target window, and apply the ffmpeg atempo filter to compress in post if it exceeded the 2% tolerance, capped at 1.20×. Combining native speed (1.15×) with atempo (1.20×) gives an effective compression ceiling of 1.38×, enough to fit naturally even in the worst cases without breaking quality.

# after generating the chunk
actual = ffprobe_duration(chunk_path)
ratio = actual / target_sec

if ratio > FIT_TOLERANCE:       # 1.02
    # compress the audio without touching pitch
    ffmpeg_atempo = min(ratio, FIT_MAX_ATEMPO)   # 1.20
    apply_atempo(chunk_path, ffmpeg_atempo)

When a generated chunk comes out shorter than the target window, the pipeline slightly stretches the audio (without affecting pitch) to reduce the ugly silence after it ends. The goal isn’t to fill the whole window — a natural pause between sentences is fine — just to smooth over the worst offenders that would feel like “cut audio”.

On a big episode, 95 minutes, 194 chunks, the total cumulative drift came out to -0.7%. About 43 seconds of drift across the entire episode, imperceptible while you’re watching.

What saved the budget on this architecture is that each chunk lives on disk as its own .mp3. The pipeline keeps a manifest with the normalized text of every chunk, and before calling the ElevenLabs API, it compares the current text against the cache. If the text hasn’t changed, it reuses the existing audio without spending a single credit. If I rewrite a problematic cue, only the chunks affected by that cue get regenerated — the rest of the episode stays untouched.

This is what made the iterations viable. I’d run the batch, listen to sections, spot a problem (translation too long, a code snippet the TTS couldn’t pronounce, a chunk boundary that landed at a bad spot), fix the SRT or tweak the chunker parameters, and re-run. Each re-run consumed a fraction of the credits and time of the original run, because only the changed chunks were regenerated. Without this cache, every iteration would have cost nearly as much as the first pass, and the total batch cost would have been two or three times higher.

That chunking change also had a heavy cache and cost impact, but I’ll leave that part for the later section where I explain the actual retracing and rerender cycle. Here, the important point is just that it was the right technical move.

When `atempo` can’t save you: rewriting cues before TTS

Not everything is about window alignment. There’s a kind of cue that breaks the whole pipeline, where neither preemptive speed nor post-process atempo can save it, and I only ran into it once I started running the batch over the more technical episodes of the channel.

The problem is this. The v3 reads at about 16 characters per second when you hand it normal conversational text. But when you hand it a cue stuffed with a literal URL, a hex hash, a binary string, a long sorted number list, or a shell code block, the model shifts into “spelling it out letter by letter” mode and drops to about 9 characters per second. A 500-character cue that was supposed to turn into 30 seconds of audio comes out at 55. The sanity check rejects it (because it’s past 1.8×), the automatic retry tries the same 500 characters again, the five retries all fail in a row, and the chunk gets stuck.

And this did not happen by accident. I used to write those passages that way because I knew the same scripts would later become blog posts, so it made sense to leave the raw URL, the full command, the complete hash, the whole technical detail there for the reader. In Portuguese that worked fine because it matched my original workflow exactly: record the video, then publish the matching text on the blog. What I had never prepared for was the next step, turning that same material into automated English dubbing. I only realized that conflict when the pipeline started failing for real.

I hit this first on ep052, the Ubuntu beginner’s guide, where two cues carried github.com URLs, hkp://keyserver.ubuntu.com URLs, and a 40-character GPG hash. Trying to fix that in post-processing is a waste of time. The 1.20× atempo ceiling will never compress 55 seconds into 30, and even if it did, the result would be an unlistenable chipmunk. The fix has to land upstream, in the text itself.

The solution was to rewrite the cue so it describes what the command does instead of showing the literal command:

- Run: apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 \
-   --recv-keys 0A6A3E7F79F93EF8AAB9E92BAEBB74C8B5A1E44D
+ Run the full command in the video description to import the
+ signing key from the Ubuntu keyserver.

The on-screen English caption still shows the full command, so anyone reading the caption sees exactly what they need to type. The dubbed audio describes the intent in spoken language, which is what the TTS can deliver without choking. The instruction to the viewer stays intact, the characters-per-second ratio goes back to the expected 16, and the generated audio fits the original window without needing any stretch or compression in post.

I went hunting for cues like this across the whole batch. I scanned from ep052 through ep146 looking for cues with a high density of “hard” characters (digits, brackets, operators, long URLs, binary, hex) and ended up rewriting around 30 cues across 13 episodes:

ep052 Ubuntu for devs: GitHub URLs and a 40-character GPG hash
ep091 Hello World in C: binary examples
ep095 640kB memory: address wrap, segment-offset
ep106 CodeMiner: spelled-out email address
ep113 Compression: pure binary strings, 75% hard characters
ep115 SQL Server: long sorted number list
ep120 Internet: dotted IP addresses
ep121 Sockets: two JavaScript code blocks
ep122 Proxies: Chrome User-Agent headers
ep123 Secure networking: shell one-liners, Docker flags
ep126 Gentoo: C chroot demo
ep136 Containers: GitHub release URL
ep144 Cryptography: signing key address

The pattern was always the same: keep the caption showing the code, URL, or hash verbatim, and rewrite the spoken text so the narrator describes what it does. I could have automated the rewrite by letting an LLM read each cue and propose a substitution on the fly, but it felt safer to review them all by hand. It’s exactly the kind of thing where the model decides to “improve” a command to make it cleaner and ships with broken shell. The manual sweep took about two hours. The automated version would have cost the same time in review, with the bonus anxiety of having shipped a mangled command by accident.

There was also one weird structural case on ep146, about Docker Compose. Two cues had glued themselves together because of a missing blank line in the SRT, and pysrt was treating them as one giant cue. The TTS never even got to process it — the chunker choked first. I fixed it by hand, adding the missing blank line. One-character fix, half an hour to track down.

Reconstructing truncated SRTs

Another trap showed up when I went to run the full batch: three episodes had English captions that were too short relative to the actual video length. Ep056, the Rails episode, was the most absurd case. Nineteen minutes of caption for an 80-minute video. Sixty-two minutes of content with no caption at all. Ep057 (WSL 2) was missing 14 minutes, and ep068 (Git Direito) was missing 2.

This is a leftover from my old translation workflow. At some point I started hand-revising the caption, stopped partway through, and the .en.srt file got saved truncated at the spot where I stopped. You can’t dub an 80-minute video with a 19-minute caption. The chunker produces audio up to where it can read and then simply has no idea what to do with the rest of the video.

The fix became a new script. It diffs the truncated .en.srt against the .pt-orig.srt (the raw YouTube auto-caption, which always covers the whole video), picks up the cues that only exist in Portuguese, sends them to Claude Sonnet 4.6 with a strict JSON schema asking for cue-by-cue translation, and pastes the result back at the tail of the English file. The schema is the detail that matters. When I tried asking for the response as plain text in N|text format, Claude was dropping about 20% of the cues along the way. With a strict JSON schema, the drop rate fell to zero across the three repairs.

The numbers:

ep068 Git Direito: +50 cues / 2 minutes reconstructed
ep057 WSL 2: +368 cues / 14 minutes
ep056 Rails: +1445 cues / 62 minutes (plus dropping 5 fake cues the old translator had invented at the tail to plug the gap)

After the repairs, the three episodes joined the batch normally and got dubbed like any other. The kind of tool you hope you never need, but when you need it, it’s worth writing once and closing the problem for good.

Automatic emotion tags

While I was testing, I decided to add one more step to the pipeline: an “emotion tagger” that reads the English SRT before it goes to TTS and inserts tags like [sarcastic], [thoughtful], [emphatic], [deadpan] at spots where a human narrator would naturally shift tone. The idea is to mimic what a professional voice actor would do — hit the emphasis at the right moments — without turning the video into a theater of overblown emotions.

This part is tricky for two reasons. First: letting an LLM loose on your SRT runs a real risk of it “improving” the text (swapping a word here, rewriting a sentence there) and you ending up with a dub that doesn’t match the caption. Second: LLMs love to overdo it. Ask them to tag emotion and they’ll put a tag on every other line. To guard against both, I pinned a small allow-list of tags and run a round-trip validation after Claude’s response comes back:

ALLOWED_TAGS: frozenset[str] = frozenset({
    "[sarcastic]", "[thoughtful]", "[emphatic]",
    "[deadpan]", "[serious]", "[amused]",
    "[sighs]", "[exasperated]", "[confident]",
    "[matter-of-fact]",
})

def validate_tagged(original, tagged):
    """Ensure the tagged SRT preserves the original with no drift."""
    if len(original) != len(tagged):
        raise TagValidationError("cue count mismatch")

    for o, t in zip(original, tagged):
        # Index and timestamp must match byte for byte.
        if o.index != t.index or o.timestamp != t.timestamp:
            raise TagValidationError(f"cue {o.index}: header drift")

        # Any tag outside the allow-list invalidates the whole response.
        bad = find_disallowed_tags(t.text)
        if bad:
            raise TagValidationError(f"cue {o.index}: {bad}")

        # The text with the tags stripped must be byte-identical to
        # the original. If the LLM swapped a word, rewrote anything,
        # validation fails and the response is rejected.
        if strip_tags(o.text) != strip_tags(t.text):
            raise TagValidationError(f"cue {o.index}: text drift")

When validation fails, the pipeline throws the response away and re-requests (or, if it keeps failing, ships the SRT untagged). On tag density, the prompt sent to Claude explicitly says: aim for N/10 tags in a SRT with N cues, hard floor of 1 tag per 12 cues, hard ceiling of 1 tag per 7 cues, distribution balanced across the four quarters of the episode, and no two tagged cues within 4 cues of each other. That rule set was the sixth iteration — the five before it either blanketed everything in tags or loaded them all into the first quarter.

Don’t let v3 hallucinate

A critical detail that only shows up in production: ElevenLabs v3 tends to hallucinate when you send very long blocks of text. I’ve seen blocks where the input text was around 1,500 characters and the model generated nine minutes of audio when the expected duration was a minute and a half. The model just decides to keep talking on its own, inventing content.

The official ElevenLabs docs recommend keeping each call under 800 characters to avoid this. I went more conservative and cut at 700, always at sentence end (never mid-sentence, because a mid-sentence split sounds horrible when concatenated). On top of that, I bumped stability to 0.9 (Robust mode, more stable but less responsive to emotion tags), turned on apply_text_normalization so the model pronounces numbers and acronyms correctly, and added a sanity check that rejects any generated audio longer than 1.8× the expected duration.

DEFAULT_VOICE_SETTINGS = {
    "stability": 0.9,          # Robust mode
    "similarity_boost": 0.95,
    "style": 0.0,
    "use_speaker_boost": True,
}

EXPECTED_CHARS_PER_SEC = 16.0
CHUNK_SANITY_MAX_FACTOR = 1.8   # reject if actual > 1.8× expected

With those mitigations in place, a 95-minute episode (194 blocks) ran start to finish with zero hallucinations. Exactly one block failed mid-run on a transient 502 error from the API, which the automatic retry caught on the next attempt.

Mastering for YouTube

YouTube doesn’t publish an official number, but industry consensus is that its playback loudness normalization targets -14 LUFS. Deliver audio louder than that and YouTube turns it down; deliver it quieter and it leaves it alone. To land on the target exactly, the pipeline runs ffmpeg’s loudnorm filter in two passes. The first pass measures the entire audio and prints the stats (input_i, input_tp, input_lra, input_thresh) as a JSON blob on stderr. The second pass reads those stats back in and applies a linear static gain to land precisely at -14 LUFS, -1.5 dBTP:

# Pass 1: measure
ffmpeg -i final_en.mp3 \
  -af "highpass=f=80,loudnorm=I=-14:TP=-1.5:LRA=9:print_format=json" \
  -f null -

# Pass 2: apply linear gain with the measured values
ffmpeg -i final_en.mp3 \
  -af "highpass=f=80,loudnorm=I=-14:TP=-1.5:LRA=9\
:measured_I=-18.3:measured_TP=-3.1:measured_LRA=5.4\
:measured_thresh=-28.7:offset=1.2:linear=true" \
  -ar 48000 -ac 2 -c:a pcm_s16le final_en_mastered.wav

Two passes instead of one because single-pass mode runs in dynamic compression and ends up “pumping” the gain during silent stretches. With linear=true on pass 2, the gain stays static on top of the pass-1 measurements, so no pumping. The result lands within ±0.1 LU of the target, which is effectively inaudible. The highpass=f=80 filter in front kills HVAC rumble and mains hum below 80 Hz, which the human ear doesn’t catch but will move peak measurements around.

The point where I thought it was done. And it wasn’t.

This is where the most annoying, and most educational, part of the project starts.

I really thought I had nailed it on the first big pass. The pipeline closed all 146 episodes, I uploaded everything, and it looked done. The durations matched, the files were all there, and the whole process had finally crossed the complete batch of nearly 96 hours of video. But it was exactly after that upload that I started noticing problems I simply had not anticipated.

The first scare was the jingle situation. I already knew many Akitando episodes begin with that short instrumental intro before I start talking, but I underestimated how much trouble it would cause. YouTube’s .srt is almost useless here, because a jingle is not speech. Sometimes it marks [Music], sometimes it gives you nothing useful, sometimes the window is just wrong. In the first assembly pass, some episodes lost the jingle entirely, some got it doubled, and in others the source-fill from the original audio already contained the right jingle and my splice came in on top of it, pushing everything else forward. This was the kind of bug you only catch when you sit down and actually listen to the final mastered file, not when you just look at total duration.

The fix was to stop hoping the subtitle would tell me where the jingle was and go straight to the source. I built a detector using the original video audio itself plus reference jingles, with an audit and repair pass on top of the mastered outputs already generated. That eventually became a small subsystem of its own: detector, audit, cached reassembly, exact-window repair. At the end of that detour, the set of changed episodes collapsed into a well-defined reupload bucket and the jingle problem finally came under control. It was one of those classic production moments: the first solution looked good enough until it hit the whole archive.

Then came the second punch, and this one was more expensive: I had forgotten a basic fact about YouTube subtitles. SRT is built to be read on screen, not to become spoken script. So an idea that was a full paragraph in my original text showed up split into several small cues, each with its own micro time window. In the first version of the pipeline I was using those cues as the generation unit for ElevenLabs. The result was TTS audio that was coherent locally, but the final assembly inserted artificial silences in the middle of sentences. Technically aligned, humanly weird.

Fixing that hurt because it meant walking several steps back. Instead of respecting the cue structure of the SRT, I had to go back to my original scripts and rebuild chunking around the paragraphs I had actually read when recording the videos. Almost every Akitando episode has a matching blog post with a ## Script section, and that is what saved the project. Because I had been meticulous back when I made those videos — writing the scripts first, reading exactly that on camera, and archiving both the videos and the scripts — I had a source of truth. That made it possible to map the script paragraphs onto the timings in the .pt-BR.srt, regroup the .en.srt on top of that, and regenerate the chunks in the right structure.

The problem is that this destroys the previous cache. Even when the text barely changes, if the chunk boundary changes then the TTS call changes too: different context, different prosody, different beginning, different ending. So this was not a touch-up. It was a full regeneration of everything again. That was the point where the cost started escaping my original estimate.

And just when I thought now it really was over, the third hit landed: external clips. Many videos on the channel include inserted footage from other sources. The Ruby on Rails episode, for example, has 37signals clips, Apple material, commercials, external inserts in the middle of the narrative. YouTube’s .srt did not know what to do with that. In some places it simply hallucinated text, in others the timings were completely wrong, and in others the original speech had nothing to do with the subtitle I was using as ground truth. Again: if you only look at the automation, it looks fine; if you actually listen to the final master, it isn’t.

The fix, again, was to leave the subtitle behind and go back to the original material. I built an external-clip window detector based on alignment gaps between transcript, .pt-BR.srt, speechless cues and chunks that crossed those windows. Then, instead of trying to dub what made no sense to dub, the process started recovering the original audio from the source video exactly in those stretches and filling the master with the correct source audio there. That’s what fixed the 37signals, Apple Education, Mac vs PC commercial and other inserted segments I had forgotten were even there.

That was the biggest lesson from the second half of this project: automation takes you very far, but the full archive always has more dark corners than you imagine on the first pass.

The dubbing batch numbers (and why it hurt more than I expected)

I need to correct myself again. I didn’t just miscalculate the first bill. I also overestimated how confident I should have been after the first full run.

When I upgraded to ElevenLabs Business, I thought I would finally have plenty of headroom. In practice, I didn’t. The latest dashboard screenshot shows 10,046,279 credits consumed out of a limit of 11,000,000, leaving 953,721:

In plain English: I burned through more than 91% of the Business workspace limit. And that helps explain what “10 million credits” really means in this context. It does not mean “I rendered 146 episodes once”. It means rendering 146 episodes, discovering a structural problem, going back, fixing it, regenerating chunks, discovering another structural problem, going back again, rebuilding masters, auditing jingles, repairing external clip windows, uploading again. Every extra round costs money. And when the change invalidates the cache, it costs a lot.

The cost also doesn’t stop at ElevenLabs. All the subtitle curation, problematic cue rewrites, truncated SRT reconstruction, jingle audits, external clip detector work and the rest ran through Claude Code on the Claude Max 20× plan. Combined with the mass blog translation to English that I described in the anniversary post, the Claude Max 20× weekly limit hit 100%, with more than BRL 300 in extra usage on top. If you think generative AI becomes free once you’ve signed up for the plan, the invoices disagree.

Even so, I still think it’s worth it. We’re talking about 146 episodes, almost 96 hours of technical content, more than 5 years of channel archive. Doing that manually in a studio, with an actor, direction, pickups and human review on everything, would cost a completely different order of magnitude.

What changed was how I think about the project. At the start I was looking at this as “one very large batch”. In the end it was something else. It was digging through a whole archive, finding hidden exceptions, fixing old rot, rebuilding masters, rerendering, relistening.

The first full batch took a little under two days and made me think I had solved everything. I hadn’t. The second pass, with paragraph-aware chunking, was already expensive. The third, with jingle repairs and mastered-output fixes, cost more again. And the fourth hit came from the external clips I had not modeled correctly in the first automation pass.

But now, finally, I think it’s done. Or rather: I hope it’s done. Everything has been uploaded again, for the last time. It isn’t perfect in the absolute sense of the word. It’s as good as I can get it with scripts, audits and automation, without descending into manual-intervention hell episode by episode.

And if all of this was possible, it’s because years ago I had the discipline to produce the videos the right way: write first, read the script when recording, and archive the materials afterward. Without the original videos and without the original scripts, I would have had no way to discover where the .srt was lying, where the timing was wrong, where the external clip started, where the original paragraph began and ended. That old organization is what made it possible to repair the project now.

So yes: the dub for all 146 episodes is done. Not perfect. But closed. At least, I sincerely hope this time it really is.

The first test, watch it

I made a test video to prove the thing works. Two caveats up front: first, the embed below already loads with English captions on by default, so that side is solvable via URL. Second, the audio is annoying: YouTube removed the audio track selector from embedded players back around March of this year. So inside the embed you won’t find that option in the gear menu at all, it just plays in Portuguese. To actually hear the English dub, click watch directly on YouTube — the main YouTube watch page still has the audio track switcher in the gear menu.

If you’re used to YouTube’s automatic dubbing or to the AI dubbing on TikTok, compare them. The difference is glaring. The voice is mine, with a worked-over American accent, and the sync with the original video stays within 1 to 2% cumulative drift, practically imperceptible over 95 minutes of video.

And the missing piece: translating the thumbnails

While I was closing this post, the penny dropped: I’d left a loose end. All 146 thumbnails on my channel are in Portuguese, many of them with big all-caps titles like “9 DICAS PARA PALESTRANTES” or “7 RECOMENDAÇÕES DE SHOWS PARA PESSOAS DE TECH”. Perfect English audio doesn’t help if the image in the search results is an illegible block of text to anyone who doesn’t read Portuguese. YouTube lets you upload an alternative thumbnail per language, so I needed a way to generate English versions, keeping the rest of the art identical and swapping only the text.

The tool uses two pieces. yt-dlp is the modern fork of the old youtube-dl, a Python CLI that downloads anything from YouTube (videos, audio, subtitles, thumbnails) without needing an API key. Nano Banana Pro (nano-banana-pro-preview on the API) is Google Gemini’s latest image-editing model — you feed it an input image with a prompt and it returns an edited image, preserving the rest of the composition when you ask it to touch only a specific part.

The pipeline has two steps. Step 1: yt-dlp grabs the thumbnail as a .jpg, no video download:

yt-dlp --skip-download \
       --write-thumbnail \
       --convert-thumbnails jpg \
       -o "thumbnails/originals//" \
       "https://www.youtube.com/watch?v="

Step 2: send each image to Gemini Nano Banana Pro with a prompt that’s a rigid contract of hard requirements, not just a “translate this”:

TASK:
Detect every piece of Portuguese text visible on this image and
translate it to clear, natural American English. Replace the
Portuguese text with the English translation in the same visual
position.

STRICT REQUIREMENTS — follow every one:
- Preserve EVERY non-text element identically: face, pose,
  expression, background, color palette, lighting, icons, logos,
  decorative shapes, borders, layout. Only the text changes.
- The English translation must be IDIOMATIC and CONFIDENT — not
  a literal word-for-word rewrite. It's a YouTube thumbnail for a
  tech audience, so use punchy phrasing a native English-speaking
  tech YouTuber would write.
- Match the original text's font family, weight, size, color,
  stroke outline, drop shadow, and any decorative treatment.
- If there is no Portuguese text at all on the image, return the
  original image unchanged.

The “return the original image unchanged” clause is what saves the earliest episodes, which don’t have overlay text, just my face. Without it, the model would invariably try to “improve” the composition, swap the lighting, and so on.

Here’s the result on two examples (episodes 10 and 11):

Original (PT)

Translated (EN)

Original (PT)

Translated (EN)

Notice the detail on episode 10. The model didn’t translate “7 Recomendações de Shows para Pessoas de Tech” literally into “7 Show Recommendations for Tech People” the way Google Translate would. It rewrote it as “7 TV Shows You Must Watch If You’re In Tech”, the way an actual American tech YouTuber would headline the video. The rest of the image stays pixel-for-pixel identical. On episode 11, “9 DICAS PARA PALESTRANTES: venda sua caneta” becomes “9 TIPS FOR SPEAKERS: sell your pen”, with the font, all-caps treatment and position preserved. If you didn’t know the original was in Portuguese, you couldn’t guess the second image is an automatic AI edit.

Cost on the Gemini side: a few cents per image, a handful of dollars total for the whole batch. Trivial next to the $1,500+ on the audio dub. With the thumbnail sorted, the channel’s English conversion is finally buttoned up — ElevenLabs v3 cloning the voice over hand-curated English .srt files, and Nano Banana Pro editing the image so the title matches the audio.

The conclusion, which is the title of this post

When Qwen3 TTS dropped, the hype was calling it an “ElevenLabs killer”. I spent weeks trying to live with that premise in practice, with real content shipping every week. And what I found is that open source TTS is still miles behind ElevenLabs. The gap is large, and it’s a gap that shows up precisely when you step off the 5-second demo tweet and put the model to work on a real 30-minute weekly podcast.

In practice, Qwen3 doesn’t even beat ElevenLabs’ v2 model. v3, which is the current one, is still a step above v2. The prosody comes out better, the inline emotion tags work in Portuguese and English without effort, and the API stays up without you maintaining a server. The cost per character is a bit higher, but for my current volume it sits comfortably within budget and gives me back the hours I was spending on GPU babysitting.

The lesson here is the same one the LLM crowd is learning slowly. A 30-second demo tweet is one thing, production is a completely different thing. Open source has its niche, mainly when you have sensitive data that can’t leave the house, or when you have tons of idle GPU and little recurring budget. But for serious TTS use in a commercial product, here in April 2026, ElevenLabs is still untouchable. Qwen3 didn’t kill anyone.

And don’t forget: subscribe to The M.Akita Chronicles on Spotify so you don’t miss new episodes like this one, made with the new pipeline.

20 Years of Blogging: Translating Everything to English

Thu, 09 Apr 2026 08:00:00 GMT

Four days ago, April 5th 2026, my blog turned 20. Yeah, I’m late on the anniversary post. There’s a reason, and that reason is what this piece is about.

I started in 2006 on Google’s Blogspot, like most people did back then. Later I migrated to a Rails 2.0 CMS that already existed, went through Typo3, and in 2012 I built my own engine from scratch on ActiveAdmin, which I dragged along from Rails 3 all the way to Rails 7. Only recently, in September 2025, I finally abandoned that custom engine and moved to Hugo with the Hextra theme, which is what this post is running on. Twenty years carrying the same posts across Textile, migrating from Less to Sass, trading Liquid for Markdown, and trimming obsolete junk along the way.

Two decades, five eras

Something few programmers stop to think about is how much the tech world changes in 20 years. I talked about this at length on Akitando #37 — The Dimension of Time. When I opened this blog in April 2006, I already had 10 years of experience behind me. I’d watched the dot-com bubble rise and blow up in 2000. And from that point on I saw a few more big ones: the rise of social networks (Orkut, Facebook, Twitter), the smartphone and mobile app revolution in 2008, the 2008 financial crisis, Bitcoin being born in 2009, the cloud and SaaS era, and now the generative AI era.

I watched those waves crash for real. Each one completely changed how we work and who survives professionally. This is the kind of thing you only see from a distance, after you’ve stacked up a few turns of the wheel.

My own career went through some violent pivots. I told the whole story on My First 5 Years (1990-1995), but the short version: I started in multimedia agencies, moved into enterprise consulting working on SAP, dropped all of it in 2006 to jump into Ruby on Rails and open source, spent a decade running events — mainly Rubyconf Brasil from 2007 to 2018 — then shut the conference down and started the Akitando YouTube channel, which grew past 500k subscribers. In the middle of all that there was the pandemic that rearranged everyone’s life. And since the beginning of 2025 I’ve turned into a full-time AI researcher and user, running models locally and breaking things with Claude Code until they work.

Checking the /en/tags/ai/ tag, I wrote 51 AI-related posts between 2025 and 2026 alone — 19 in 2025 and 32 more in 2026, which has barely started. Between 2018 and 2024, the blog was mostly the transcript archive for my Akitando videos. I’d record the video, transcribe it, dump it on the blog, and people would read it there. But once I started writing actively again, I noticed there was a loyal audience that never went anywhere, even through the quiet years. That feedback is what made me start The M.Akita Chronicles, a more personal series about the behind-the-scenes of recent projects.

The problem I never solved

For those 20 years, one thing kept nagging at me. Since everything was in Portuguese, it was inaccessible to anyone who didn’t read the language. I’d been getting messages for years from Brazilian readers living abroad (Portugal, US, Japan, Germany, Canada) asking for some English version they could share with their gringo coworkers. Stuff like “look, I’d show this piece to my team, but it’s only in Portuguese.”

I always said “one day I’ll translate it.” And I never did. Why? Because we’re talking hundreds of posts. I spent the last couple of days looking at the numbers: 727 Portuguese index.md files in the repo. Translating by hand, one by one, would take weeks if not months of dedicated work. I was never going to find the stamina for that. And every year that passed, more posts piled onto the queue.

The ironic part is that some blog posts were originally born in English. During 2017 and 2018 I had decided to write English-first, trying to reach an international audience. A whole interview series — the “chatting-with” ones with folks like Hal Fulton, Scott Hanselman, Chris Wanstrath (GitHub), Blaine Cook (former Twitter), Adam Jacob (Chef) — were born in English and stayed there. The missing piece was the reverse: take the Portuguese posts and translate them the other way.

The weekend that paid off 20 years of technical debt

This past Monday I opened Claude Code inside the blog directory. And asked it to translate everything to English. That was literally it.

It kicked off Monday, April 6th, around 6:30pm. I let it run. Went to sleep. Woke up Tuesday, it kept running. Slept again. Wednesday morning it was done. Counting from the first translation commit to the last, it was something like 39 hours of elapsed time. Not continuous, of course — there were nights, lunches, a coffee break or two. In practice, one long working weekend.

The result is sitting in the git repo, in commits anyone can inspect on the public blog repository. Scan the log and you can see the cadence: batch of 2008 QCon posts. Batch of 2009 RailsConf. Batch of 2011 Objective-C. All of 2012. The 2015 Elixir series. All of 2016, 2017, 2018. Batch after batch, organized by year and by series. More than 80 commits tagged i18n: or EN translation: between the night of the 6th and the morning of the 8th. By the time I’m writing this post, 354 index.en.md files are live against the 727 original index.md. Almost half the blog translated in one shot.

And look, ninety percent of it was smooth sailing. The translation came out good, natural, faithful to my voice in the original — I honestly didn’t expect it to work this well. Claude Code respects the tone of the source if you give it solid voice and style guidance, and it respects technical ideas without “correcting” anything you wrote. Worst case, you review a few paragraphs. Best case, you read the English version and it sounds like you wrote it directly in the other language.

A quick aside on the Claude side of the bill. I’m a Max 20x subscriber on Anthropic, which is the heavy-use tier built for people who hammer Claude Code all day. I’d never hit the ceiling of that plan before, not even in my most intense vibe coding sessions. This translation weekend was the first time I genuinely blew past the Max 20x limit and kept going into the “extra usage” mode (Anthropic charges on top when you go over the monthly plan ceiling).

For a sense of the damage, here’s my account dashboard this morning:

91% of the weekly “all models” quota burned through, already into the extra-usage mode, R$280.88 (about $50 USD) spent on top of the flat monthly subscription, and there’s still a good chunk of the cycle left. It tracks: read the full post, generate the translation, run the humanizer on it, repeat hundreds of times in a row. The token volume is massive. And today, while I’m writing this piece, Claude Code keeps hitting me with API Error: Request rejected (429) · Rate limited more and more often, which also tracks: probably some combo of my own plan quota topping out and Anthropic applying general backpressure, because I can’t be the only one going nuclear with the tool these days. Fine, whatever. That’s the price of using the tool hard when it actually mattered.

The other ten percent was a different story.

When commercial LLMs say “no”

Half a dozen older posts flat out refused to translate. Both Claude and GPT, through their APIs, would bounce with a 400 error and a content policy message. I tried multiple times, fresh sessions, clean context. Nothing.

The hypothesis is simple: those posts touched topics that automated content moderation flagged. A 2009 post about Steve Jobs’s 2005 Stanford commencement speech (which mentions cancer — Jobs was talking about the terminal diagnosis he’d received). A couple of old Ayn Rand chapters I translated years ago about rights of man and argument by intimidation. An anti-Nazi post that ironically got blocked, probably just because the word “nazi” showed up, even though the context was critical. A post about the money speech from Atlas Shrugged. And a democracy-and-ethics essay that systematically broke every attempt.

The ironic part here is that the only way to get those posts translated was using an open source model. I loaded up Qwen 3.5 35B in llama-swap, running locally, no corporate policy filters. The model read them, understood the context, and translated everything without drama. It’s the same model I tested extensively in my last LLM benchmark, and which I rate as one of the best open source models available right now.

So yeah — a Chinese model translated posts that Western models refused. I can’t help finding this slightly hilarious. Oh, and of course, I’m not allowed to speak ill of China (sarcasm). Commercial LLMs will always have the corporate-policy-and-preemptive-censorship problem applied with a fairly thick ruler. It’s a real trade-off for anyone using them in production: you get fluency and reasoning power, you lose control the moment the topic brushes up against whatever the company decided is sensitive.

The reverse case: 2017-2018 translated back into Portuguese

While I was dialing in Claude Code to generate the English content, I figured I might as well solve the symmetric problem. The posts I had originally written in English during 2017 and 2018, plus the whole “chatting-with” interview series from 2008, were sitting there without a Portuguese version. Claude Code ran the reverse: read the English, wrote the Portuguese. So if you never read the English originals (since most of my audience is Brazilian anyway), now you have access too. Take a look at the Off-Topic section to see what showed up that’s new.

On top of that, Claude Code also updated my generate_index.rb script so it understands the blog’s bilingual structure and generates two separate indexes, one in Portuguese and one in English. The PT/EN toggle in each post’s footer shows up automatically whenever an index.en.md sibling exists. All nicely plugged into Hugo, using its native multilingual pattern.

The bigger point

Here’s the takeaway I wanted to leave in this anniversary post. Translation at scale, hundreds of articles, your content, in your voice, respecting the tone of the original, used to be an expensive and miserable problem. Now it’s a weekend problem. This is already live — you’re reading the result. The barrier that existed to reaching audiences outside Brazil became cheap enough that I don’t have an excuse anymore.

Alright, so let’s wrap the anniversary recap. 20 years. 727 Portuguese posts, built across five different eras of technology. 354 new English posts generated over a weekend. Half the blog is now bilingual. And the rest will follow. What’s missing is mostly old ActiveAdmin posts, Rails 2 stuff, and obsolete tips that honestly deserve to be deleted or left untranslated on purpose.

If you’ve got a gringo friend who’s tired of seeing you post Portuguese links and needs an excuse to send them something of mine — send it. The main AI posts, the Chronicles, benchmarks, rants, and a good chunk of the historical archive are already available in English at the same domain, just flip the toggle in the footer. Thank you to everyone who’s followed this blog for these twenty years. There are readers here who’ve known me since the Blogspot days. You know who you are.

Here’s to another year.

Is RAG Dead? Long Context, Grep, and the End of the Mandatory Vector DB

Mon, 06 Apr 2026 11:00:00 GMT

This is one of those itches I can’t scratch. Back in the early LLM days, around 2022/2023, we had 4k of context on GPT 3.5, 8k if you were lucky, 32k was a luxury. To do anything with a real document you had no choice: chop the text into pieces, generate embeddings, throw them in a vector database, do similarity search, grab the top-5 chunks, and pray the right ones came back.

Then it became an industry. Pinecone, Weaviate, Qdrant, Chroma, Milvus, pgvector, LangChain, LlamaIndex, Haystack. Tutorials everywhere, “build your chatbot with your PDFs,” entire consultancies feeding off this. It became the “hello world” of applied LLMs: document → chunk → embed → vector DB → query.

Today, in April 2026, Claude Opus 4.6 has 1 million tokens of context. Sonnet 4.6 too. Gemini 3.1 Pro too. GPT 5.4 has a smaller window but still in the comfortable range, in the hundreds of thousands. And some models already have experimental 2M token modes. The question that keeps nagging at me: what on earth do I need a vector stack for, to solve a problem that fits inside the model’s window?

And there’s more: vector databases have real problems nobody wants to talk about. False neighbors. Arbitrary chunking that splits a definition from its usage. Embeddings that age badly. Not to mention that when the result is wrong, you have absolutely no idea why.

The thesis I’ve been chewing on is simple: in most cases, a well-aimed grep plus a generous context window beats a full RAG stack. It’s cheaper, it’s easier to maintain, and when it breaks you can actually debug it. Let’s break this down.

What the Claude Code leak showed

Before we get into the theory, let me bring up something that happened a few days ago that backs this whole argument up. On March 31, 2026, Anthropic, by accident, published version 2.1.88 of the @anthropic-ai/claude-code package on npm with a nearly 60 MB source map attached, and roughly 512,000 lines of TypeScript from their internal tool leaked into the wild. I already wrote about the incident last week, with more detail on what showed up in the code.

The part that matters for this discussion is Claude Code’s memory system. Instead of dumping everything into a vector DB, the architecture has three layers. There’s a MEMORY.md that stays permanently loaded in context, but it doesn’t hold any actual data: it’s just an index of pointers, around 150 characters per line, kept under 200 lines and about 25 KB. The real facts live in “topic files” that get pulled on demand when the agent needs them. And the raw transcripts from previous sessions are never reloaded whole, only searched with grep, hunting for specific identifiers. No embedding. No Pinecone. Just write discipline (topic file first, index after) and lexical search. That’s it.

Claude Code’s main loop also has a tiered system for handling a context that’s filling up. As I detailed in the previous post, there are five different context compaction strategies, with names like microcompact (clears old tool results based on age), context collapse (summarizes long stretches of conversation), and autocompact (which fires when the context gets close to the limit). The CLAUDE.md file, which a lot of people thought was just a convention, is first-class in the architecture: the system re-reads it at every iteration of the query.

What this tells me: the best coding agent on the market right now, built by the company selling the most expensive model out there, does not use a vector DB. It uses files on disk, a markdown index, lexical search, and smart compaction strategies for when the context overflows. They could’ve slapped embeddings on top, they have the money to run whatever they wanted, and they chose not to. The reason, in my reading, is exactly what this post is arguing: to retrieve text from files you control, with generous context available, a vector DB is dead weight. Better to invest in compacting the window you already have than indexing everything into an external store.

There’s a curious security detail that came along with the leak: people noticed the compaction pipeline has a vulnerability they’re calling “context poisoning.” Content that looks like an instruction, coming from a file the model reads (say, a CLAUDE.md from a cloned repo), can end up being preserved by the compaction model as if it were “user feedback,” and the next model takes that as a real user instruction. It’s a new attack vector. But that’s a topic for another post.

The “Dream” system and memory consolidation

But what really caught my eye for the RAG debate, which I unpacked in detail last week, is the system called autoDream. It’s a forked subagent, with read-only bash access to the project, that runs in the background while you’re not using the tool. Its job is literally to dream: to consolidate memory. The name isn’t accidental, and the obvious analogy (which I couldn’t resist) is the human brain consolidating memory during sleep, turning short-term experience into something more stable.

For a dream to actually run, three gates have to open at once: 24 hours since the last dream, at least 5 sessions since the last dream, and a consolidation lock that prevents concurrent dreams. When it fires, it goes through four phases. Orient (does an ls on the memory directory, reads the index). Gather (looks for new signals in logs, stale memories, transcripts). Consolidate (writes or updates the topic files, converts relative dates into absolute ones, deletes facts that have been contradicted). And Prune, the final cleanup that keeps the index under 200 lines.

The decision to make autoDream a forked subagent is the detail that matters here. It does not run in the same loop as the main agent. Why? Because memory consolidation is a noisy process. The model has to re-read old transcripts, compare them against what’s in MEMORY.md, decide what stays and what goes, form hypotheses about things it saw in earlier sessions. If that ran in the main context, it would pollute the “train of thought” of the agent that’s trying to help you with your current task. By forking, you keep the two separate. The main agent stays focused on what you asked for, and autoDream does the housekeeping in parallel, with no write permission on the project.

And the way it figures out what needs to be consolidated is plain old lexical search. The transcripts live as JSONL files on disk, and autoDream uses grep to look for new signals. Just grep, on text logs. Stop and think about that for a second. The memory consolidation of the most advanced agent in the world, built by one of the richest AI companies out there, is a forked subagent running grep on text logs. If a vector DB were the right answer for this kind of problem, Anthropic would’ve put a vector DB in there. They didn’t.

And there’s a detail that, to me, is the buried gold of the entire leak, and it fits this argument like a glove. In autoDream, memory is treated as a hint. The system assumes that what’s stored may be stale, wrong, contradicted by something that happened later, and the model has to verify before it trusts it. The vector DB pitch is the opposite of that: index everything, search by similarity, return the top-k, trust the result. Claude Code went the conservative route. Index little, search by word, return a hint, and stay skeptical until you’ve laid eyes on the actual fact.

The whole strategy works in two layers. Inside a single session: generous context plus grep plus smart compaction (microcompact, context collapse, autocompact). Between sessions: a subagent that consolidates memory asynchronously, using grep on the transcripts and treating the result as a tip, not as truth. Embeddings and vector DBs don’t show up in either layer. The deliberate choice was a smart reader chewing on raw text, not a dumb reader being spoonfed the top-k of an embedding.

The practical lesson for our debate is simple. The most advanced agents on the market are heading toward generous context, lexical search, and smart compaction, not toward classic RAG pipelines. If Anthropic, with all the infrastructure and talent they’ve got, picked this path for Claude Code, those of us building internal applications on a fraction of that budget should at least think about going the same way.

Where the story started turning

When the ceiling was 32k of context, retrieval was the bottleneck of the entire problem. You had to pre-filter aggressively, because anything that made it into the window was sacred space. A vector DB was the only halfway-decent way to do that semantic pre-filtering. The logic was: “the reader (LLM) is expensive and dumb, so the retriever has to be smart and selective.”

Today the equation has flipped. The reader is now the smartest one at the table, and the window grew big enough to hold an entire document. So the retriever can (and maybe should) go back to being dumb. The dumber, the better. You want high recall and low precision, and you let the model do the fine work. Grep does exactly that. So does BM25. And ripgrep flies through millions of lines without breaking a sweat.

And this isn’t just my hunch. The BEIR benchmarks have shown for a while now that BM25 matches or beats a lot of dense retrievers when the domain drifts away from where the embeddings were trained. Anthropic itself published a post on Contextual Retrieval that basically says the same thing: a lexical signal plus an LLM’s judgment beats pure embeddings on most knowledge tasks. And take a look at Claude Code, the tool I’ve been using every day for 500 hours: it navigates the repo with Glob and Grep. No vector DB, no embedding, no LangChain. It works ridiculously well.

The real problems with vector databases nobody advertises

The vector DB marketing sells the dream of perfect semantic search. Reality is messier.

False neighbors come first. Cosine similarity rewards topical similarity, not relevance. You ask “how do we handle authentication errors” and the DB returns every chunk that mentions authentication. The chunk that actually answers the question may be in tenth place, or may not have been retrieved at all because the doc author wrote “login” instead of “auth.”

Chunking is the second one, and it’s a disguised disaster. A 512-token window with a 64-token overlap sounds reasonable, until you realize your important table got cut in half, the function definition ended up separated from its usage, and the piece of documentation with the exact command got orphaned without the context of its section. The chunk boundary tends to land exactly where the answer was living.

When it fails, it fails without leaving a trace. When BM25 misses, you know why: the word isn’t there. When a vector DB returns garbage, you get a plausible-looking wrong chunk, with no diagnostic signal at all. Good luck debugging that in production at two in the morning.

The index gets stale. Every document update calls for re-embedding. If you have 10,000 docs and 200 of them change per day, that turns into a batch process, monitoring, a queue, retries, embedding API costs, and an unavoidable inconsistency window between what’s on disk and what’s in the index. Grep has none of that. File changed? The next query already sees it.

And there’s the operating cost nobody adds up. Pinecone charges per vector. Weaviate wants a cluster to maintain. pgvector saves you a new server but you still own a schema, an index, and a re-embedding pipeline. Each of those things wants engineer time, monitoring, tests, deploys. All of that to do a search that rg would often crack in 200ms.

Comparing the complexity

Look at the diagram:

On one side, eight steps, four or five services, an external index that needs to be maintained and kept up to date. On the other, four steps, zero new infrastructure. This isn’t a caricature: it is literally what you have to set up for each case.

The honest question: does the left column pay off? In 2023, yes, because the right column didn’t exist (no LLM had a 200k window). In 2026, in most cases, it doesn’t.

Pros and cons of each side

Classic RAG (vector DB)

For:

Works for huge document bases, on the order of hundreds of GB, where even rg won’t cut it without prior indexing
Handles heavy paraphrase and cross-lingual queries (“how do I cancel” vs. “subscription termination process”) where the user’s vocabulary doesn’t match the document’s
Works for non-textual modalities (image, audio) where grep has nothing to look at
Saves input tokens if you’re tight on budget or absolute latency

Against:

Complex stack: embedding, vector DB, chunking, reranker, re-indexing pipeline
Opaque failures, hard to debug
Chunking destroys the context of tables, code, long definitions
Operational overhead (index, queue, monitoring, re-embedding cost)
The semantic search the marketing is selling rarely works the way the marketing promises

Grep + long context

For:

Practically zero new infrastructure: ripgrep, sqlite, or a plain LIKE in Postgres
Always fresh: file changes, the next query sees them
Transparent failures: the word is either there or it isn’t
Loads the document in generous chunks, the model does the fine filtering with actual semantics
Cheaper in dev and ops, cheaper to pivot domains

Against:

Doesn’t scale to terabytes of raw text without some kind of indexing
Suffers when the user’s vocabulary is very different from the document’s
Doesn’t work for non-textual modalities
Per-query latency is higher in absolute terms (loading 100k tokens always costs more than loading 5k)
Per-query input cost is higher if you don’t have prompt caching

But what about cost?

This is the argument I get hit with the most when I defend the “load everything into context” thesis. “It’ll get crazy expensive, 200k tokens of input per query is absurd.” Let’s actually run the numbers.

In yesterday’s LLM benchmark post I mapped out the per-token price of every model. Take Claude Sonnet 4.6: $3 per million input tokens, $15 per million output. Take GLM 5 (which I proved actually works): $0.60 input, $2.20 output. Take GPT 5.4 Pro at the top of the heap: $15 input, $180 output (yeah, that one stings, I know).

Before we turn “200k tokens” into dollars, let’s land that number on something tangible, because “100k tokens” doesn’t mean anything to anyone. A token, on average, is roughly 0.75 of a word in English (Portuguese is similar, maybe a touch heavier because of longer words). So, translating:

100k tokens ≈ 75,000 words ≈ a whole short novel like Hemingway’s The Old Man and the Sea with room to spare, or about three long Wikipedia articles glued together.
200k tokens ≈ 150,000 words ≈ a big novel, like Crime and Punishment in full, or half of the first Game of Thrones book (which clocks in around 298k words, so roughly 400k tokens).
400k tokens ≈ 300,000 words ≈ A Game of Thrones in full, the entire first book of the series in your window.
1M tokens ≈ 750,000 words ≈ the entire Lord of the Rings trilogy plus The Hobbit, or the whole Bible (King James is around 783k words, roughly 1M tokens), or about two and a half Game of Thrones books stacked on top of each other.

So when I say “throw 200k tokens of input at the model,” what that actually means in the real world is “throw the entire Crime and Punishment in as the context for your question.” That’s a lot. And that’s exactly what makes the argument of this post viable: today’s models can read an entire novel in one go and still answer a specific question about it. In 2023, this was science fiction. In 2026, it’s the base case.

So picture a query that throws 200k tokens of input at the model (there goes Crime and Punishment again) and produces 2k tokens of output (about three pages of response):

Model	Input ($)	Output ($)	Total per query
Claude Sonnet 4.6	$0.60	$0.03	$0.63
Claude Opus 4.6	$3.00	$0.15	$3.15
GLM 5	$0.12	$0.0044	$0.12
Gemini 3.1 Pro	$0.40	$0.024	$0.42
GPT 5.4 Pro	$3.00	$0.36	$3.36

Now throw prompt caching into the mix. Claude has a cache that drops cached input to a fraction of the full price (in the ballpark of 10%, depending on the model). Gemini has a similar mechanism. When you fire a sequence of queries against the same 200k-token dump, the cost of subsequent queries plummets to pennies. With Sonnet cached, you can fairly call it about $0.10 per follow-up query without making things up.

Now compare that to the cost of running a Pinecone, or a Weaviate, or a pgvector. Setting aside the price of the subscription itself (which varies a lot), you need an engineer to wire up the pipeline, maintain it, monitor it, deal with embedding failures, redo the chunking when the domain shifts. Conservatively, you’re looking at somewhere between 40 and 80 hours of engineering to make the thing stable. At R$ 200/hour, that’s between R$ 8,000 and R$ 16,000. In USD, somewhere between $1,600 and $3,200 just to stand it up.

With $3,200, on Sonnet 4.6 with prompt caching, you can run something on the order of 30,000 queries of 200k tokens each. Thirty thousand queries, depending on the scale of the project, gives you several months or even an entire year of an average internal tool. And you didn’t pay an engineer to wire up a pipeline. There’s no vector DB server to maintain. And if the document changes, the system already sees it on the next query.

The “RAG is cheaper in tokens” argument ignores that tokens are the cheapest thing in the entire equation. Engineers cost a lot, servers cost a lot, bugs in production cost a whole lot more. Tokens have become a commodity, and they’re getting cheaper with every new model release.

The classic RAG argument was “the model is expensive, retrieval is cheap.” Today it’s the opposite: the model is the cheap part of the stack, smart retrieval is what costs a fortune to build and maintain.

Where the thesis doesn’t hold

I don’t want to come off as a fanboy. There are cases where classic RAG still wins:

Massive corpora. If you have 500 GB of raw text, even rg won’t solve it in acceptable time. You need some kind of indexing. It can be indexed BM25 (Tantivy, Elasticsearch), it can be a vector DB. But notice: the first option is still lexical, not vector.
Wildly scattered vocabulary. Customer support, where the user types “my wifi’s down” and the documentation says “loss of connectivity at the physical layer.” BM25 won’t catch that. Embedding will. Vector DB scores a point here.
Non-textual modalities. Image-by-image search, audio-by-audio. Embedding is mandatory.
Critical absolute latency. If you have to answer in 100ms with a 5k input budget, a generous dump won’t fit. Pre-filtering is necessary.
Compliance and audit. If you have to prove that a specific document was consulted to answer a specific query, having indexed and trackable chunks helps. A 200k-token context dump is more opaque from an audit standpoint.

For those cases, classic RAG still makes sense. But notice the size of the list. These are specific cases. The general case, things like “chat with our internal docs” or “ask the product manual,” almost all of it falls into the “grep + long context handles it better” bucket.

Lazy retrieval: the recipe I’d defend

If I were building a “chat with docs” tool today, from scratch, it would look more or less like this:

Keep the documents raw. Markdown, converted PDF, code, whatever. On disk, organized in folders that make sense for the domain.
Fast lexical filter. ripgrep with regex, or BM25 with Tantivy/SQLite FTS5, or a LIKE in Postgres if you already have one. Returns 100-300 hits.
Load generously. Grab not just the matching snippet, but the entire file, or a wide window around it. Throw all of it into the context.
Let the LLM do the fine work. Pass the original question, tell the model to find what matters, drop the rest, and answer with citations.
(Optional) Add embeddings only for the query classes where lexical fails, after you have real data showing that it fails.

This is the opposite of the old advice (“start with vectors, fall back to keyword”). It’s: start with keyword, and add vector only if you feel the gap. In most projects, you never will.

A toy implementation in Ruby

To make it concrete. Here’s a Ruby script using the ruby_llm gem (the same one from yesterday’s benchmark) that does exactly this flow: grep through the files, load the snippets with context, send to Claude, get the answer back. No vector DB, no chunking, no embedding, no LangChain.

#!/usr/bin/env ruby
require "ruby_llm"
require "open3"

DOCS_DIR = ARGV[0] || "./docs"
QUERY    = ARGV[1] or abort "uso: ./ask.rb  "

# 1. Fast lexical filter with ripgrep.
#    -i case insensitive, -l file names only, --type-add covers md/txt/extracted-pdf.
def lexical_search(dir, query)
  terms = query.downcase.scan(/\w{4,}/).uniq.first(8)  # words with 4+ letters
  pattern = terms.join("|")
  cmd = ["rg", "-l", "-i", "-e", pattern, dir]
  files, _ = Open3.capture2(*cmd)
  files.split("\n").reject(&:empty?)
end

# 2. Load entire files (up to a reasonable cap).
def load_context(files, max_chars: 600_000)
  total = 0
  files.map do |path|
    body = File.read(path)
    next if total + body.size > max_chars
    total += body.size
    "## #{path}\n\n#{body}\n"
  end.compact.join("\n---\n")
end

# 3. Send to Claude with the question and the documents.
def ask(query, context)
  chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4-6")
  prompt = <<~PROMPT
    Você tem acesso aos documentos abaixo. Responda a pergunta do usuário
    usando apenas o que está nos documentos. Cite o nome do arquivo nas
    referências. Se a resposta não estiver nos documentos, diga isso.

    --- DOCUMENTOS ---
    #{context}
    --- FIM DOS DOCUMENTOS ---

    Pergunta: #{query}
  PROMPT
  chat.ask(prompt).content
end

files = lexical_search(DOCS_DIR, QUERY)
abort "nenhum arquivo bateu" if files.empty?
puts "Encontrei #{files.size} arquivos. Carregando contexto..."
context = load_context(files)
puts ask(QUERY, context)

About 40 lines. No Pinecone dependency, no vector schema, no re-indexing pipeline. You run it as ./ask.rb ./docs "how do I configure the payment webhook" and that’s it.

That example is one-shot. You run it, it answers, done. For a real chat, with multiple questions in a row over the same documents, the design changes. Instead of running lexical_search upfront and shoving everything into the context at once, you expose the search as a tool to the model. Then it’s the agent that decides when it needs to pull more docs, what term to look for, which file is worth opening in full. That’s how Claude Code actually works: Glob, Grep and Read are tools, and the model picks the sequence. ruby_llm supports tool calling, so you can do the same thing in Ruby. It looks something like this:

require "ruby_llm"
require "open3"

DOCS_DIR = "./docs"

class SearchFiles < RubyLLM::Tool
  description "Procura arquivos cujo conteúdo casa com o padrão dado (regex). Retorna lista de paths."
  param :pattern, desc: "Padrão regex pra busca lexical (case-insensitive)"

  def execute(pattern:)
    out, _ = Open3.capture2("rg", "-l", "-i", "-e", pattern, DOCS_DIR)
    out.split("\n").reject(&:empty?)
  end
end

class ReadFile < RubyLLM::Tool
  description "Lê o conteúdo completo de um arquivo do projeto."
  param :path, desc: "Caminho relativo do arquivo"

  def execute(path:)
    File.read(path)
  rescue => e
    "erro: #{e.message}"
  end
end

chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4-6")
            .with_tools(SearchFiles, ReadFile)
            .with_instructions(<<~SYS)
              Você responde perguntas sobre os documentos em #{DOCS_DIR}.
              Use search_files pra encontrar arquivos relevantes e read_file
              pra ler o conteúdo. Sempre cite o arquivo na resposta.
            SYS

loop do
  print "> "
  msg = gets&.chomp
  break if msg.nil? || msg.empty?
  puts chat.ask(msg).content
end

The model gets the question, decides whether it needs to search, calls search_files, sees what came back, decides whether it needs to open any file, calls read_file, and only then answers. On the next question it already has the previous context in the session and can ask for more if it needs to. The context only receives what the model asked for, not the whole grep dump from the earlier example.

The same idea works for databases: swap rg for a SQL query with LIKE or tsvector (Postgres full-text), load the relevant rows, throw them in the context. If you have 10k records in an internal database, this handles it. If you have 10 million, you start needing smarter pagination or a more serious pre-filtering layer. But the mental model is the same: dumb filter + smart reader.

The point that matters

The most interesting thing in all of this isn’t even the Pinecone savings. It’s that the nature of the bottleneck has changed. In 2023, the bottleneck was retrieval: the reader was small, slow, expensive, and you needed a clever retriever to fill the window with the bare minimum. In 2026, the bottleneck is reasoning over messy context: the reader is big, relatively fast, and cheap. So it makes more sense to have a dumb retriever with high recall and let the model do the heavy lifting.

Anyone still designing systems with the 2023 mindset is paying a premium to solve a problem whose shape has changed. RAG didn’t die, the “R” got dumber and cheaper, and that’s an upgrade. The vector DB vendors aren’t going to tell you this, but it’s the path the more experienced folks have been quietly walking.

The next wave of LLM applications, in my bet, is going to be dominated by the people who got this inversion. Smaller stacks, simpler infrastructure, generous context, and a whole lot less LangChain.

What the recent literature says

Before I close out, I went and checked what the research crowd published on this. Blog hot takes age in three months in this field, so it’s better to look at the papers.

Retrieval Augmented Generation or Long-Context LLMs?, out of Google DeepMind, published at EMNLP 2024, is probably the most cited piece in the debate. Their conclusion: when the model has enough resources, long context beats RAG on average quality, but RAG is still much cheaper in tokens. They propose Self-Route, an approach where the model itself decides whether it needs retrieval or whether it can just go straight through context. The token savings are big and the quality loss is small.

Then LaRA, presented at ICML 2025, is more measured. The authors built 2326 test cases across four QA task types and three long-context types, ran them across 11 different LLMs, and the conclusion was: there is no silver bullet. The choice between RAG and long context depends on the model, the context size, the task type, and the retrieval characteristics. RAG wins on dialogue and generic queries, long context wins on Wikipedia-style QA.

Long Context vs. RAG for LLMs: An Evaluation and Revisits, from January 2025, is the one that most reinforces this post’s thesis. Long context tends to beat RAG on QA benchmarks, especially when the base document is stable. Summarization-based retrieval comes close, and chunk-based retrieval lags behind. In other words: the old way, chunk plus embed plus top-k, is the one that comes out worst.

Worth keeping on the radar too is the original Lost in the Middle (Liu et al., 2023, published in TACL in 2024). That’s the paper that showed even models with big windows have performance that depends on the position of the relevant information. Stuff at the beginning or end of the context is found easily; stuff in the middle degrades. For a long time this got used as the argument against long context, but the paper is from 2023, with 2023 models. Today’s models, the Claude 4.x and Gemini 3.x line, handle the middle a lot better. It’s not a solved problem, but it’s much smaller than it was.

On the lexical retrieval side, BEIR is still the canonical reference. The classic result is that BM25, all the way from the 90s, is still ridiculously competitive in out-of-domain scenarios. Dense models only win consistently when you have in-domain data to fine-tune the embeddings. In zero-shot scenarios, which is where most projects live, BM25 is hard to beat without serious work.

To wrap up, the Anthropic post on Contextual Retrieval, from September 2024, is the most practical piece on the list. They show that combining contextual embedding with contextual BM25 drops the top-20 failure rate from 5.7% to 2.9%. Add a reranker and it drops to 1.9%. Important detail: BM25 is the centerpiece of their result, not a sidekick. The right reading is “lexical plus vector plus reranker is the combination that works.” Anyone who can only pick one picks BM25 and still gets pretty far.

To sum up what we can actually nail down: the literature isn’t claiming “RAG is dead.” It’s saying that long context, when you can use it, tends to win on quality. It’s saying RAG’s cost is still its main argument. It’s saying lexical BM25 is much stronger than the vector DB marketing makes it sound. And it’s saying that when you really do need heavy retrieval, the robust combination is hybrid (lexical plus vector plus reranker), not pure vector. All of that lines up with what I’ve been defending in practice.

Sources

Li, Z. et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 Industry Track.
Yuan, K. et al. (2025). LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing. ICML 2025.
Yu, T. et al. (2025). Long Context vs. RAG for LLMs: An Evaluation and Revisits. arXiv:2501.01880.
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL 2024.
Thakur, N. et al. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks 2021.
Anthropic (2024). Introducing Contextual Retrieval. Blog post.
Akita, F. (2026). Claude Code’s Source Code Leaked. Here’s What We Found Inside. — my coverage of the leak, with more detail on the memory architecture, KAIROS and autoDream.

Testing Open Source and Commercial LLMs - Can Anyone Beat Claude Opus?

Sun, 05 Apr 2026 18:00:00 GMT

⚠️ Obsolete article (updated 2026-04-24). The conclusions and rankings in this post were superseded after I re-audited the benchmark against the ruby_llm gem source and restructured the evaluation criteria. Several “hallucinations” I had cataloged were actually valid API. Kimi K2.6 and Gemini 3.1 Pro moved up to Tier A. GLM 5.1 dropped to Tier C. MiMo V2.5 Pro fell from “first non-Anthropic Tier 1” to Tier B. The canonical version lives at LLM Coding Benchmark (April 2026). This post stays as a historical record of what I concluded before the re-audit.

Update April 16, 2026: Added Claude Opus 4.7, Qwen 3.6 and GPT 5.4 via Codex CLI (xHigh reasoning). Opus 4.7 is an incremental improvement over 4.6 (28 tests vs 16, same correct API) and becomes the new baseline. GPT 5.4, previously Tier 1 based on my personal vouch, now has objective data and dropped to Tier 2 — burned 7.6M tokens (~$16/run, 15x more expensive than Opus) and got the add_message calling convention wrong on multi-turn. Qwen 3.6 Plus remains Tier 3 with the same API hallucination as 3.5. The conclusion stands: if you want safety, pick Opus.

TL;DR: If you don’t want to read the whole analysis: the only models that produced code that actually works in our benchmark were Claude Opus 4.7, Claude Opus 4.6, GLM 5 and GLM 5.1 (from Z.AI, ~89% cheaper than Opus). Sonnet works in this benchmark too, but in practice it falls short on projects requiring deeper reasoning — details in the conclusion. GPT 5.4, previously Tier 1 based on my personal vouch, now has objective data via Codex CLI: it burned 7.6M tokens ($16/run) and got the add_message calling convention wrong — works on the first message, breaks on multi-turn. Dropped to Tier 2. Everything else — Kimi, DeepSeek, MiniMax, Qwen, Gemini, Grok 4.20 — invented APIs that don’t exist or ignored the gem we asked for.

There’s a new wrinkle in this update: I redid the local part of the benchmark on an RTX 5090 (instead of the AMD Strix Halo) and added a fresh batch of Qwen models, including a Qwen 3.5 27B distilled directly from Claude 4.6 Opus. That reopened the conversation on running open source models locally. The 5090’s memory bandwidth flips the game from “unworkable” to “workable with 1-2 follow-up prompts.” The bottleneck for open source models has moved to a lack of factual knowledge about specific libraries, which I unpack in detail in the new section on the Qwen family. The Claude distillation gamble, by the way, gave a pretty frustrating result that I haven’t seen documented in these terms before.

If you’ve been following my previous vibe coding pieces, you know I spent the last two months in a 500-hour marathon using Claude Opus as my main coding agent. The results were good, as I reported in the conclusion about business models. But there was an itch I couldn’t scratch: am I locked into one model? Is there a real alternative to Claude Opus for daily use on real projects?

I’ve got an RTX 5090 with 32 GB of GDDR7. I know I can run the latest open source models. I bought a Minisforum MS-S1 with an AMD Ryzen AI Max 395 and 128 GB of unified memory, and built a home server with Docker to serve local models. The infrastructure was ready. What was missing was actually testing it.

I built an automated benchmark to compare open source and commercial models under identical conditions. 33 models configured in total (25 from the original run plus 8 added in the NVIDIA rerun), 27 executed, 16 completed in some form. The code is on GitHub.

The bottleneck nobody explains: VRAM and KV Cache

Before getting to the results, I have to explain why running large models locally is much harder than it looks.

Take Qwen3 32B. The model in FP16 (full precision) takes ~64 GB. Quantized to Q4 (4 bits), it drops to ~19 GB. So it fits in my RTX 5090’s 32 GB, right? Wrong. That’s just the model weights. There’s a part nobody tells you about: the KV Cache.

KV Cache is the memory the model uses to “remember” what it has already read. Every time it processes a token (a word or piece of a word), it computes two vectors — K (key) and V (value) — for every attention layer. Those vectors stick around so the model doesn’t have to recompute everything when it generates the next token. Without that, generation would be quadratically slow.

The KV Cache scales linearly with the size of the context. The formula:

KV Memory = 2 × Layers × KV_Heads × Head_Dimension × Bytes_per_Element × Context_Tokens

For a model like Llama 3.1 70B in BF16, that comes out to ~0.31 MB per token. Sounds tiny, until you realize that a 128K context eats 40 GB of KV Cache alone. The model itself plus KV Cache adds up to way more VRAM than most GPUs have.

And for actual coding agent use, 128K tokens isn’t a luxury, it’s the bare minimum. The agent has to read files, keep conversation history, receive command output. In long benchmark sessions, our models consumed between 39K and 156K tokens. Less than 100K of context isn’t practical for day-to-day project work.

Google published TurboQuant (ICLR 2026), which compresses the KV Cache to 3 bits without accuracy loss — a 6x memory reduction and up to 8x speedup. It uses random vector rotation (PolarQuant) followed by a 1-bit algorithm on the residuals. Works online during inference, compressing on write and decompressing on read. Not yet implemented in the runtimes we use (llama.cpp, Ollama), but when it lands it’ll change the equation a lot.

For anyone wanting to dig deeper into the VRAM math, I recommend this link from Ahmad Osman for the article “GPU Memory Math for LLMs (2026 Edition)”.

The hardware problem: not all memory is created equal

“But I have 128 GB of RAM!” Cool, but that’s not what matters. What matters is memory bandwidth, and the difference between types is wild:

The RTX 5090 has 7x the bandwidth of the LPDDR5x memory in my Minisforum. That means even if a model fits in the AMD’s unified RAM, inference will be proportionally slower. On my Minisforum with LPDDR5x at 256 GB/s, Qwen3 32B runs at ~7 tok/s. On the RTX 5090 at 1,792 GB/s, it’d be much faster — if it fit entirely in VRAM alongside the KV Cache.

Most folks running local models are still on DDR4. At 50 GB/s, 32B models are basically unusable. And there’s another factor people forget: storage. When the RAM can’t keep up and the system swaps, the storage speed becomes the bottleneck:

Storage	Sequential Speed
SATA SSD	~550 MB/s
NVMe Gen3	~3,500 MB/s
NVMe Gen4	~7,000 MB/s
NVMe Gen5	~12,000 MB/s

From SATA to NVMe Gen5 you’re looking at a 22x difference. If you’re doing partial offloading to disk (which is common when the model doesn’t fit entirely on the GPU), NVMe Gen4 or Gen5 makes a real difference. SATA is a non-starter.

To sum up: running local models isn’t just “having enough RAM.” You need the right kind of memory, with the right bandwidth, and fast storage as a fallback. For a lot of people, a Mac Studio with high-bandwidth unified memory (up to 800 GB/s on the M4 Ultra with 512 GB) would be the more practical option, but it costs more than US$ 10,000. The AMD Ryzen AI Max is the cheaper alternative with unified memory, but its LPDDR5 caps out at 256 GB/s.

Ollama vs llama.cpp: why Ollama falls apart on benchmarks

Ollama is the most popular way to run local models. Install, pull the model, run. For casual use it works. But when I tried to use it for automated benchmarks with long unattended sessions, it broke in 6 different ways across 8 models:

Unloads the model mid-session. On long runs, Ollama decides the model isn’t being used and unloads it from the GPU. The agent sits there waiting for a response from a model that no longer exists.
Ignores the requested context. You ask for num_ctx=131072, Ollama accepts, then halfway through the run it reverts to the default without warning.
Unstable lifecycle. Asking for keep_alive: 0 to unload doesn’t always work. The model stays resident and blocks the next one.
Incompatible formats. Native bf16 variants on Ollama failed, while the same model as a Q8 GGUF from HuggingFace worked fine.

The fix: migrate to llama-swap, a Go wrapper that manages llama.cpp processes with hot-swap. A request comes in for a different model than the one currently loaded, it kills the current process and starts the new one. No context negotiation, no flaky lifecycle.

llama-swap fixed the loading of 6 of the 8 models that had failed under Ollama:

Model	Ollama	llama-swap
Gemma 4 27B	HTTP 500	47.6 tok/s
GLM 4.7 Flash	No output	47.4 tok/s
Llama 4 Scout	Unloaded	17.5 tok/s
Qwen 3.5 35B	Output off-spec	49.7 tok/s
Qwen 3.5 122B	Context drift	23.1 tok/s
GPT OSS 20B	Model not found	78.3 tok/s

But llama-swap isn’t magic.

Why “just use llama.cpp” doesn’t fix everything

llama.cpp solves Ollama’s lifecycle problems but brings its own:

Each model needs specific flags. GLM and Qwen 3.5 emit tags that break clients if you don’t pass --reasoning-format none. Gemma 4 needs build b8665+ for the tool call parser to work.

Not every model supports tool calling. llama.cpp needs a dedicated parser for each model’s tool call format. Llama 4 Scout uses a “pythonic” format ([func(param="value")]) that llama.cpp simply doesn’t parse and emits as plain text. vLLM has a parser for it, llama.cpp doesn’t.

And then there are the repetition loops. Gemma 4, even with the right parser, gets into an infinite loop after ~11 tool calls in long sessions. It’s a known bug that PR #21418 didn’t fully fix.

Tool calling compatibility per model:

Model	Tool Calling	Required Flags	Benchmark Result
Gemma 4 27B	Partial (b8665+)	`--jinja --reasoning-format none`	Infinite loop after ~11 steps
GLM 4.7 Flash	Yes	`--jinja --reasoning-format none`	2029 files, ended mid-tool-call
Qwen 3.5 (35B, 122B)	Yes	`--jinja --reasoning-format none`	Completed successfully
Qwen 3 Coder Next	Yes	`--jinja`	Completed (best local result)
GPT OSS 20B	Yes	`--jinja`	Tool calls ok, but app in wrong directory
Llama 4 Scout	No	—	No parser in llama.cpp

At the end of the day, llama.cpp is better than Ollama for automated runs, but “plug and play” it ain’t. Each model requires specific configuration, and some just don’t work for agentic coding yet.

Reasoning: models that think vs models that wing it

There’s one difference between models worth explaining: reasoning. The idea is that the model “thinks before answering” instead of generating tokens straight from left to right. Models with reasoning go through an internal chain-of-thought step where they evaluate the problem, consider alternatives, plan, and only then emit the response.

In practice this shows up as ... tags in the output, blocks of text the model writes to itself that shouldn’t go to the end user. Claude Opus 4.6, GPT 5.4, DeepSeek V3.2 and the Qwen 3.5 line support reasoning natively. The smaller ones (Gemma 4, GPT OSS 20B, older models) don’t have that capability.

Why does it matter for coding? When a coding agent gets “build a Rails app with 9 components,” it has to decompose the task into steps, decide the order, anticipate dependencies, adapt when something fails. Without reasoning, the model generates code sequentially with no planning. It works for simple tasks, falls apart on projects with interdependent parts.

In the benchmark, the difference was clear:

GPT OSS 20B (no reasoning, 20B parameters) created the app in the wrong directory. Couldn’t keep workspace instructions in mind while generating code.
Qwen 3 32B has reasoning, but at 7 tok/s it was too slow. The “thinking” tokens drag out the generation time.
Gemma 4 31B, with no reasoning trained for agentic use, fell into repetitive tool calling loops.
GLM 5 (cloud, 745B MoE) with reasoning and 44B active parameters, finished cleanly and used the correct API.

There’s a trade-off: reasoning consumes extra tokens (the blocks), which take up VRAM in the KV Cache and slow generation down. That’s why flags like --reasoning-format none are needed in llama.cpp. Some clients don’t know what to do with reasoning tokens and break. Models that emit reasoning when the runtime isn’t expecting it can produce garbage in the output.

And reasoning isn’t something you “turn on” in any model. It’s a capability trained with reinforcement learning on top of the base model, using data from problems that require multi-step thinking. The smaller open source models (20B-35B) typically didn’t go through that training, or went through it on a smaller scale. They know how to generate code, but they don’t know how to plan code. On tasks that require 50+ coordinated tool calls, that difference is fatal.

The benchmark: methodology

To compare models fairly, I built an automated harness in Python. Each model gets the exact same prompt: build a complete Ruby on Rails application, a ChatGPT-style chat SPA using the RubyLLM gem, with Hotwire/Stimulus/Turbo Streams, Tailwind CSS, Minitest tests, CI tools (Brakeman, RuboCop, SimpleCov, bundle-audit), Dockerfile, docker-compose and README.

The runner is opencode, one of the most popular open source coding CLIs, competing with Claude Code and Codex. One clarification worth making: the original author left the project to work on Crush together with the Charm team (the folks behind Bubble Tea, Lip Gloss and several other Go terminal tools), but the rest of the original team kept evolving opencode normally — the project has not been discontinued. Today the two coexist. If you read my piece on Crush, you already know that branch. Both run everywhere: macOS, Linux, Windows, Android, FreeBSD.

I actually tried to use Crush for the benchmark first. The problem: it advertised a --yolo flag in its help to auto-approve every action (essential for unattended automated runs), but at runtime it rejected the flag. Without auto-approve there’s no way to do an unattended benchmark. opencode, on the other hand, had the opencode run --agent build --format json mode that emits JSON events with session IDs and token counts, perfect for automation. So we went with opencode.

I picked opencode (and not Claude Code or Codex) for two reasons:

Neutrality. Claude Code is optimized for Anthropic models. Codex is optimized for OpenAI models. opencode is agnostic, same interface for all.
Automation. opencode exposes a machine-readable JSON format. Claude Code and Codex don’t have an equivalent interface for external benchmarking.

Cloud models ran in two phases: phase 1 (build the app) and phase 2 (validate local boot, docker build, docker compose). Local models only ran phase 1.

Worth mentioning: the entire benchmark cost less than $10 in tokens on OpenRouter. Apart from GPT 5.4 Pro which torched $7.20 to fail, the other 11 cloud models added up to about $2.50 total. Local models cost only electricity. The point is: running your own benchmark is cheap. If you want to know whether a model works for your use case, drop the $2 and test it. The harness code is on GitHub, just swap the prompt for your own project.

Why GPT 5.4 failed the benchmark (but not in real life)

GPT 5.4 Pro is the only cloud model that consistently failed our benchmark. Two separate runs, same result: the model generated files but never reached finish_reason: stop. It always ended on finish_reason: tool-calls — wanted to keep calling tools but the loop kept breaking.

For folks who don’t know: tool calling is when an LLM needs to perform an action (read a file, run a command, edit code) and emits a “tool call” in a structured format. The client (opencode, Claude Code, Codex) interprets it, executes it, and returns the result back to the model. Each provider has its own format: Anthropic uses tool_use blocks, OpenAI uses function_calling with proprietary JSON schemas, Google uses FunctionCall.

GPT 5.4 is heavily trained for OpenAI’s native function calling format — tool_choice, tools with proprietary JSON schemas. When the benchmark routes through opencode → OpenRouter → GPT 5.4, the tool schemas get translated at every hop. If GPT emits tool calls in a format that OpenRouter or opencode doesn’t parse correctly, the agent loop breaks.

The evidence: every other cloud model (Claude Opus, Claude Sonnet, Kimi K2.5, DeepSeek V3.2, MiniMax M2.7, GLM 5, Qwen 3.6 Plus, Step 3.5 Flash) ended on finish_reason: stop. Only GPT ends on finish_reason: tool-calls.

A fair comparison for GPT 5.4 would require running it in its native environment. And now we have that comparison: we built support to automate the Codex CLI (codex exec with --dangerously-bypass-approvals-and-sandbox and reasoning effort xhigh) and ran the same benchmark. GPT 5.4 completed in 22 minutes, generated all 9 artifacts, wrote 22 tests with the most sophisticated architecture in the entire benchmark: dependency injection of the RubyLLM client, PORO models for ChatMessage and PromptSubmission, session-backed ChatSession with TTL and message trimming, bin/ci script.

But the code breaks on the second message. GPT 5.4 uses chat.add_message(role:, content:) with keyword arguments instead of a positional hash chat.add_message({role:, content:}) — this causes ArgumentError: wrong number of arguments (given 0, expected 1) on the first multi-turn exchange. The first message works (uses chat.ask directly), multi-turn doesn’t.

And the cost: 7.6 million tokens on xHigh reasoning effort. That’s 65x more tokens than Opus 4.7 (118K) for the same benchmark. Estimated cost of ~$16 per run, versus ~$1.10 for Opus. Spent 15x more and still got the calling convention wrong. A massive token budget and maximum reasoning effort still can’t guarantee factual correctness on a gem API. API knowledge is binary recall in the weights, not a function of how hard the model “thinks.”

With objective data in hand, GPT 5.4 moves from Tier 1 to Tier 2. The architecture it generates is better than Opus in terms of design patterns. But the code needs a fix for multi-turn to work, and the per-token cost is prohibitive.

Sonnet and Opus through opencode/OpenRouter were probably also not pushed to their limits. Claude Code offers native tool support that opencode doesn’t replicate — meaning the benchmark results represent a floor, not a ceiling, for those models.

Open source models: reality vs the narrative

A lot of people are saying open source models have already caught up with the commercial ones and you can run your own “Claude” at home. In practice, not really.

The scale isn’t comparable. Frontier models like Claude Opus 4.6 and GPT 5.4 are closed-source, but estimates put them in the hundreds of billions to trillions of parameters range, trained with compute and data no open source company can replicate. The best models that fit on reasonable hardware are:

Model	Total Parameters	Active Parameters	Architecture
Qwen 3.5 35B-A3B	35B	3B	MoE (A3B)
Qwen 3.5 27B	27B	27B	Dense
Qwen 3 32B	32B	32B	Dense
Qwen 3.5 122B	122B	122B	Dense
GPT OSS 20B	20B	20B	Dense
Gemma 4 31B	31B	31B	Dense

Post-publication correction: Qwen 3.5 35B is actually the 35B-A3B, an MoE with only 3B active parameters per token (not dense, as I’d originally written). That’s why it runs relatively fast for its size. And for folks with 24 GB of VRAM, the model recommended by Unsloth themselves is the Qwen 3.5 27B dense — that one I didn’t get around to testing in the benchmark, but it’s worth a look. For anyone wanting to dig deeper into local models, @sudoingX has been doing some serious experimentation in this space. Thanks to @thpmacedo for the heads-up.

Even the largest open source MoE (Mixture of Experts) models that companies make publicly available activate only a small fraction of parameters per token:

Model	Total Parameters	Active Parameters	Notes
Kimi K2.5	1T	32B	384 experts, top-8 + shared
GLM 5	745B	44B	256 experts, 8 activated
DeepSeek V3.2	671B	37B	Sparse Attention
Qwen 3.5 397B	397B	17B	MoE, cloud-only

These large models aren’t self-hostable. Kimi K2.5 with 1T parameters needs GPU clusters with hundreds of GBs of VRAM. GLM 5 with 745B is the same. Even if Alibaba or Z.AI release the weights (and some do), nobody has home hardware to run them.

What fits on your home GPU are the 20B-35B models — and those have real limitations.

What each local model did in the benchmark

Results from the original run on the AMD Strix Halo:

Qwen 3 Coder Next (30B) — Completed in 17 minutes on the Strix, generated 1675 files, Rails app with all the artifacts. But only 3 tests. And more importantly: it invented RubyLLM::Client.new, a class that doesn’t exist in the gem. The app doesn’t run.

Qwen 3.5 35B — Completed in 28 minutes on the Strix, 1478 files, 11 tests. Used RubyLLM.chat without a model parameter — works only if the default is configured. No LLM mocking in the tests.

Qwen 3.5 122B — Completed in 43 minutes on the Strix, 1503 files, 16 tests. But it ignored the RubyLLM gem completely and built a custom HTTP client for OpenRouter. The prompt explicitly asked for ruby_llm.

GLM 4.7 Flash (local, Strix) — Produced 2029 files with all the artifacts, but the session ended mid-tool-call. The cloud version (GLM 5) works perfectly.

Gemma 4 31B (Strix) — Infinite tool call loop after ~11 productive steps. Known llama.cpp bug.

GPT OSS 20B (Strix) — Created the Rails app in the wrong directory (project/app/ instead of project/). A 20B model doesn’t follow workspace instructions reliably.

Qwen 3 32B (Strix) — Way too slow (7.3 tok/s). The hardware can’t keep up.

And the results from the rerun on the NVIDIA RTX 5090 (all with Q3_K_M or Q4_K_M and contexts between 64k and 128k to fit the 32 GB of VRAM):

Qwen 3.5 35B-A3B (5090) — 5 minutes at 273 tok/s. Recognizable Rails project, entry point RubyLLM.chat(model:) is right, but it hallucinates chat.add_message(role:, content:) and chat.complete instead of .ask. Fixable in 1-2 follow-ups. The best candidate for “OSS local that’s actually worth trying.”

Qwen 3.5 27B Claude-distilled (5090) — 12 minutes at 129 tok/s. Impeccable Claude style, total API hallucination (RubyLLM::Chat.new.with_model{}, add_message, response.text). More details in the distillation section below.

Qwen 3 Coder 30B (5090) — 6 minutes at 145 tok/s. Returned a hardcoded mock string instead of calling the API. Tier 3 unusable.

Qwen 2.5 Coder 32B (5090) — 90 minutes of timeout, zero files. The model spun without ever calling a write tool.

Qwen 3 32B (5090) — 4 minutes at 69 tok/s, partial scaffold, errors. The general version is better than the Coder one but still breaks.

Gemma 4 31B (5090) — 8 minutes at 213 tok/s. Same repetition loop it had on the Strix. The llama.cpp bug isn’t a hardware issue.

Qwen 3.5 27B Sushi Coder RL (5090) — Infrastructure failure (ProviderModelNotFoundError), couldn’t be evaluated. Redo on a future run.

GPT OSS 20B (5090) — Pulled from this run because of a recent llama.cpp main regression in the harmony family tool call parser. The logs show Failed to parse input at pos 755: <|channel|>... in multi-turn sessions. It worked on the Strix with llama.cpp b8643, broken on today’s main. Waiting on upstream to fix it.

Cloud models: what actually works

Of the 12 models that completed the benchmark, all of them generated a recognizable Rails project with all the requested artifacts (Gemfile, routes, views, JS, tests, README, Dockerfile, docker-compose). 9 out of 9 on the completeness checklist.

But here comes the question that matters: does the code run?

The correct RubyLLM API is simple:

chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4")
response = chat.ask("Hello")
response.content  # => "Hi there!"

8 of the 12 models invented APIs that don’t exist. The most common pattern: hallucinating an interface that doesn’t match the actual gem:

Model	What It Invented
DeepSeek V3.2	`RubyLLM::Client.new` — nonexistent class
Qwen 3 Coder Next	`RubyLLM::Client.new` — same error
Qwen 3.5 122B	`Openrouter::Client` — nonexistent gem
Kimi K2.5	`add_message()` and `complete()` — nonexistent methods
MiniMax M2.7	`RubyLLM.chat(messages: [...])` — nonexistent signature
Qwen 3.6 Plus	`chat.add_message()` — nonexistent method
Gemini 3.1 Pro	`RubyLLM::Chat.new()` and `add_message()` — internal API, not public
Grok 4.20	Ignores the gem completely — uses `OpenAI::Client` (ruby-openai) hitting the OpenRouter URL directly

The models that got it right — both Claudes, GLM 5 and GLM 5.1 — used the simple two-step pattern (chat = RubyLLM.chat(model:) then chat.ask(message)). The ones that got it wrong tried to make RubyLLM look like the OpenAI Python SDK, which is a different thing. Grok 4.20 was the most brazen case: it didn’t even try to use the gem, it went straight for OpenAI::Client pointing at the OpenRouter URL, ignoring the explicit prompt.

And the tests? Only Opus, Sonnet, GLM 5 and GLM 5.1 did proper mocking of the LLM calls. All the others either hit the real API (which fails without a key) or mocked the invented API (tests pass but prove nothing). Test count is a misleading metric: Kimi K2.5 wrote 37 tests, more than anyone else, but none of them test real functionality because the API it uses doesn’t exist.

Real viability table

Model	Correct API?	Runs?	Test Mocking?	Problem
Claude Opus 4.7	Yes	Yes	Yes (FakeChat)	Clean implementation, 28 tests
Claude Sonnet 4.6	Yes	Yes	Yes (mocha)	Clean implementation
Claude Opus 4.6	Yes	Yes	Yes (mocha)	Clean implementation
GLM 5	Yes	Yes	Yes (mocha)	Correct API, works
GLM 5.1	Yes	Yes	Yes	Correct API, works
GPT 5.4 (Codex)	Partial	1st msg only	Yes (FakeChat)	`add_message(role:, content:)` with keyword args instead of positional hash — breaks multi-turn
Step 3.5 Flash	N/A	Yes*	No	Bypasses RubyLLM, uses HTTP directly
Grok 4.20	N/A	Yes*	No	Bypasses RubyLLM, uses `OpenAI::Client` directly
Qwen 3.6 Plus	Partial	Only 1st msg	No	`add_message()` doesn’t exist
Qwen 3.5 35B	Partial	Maybe	No	No model parameter
Kimi K2.5	No	No	No	`add_message()`/`complete()` invented
MiniMax M2.7	No	No	No	`RubyLLM.chat` signature wrong
DeepSeek V3.2	No	No	No	`RubyLLM::Client` nonexistent
Qwen 3 Coder Next	No	No	No	`RubyLLM::Client` nonexistent
Gemini 3.1 Pro	No	No	Wrong mock	`RubyLLM::Chat.new()` and `add_message()` don’t exist
Qwen 3.5 122B	No	No	No	`Openrouter::Client` gem doesn’t exist

*Step 3.5 Flash works by calling the OpenRouter REST API directly with Net::HTTP, completely bypassing the gem the prompt asked for.

Now, this doesn’t mean those models are useless. If you take Kimi K2.5 or DeepSeek V3.2 and tell it “the RubyLLM::Client class doesn’t exist, fix it to use the gem’s real API”, it’ll probably fix it. One or two follow-ups and the project becomes functional. Most of the models that failed here could deliver a working project with a few more rounds of conversation.

But that’s where the trade-off lives. With Opus or GPT 5.4, the first output already works. You ask, they deliver, you test, it runs. With the cheaper models, you’ll spend time fixing API hallucinations, debugging code that “looks right” but crashes, steering the model in the right direction. Each of those rounds is 10-30 minutes. Three extra rounds and you’ve spent an hour of your time to save $0.90 in tokens.

You save dollars, you spend time. And time is money. For someone learning or exploring without urgency, that trade can make sense. For someone who needs to ship, the frontier models pay for themselves fast.

Comparing the models that work

Model	Provider	Time	Tests	Cost/Run	vs Opus
Claude Opus 4.7	OpenRouter	18m	28	~$1.10	New baseline
Claude Sonnet 4.6	OpenRouter	16m	30	~$0.63	40% cheaper, more tests
Claude Opus 4.6	OpenRouter	16m	16	~$1.05	Previous baseline
GLM 5	OpenRouter	17m	7	~$0.11	89% cheaper
GLM 5.1	Z.AI direct	22m	24	~$0.13	~88% cheaper, more tests than GLM 5

Full ranking by time and tokens

Model	Provider	Time	Total Tokens	Tok/s	Cost/Run
Grok 4.20	OpenRouter	8m	63,457	412.54	~$0.04
Gemini 3.1 Pro	OpenRouter	14m	104,034	128.28	~$0.50
MiniMax M2.7	OpenRouter	14m	79,743	574.52	~$0.05
Claude Opus 4.6	OpenRouter	16m	136,806	347.18	~$1.05
Claude Sonnet 4.6	OpenRouter	16m	127,067	532.26	~$0.63
GLM 5	OpenRouter	17m	59,378	400.01	~$0.11
Qwen 3.6 Plus	OpenRouter	17m	88,940	182.91	Free
Claude Opus 4.7	OpenRouter	18m	118,216	328.24	~$1.10
GLM 5.1	Z.AI direct	22m	81,666	166.62	~$0.13
GPT 5.4 (Codex)	Codex CLI	22m	7,643,800	5,824.56	~$16.00
Qwen 3 Coder Next	Local	17m	39,054	37.49	Electricity
Qwen 3.5 35B	Local	28m	76,919	46.03	Electricity
Kimi K2.5	OpenRouter	29m	63,638	160.14	~$0.07
Step 3.5 Flash	OpenRouter	38m	156,267	242.11	~$0.02
Qwen 3.5 122B	Local	43m	57,472	22.41	Electricity
DeepSeek V3.2	OpenRouter	60m	115,278	53.37	~$0.04

DeepSeek V3.2 is the slowest despite being cloud — it has no prompt caching, so it resends the full context on every turn.

Token efficiency and cache

Models with prompt caching pay much less in effective tokens:

Model	Total Tokens	Cache Read	Effective New Tokens
Claude Sonnet 4.6	127,067	126,429	638
Claude Opus 4.7	118,216	116,824	1,392
Claude Opus 4.6	136,806	135,976	830
GLM 5	59,378	58,240	1,138
GLM 5.1	81,666	81,216	450
Grok 4.20	63,457	62,400	1,057
Gemini 3.1 Pro	104,034	98,129	5,905
GPT 5.4 (Codex)	7,643,800	0	7,643,800
DeepSeek V3.2	115,278	0	115,278
Kimi K2.5	63,638	0	63,638

Speed: the chasm between cloud and local

There’s an aspect that the cost tables hide: inference speed. And the difference is brutal.

Claude Sonnet generates 532 tok/s. Qwen 3.5 122B running locally on my Minisforum (AMD Strix Halo) generates 22 tok/s. That’s a 24x difference. In practice, what Sonnet does in 16 minutes, Qwen 3.5 122B takes 43 minutes. Qwen 3 Coder Next at 37 tok/s is the fastest of the local models on the Strix and even so it’s 14x slower than Sonnet.

And it’s not just clock time. When you’re in an interactive coding loop — ask for a change, wait for output, test, ask for another — the model’s speed sets your rhythm. At 37 tok/s, every long response makes you wait 30-60 seconds. At 530 tok/s, it appears almost instantly. Over a day, you feel it.

DeepSeek V3.2 is a curious case: it’s cloud but it runs at 53 tok/s, slower than the locally-running Qwen 3.5 35B on the Strix (46 tok/s). The reason is that DeepSeek has no prompt caching — it resends the full context on every turn, strangling throughput. Paying for a cloud model that’s slower than running it locally doesn’t make any sense.

Local models are free in tokens, but they pay in time. On the AMD Strix, that math was a non-starter for every Qwen I tested: two minutes waiting for a long response, multiplied by 50 turns, eats your whole afternoon. But that changes when the hardware changes, and that’s why I redid the local part of the benchmark on a different machine.

AMD Strix Halo vs NVIDIA RTX 5090: what changes when the memory bandwidth doubles

To check whether the bottleneck was hardware or model, I took the same Qwen models and reran the benchmark on a workstation with an NVIDIA RTX 5090 (Blackwell, 32 GB GDDR7, 1,792 GB/s bandwidth). The numbers shift in a way that’s worth looking at carefully.

Model	AMD Strix (LPDDR5x)	NVIDIA 5090 (GDDR7)	Speedup	Total time on 5090
Qwen 3 32B (dense)	7 tok/s	69 tok/s	~10x	4 min
Qwen 3 Coder 30B (Coder)	37 tok/s	145 tok/s	~4x	6 min
Qwen 3.5 35B-A3B (MoE)	46 tok/s	273 tok/s	~6x	5 min
Qwen 3.5 27B Claude (distilled)	timeout 90m	129 tok/s	n/a	12 min
Gemma 4 31B	(didn’t test on Strix)	213 tok/s	n/a	8 min
Qwen 2.5 Coder 32B	(didn’t test on Strix)	2.86 tok/s	n/a	timeout 90m

To put those speeds in context, remember that in the cloud Sonnet runs at 532 tok/s, Opus at 347 tok/s, Step 3.5 Flash at 242 tok/s, Gemini 3.1 Pro at 128 tok/s and Kimi K2.5 at 160 tok/s. Qwen 3.5 35B-A3B on the 5090, at 273 tok/s, is in the same neighborhood as Step 3.5 Flash, faster than Gemini, Kimi and GLM 5.1. Qwen 3 Coder 30B at 145 tok/s is in Gemini territory. The classic line “local models are ten times slower than cloud” stopped being true the moment the 5090 entered the conversation.

The practical consequence is that the “time is money” argument shifts. On the Strix, “waiting an hour for a Qwen 3.5 122B to do what Sonnet does in 16 minutes” is straight-up loss. On the 5090, waiting 5 minutes for Qwen 3.5 35B-A3B to do the work, plus 10-15 minutes for you to do 1-2 correction prompts, gives you a total in the 20-25 minute range. Sonnet does it in 16 minutes with zero corrections. The difference shrank enough that, if cost matters a lot, it’s worth it.

The catch: for this to be worth it, the model has to be close enough to the right answer that 1-2 correction prompts can fix it. When the error is “the model decided not to use the gem I asked for and returned a hardcoded mock string,” like Qwen 3 Coder 30B did, no easy correction prompt fixes that. That’s a redo.

Before you spend money on hardware thinking it’s the answer

I’ve got to give a warning here, because it’s the most common buying mistake I see right now. Every other week somebody tells me they’re going to grab a Ryzen AI Max because it has 128 GB of unified memory and that “lets you run huge models.” Technically, sure — the model fits. In practice, it’s almost unusable. The memory is LPDDR5x at 256 GB/s, seven times slower than the 5090’s GDDR7. What fits doesn’t run at human speed. My own Strix with Qwen 3.5 122B hit 22 tok/s and the run took 43 minutes. To do anything serious day to day, that’s not workable.

The 5090 is clearly superior, and it starts to make sense even for smaller models precisely because of the memory bandwidth. A Mac Studio with high-speed unified memory (up to 800 GB/s on the M4 Ultra) is the other viable option, and costs proportionally the same. But neither of those comes anywhere close to beating the commercial models on quality — and the per-token price of Claude, GPT or GLM, combined with their brutal inference speed, makes the math hard to justify for anyone who isn’t an enthusiast or a researcher. Expensive local AI hardware is a weekend hobby, a tool for people who need to run offline for compliance reasons, or a research playground. For day-after-day production work, right now, cloud is still the rational choice. A 128 GB Ryzen AI Max may look tempting on the spec sheet, but if the goal is serious coding agent work, it’s money badly spent.

The Qwen family: Coder vs General, distillation, and why nothing is a silver bullet

With so many different Qwens running in this rerun, it’s worth doing a more focused analysis. What I learned might surprise people who follow model benchmarks on Twitter.

Before getting to the results: what quantization is and what distillation is

These two concepts come up constantly in this discussion and they deserve a quick explanation.

Quantization is the technique of compressing the model’s weights so they take up less memory. A model trained in FP16 (16 bits per weight) can be quantized to Q8 (8 bits), Q4 (4 bits), Q3_K_M (3 bits, but with medium-sized groupings), and so on. Each step halves the size of the model on disk and in VRAM, at the cost of some loss of precision. Q8 is practically lossless. Q4 already loses something measurable. Q3 loses more. Q2 is the line where the model starts saying real nonsense. The rule of thumb is that for coding and multi-step reasoning, you want to stay at Q4 or higher. Q3_K_M is the minimum that still works for many models, and it’s what fits a 27B on the 5090 with 128k context.

The surprise from my test, and look, this goes against the consensus, is that quantization wasn’t the bottleneck here. I ran the Qwen 3.5 27B Claude-distilled in two versions: Q8 on the AMD Strix (~27 GB of weights) and Q3_K_M on the 5090 (~12 GB of weights). Both hallucinated exactly the same fake RubyLLM APIs. Q3_K_M even produced a cleaner Gemfile. The model’s limitation was in what those weights know, not in the precision they were compressed to.

Distillation is the technique of training a smaller model (the “student”) to imitate the output or behavior of a larger model (the “teacher”). The classic version is logit distillation — the student learns to approximate the teacher’s probability distributions. The modern, more popular version for coding agents is distillation of reasoning traces: you take chain-of-thought from the big model on real problems and train the smaller one to reproduce the same reasoning style.

The hype of the moment is distilling Claude and GPT into open source models. The promise is that you can have “Claude-at-home” running locally. I wanted to test this, and that’s why I added Jackrong’s Qwen 3.5 27B distilled from Claude 4.6 Opus to the benchmark. If any open source model was going to use RubyLLM correctly, this was the bet — after all, in the entire benchmark, Claude and GLM 5 are the only ones that get the API right.

What the Claude-distilled learned (and what it didn’t)

I ran the same distillation twice: once at Q8 on the AMD Strix (which blew through the 90-minute timeout), and once at Q3_K_M on the 5090 (completed in 12 minutes). Both produced the same elegant frustration.

The code that comes out looks like Claude. It has # frozen_string_literal: true at the top of every file. It has a separate Response class as a value object with explicit attribute readers. It has a clear separation between service, controller and model. It has doc comments at the top of every file. It correctly comments out things like active_record, active_job and action_mailer in application.rb. It has defensive case statements trying multiple return formats. Stylistically, it’s Claude.

Functionally, it’s a complete RubyLLM hallucination. Look at the service generated by the 5090 run:

RubyLLM::Chat.new.with_model(@model) do |chat|
  conversation_history.each do |msg|
    chat.add_message(role: :user, content: msg[:content])
  end
  response = chat.ask(message)
  Response.new(content: response.text, usage: build_usage(response))
end

Every primitive in this code is invented:

RubyLLM::Chat.new — the constructor isn’t public, the correct entry is RubyLLM.chat(model:)
.with_model(@model) do |chat| ... end — there’s no block API like that
chat.add_message(role:, content:) — doesn’t exist
response.text — the real API exposes response.content
response.usage.prompt_tokens — the object doesn’t have that shape

This will blow up with a NoMethodError on the first request. The initializer also tries config.openrouter_api_base= which doesn’t exist on RubyLLM.configure, so the app probably won’t even boot.

The Q8 version on the AMD Strix does the exact same thing, with one difference: the entry call is RubyLLM.chat(model:, provider: :openrouter) — the entry point is right, but provider: is invented and it’s immediately followed by the same fake chat.add_message(role:, content:). Worse, the Gemfile from the 90-minute run lists gem "ruby-openai" (wrong gem!), gem "minitest", "~> 6.0" (minitest 6.0 doesn’t exist) and gem "tailwindcss" (wrong gem name, it’s tailwindcss-rails). The Gemfile doesn’t include the gem the service code itself is trying to use.

For comparison, look at the actual Claude Opus 4.6 baseline, in the same benchmark, getting it all right:

@chat = RubyLLM.chat(model: model_id)
response = @chat.ask(message)
response.content

Twelve lines in the entire service. Zero hallucination. Includes streaming via block. The distilled model produced three times the code volume and got the API wrong.

The honest reading is that distillation transferred one layer and stopped. The layer that came along was the style: code organization, comments, class structure, the order of things. The layer that got left behind was factual memory about specific libraries. That makes sense when you think about it: Claude’s reasoning traces, even when written carefully, rarely contain repeated references to chat.ask(msg).content in some obscure Ruby gem. The student only learns what the teacher repeats, and Claude never had any reason to keep whispering “use ask, not complete” throughout its chains of thought. Library API knowledge is binary recall memory, the kind that’s either in the weights or it isn’t. Decomposing that into reasoning steps is impossible because it isn’t reasoning, it’s just raw memorization.

To wrap up the practical recommendation: if you need the model to actually use RubyLLM, or any less-popular library for that matter, Claude distillation won’t save you. Use real Claude or GLM 5. The “Claude-stand-ins” in open source will fail the same way the Qwen base would, just with prettier handwriting.

Coder vs General: the surprise of the “for coding” models

Almost everyone’s instinct is that models with “Coder” in the name are the best for programming. Makes sense, they were specifically fine-tuned on code. But in the benchmark, it was exactly the opposite.

Model	Type	Hardware	Time	Result
Qwen 3.5 35B-A3B	General (MoE)	5090	5 min	Runs Rails, hallucinates `add_message`/`complete` (1-2 follow-ups fix it)
Qwen 3 Coder 30B	Coder	5090	6 min	Returned a hardcoded mock string instead of calling RubyLLM
Qwen 2.5 Coder 32B	Coder	5090	timeout 90m	Zero files, model froze
Qwen 3 32B	General	5090	4 min	Partial scaffold, errors
Qwen 3.5 27B Claude-distilled	General + distilled	5090	12 min	Runs Rails, hallucinates the entire API
Qwen 3.5 27B Sushi Coder RL	Coder (RL)	5090	6 min	Infrastructure failure, couldn’t be tested

Of the three dedicated Coders, two failed catastrophically (full timeout and hardcoded mock string) and one didn’t even run properly because of an infra bug. Meanwhile, the Qwen 3.5 35B-A3B, which is the general model in the line (not the Coder), came closest to something usable: 5 minutes of execution, recognizable Rails project, and the problem is fixable in 1-2 prompts.

Qwen 3 Coder 30B is particularly disappointing. It went so far past trying to use the API that it didn’t really try at all: the controller it generated has this:

class Api::V1::MessagesController < ApplicationController
  def create
    render json: {
      response: "This is a mock response. In a real implementation, this would connect to RubyLLM with Claude Sonnet via OpenRouter."
    }
  end
end

The Gemfile lists gem "ruby_llm" but nothing imports it. The service layer is nonexistent. The model decided it was easier to return a fake string and call it a day. That’s Tier 3 garbage in a way no correction prompt fixes — you have to tell it to start over.

Qwen 2.5 Coder 32B is even worse: 90 minutes running, zero files. The 1.8 MB opencode-output.ndjson shows the model spinning without managing to write anything. It probably got stuck in a planning loop without ever calling the write tools. Total slot waste.

Why did the “Coder” Qwens do so badly? My read is that the coding-specific fine-tuning they got was trained on more isolated problems (Codeforces, Leetcode, short snippets), far from agentic flows with long-running tool calling. The general Qwen 3.5 35B-A3B has broader training and handles the orchestration part better. The popular intuition “Coder = best for coding agent” is wrong for this kind of task. The use case where Coders shine is “complete an isolated function,” which is exactly what they were trained for, and that’s a tiny fraction of what a coding agent does day to day.

The question I wanted to answer

It was this: running locally on the 5090, which Qwen model is worth the 1-2 correction prompts to deliver code that works?

The honest answer is: only Qwen 3.5 35B-A3B, and maybe the Claude-distilled if you don’t mind spending 12 minutes more.

Qwen 3.5 35B-A3B on the 5090: 5 minutes, correct entry point (RubyLLM.chat(model:)), errors on the subsequent calls. Realistic total until it works: in the 15-20 minute range with 1-2 follow-ups. Beats cloud OSS on cost.
Qwen 3.5 27B Claude-distilled on the 5090: 12 minutes, deeper hallucination (entry point is invented too). Realistic total: 25-30 minutes with 2-3 follow-ups. Still competes on cost, and loses on absolute time to the real Claude.
The others (Coder 30B, Coder 2.5 32B, 3 32B): don’t pay back the correction time. Each one has a structural problem that calls for a full rewrite from scratch.

For folks with hardware in this category who want to escape Anthropic vendor lock-in, it now works. It didn’t work on the 5090 from last year, and forget about it on the Strix Halo. In 2026, on NVIDIA Blackwell, with the right model, it works. For folks with low-bandwidth hardware (LPDDR5x, DDR4, DDR5), it’s still a waste of time: the clock alone takes down any plan to make this practical.

Qwen 3.6: what changed from 3.5

We tested two flavors of Qwen 3.6: the 3.6 Plus (cloud, OpenRouter, free) and the 3.6 35B (local, NVIDIA 5090, Q3_K_M).

Qwen 3.6 Plus (cloud) completed in 17 minutes with 88,940 tokens and 9/9 on the artifact checklist. Completed fast and for free. But the generated service uses chat.add_message(), a method that doesn’t exist in RubyLLM. First message works, second one crashes. Same problem as 3.5.

Qwen 3.6 35B (local, 5090) is more interesting. Completed in 4.7 minutes at 240 tok/s, 169 files, correct entry point RubyLLM.chat(model:, provider:) and correct chat.ask(message). The bug is subtler: it returns response instead of response.content and doesn’t do history replay. One-line fix. That’s a real improvement over Qwen 3.5 35B-A3B (which hallucinated add_message and complete). It’s the cleanest Qwen result we’ve seen.

Model	Version	Hardware	Correct API?	Multi-turn works?	Time	Tok/s
Qwen 3.5 35B-A3B	3.5	NVIDIA 5090	Entry point yes, `add_message`/`complete` no	No	5m	273
Qwen 3.5 27B Claude-distilled	3.5	NVIDIA 5090	No (entry point invented)	No	12m	129
Qwen 3.6 35B	3.6	NVIDIA 5090	Entry point yes, `chat.ask` yes, missing `.content`	No (no replay)	5m	240
Qwen 3.6 Plus	3.6	Cloud	Entry point yes, `add_message` no	No	17m	183

The gap shrank but didn’t close. The 3.6 35B local is closer to working than any previous Qwen — the bug is a forgotten .content, not an entirely invented API. But it still doesn’t do multi-turn out of the box. In practice, Qwen 3.6 35B local moves up from Tier 3 to Tier 2: it’s the local open source model that comes closest to delivering correct code on the first try, one fix away from working.

The Deep Code Review: Sonnet vs GLM 5 vs Gemini vs Kimi vs MiniMax

The tables above measure structural completeness. But does the project work? I did detailed code review of the models that completed the benchmark.

Claude Sonnet 4.6 — works and is the most complete. Synchronous responses via Turbo Stream. Chat history persisted in a session cookie with full replay of previous messages on every request. Correct LLM mocking in the tests with mocha (30 tests in 328 lines). LLM logic extracted into a separate LlmChatService. Views decomposed into 9 partials. Minor problems: duplicated model constant, leak in the auto-resize event listener. None are blockers. Of the generated projects, it’s the closest to something you’d actually put into production.

GLM 5 — works, but it’s the bare minimum. Uses the correct API (RubyLLM.chat(model:) then .ask()), does mocking with mocha in the tests. But the project is way leaner than Sonnet’s: 21-line controller (vs Sonnet’s 52), no service layer (LLM logic inline in the controller), no chat history persistence, every message handled in isolation. The first message works, but the app doesn’t keep conversation context, so you can’t have a multi-turn dialog. The tests exist (7 methods) but they’re skeletal: ruby_llm_test.rb only checks that the module is loaded, chat_flow_test.rb is a copy of the controller test. The Dockerfile, on the other hand, is the best of the four: multi-stage, non-root, jemalloc. But as a chat app? It’s more of a proof of concept than something functional. Funny detail: the README says “Powered by Claude Sonnet 4” instead of the model that actually generated the project.

Gemini 3.1 Pro — the fastest, but trips on the API. Completed in 14 minutes, the fastest along with MiniMax. The Rails code itself is well written: uses Rails.cache with session ID and a 2-hour expiration to keep state (instead of a database), Turbo Streams nicely integrated, Stimulus controller for auto-scroll, and the Dockerfile is the best of the group (multi-stage, non-root, jemalloc). The problem is the usual one: it uses RubyLLM::Chat.new() instead of RubyLLM.chat(), and calls add_message() which doesn’t exist. The app boots, Docker runs, the health check passes, but the first chat message returns 500. The tests (5 methods) mock with a FakeChat that replicates the wrong signature, so they pass. It’s frustrating because the rest of the code is the most “Rails way” of the non-Anthropic models. Fixing it would be 3 lines, but the benchmark measures what comes out the first time.

Kimi K2.5 — ambitious but broken. Tried the most sophisticated architecture: ActionCable streaming, configurable models, dual Dockerfiles, 37 tests in 374 lines. Problem: the streaming depends on ActionCable, which is commented out in config/application.rb. The return unless defined?(ActionCable) guard makes the method do nothing. The assistant never responds. The Stimulus controller has a scope bug: submitTarget references a button outside the controller’s subtree. Thread-unsafe storage with a hash in a class variable. Kimi wrote more tests than any other model (37), but none of them mock the LLM calls — so the tests pass without proving any of the functionality works.

Grok 4.20 — fast and wrong. It was the fastest in the entire benchmark: 8 minutes, 412 tok/s. Except it was fast because it cut corners. The prompt explicitly asked for the ruby_llm gem, and Grok ignored it. It went straight for OpenAI::Client from the ruby-openai gem pointing at the OpenRouter URL. Technically the first message comes back, so yeah, it “works.” But it’s the same trick as Step 3.5 Flash and Qwen 3.5 122B: skip the part that was actually being tested. No history, 33-line controller calling the HTTP client by hand, two tests, no real mocks. It was fast because it did less than what was asked.

MiniMax M2.7 — looks right, crashes. Calls RubyLLM.chat(model: '...', messages: [...]) — that signature doesn’t exist. No message persistence. Duplicated HTML (DOCTYPE inside the layout). Committed master.key. And the tests? They mock the wrong API, so they pass but they don’t prove anything.

Code review summary:

Aspect	Sonnet 4.6	GLM 5	Gemini 3.1 Pro	Kimi K2.5	MiniMax M2.7
Correct API	Yes	Yes	No	No	No
Chat history	Session cookie	None	Rails.cache (2h)	Broken (ActionCable off)	None
Service layer	LlmChatService	Inline in controller	LlmService	LlmService	ChatService (wrong API)
Tests (methods)	30	7	5	37	12
LLM mocking	Yes (mocha)	Yes (mocha)	FakeChat (wrong API)	No	Mocks wrong API
Dockerfile	Multi-stage	Multi-stage + jemalloc	Multi-stage + jemalloc	Dual (dev/prod)	Single-stage
Actually runs?	Yes	Yes (no history)	No (500 in chat)	No	No

GLM 5 vs GLM 5.1: what changed

GLM 5 was one of the few models that spat out functional code on the first try, so it was obvious to test the new version. One important detail before the numbers: GLM 5 ran via OpenRouter, GLM 5.1 wasn’t there yet when I ran this test, so I used the Z.AI direct API. Different provider, different infra, different cache. The numbers below are reference, not exact measurement.

Aspect	GLM 5	GLM 5.1
Provider	OpenRouter	Z.AI direct
Total time	17m	22m
Tok/s (final phase)	400	167
Effective new tokens	1,138	450
Cache read	58,240	81,216
Correct RubyLLM API	Yes	Yes
Test mocking	Yes (mocha)	Yes
Tests	7	24
Chat history	No	Yes (in-memory)
Service layer	Inline in controller	`ChatSession` model with `add_user_message`/`add_assistant_message`

The GLM 5.1 project came out way more complete. 24 tests vs 7. Real separation between ChatSession, ChatMessage and the controller, instead of GLM 5 cramming everything inline. Chat history persisted in memory during the session, so you can actually have a real multi-turn conversation (GLM 5 treated every message like it was the first). And the RubyLLM API is still correct, the same RubyLLM.chat(model:, provider:) pattern followed by c.user/c.assistant to build the context. There’s even a test covering the MODEL constant, which usually nobody does.

The price was speed. 22 minutes vs 17, and throughput dropped from 400 to 167 tok/s. Could be the provider (Z.AI direct isn’t the same infra as OpenRouter), could be a more loaded server during the run, could be that 5.1 reasons more. I didn’t run it multiple times to take an average, so I won’t say 5.1 is “slower.” A single run doesn’t prove a regression. What I can say is that, in my test, 5.1 delivered a better-structured project and took a bit longer to do it.

For folks who want to get out from under Anthropic without losing quality, GLM 5 and GLM 5.1 are the two options that work. If you need centralized billing on OpenRouter, GLM 5. If you can use Z.AI direct and want a more rounded project on the first try, GLM 5.1.

Costs: API vs Subscription

First, the per-token price of each model on OpenRouter:

GPT 5.4 Pro charges $180 per million output tokens. Claude Opus charges $25. GLM 5 charges $2.30. And Qwen 3.6 Plus is free (with a rate limit). The log scale on the chart hides some of the brutality of the gap: from free Qwen to GPT 5.4 Pro is orders of magnitude.

But per-token price isn’t the whole story. If you use Claude or GPT daily for coding, the monthly subscription can come out way cheaper than paying per token via the API:

Approach	Est. $/month*	Notes
Qwen 3.6 Plus (OpenRouter)	$0	Free but rate-limited
Local models	Electricity	Needs hardware
Claude Pro	$20	~44K tokens/5hr
ChatGPT Plus	$20	Includes Codex
Claude Max 5x	$100	~88K tokens/5hr
Claude Sonnet (OpenRouter API)	~$150	No cap, pay-as-you-go
Claude Max 20x	$200	~220K tokens/5hr
ChatGPT Pro	$200	GPT 5.4 Pro unlimited
Claude Opus (OpenRouter API)	~$450	No cap, pay-as-you-go
GPT 5.4 Pro (OpenRouter API)	~$990	Absurdly expensive

*Estimate for moderate coding use (~15M input + ~3M output tokens/month).

The main point: if you use GPT 5.4 Pro, the ChatGPT Pro subscription at $200/month with unlimited use is 5x cheaper than paying per token on the API. For Claude, Pro at $20/month covers light use, but for heavy users (a coding marathon like mine), the Max 20x at $200/month comes out cheaper than paying for Opus per token on OpenRouter (~$450/month). The open source models on OpenRouter all sit below $2.50/M output tokens, but as we saw, most of them generate code that doesn’t run.

What works for real use

After testing 33 models across both runs and looking at the generated code in detail:

Tier 1 (works plug and play):

Model	Quality	Cost/Run	Trade-off
Claude Opus 4.7	New baseline (28 tests, FakeChat, 96.7% coverage)	~$1.10	Incremental over 4.6, new gold standard
Claude Sonnet 4.6	Better than Opus 4.6 on opencode (30 vs 16 tests)	~$0.63	Cheaper, but fails on deeper reasoning in real projects
Claude Opus 4.6	Previous gold standard	~$1.05	Previous baseline
GLM 5	Good (7 tests, correct API)	~$0.11	89% cheaper, non-Anthropic/OpenAI alternative that works
GLM 5.1	Good (24 tests, history, correct API)	~$0.13	~88% cheaper, more complete project than GLM 5

Tier 2 (works with caveats):

Model	Hardware	Cost/Run	Caveat
GPT 5.4 (Codex)	Cloud	~$16.00	Impressive architecture (22 tests, dependency injection, PORO models), but `add_message` with keyword args instead of positional hash breaks multi-turn. 7.6M tokens, 15x more expensive than Opus
Step 3.5 Flash	Cloud	~$0.02	Bypasses the requested gem, slow (38m)
Grok 4.20	Cloud	~$0.04	Bypasses the requested gem (goes straight to `OpenAI::Client`), but it’s the fastest in the benchmark
Qwen 3.6 35B	NVIDIA 5090	Free	Entry point and `chat.ask` correct, missing `.content`. One-line fix. ~10-15 min total
Qwen 3.5 35B-A3B	NVIDIA 5090	Free	Correct entry point, hallucinates `add_message`/`complete`. Fixable in 1-2 follow-ups. ~15-20 min total
Qwen 3.5 27B Claude-distilled	NVIDIA 5090	Free	Claude style, complete API hallucination. 2-3 follow-ups to fix. ~25-30 min total
Qwen 3.5 35B (local)	AMD Strix	Free	Works if default is configured, no mocking, and slow

Tier 3 (broken code, easier to redo than to fix):

Kimi K2.5, MiniMax M2.7, DeepSeek V3.2, Gemini 3.1 Pro, Qwen 3 Coder Next (Strix), Qwen 3 Coder 30B (5090, returned a hardcoded mock string), Qwen 3.5 122B, Qwen 3.6 Plus — all of them either invent APIs that don’t exist or don’t even try to use the gem.

Tier 4 (didn’t complete):

Gemma 4 (infinite loop on both hardware), Llama 4 Scout (no parser), GPT OSS 20B (wrong directory on Strix, parser regression on 5090), Qwen 3 32B (too slow on Strix, partial scaffold on 5090), Qwen 2.5 Coder 32B (90m timeout with zero files).

Simplified ranking (quality, time, price)

For folks who only want the report-card summary. Quality is whether the code runs and how complete it is. Time is the total runtime. Price is the estimated cost per execution on opencode. Hardware indicates where the model ran — Cloud, Strix (AMD Strix Halo, LPDDR5x 256 GB/s) or 5090 (NVIDIA RTX 5090, GDDR7 1792 GB/s). Cloud models ran via OpenRouter or the provider’s direct API.

Model	Type	Hardware	Quality	Time	Price
Claude Opus 4.7	Commercial	Cloud	A+	A	D
Claude Sonnet 4.6	Commercial	Cloud	A+	A	C
Claude Opus 4.6	Commercial	Cloud	A+	A	D
GPT 5.4 (Codex)	Commercial	Cloud	B+	B	F
GLM 5.1	OSS	Cloud	A	B	A
GLM 5	OSS	Cloud	A−	A	A
Qwen 3.6 35B	OSS	5090	B+	A+	A+ (free)
Qwen 3.5 35B-A3B	OSS	5090	B	A+	A+ (free)
Qwen 3.5 27B Claude-distilled	OSS	5090	C+	B	A+ (free)
Gemini 3.1 Pro	Commercial	Cloud	C	A+	B
Grok 4.20	Commercial	Cloud	C−	A+	A+
Step 3.5 Flash	Commercial	Cloud	C−	D	A+
Qwen 3.5 35B-A3B	OSS	Strix	C	C	A+ (free)
Qwen 3 Coder Next	OSS	Strix	D+	A	A+ (free)
Qwen 3 32B	OSS	5090	D	A+	A+ (free)
Qwen 3 Coder 30B	OSS	5090	D−	A+	A+ (free)
Qwen 3.6 Plus	Commercial	Cloud	D	A	A+ (free)
Kimi K2.5	OSS	Cloud	D	C	A
MiniMax M2.7	OSS	Cloud	D	A	A+
Qwen 3.5 122B	OSS	Strix	D	F	A+ (free)
DeepSeek V3.2	OSS	Cloud	F	F	A+
Qwen 2.5 Coder 32B	OSS	5090	F	F	A+ (free)
Gemma 4 31B	OSS	5090	F	A+	A+ (free)
Gemma 4 31B	OSS	Strix	F	—	A+ (free)
GLM 4.7 Flash	OSS	Strix	F	—	A+ (free)
Llama 4 Scout	OSS	Strix	F	—	A+ (free)
GPT OSS 20B	OSS	Strix	F	—	A+ (free)
Qwen 3 32B	OSS	Strix	F	—	A+ (free)

Quality criteria: A+ works and the code is well structured. A/B works with small to medium caveats. C runs but skips a prompt requirement or has a serious structural issue. D breaks on the first message because of an invented API. F didn’t complete the benchmark or produced garbage. GPT 5.4 via Codex dropped from A+ to B+: the architecture is the most sophisticated in the benchmark (dependency injection, PORO models, 22 tests), but the first message works and multi-turn breaks due to a wrong calling convention on add_message. Burned 7.6M tokens (~$16/run) without reaching Opus-level correctness. “Type” separates commercial models (closed weights) from OSS (open weights, even when used through a hosted API). Some Qwens appear twice when they ran on both hardware profiles, because the results are different enough to justify it — Qwen 3.5 35B-A3B on the 5090 jumps to Tier B, on the Strix it stays at Tier C because of the wait time. Of the 33 models configured across both runs, some don’t appear in this table because they never even executed (no quota, broken runner, infra failure, or timeout before the first message).

The verdict

If you want the best result and don’t want to think about it: Claude Opus. Opus 4.7 is the new baseline — 28 tests, correct API, 96.7% coverage, FakeChat pattern in the tests. It’s an incremental improvement over 4.6 (which had 16 tests), not a revolution. But it doesn’t need to be a revolution. 4.6 already worked, 4.7 works the same and delivers a slightly more polished project. If you were on 4.6, upgrading to 4.7 means switching to a model that does the same thing with more care in the tests and structure.

A word about Sonnet: it beat Opus in this benchmark (30 tests vs 16 for Opus 4.6, vs 28 for Opus 4.7). But this benchmark is a small, well-defined web app. In my real-world experience with bigger projects, Sonnet fails when the reasoning needs to go deeper. I’m not talking about massive projects — just going a bit further than this benchmark (more controllers, more integrations, architectural decisions that depend on each other) is enough for Sonnet to lose the thread. Opus has a 128K max output token ceiling vs Sonnet’s 64K, and its training was specifically aimed at long-horizon tasks, multi-step planning and deep reasoning over complex code. On a small project like the benchmark, those muscles stay idle, and in that scenario Sonnet wins by being faster and cheaper. But if you extrapolate that to “Sonnet is better than Opus,” you’ll get a surprise on the first task that requires sustained reasoning. You can try Sonnet — it’s cheaper and for small projects it works. But for real projects, you’ll probably end up on Opus anyway.

On GPT 5.4: we now have objective data. Ran it via Codex CLI with xHigh reasoning effort. The architecture it generates is the most sophisticated in the benchmark — dependency injection, PORO models, session management with TTL. But it burned 7.6M tokens (~$16/run, 15x more expensive than Opus) and got the add_message calling convention wrong (keyword args instead of positional hash), breaking multi-turn. Spent more, got it wrong. Dropped from Tier 1 to Tier 2. Same pattern: API correctness is binary recall in the weights. It doesn’t scale with token budget or reasoning effort.

If cost matters and you want to leave Anthropic: GLM 5 or GLM 5.1 are the plug-and-play alternatives that work. Correct API, mocking in the tests, ~$0.11-$0.13 per run, ~88-89% cheaper than Opus. GLM 5.1 delivered a more complete project (24 tests, chat history) at the cost of about 5 more minutes.

If you want to avoid total vendor lock-in and you have decent hardware: Qwen 3.6 35B running locally on an NVIDIA RTX 5090. Under 5 minutes of execution at 240 tok/s, correct entry point and chat.ask, just missing .content extraction — a 1-line fix. That’s better than the Qwen 3.5 35B-A3B which hallucinated entire API methods. The 3.6 generation is the first Qwen that’s genuinely one fix away from working, not a rewrite away. Realistic total: ~10-15 minutes with one follow-up.

If you want to avoid vendor lock-in but you’re on weak hardware: GLM 5 or GLM 5.1 remain the choice. They’re cloud, true, but at $0.11-$0.13 per run it’s basically the price of electricity.

If you want to test the “Claude at home” gamble via distillation: the Qwen 3.5 27B Claude-distilled is sitting there to play with, but I already warned you it hallucinates exactly the same fake APIs as the base Qwen. Distillation transferred Claude’s style, not its factual knowledge about libraries. It’s worth it as an experiment, not as production.

Yes, maybe with days of tweaking llama.cpp, calibrating flags, adjusting prompts, testing different builds, you could make Gemma 4 or other models work better. For most people, that isn’t realistic. The distance between frontier models (Claude, GPT) and self-hosted open source models is real. It isn’t marketing. The gap is shrinking, but it still exists, and the nature of it has changed: today what’s missing in open source is factual knowledge about specific libraries, not raw reasoning capacity. Hardware stopped being the bottleneck, at least for anyone with a recent GPU.

In the end, what matters is whether the code runs. A model can generate 3,405 files, write 37 tests, produce a 181-line README, and the app still won’t work because the API it uses doesn’t exist. Completeness metrics and test counts are necessary but not sufficient. The only reliable signal is whether the model uses real APIs correctly.

The full benchmark, with code, configuration, prompts and per-model results, is on GitHub.

Turning YouTube into a Karaoke App | Frank Karaoke

Sun, 05 Apr 2026 12:00:00 GMT

Project on GitHub: github.com/akitaonrails/frank_karaoke

I’ve always loved karaoke. I go out to sing with family or friends every now and then. In São Paulo there are good places in Liberdade and Bom Retiro, for instance, with private Japanese-style booths. If you’ve never been to a karaoke like that: you rent a private room by the hour, there’s a huge song catalog, two microphones, and a scoring system that grades your singing in real time. The best systems are Japanese, like Joysound and DAM. A score above 90 (out of 100) is considered advanced. DAM, in the LIVE DAM Ai series, even uses AI to give scores that feel more “human.”

But not every place has that level.

The problem with karaoke in Brazil

In Brazil we grew up with Videokê, the brand the Korean Seok Ha Hwang brought to the country in 1996, importing equipment from Korea. It became a craze in the 90s and 2000s, showed up in every bar, barbecue and birthday party. The problem is that those machines stopped in time. The current models, like the VSK 5.0, ship with around 12-13 thousand songs in the catalog, which you expand by buying cartridges or song packs. In practice, the repertoire is old, the interface is straight out of the 2000s, and if the song you want to sing came out after 2015, good luck.

The workaround a lot of bars adopted was to allow Chromecast or screen mirroring so that customers can search for songs directly on YouTube. Makes sense: on YouTube you can find karaoke for any song. With-lyrics version, instrumental version, vocal guide version.

But there’s a downgrade: you lose the scoring. One of the most fun parts of karaoke is the competition. Watching your score climb, comparing with friends, trying to beat the night’s record. If you’re just singing on top of a YouTube video, you get no feedback. It’s like bowling without a scoreboard.

And buying a professional system for home? Importing a Joysound F1 runs north of US$ 2,000 just for the hardware, not counting the monthly catalog subscription. For casual use it makes no sense.

The idea: YouTube with real-time scoring

Frank Karaoke came out of that frustration. If YouTube already has every song, why not build an app that works as a YouTube wrapper with a real-time scoring overlay? You search for any karaoke video, sing along, and the app analyzes your voice through the mic and shows a live score.

It’s a Flutter app for Android. Internally it loads YouTube into a webview and injects an overlay in HTML/CSS/JavaScript right into the page. The score display, the pitch trail, the settings panel, the mode selector — all of it rendered inside the webview through JS injection.

Scoring without a reference

Now, the real problem. Every professional karaoke system depends on prebuilt reference files for each song. Every single one.

Sony’s SingStar, which sold over 12 million copies between 2004 and the end of the PS3 era, had a hand-crafted note track for every song. Every note, every syllable, all mapped manually. The mechanism compared the singer’s pitch via FFT against that reference in real time. A detail I thought was clever: octave was ignored. If the right note was a C, it didn’t matter if you sang C3 or C4. Men sing women’s songs no problem.

Joysound and DAM in Japan go further and evaluate three separate dimensions: pitch accuracy (音感), rhythm/timing (リズム感) and expressiveness/dynamic volume (表現力). All based on MIDI data from the operator’s server. The open source equivalent format is UltraStar, where each song has a .txt file like:

: 12 4 5 Hel-    (NoteType StartBeat Duration Pitch Syllable)

Pitch 5 = MIDI 65 (F4). Scoring compares the singer’s pitch against the note’s pitch, modulo octave, with a tolerance of 1 semitone.

Frank Karaoke works with any YouTube video. There’s no reference file. There’s no MIDI. There’s no melody annotation. Zero metadata about what note you’re supposed to be singing.

I don’t know anything about karaoke scoring. I don’t know anything about audio processing, pitch detection, music theory applied to software. Nothing. So I asked Claude Code to do extensive research on the subject. What it brought back is documented in docs/scoring.md in the repository, and it’s a lot: academic papers on singing evaluation (Nakano et al. 2006, Tsai & Lee 2012, Molina et al. 2013), patents (Yamaha has one from 1999, US5889224A, that details MIDI-based scoring with 3 tolerance bands), and the source code of open source projects like UltraStar Deluxe, AllKaraoke, Vocaluxe and Nightingale.

The conclusion of the research: without a per-song reference, you have to evaluate vocal quality generically. Measure how the person is singing, not what they should be singing. And since no single metric works for every case, we decided to implement four different scoring modes, each measuring a different dimension of vocal quality.

The phone microphone problem

Before the scoring modes, I have to explain a more fundamental problem the research uncovered: the phone microphone.

When you sing karaoke with the phone, the mic picks up three things at once: your voice, the music coming out of the speaker, and ambient noise from the room. Your voice is physically closer to the mic, so it dominates the signal. But not enough for clean separation.

I tried several approaches to isolate the voice:

Spectral subtraction using YouTube’s reference audio. Dropped it. The YouTube CDN blocks direct audio extraction by non-browser user-agents, and even with the reference audio in hand, the speaker’s EQ, the room reverberation and the Bluetooth delay make the signal too different from what the mic captures. Naive subtraction produces artifacts worse than no subtraction at all.

Pre-emphasis + center clipping. Dropped that too. Center clipping destroys the waveform that the YIN algorithm needs for autocorrelation, and pre-emphasis amplifies noise as much as it amplifies voice.

What works is a 200-3500 Hz bandpass filter: a second-order IIR (Butterworth, Q=0.707) in cascade. The high-pass at 200 Hz kills bass, kick drum, bass guitar bleed from the speaker. The low-pass at 3500 Hz kills cymbals, hi-hats, high-frequency noise. Human voice fundamentals (85-300 Hz) and formants (300-3000 Hz) pass through the filter. It’s not perfect isolation, but it improves the voice/music ratio enough for pitch detection.

But the bandpass alone doesn’t solve everything. Guitars, synths and piano produce periodic signals in the same frequency range as voice, and YIN detects pitch in them too. To deal with that, the app does adaptive calibration: in the first 5 seconds of warmup (when nobody’s singing yet), it collects RMS samples from the signal to establish a baseline of the speaker’s level. During the song, it keeps that baseline updated (25th percentile of the last ~4 seconds of frames). For a frame to be scored, the RMS has to be at least 1.3x above the baseline. Your voice is closer to the mic, so it pushes the RMS above the speaker’s level. The instrumental melody stays near the baseline and gets filtered out. In testing, the original singer coming out of the speaker scored around 37 with sparse dots in the trail, while someone actually singing scored ~59 with dense dots.

Another annoying detail: on Android, specifically on Samsungs, the DSP’s AutomaticGainControl (AGC) attenuates the signal instead of amplifying it. On Galaxies, enabling AGC drops the mic peak from ~0.06 to ~0.003. Silence as far as pitch detection is concerned. So the app disables AGC, echo cancellation and noise suppression. When the peak falls below 0.01, it applies software gain (up to 30x) to bring the signal up to usable levels.

The YIN algorithm

To detect the voice’s pitch I use YIN, by Alain de Cheveigné (IRCAM-CNRS) and Hideki Kawahara (Wakayama University). It’s a fundamental frequency estimator in the time domain. The central idea is the Cumulative Mean Normalized Difference Function (CMNDF), which basically measures how periodic the signal is at each lag, normalizes it to reduce false positives, and uses parabolic interpolation to refine the result. It’s lightweight enough to run in real time on a phone, which is what matters here.

In the app, the YIN threshold is 0.70 (tuned for mixed voice + music signals), and frames with confidence below 0.3 get discarded. Below that, it’s probably noise or an instrument.

The 4 scoring modes

Each mode evaluates a different aspect of vocal quality. They all share the same audio pipeline (bandpass → YIN → confidence gate). The difference is how they interpret the detected pitch.

Pitch Match

Measures how cleanly you sustain notes. Uses Gaussian decay based on the standard deviation of MIDI values in a rolling ~15-frame window. Steady notes (deviation < 0.3 semitones) score 85-100%. A trembling voice (deviation > 2 semitones) scores near zero. Good for songs you already know well.

Contour

Measures the melodic shape of your singing. It doesn’t matter which exact note you hit, only the direction and the flow. Evaluates the pitch range and melodic movements (jumps > 0.5 semitone) in a rolling window. Monotone singing scores ~10%. Smooth melodic movement with a 2-6 semitone range scores 70-100%. Good for when you’re learning a new song.

Intervals

Measures the musical quality of jumps between consecutive notes. A whole tone (2 semitones) scores highest. Thirds and fourths score well. Wild jumps of an octave or more score low. Uses a Gaussian curve centered on the whole tone. Works when you’re singing in a different key from the original.

Streak

It’s Pitch Match with a combo multiplier. Each consecutive frame with a score above 0.4 increments the streak counter. The streak adds bonus points (up to +0.4 on a streak of 30+). Breaking a streak > 5 frames pushes a 0.05 penalty into the EMA. Silence freezes the streak, so instrumental breaks don’t hurt you. The most fun mode for parties.

The logic behind these four modes came from the research Claude did across academic papers. Each one measures a different dimension: pitch accuracy, melodic contour, phrasing and consistency. None of them is sufficient on its own, but together they cover, reasonably well, what you can evaluate without having the song’s reference melody.

The Pitch Oracle

Beyond the four purely vocal modes, the app has what I call the Pitch Oracle. The idea: instead of evaluating your voice in isolation, the app downloads the video’s reference audio via youtube_explode_dart, decodes it to PCM, runs YIN on it, and builds a timestamped pitch timeline of the entire song. During scoring, if the mic’s pitch matches the reference’s pitch at that moment in the video, it’s probably speaker bleed, and gets ignored. If it differs, it’s your voice, and gets scored.

The synchronization works through the currentTime of the HTML5 video element, sent to Dart through a JS timeupdate listener every ~250ms. The oracle queries the reference pitch at the exact playback position, accounting for pause, seek and speed change.

The first time you play a song, the oracle takes 5-15 seconds to download and analyze the audio. But the timeline is saved as JSON in the app’s local cache (pitch_oracle/.json). If you play the same song again, it loads instantly from cache, no network request. That also fixes YouTube’s rate limiting problem for the songs you sing the most.

With the oracle active, the modes change behavior. Pitch Match compares the singer’s pitch class against the reference’s, agnostic to octave (like SingStar). Contour uses cross-correlation between the singer’s pitch movement and the reference’s. Intervals compares semitone jumps against the reference’s.

When YouTube blocks the download with rate limiting (happens after many consecutive requests from the same IP, clears in 15-30 minutes), the oracle silently fails and the modes fall back to purely vocal analysis.

The road to here

The app you see now went through a lot of iteration before reaching this state.

First, I tried to make a Linux desktop version to make debugging easier. Makes sense, right? Test on the desktop, iterate fast, then port to mobile. The problem is that Flutter has no webview backend for Linux desktop. webview_flutter simply doesn’t work. I tried webview_cef, which is based on the Chromium Embedded Framework. CEF spawns its own GPU process, and on Hyprland (a Wayland compositor based on wlroots) that conflicts with the compositor’s render pipeline. On my NVIDIA setup, the entire Hyprland session froze. Locked screen, no keyboard response, I had to kill it from a TTY. On top of that, CEF requires downloading a ~200MB binary on the first build. I gave up on CEF and wrote a native bridge in C++ with Claude using WebKitGTK and Flutter method channels. It worked, but every YouTube quirk required separate code for Linux and Android. just_audio also has no Linux desktop implementation. The Linux version turned into dead weight. I deleted ~1,500 lines of Linux-specific code and focused only on Android.

Then came the Samsung mic saga. On my Galaxy Z Fold, the mic was capturing an absurdly low signal. Peaks of ~0.005, basically silence as far as pitch detection was concerned. I spent two hours trying to figure it out. I lowered thresholds, raised software gain to 50x, disabled audio preprocessors. Nothing was working right. Until I figured out the real problem: Android’s AutomaticGainControl. The name says “automatic gain control,” which suggests it amplifies weak signals. In the Samsung DSP implementation, it does the opposite. It attenuates the signal to a low reference level, optimized for voice calls. With AGC on, the peak dropped from ~0.06 to ~0.003. Disabling AGC fixed it. But then the audio_session package was re-enabling AGC under the hood. I removed that one too. It was three rounds of fixes, each finding one more layer of the problem.

And the scoring. The scoring took longer than everything else combined. The first implementation used a cumulative average, which kept the score stuck at one value and never responded to live singing. I switched to a rolling window. Then the score was stuck at ~50% because of a bug in the primary score weight. I fixed it, and it started showing 70% even with nobody singing. Fixed it again. Streak mode wasn’t resetting properly during silence. The chromatic snap was giving high scores for anything. The pitch history wasn’t being cleared on silence gaps and the modes were going stagnant. Every fix revealed another bug. It took more than 25 commits just on the scoring, from the first prototype to the current state.

The result isn’t perfect. I know. But it works well enough to be fun, which was the goal from the start.

Settings

The settings panel lives behind the gear icon on the overlay. There are three mic presets for different environments (clean external mic, normal room, loud party), each adjusting confidence and amplitude thresholds. There’s a pitch shift for when the song is too high for your vocal range. The shift moves both the video audio and the scoring at the same time: it uses the HTML5 element’s playbackRate with preservesPitch=false, so +2 semitones speeds the audio up to 1.12x (pitch goes up) and -2 semitones slows it down to 0.89x (pitch goes down). The scoring compensates for the offset, so you sing in your comfortable range and the system grades you correctly. There’s mic calibration, a 3-second process that measures the room noise and adapts the thresholds. And there’s a restart to reset the score without reloading the video.

To switch scoring modes, tap the score box during playback.

Usage flow

Open the app. YouTube loads inside the app with the Frank Karaoke logo.
Search for a karaoke video. Any video works, but instrumental tracks with on-screen lyrics give better results.
The video pauses briefly to initialize the mic, download the song’s data for the pitch oracle, and prepare the overlay. The first time with a new song this takes 5-15 seconds. If you’ve played it before, it loads from cache instantly.
Sing. The “live” score reflects your current performance (exponential moving average with alpha 0.15, ~1 second response). The “overall” score is the cumulative average of the entire song.
When the video pauses, scoring pauses with it (so it doesn’t score ambient noise). If you seek, the score resets and gets a 5-second warmup.

How to install

The app isn’t on the Play Store yet, I’m waiting for Google to verify my developer identity. It should show up there in the next few days. In the meantime, it’s an open project and you can install it directly.

The easiest way is to download the signed APK directly from the GitHub releases page. On your Android phone or tablet, download FrankKaraoke-0.2.0-android.apk, open it and tap Install. If Android complains about “unknown sources,” enable it under Settings > Security for your browser. On the first run the app will ask for mic permission. Then go into settings (the gear icon) and calibrate the mic before singing — three seconds.

If you want to compile from source or contribute, the repository is on GitHub. You’ll need Flutter SDK 3.10+, Android SDK API 24+, and a physical device for mic testing (an emulator doesn’t give representative results).

git clone https://github.com/akitaonrails/frank_karaoke.git
cd frank_karaoke
flutter pub get
flutter run -d

The README has the rest.

Stack: Flutter + Riverpod for state management, webview_flutter for YouTube, youtube_explode_dart for audio extraction, record for PCM mic capture, audio_decoder for reference decoding via Android MediaCodec, and the YIN algorithm implemented in pure Dart.

The technical documentation for the scoring system is in docs/scoring.md in the repository. It covers how SingStar, Joysound and DAM work, the academic papers, the pitch oracle architecture, the voice isolation problems on Android, and the roadmap.

The scoring is experimental

I have to be straight: the scoring system is experimental. Without per-song reference files, the evaluation is approximate. The app measures whether you’re in tune, whether you follow a melodic contour, whether your intervals are musical, whether you’re consistent. But it doesn’t tell you whether you’re singing the correct melody for this specific song (unless the pitch oracle manages to download the audio, and that doesn’t always work).

If you have experience with audio processing, pitch detection, or music evaluation, the repository is open and the research documentation in docs/scoring.md details what was tried, what works and what doesn’t. In particular: tuning the modes’ thresholds, improving voice isolation, and integrating with UltraSinger (which generates reference files from songs using Demucs + basic-pitch + WhisperX) are areas where contribution from people who know the subject would make a real difference. I’d appreciate any help from specialists on calibrating these systems.

Oh, and the name. Frank Karaoke. It’s a tribute to Sinatra. Who else?

Project on GitHub: github.com/akitaonrails/frank_karaoke

Bitcoin on the Home Server: Sovereignty and Privacy with Coldcard, Sparrow and Fulcrum

Wed, 01 Apr 2026 19:00:00 GMT

This post is a direct follow-up to my recent articles about the new home server with openSUSE MicroOS and the Minisforum MS-S1 Max. Those covered the foundation. Here I want to show one concrete use for it: putting together a decent Bitcoin stack at home, focused on privacy, operational sovereignty and safe transactions on my side.

First things first: this isn’t an evangelism piece or a day-trading pitch. Quite the opposite. As I write this, on April 1, 2026, Bitcoin is around US$ 68k and close to R$ 391k, below the 2025 peaks. Plenty of people look at that and either panic or start fantasizing about leveraged trades. I think both reactions are wrong. There’s a “super cycle” thesis floating around based on institutional demand, spot ETFs and the lagged halving effect. Maybe. Maybe not. What I do know is that short-term candles don’t change the part I actually care about: infrastructure. If you need leverage to “speed up your gains,” you’re probably just speeding up your chances of getting liquidated.

For me, the useful question isn’t “is it going up tomorrow?” The useful question is: “if I want to store and move Bitcoin without outsourcing everything to an exchange, a web wallet and a public API, how do I set that up properly at home?”

The real problem: too much convenience costs too much privacy

Most people’s default flow is simple: buy on an exchange, leave the balance sitting there, or install some random wallet on the phone and call it done. It works. It also concentrates risk and leaks metadata everywhere.

If you leave a balance on an exchange, you have custody risk. If you use a desktop wallet pointed at a public server, you have privacy risk. If you use a hardware wallet casually, bought second-hand on Mercado Livre, you have supply chain risk. Mix all that with hurry, and it gets worse.

That’s why I ended up at a combination that, for someone technical who wants to run their own infra, feels pretty solid:

Coldcard for cold storage
Sparrow Wallet on Linux as the desktop wallet and transaction coordinator
Fulcrum on the home server as a private Electrum server
bitcoind on the same server as a real full node, validating the chain and broadcasting without depending on third parties

It’s not the easiest path. But that’s exactly the point. Real security rarely comes from the easiest path.

The concepts that confuse beginners

Before getting into the stack, it’s worth aligning on four terms that usually get tossed around like everyone already knows them:

Concept	What it is	Why it matters
Airgap	A device that never touches the internet, not even over data USB	Reduces the signer’s attack surface
PSBT	Partially Signed Bitcoin Transaction	Standard format for preparing, signing and finalizing transactions in stages
Watch-only wallet	A wallet that sees balances/addresses but doesn’t hold a private key	Great for the desktop: it observes and assembles the transaction, but doesn’t sign
Full node	A node that validates blocks and protocol rules locally	You don’t have to “trust” anyone’s API
Electrum server	An indexing layer that quickly answers wallet queries	Without one, desktop wallets end up dependent on public servers

In plain language, the flow looks like this:

Sparrow, on the desktop, builds the transaction.
That transaction becomes a PSBT.
The PSBT goes to the Coldcard via microSD.
The Coldcard signs it offline.
The signed file goes back to Sparrow.
Sparrow broadcasts through your own server, not through someone else’s public infrastructure.

That’s what people mean by “airgapped workflow.” It’s not magic. It’s just disciplined separation of roles.

Coldcard: cold signer, offline, the right kind of annoying

I use Coldcard as cold storage. The reason is simple: it was designed from day one as a Bitcoin-only device, with a heavy focus on airgapped operation through microSD. That alone eliminates an entire category of “conveniences” that many people find practical, but that I’d rather not have anywhere near my keys.

In practice, the Coldcard holds the most important part of the system: the private key. It doesn’t need to know about a server, Electrum, public API, exchange, or any of that. Its job is one thing: sign transactions offline.

That decoupling is great for two reasons:

The desktop can be convenient without becoming a single point of failure.
The signer stays isolated even if your main machine has problems.

And here’s a warning I really want to put in mental all-caps:

Never buy a hardware wallet second-hand. Ever.

This isn’t an exaggeration. You have no way to actually know what happened to that device before it reached your hands. It could have a pre-generated seed, tampered firmware, swapped components, repackaged box, compromised supply chain, or simply some dumb trick waiting for you to let your guard down. Hardware wallet is one of those categories where saving R$ 300 buying used is insanity. Always buy from the manufacturer’s official site or from a reseller officially authorized by the manufacturer. And even then, check seals, provenance and firmware.

Can you do something similar with an old phone?

You can. But I’d treat it as a study or budget alternative, not as an obvious substitute for a Coldcard.

The most serious path for that today is AirGap Vault, which was specifically designed to use an old smartphone as an offline signer over QR codes, keeping the device off the network. The idea is good, and for many people it might be the right entry point.

But there are trade-offs:

An old smartphone wasn’t designed as a dedicated hardware wallet
The device’s prior history matters
An aged battery, bad screen and abandoned Android are real problems
The threat model is less clear than on a dedicated device

So my view is simple: can you use it? Yes. Would I recommend it as the main solution for storing meaningful wealth? No. For that I still prefer dedicated hardware bought from the right source.

Sparrow Wallet: the best desktop piece in this puzzle

On Linux, I use Sparrow Wallet. For me, today, it’s one of the best pieces of software in this ecosystem.

What I like about it:

works very well on Linux desktop
supports hardware wallets properly
understands PSBT without drama
makes it crystal clear what’s happening in a transaction
it’s great as a watch-only wallet

In my flow, Sparrow does three things:

Holds the watch-only wallet.
Builds the transaction with outputs and fees.
Receives the signature back from the Coldcard and broadcasts it.

That separation is elegant. The desktop becomes the coordinator. The signer stays cold.

Why Coldcard + Sparrow works so well

This combo is good because each piece does what it does best:

the Coldcard protects the key
Sparrow organizes the human use of the wallet
the server handles the infrastructure

A lot of wallets try to do everything. I prefer this modular design. It’s less “magic,” more explicit, and easier to reason about without lying to yourself.

If I’m at the desktop, I want visibility. If I’m at the signer, I want isolation. If I’m at the server, I want validation and a local index. That division is clean.

The Sparrow problem when you don’t run your own infra

Now comes the important detail. Sparrow alone doesn’t solve privacy.

If you install it, open it and just use public servers, the people on the other end learn quite a lot about your wallet: your address set, xpubs or derivations, balance, history, query behavior, broadcast. It’s not custody, but it’s still exposure.

That’s the hole Fulcrum fills.

Fulcrum: the home’s private Electrum server

Fulcrum is an Electrum server. Instead of letting Sparrow ask things of a third-party public server, it asks my own server.

In practice, that means:

local balance lookups
local history
local address discovery
local broadcast

In other words: the desktop wallet stops “phoning home” to the world every time you open the program.

In my current setup, Sparrow points at a Fulcrum running on the home server on the LAN, with port 50001 on the internal network and 50002 with TLS.

And why Fulcrum isn’t enough on its own

Because Fulcrum doesn’t replace a full node. It indexes on top of a full node.

The thing actually validating blocks, consensus rules, scripts, transactions and the chain is bitcoind. Fulcrum sits in front of it as an indexing layer, because plain Bitcoin Core wasn’t built to serve a desktop wallet with that kind of fast querying.

So the correct architecture is:

Coldcard (offline signer)
        ^
        | microSD / PSBT
        v
Sparrow Wallet (desktop watch-only + coordinator)
        |
        v
Fulcrum (private Electrum server)
        |
        v
bitcoind (full node)

What I actually brought up on the home server

On my home server, the stack lives in a dedicated Docker Compose folder and is made of two containers:

bitcoin-bitcoind
bitcoin-fulcrum

The compose is simple. And that’s good. Sensitive infra doesn’t gain anything by getting clever in YAML.

The main design is this:

services:
  bitcoin:
    image: lncm/bitcoind:v28.0
    container_name: bitcoin-bitcoind
    user: "${BITCOIN_UID}:${BITCOIN_GID}"
    restart: always
    security_opt:
      - label:disable
    volumes:
      - /srv/bitcoin/data:/data/.bitcoin
    ports:
      - "8333:8333"
    stop_grace_period: 5m
    healthcheck:
      test: ["CMD", "bitcoin-cli", "-datadir=/data/.bitcoin", "ping"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

  fulcrum:
    image: cculianu/fulcrum:latest
    container_name: bitcoin-fulcrum
    restart: always
    security_opt:
      - label:disable
    volumes:
      - /srv/bitcoin/fulcrum:/data
      - /srv/bitcoin/data:/bitcoin:ro
    command: ["Fulcrum", "/data/fulcrum.conf"]
    ports:
      - "50001:50001"
      - "50002:50002"
    depends_on:
      bitcoin:
        condition: service_healthy

In my case, the restriction of 50001 to LAN happens at the host’s network layer. The YAML above is the skeleton of the stack, not the entire firewall policy.

The most important parts of this:

restart: always because this is a long-running service
explicit volume so state isn’t lost
user: "${BITCOIN_UID}:${BITCOIN_GID}" because the persistent directory needs to match the storage’s real ownership, so I’d rather pin UID/GID explicitly than trust the image’s default
the RPC isn’t published on the host; it stays on the Compose internal network, which is all Fulcrum needs
the healthcheck uses Bitcoin Core’s own local .cookie, so there’s no need to spread a fixed password through commands
Fulcrum mounts the node’s datadir as read-only just to authenticate via the .cookie without inventing parallel credentials
in fulcrum.conf, that becomes a simple configuration: talk to bitcoin:8332 and read the mounted .cookie, instead of repeating credentials in plaintext
security_opt: label:disable because on this host with MicroOS, SELinux and sensitive bind mounts, I preferred the pragmatic route of disarming this specific friction rather than wasting time fighting labels on a volume that’s already being handled in a controlled way
depends_on with service_healthy so Fulcrum only comes up after bitcoind’s RPC is responding
stop_grace_period: 5m because bitcoind needs real time to flush state on a graceful shutdown

The final version

Today, the design I want to keep is this: bitcoind with txindex, dbcache=1024, persistent volume, 5-minute graceful stop, .cookie authentication, and Fulcrum in front serving Sparrow over LAN or TLS.

The current stack looks like this:

Component	State
Bitcoin Core	`28.0`
Fulcrum	`2.1.0`
Container stop timeout	`300` seconds
Node data dir	dedicated persistent volume mounted at `/data/.bitcoin`
Network	`8333` for P2P, RPC only on the Compose internal network, `50001/50002` for the private Electrum

I’m not interested in turning this into a spectacle. The point is simpler: the final infrastructure has to be boring, predictable and stable.

The tunings that actually matter

There’s no magic here. There are a few parameters that make a real difference and a bunch of stuff that just decorates compose files.

stop_grace_period: 5m exists because bitcoind isn’t a disposable stateless API container. It maintains chainstate, indexes and an in-memory cache. If you don’t give the process time to finish properly, you create unnecessary work for the next start.

user: "${BITCOIN_UID}:${BITCOIN_GID}" is there for a much less glamorous and much more important reason: persistent storage with the wrong permissions is an excellent way to break a working service. So I’d rather align the container with the volume’s actual ownership instead of leaving that implicit.

dbcache=1024 is the spot I find most reasonable for a domestic node that’s always on. Big enough not to suffer constant I/O, small enough that every restart isn’t a labor.

txindex=1 I keep because I want the complete node, not a minimalist install just to claim “it runs Bitcoin.” If the goal here is operational autonomy, I’d rather have the full index.

rpcworkqueue=512 and rpcthreads=16 are the kind of tweak that makes sense when you know you’ll have Fulcrum querying the node all day and you want some headroom.

On the Fulcrum side, the main parameters are:

db_mem = 8192
db_max_open_files = -1
bitcoind_clients = 8
worker_threads = 0
peering = false

Again: nothing esoteric. Just enough cache, reasonable parallelism and absolutely no announcing this server as a public service.

In my current bitcoin.conf, the important core ended up like this:

server=1
txindex=1
prune=0
rpcbind=0.0.0.0
rpcallowip=172.0.0.0/8
rpcthreads=16
rpcworkqueue=512
dbcache=1024
maxmempool=512

All of this makes sense on a server with decent RAM and fast NVMe. But the detail that matters most is still the clean shutdown. Wallet infrastructure has no room for “we’ll deal with it later” thinking.

The actual size of all this

This is another point a lot of people underestimate.

If you look at older Bitcoin Core documentation, you’ll find numbers like 350 GB of disk for a node with default config. That’s outdated. More current data on the size of the blockchain points to something around 725.82 GB on March 11, 2026, and that’s just the raw chain, without the extra indexes that many technical folks will want to keep.

And here comes the catch: the stack I’m describing isn’t “a bare Bitcoin Core just to claim you run a node.” It’s bitcoind with txindex, plus Fulcrum, plus headroom for rebuild, logs, snapshots and normal network growth.

So to put together something similar today, I’d think like this:

below 1 TB: I wouldn’t even start
1 TB: pragmatic minimum
2 TB: comfortable range
above that: if you want long-term headroom, snapshots and less operational anxiety

And here’s the most important observation of all in self-hosting: don’t assume persistence, mount and backup are right just because the YAML looks clean. Verify.

Another thing I wouldn’t forget on a btrfs host: put Fulcrum’s database (fulc2_db) on a separate subvolume. The reason is mundane. That directory grows, changes constantly and has nothing to do with generic automated snapshots of /var. If you mix everything, you end up dragging a large rebuildable index along with system snapshots, burning space and making maintenance more annoying than it needed to be. The Fulcrum index isn’t sensitive configuration. It’s heavy, volatile, rebuildable data. I treat it exactly like that.

Hardening: what I’ve already applied

This is where the difference between “ran on my laptop” and “I’d trust this to operate my wallet” shows up.

In the current state of the stack, the points I consider important ended up like this:

Bitcoin Core’s RPC no longer relies on unnecessary host exposure; Fulcrum talks to bitcoind over the Docker internal network, which is what actually matters
50001 is restricted to internal LAN use
50002 is available with TLS, which is the right move when you need to leave plaintext behind
shutdown is graceful, with stop_grace_period: 5m, so bitcoind has time to flush state instead of dying any old way
the storage mount isn’t on a “we’ll see later” basis; there’s a mount check before Docker comes up, precisely to avoid silent drift

Each of those items exists for a very concrete reason.

Pulling the RPC off the host’s surface reduces attack at zero cost. Fulcrum is already in the same Compose and can already talk to the service by its internal name. There’s no real gain in leaving that port exposed where it doesn’t need to be exposed.

Separating 50001 and 50002 also helps keep the house in order. Within a controlled LAN, plaintext is acceptable. Outside of that, the minimum reasonable thing is TLS. Mixing the two scenarios usually turns into a mess.

stop_grace_period: 5m looks like a container detail, but it isn’t. Anyone who’s ever had a database, an index or a blockchain node killed without grace knows how that turns into hours of work later. A stateful service needs a decent stop.

And the mount check is one of those annoying things that saves you from yourself. The YAML can look beautiful. If the storage didn’t mount and the service came up writing where it shouldn’t, you’ve just manufactured a really irritating problem.

There’s also one detail I really like in this final version of the stack: Fulcrum authenticates to bitcoind through the .cookie file, not through a fixed plaintext password. That’s interesting for two reasons:

you don’t need to leave a static credential showing up in compose, inspect, healthcheck or documentation
the authentication is more aligned with the way Bitcoin Core already knows how to operate locally

In practical terms, that reduces accidental leakage of operational secrets. It’s not a magic solution to everything, but it’s much better than spreading rpcuser and rpcpassword across files, logs and commands.

The only kind of hardening I try to avoid here is the one that’s overly performative in YAML and loose in operation. I’d rather have less “stage engineering” and more basic discipline:

minimum network
minimum secrets
minimum privilege
clean shutdown
verified storage
separate subvolume for large rebuildable data, like the Fulcrum index

And, again, document everything. Good infrastructure isn’t the kind that just works today. It’s the kind that keeps working when you come back to it six months later.

Why this improves transactions on your side

When I build a transaction in Sparrow and sign it on the Coldcard, the chain of trust is much better defined:

the private key never touches the internet
the desktop wallet doesn’t have to trust a public server
the broadcast can come out of my own node
the address history doesn’t need to land on a third-party Electrum server

This doesn’t make anything invulnerable. There’s still a risk of malware on the desktop, badly stored seeds, human error, social engineering and physical disaster. But the design becomes much more coherent.

What about Lightning? Especially in Brazil?

There I separate things.

Reserves and larger-value transactions I treat one way. Day-to-day spending I treat another.

For day-to-day spending, especially in Brazil, I think it’s operationally dumb to carry a lot of balance in a hot wallet. Lightning wallets and spending apps have to be almost like a “pocket wallet”: just enough for daily life.

That goes double if you use a hybrid or custodial solution like RedotPay. I get why it’s interesting for Brazilians: a Hong Kong company, international focus, a reasonably practical bridge between crypto and card spending. For travel, online shopping and life outside the Brazilian banking axis, it makes sense. But I’d never treat it as a place to store wealth. That’s a spending tool, not a vault.

Same logic for Bitrefill Brasil. I think the service is interesting precisely because it solves a real pain in Brazil: turning sats into concrete utility without selling your full position or depending on banking integration all the time. Gift cards, top-ups, small expenses. As a use tool, it makes a lot of sense.

For a Lightning wallet on the phone, I’d look at first:

Phoenix for people who want something very good and simple
Breez for people who want a great payments experience
ZEUS if you’re more technical and you eventually plan to operate your own Lightning node

All of them, in my head, fit into the “pocket wallet” category. Small balance. Daily use. Don’t turn a phone app into a retirement vault.

So is it worth it?

For most people, honestly, probably not in the first month. It’s labor-intensive, has a learning curve, and demands discipline.

But for programmers, engineers and any technical person who wants to learn not to depend on someone else’s service all the time, I think it’s an excellent exercise.

You learn about:

separation of concerns
persistence and state
graceful shutdown
observability
secret isolation
the trade-off between convenience and security

And all of that is valuable beyond Bitcoin.

In the end, that’s what interests me most about this stack. It isn’t about preaching “hyperbitcoinization” or posing as a price prophet. It’s about building a system at home that I can trust more because I’m the one who installed it, measured it, broke it, fixed it and documented it.

Is it work? Yes.

But that kind of work teaches exactly what modern software tries to make you forget: depending less on others is more work upfront, but it usually buys a lot more control in the long run.

My Sim Racing Cockpit - Formula FX1

Wed, 01 Apr 2026 17:00:00 GMT

I’ve loved cars for as long as I can remember. My first real contact with racing games was at the arcades of the 80s and 90s. And when I say “real,” I mean sitting at a cabinet with a wheel, pedals, a hard plastic seat and the screen wrapping in front of you. OutRun (1986), Rad Mobile (1991), Virtua Racing (1992), Ridge Racer (1993), Daytona USA (1994), Scud Race (1996). Every one of those games left a mark on me. But Daytona USA stuck in a different way. That twin cabinet, two machines side by side, the “DAYTONAAA, let’s go away” track blasting through the arcade, the wheel rumbling in your hand. I still remember it.

But the game that really got me hooked on the simcade genre was the original Gran Turismo, in 1997, on the PlayStation 1. “The Real Driving Simulator” on the cover. I played that game obsessively. Around the same time I was watching the Initial D anime, which premiered in 1998 in Japan. I bought every manga volume and read all of it from start to finish. The story of Takumi Fujiwara going down Mount Akina at dawn delivering tofu in his father’s AE86 is, to me, one of the best motorsport stories ever told in any medium.

I still follow Shuichi Shigeno’s work today. After Initial D came MF Ghost (2017-2025), set in the same universe but in a near future where combustion cars have become museum pieces. And now, since July 2025, I’m reading Subaru and Subaru, the direct sequel that ties the Initial D and MF Ghost universes together with two protagonists named Subaru — one from Gunma, one from Kanagawa — competing in a new racing series. It’s Shigeno at his best.

Two years ago I traveled to Japan with my girlfriend and made a point of going to Daikoku PA, the famous parking area on the Shuto Expressway in Yokohama where the JDM culture concentrates. As an old fan of Tokyo Xtreme Racer, by Genki, I needed to see Daikoku with my own eyes at least once. And it didn’t disappoint. Instead of renting a car, we booked a tour with a local guide in his prepped Nissan GT-R. Better that way. On the drive there he explained the history of the Wangan, how the scene works, what’s YouTube exaggeration and what’s real. When we got there on a Friday night and I saw the whole thing in person — Skyline R34, RX-7, Supra, GT-R, tuned kei trucks, insane bosozoku — the feeling was strange in the best possible way. It looked like Tokyo Xtreme Racer, except with the smell of fuel in the air and the sound of real exhaust pipes.

And there’s another detail: I’m playing the new Tokyo Xtreme Racer reboot on PC, and it’s exactly the kind of game that understands its own audience. Strong single-player campaign, addictive progression, the right vibe, and none of the loot box nonsense. I’d recommend it without hesitation. For the same reason, I’m also really looking forward to Forza Horizon 6, which this time is going to be set in Japan. I’ve already pre-ordered it and I can’t wait to play it on the new cockpit.

Driving for real

Now that I’m semi-retired, I’ve had the chance to take my Mercedes to track days. I’ve driven at Autódromo de Interlagos (the Autódromo José Carlos Pace), the 4.309 km circuit in São Paulo that’s been hosting the Brazilian F1 GP since 1973, famous for the S do Senna corner complex and the circuit’s wild elevation changes. I’ve also driven at Autódromo Velocitta, a modern 3.443 km circuit opened in 2014 in Mogi Guaçu in the interior of São Paulo, which hosts Stock Car Brasil and Porsche Cup.

In Las Vegas I’ve driven supercars on those track-day experiences. And when I traveled with my girlfriend to Gramado, in Rio Grande do Sul, we went to Super Carros, which is on Av. das Hortênsias 4635. They have a 2,400 m² hangar with more than 50 cars — Ferraris, Lamborghinis, Porsches, GT-Rs, Corvettes, American muscle cars. You pick a car, head out with an instructor, and drive a roughly 17 km route between Gramado and Canela. I took out a Nissan GT-R and a Ferrari California.

Three years ago I also went to Abu Dhabi with my girlfriend and we went to Ferrari World, which has some of the best racing simulators I’ve ever tried. Hydraulic platform with 6 degrees of freedom, F1 cockpit, the works. I’ve always loved testing simulators wherever I go.

But driving real cars on real tracks is a very expensive hobby. Tires, fuel, insurance, maintenance, registration. And more important: I’m an introvert. I prefer being alone. My simulator cockpit is perfect for when I want to drive without having to deal with anyone. That’s why I love rally so much — it’s me, the virtual co-driver, and the road. Nothing else.

The games I play

I know most people who build a cockpit like this do it to play serious sims — iRacing, Assetto Corsa Competizione, Automobilista 2. I respect that, but it isn’t my thing. I don’t like playing online with other people. I have zero intention of starting a live streaming career. This is purely for my own enjoyment.

These days I play Gran Turismo 7 on the PS5, Forza Motorsport (the 8, from 2023) on PC, but where I have the most fun is in rally games: EA SPORTS WRC, WRC 10 and DiRT Rally 2.0. My first experience with Forza was on the Xbox One with Forza Motorsport 5 and then Forza Horizon 4, which kept me hooked for hundreds of hours.

And I have a huge soft spot for retro games. The original Colin McRae Rally from 1998 on the PS1 was my first rally game. But my favorite of all time is Colin McRae Rally 2.0 (2000), also on the PS1. I recently played through the entire campaign again on the PC version — you can find repacks that run in high resolution and widescreen, much better than the original PlayStation versions. I’d recommend that for any of the titles in the series.

After that came Colin McRae 3 (2002), Colin McRae Rally 04 (2003) and Colin McRae 2005 (2004). Other arcades I revisit often: OutRun 2 SP (2004) and OutRun 2006: Coast 2 Coast (2006) — the best OutRun ever made, in my opinion.

But my game of the year, by a long shot, is Super Woden: Rally Edge. An indie made by a solo developer (ViJuDa, from Spain) that launched in January 2026 for less than R$ 60. Eight countries, more than 80 cars, a career mode, local split-screen multiplayer for up to 4 players, online leaderboards. The behind-the-car camera instead of the top-down view of the previous Super Woden GP made all the difference. 96% positive reviews on Steam with more than 1,300 ratings. It’s the kind of game that proves you don’t need a million-dollar budget to make something amazing.

The evolution of my wheel setup

Logitech G29 era (~2015-2021)

I always wanted a racing wheel. I started over 15 years ago with something equivalent to Logitech’s entry-level wheel, a G29. The G29 is a fine wheel to start with — gear-driven force feedback, pedals with a clutch, 900 degrees of rotation. But its force feedback is noisy and a bit crude. You can feel the gears turning.

Thrustmaster T300RS + SXT V2 stand era (~2021-2024)

Around 2021 I upgraded to the Thrustmaster T300RS, a belt-driven wheel that’s a huge jump up from the Logitech. The force feedback is much smoother and more precise. And I bought the Extreme Sim Racing SXT V2 stand, which is much sturdier than those generic desk clamps.

First I set it up in front of my desktop PC, which at the time had an RTX 3090. It worked, but it was a hassle to keep mounting and unmounting the stand and the cables every time I wanted to play.

Then I built a setup with a long fiber-optic HDMI cable to connect my 60" TV to the PC at the back of the room. I moved the stand in front of the couch. Less of a hassle, but I still had to take it down whenever I wanted to watch a movie with my girlfriend.

Around 2024 or 2025 I swapped my couch for one of those VIP cinema couches from Star Seat, which reclines and the whole nine yards. The problem: it was way taller than the previous couch. I had to do all kinds of workarounds to make the stand work at that height. I even 3D-printed mounts and sent them to PCBWay to machine steel plates so I could attach big wheels under the stand and gain a few centimeters of height. But that left the setup way too wobbly to drive comfortably.

Fanatec CSL DD + Direct Drive era (~2024-2025)

In the meantime I gave the T300 to my brother and upgraded to a Fanatec CSL DD Gran Turismo Edition. The CSL DD’s direct drive motor delivers 5 Nm of torque at the base, but the Gran Turismo DD Pro kit comes with the Boost Kit 180 that brings it up to a sustained 8 Nm with no active cooling. Direct drive means there’s no gear or belt between the motor and the wheel — the motor’s shaft IS the wheel’s shaft. The difference is absurd. The T300 was already great, much better than the Logitech. But the Fanatec is on another level. You feel every asphalt texture, every bump, every incipient slide. There’s no going back once you try it.

I bought it together with the CSL Pedals with Load Cell kit, which measures pressure on the brake instead of displacement. Makes all the difference in braking — you learn to modulate by foot pressure, not by how far the pedal moves. Way more natural.

I also wanted to try the H-pattern manual shifter with the ClubSport Shifter SQ V1.5 from Fanatec and a separate handbrake. It’s fun to try out old cars with a clutch and an H-pattern, but in practice I never adapted. The SXT V2 stand was already shaking a lot with the direct drive, and using the shifter on that unstable setup was frustrating. And I know there are people who want to do heel-toe, but in the simulator I prefer to keep my left foot on the brake and my right on the gas and modulate both at the same time. Works better for me. Now that I have the McLaren wheel with the analog handbrake and clutch paddles right on the wheel, the H-pattern shifter and the external handbrake have been retired. For rally, an analog handbrake on the wheel is much more natural.

I also bought the PS5 with Gran Turismo 7 around this time. I put on the dbrand Darkplates matte black faceplates to replace the original white plates — it looks much better and more discreet.

But the setup was still the SXT V2 stand in front of the VIP couch. The same kludge. The same wobble. I obviously wasn’t going to give up the cinema couch. The situation became unsustainable.

The computer and the hardware setup

A note on my gaming hardware. I bought a Minisforum UX790 Pro to be my dedicated Steam machine. It’s a mini-PC with an Intel Core Ultra 9 285H processor, fits in the palm of your hand. Together I bought the Minisforum DEG1, an external GPU dock that connects via OCuLink (PCIe 4.0 x4, 64 GT/s). It’s an open design — basically a board with a PCIe x16 slot and room for an ATX or SFX power supply. There’s no card size limit, so an RTX 4090 fits comfortably. The performance loss compared to a native PCIe slot is minimal. I put the RTX 4090 in it. The 4090 came from my desktop — at the start of 2025 I went to Miami and took the chance to buy an RTX 5090 because I was using more and more local AI and LLMs. I gave my old 3090 to my girlfriend to use for video editing. The 4090 went into the mini-PC.

So my gaming setup today is: Minisforum UX790 Pro + eGPU with RTX 4090 for Steam and PC games, and a PlayStation 5 with matte black Darkplates for Gran Turismo 7 and exclusives.

The cockpit: Formula FX1

To round out the setup I also needed a decent monitor. I was already used to my 80" Samsung OLED TV in the living room and didn’t want to downgrade picture quality. So I invested in the Samsung Odyssey OLED G8 32". It’s a 4K (3840x2160) OLED monitor with a 240 Hz refresh rate, 0.03 ms response time (GTG), HDR True Black 400, HDR10+, 99% DCI-P3 coverage, 1,000,000:1 contrast, and FreeSync Premium Pro. It has 2 HDMI 2.1 inputs and 1 DisplayPort 1.4.

In practice: the colors pop, black is real black (it’s OLED, no backlight), and with the RTX 4090 I run most games at 4K and 120 fps with no problem. On lighter titles like Super Woden: Rally Edge, it easily hits 240 Hz. The smoothness is absurd. For a cockpit where you’re 60-70 cm from the screen, 32" OLED in 4K is the sweet spot. Bigger and you start to see pixels. Smaller and you lose the immersion.

In January 2026, after years of kludges, I finally ordered a dedicated cockpit. I researched a lot. I considered the Cockpit AX160, made of aluminum profile and very modular, and the Cockpit 4.0, which is the more traditional tubular steel kind. But neither was available at the time of purchase. And then I found the Formula FX1 in black and green from Extreme Racing — Petronas colors, F1-styled.

The FX1 is very different from traditional cockpits. The whole structure is welded thick steel tubing. When I say it doesn’t shake, I mean it doesn’t shake at all. Zero wobble. It’s a brutal difference compared to a stand in front of the couch. The driving position is reclined, F1-style — your feet are at the same height or higher than your hips. You’d think it would be uncomfortable, but it isn’t. You can sit there for hours without complaining. It comes with a padded adjustable seat, an articulating monitor mount, a tilt-adjustable pedal mount, and a height-adjustable wheel mount.

I had to wait about a month for delivery. In the meantime, as anyone who follows my blog knows, I dove into a 16-hour-a-day marathon testing the new AI agents from Anthropic and OpenAI — check the #vibecoding and #agents tags to see everything I built. After about 30 days of that insane marathon, my lower back gave out and I started developing what looks like a herniated disc. I had to see a doctor and take heavy anti-inflammatories.

And right that week, the cockpit decided to arrive.

The build

I was in absurd pain, but I built the cockpit anyway. It took an entire day to unbox and assemble the heavy steel pieces with my back screaming, but I did it.

The official assembly video I followed:

The McLaren GT3 V2 wheel

After assembling the cockpit, I decided that the standard wheel that comes with the CSL GT kit wasn’t enough. I upgraded to the Fanatec CSL Elite McLaren GT3 V2 (~R$ 4,990). It’s a 1:1 scale replica of the McLaren GT3 wheel, with carbon fiber, an OLED display, and compatibility with PC, Xbox, PS4 and PS5.

What I like most about it: it has the normal shift paddles behind it (shift up/down), but it also has two additional analog paddles that can be configured in four different modes. In mode B, which is what I use, the left paddle works as an analog handbrake and the right one as an analog clutch. That’s perfect for rally — I can pull the handbrake mid-corner without taking my hand off the wheel. It also has two 2-position toggles, two 12-position rotaries, 7 standard buttons with interchangeable caps, and Fanatec’s 7-direction FunkySwitch. It’s a complete racing controller.

The final setup

The cockpit ended up in a corner of my bedroom, between the manga shelves (you can spot Akira, Initial D, and 500-something other volumes in the background). I mounted the mini-PC and the PS5 on the cockpit’s side structure, together with the eGPU and the RTX 4090. Everything stays permanently connected. That’s what makes the difference: I no longer have to set anything up or take it down. I sit, turn it on, and I’m driving in 30 seconds.

Gran Turismo 7 running on the final cockpit

The audio system

To round out the setup I needed dedicated audio. I didn’t want to use the monitor’s audio (terrible) and I didn’t want to be on a headset all the time. The fix was to build a separate audio system with HDMI audio extraction.

The centerpiece is an HDMI 2.1 switcher from OREI with audio extraction. It has 2 inputs and 1 HDMI output, supports 4K at 120Hz (48 Gbps of bandwidth), and extracts the audio through optical TOSLINK and 3.5mm. I connect the HDMI output of the RTX 4090 to one input and the PS5 to the other. Video goes to the monitor. Audio goes out through the optical port.

The optical audio goes to an Aiyima D03 amplifier, a compact 2.1 channel amp with 150W per channel, integrated DAC (PCM1808 chip), and Bluetooth 5.0 with aptX HD. It has optical, coaxial, USB, RCA and Bluetooth inputs. It even has a dedicated subwoofer output for when I get around to adding one. It uses Texas Instruments’ TAS5624 amplifier chip and has bass and treble control through the remote. For a cockpit setup where you’re 1 meter from the speakers, 150W is more than enough.

In practice, I keep the amp at 50% and Windows volume at 50%, and that’s already loud as hell. Which is to say: this isn’t a “good enough” little system. It’s set up to actually go loud if I want.

The speakers are Edifier P12, passive, with a 4-inch woofer and a 19mm tweeter. Frequency response from 55Hz to 20kHz, 6 ohm impedance, 20W RMS each. The MDF cabinet with wood finish has a rear bass-reflex port that helps the lows. For passive speakers this size, they deliver well. The mid-range is clean and the highs don’t distort even at high volume.

The setup logic is: HDMI switcher handles the switching between PS5 and PC, extracts audio to optical, the amp converts and amplifies, and the passive speakers deliver the sound. All without having to touch the monitor or swap cables. I press a button on the switcher and switch consoles.

When I want to play without bothering anyone, I plug my Meze 109 Pro directly into the 3.5mm output of the HDMI switcher. The Meze 109 Pro is an open-back headphone with 50mm dynamic drivers, 40 ohm impedance, 112 dB SPL/1mW sensitivity, and 5Hz to 30kHz response. The ear cups are walnut wood with handcrafted finish. It’s an audiophile headphone that works perfectly without a dedicated amplifier thanks to the low impedance. The sound is warm, with full bass and rich mids. You can hear every detail of the engines.

I haven’t decided about a subwoofer yet, but it’ll be my next upgrade. A dedicated sub is going to add that low-end weight that makes you feel the engine in your chest.

The verdict

The couch with a stand works. The PC desk with a stand works. But neither comes anywhere close to a dedicated cockpit. The FX1’s steel structure doesn’t move a millimeter, even with the Fanatec direct drive at max torque. The reclined F1 position is comfortable for sessions of hours. The load cell pedals stay firm on the base. The monitor is exactly at the right height and distance. And best of all: it’s always ready. I don’t need to assemble anything, take anything down, run cables, none of it. I sit and I drive.

For anyone who’s wondering whether it’s worth investing in a dedicated cockpit instead of staying with a desk or couch stand: it is. If you already have a direct drive wheel, the cockpit is the missing piece. I spent years thinking “this is fine” with the stand on the couch. It wasn’t fine. The difference in driveability is something else entirely. And for my case — introvert, single-player only, simcade — I couldn’t have built it any sooner. To be honest, I think I’ve finally landed on the simulator setup that’s perfect for my taste.

Me driving on the final cockpit

Shopping list: how much it all cost

Here’s the consolidated list of everything in my current setup, with approximate prices (some were bought in dollars and converted to reais at the time’s exchange rate):

Item	Estimated Price (R$)	Link
Cockpit Formula FX1 Black and Green	~6,290	Extreme Racing
Fanatec Gran Turismo DD Pro 8Nm (motor + wheel + pedals + Boost Kit)	~9,590	Fanatec / Racing Wheel Brasil
Fanatec CSL Pedals LC (with Load Cell)	~1,500	Fanatec
Fanatec CSL Elite McLaren GT3 V2	~4,990	Racing Wheel Brasil
Fanatec ClubSport Shifter SQ V1.5	~2,500	Fanatec
Minisforum UX790 Pro	~5,000	Minisforum
Minisforum DEG1 eGPU Dock + RTX 4090	~12,000	Minisforum / bought separately
PlayStation 5 + dbrand Darkplates	~4,500	Sony / dbrand
Samsung Odyssey OLED G8 32"	~2,500	Samsung
OREI BK-21A HDMI 2.1 Switcher 2x1 with audio extraction	~450	Amazon
Aiyima D03 Amplifier	~900	Mercado Livre
Edifier P12 (pair)	~799	Edifier
Meze 109 Pro	~5,390	Mercado Livre / Heinrich Audio
Cables (HDMI 2.1, optical, 3.5mm, power)	~300	Various
TOTAL ESTIMATED	~56,709

Yes, almost R$ 57k is a lot of money. I worked like a dog for decades. Now that I’ve managed to retire honestly, my family is well taken care of, I have no debts, and I can finally give myself something I always wanted as a kid but couldn’t afford. When I sat at those OutRun and Daytona USA cabinets at the arcade, I dreamed of having something like this at home. It took 30-something years, but I got there.

And if you add up the years of kludges, stands that didn’t work, 15-meter HDMI cables, 3D prints, machined steel plates, and the frustration of mounting and unmounting everything — a dedicated cockpit saves your sanity. Unlike a PC that depreciates fast, a steel cockpit lasts decades.