AkitaOnRails.com

Is RAG Dead? Long Context, Grep, and the End of the Mandatory Vector DB

Mon, 06 Apr 2026 11:00:00 GMT

This is one of those itches I can’t scratch. Back in the early LLM days, around 2022/2023, we had 4k of context on GPT 3.5, 8k if you were lucky, 32k was a luxury. To do anything with a real document you had no choice: chop the text into pieces, generate embeddings, throw them in a vector database, do similarity search, grab the top-5 chunks, and pray the right ones came back.

Then it became an industry. Pinecone, Weaviate, Qdrant, Chroma, Milvus, pgvector, LangChain, LlamaIndex, Haystack. Tutorials everywhere, “build your chatbot with your PDFs,” entire consultancies feeding off this. It became the “hello world” of applied LLMs: document → chunk → embed → vector DB → query.

Today, in April 2026, Claude Opus 4.6 has 1 million tokens of context. Sonnet 4.6 too. Gemini 3.1 Pro too. GPT 5.4 has a smaller window but still in the comfortable range, in the hundreds of thousands. And some models already have experimental 2M token modes. The question that keeps nagging at me: what on earth do I need a vector stack for, to solve a problem that fits inside the model’s window?

And there’s more: vector databases have real problems nobody wants to talk about. False neighbors. Arbitrary chunking that splits a definition from its usage. Embeddings that age badly. Not to mention that when the result is wrong, you have absolutely no idea why.

The thesis I’ve been chewing on is simple: in most cases, a well-aimed grep plus a generous context window beats a full RAG stack. It’s cheaper, it’s easier to maintain, and when it breaks you can actually debug it. Let’s break this down.

What the Claude Code leak showed

Before we get into the theory, let me bring up something that happened a few days ago that backs this whole argument up. On March 31, 2026, Anthropic, by accident, published version 2.1.88 of the @anthropic-ai/claude-code package on npm with a nearly 60 MB source map attached, and roughly 512,000 lines of TypeScript from their internal tool leaked into the wild. I already wrote about the incident last week, with more detail on what showed up in the code.

The part that matters for this discussion is Claude Code’s memory system. Instead of dumping everything into a vector DB, the architecture has three layers. There’s a MEMORY.md that stays permanently loaded in context, but it doesn’t hold any actual data: it’s just an index of pointers, around 150 characters per line, kept under 200 lines and about 25 KB. The real facts live in “topic files” that get pulled on demand when the agent needs them. And the raw transcripts from previous sessions are never reloaded whole, only searched with grep, hunting for specific identifiers. No embedding. No Pinecone. Just write discipline (topic file first, index after) and lexical search. That’s it.

Claude Code’s main loop also has a tiered system for handling a context that’s filling up. As I detailed in the previous post, there are five different context compaction strategies, with names like microcompact (clears old tool results based on age), context collapse (summarizes long stretches of conversation), and autocompact (which fires when the context gets close to the limit). The CLAUDE.md file, which a lot of people thought was just a convention, is first-class in the architecture: the system re-reads it at every iteration of the query.

What this tells me: the best coding agent on the market right now, built by the company selling the most expensive model out there, does not use a vector DB. It uses files on disk, a markdown index, lexical search, and smart compaction strategies for when the context overflows. They could’ve slapped embeddings on top, they have the money to run whatever they wanted, and they chose not to. The reason, in my reading, is exactly what this post is arguing: to retrieve text from files you control, with generous context available, a vector DB is dead weight. Better to invest in compacting the window you already have than indexing everything into an external store.

There’s a curious security detail that came along with the leak: people noticed the compaction pipeline has a vulnerability they’re calling “context poisoning.” Content that looks like an instruction, coming from a file the model reads (say, a CLAUDE.md from a cloned repo), can end up being preserved by the compaction model as if it were “user feedback,” and the next model takes that as a real user instruction. It’s a new attack vector. But that’s a topic for another post.

The “Dream” system and memory consolidation

But what really caught my eye for the RAG debate, which I unpacked in detail last week, is the system called autoDream. It’s a forked subagent, with read-only bash access to the project, that runs in the background while you’re not using the tool. Its job is literally to dream: to consolidate memory. The name isn’t accidental, and the obvious analogy (which I couldn’t resist) is the human brain consolidating memory during sleep, turning short-term experience into something more stable.

For a dream to actually run, three gates have to open at once: 24 hours since the last dream, at least 5 sessions since the last dream, and a consolidation lock that prevents concurrent dreams. When it fires, it goes through four phases. Orient (does an ls on the memory directory, reads the index). Gather (looks for new signals in logs, stale memories, transcripts). Consolidate (writes or updates the topic files, converts relative dates into absolute ones, deletes facts that have been contradicted). And Prune, the final cleanup that keeps the index under 200 lines.

The decision to make autoDream a forked subagent is the detail that matters here. It does not run in the same loop as the main agent. Why? Because memory consolidation is a noisy process. The model has to re-read old transcripts, compare them against what’s in MEMORY.md, decide what stays and what goes, form hypotheses about things it saw in earlier sessions. If that ran in the main context, it would pollute the “train of thought” of the agent that’s trying to help you with your current task. By forking, you keep the two separate. The main agent stays focused on what you asked for, and autoDream does the housekeeping in parallel, with no write permission on the project.

And the way it figures out what needs to be consolidated is plain old lexical search. The transcripts live as JSONL files on disk, and autoDream uses grep to look for new signals. Just grep, on text logs. Stop and think about that for a second. The memory consolidation of the most advanced agent in the world, built by one of the richest AI companies out there, is a forked subagent running grep on text logs. If a vector DB were the right answer for this kind of problem, Anthropic would’ve put a vector DB in there. They didn’t.

And there’s a detail that, to me, is the buried gold of the entire leak, and it fits this argument like a glove. In autoDream, memory is treated as a hint. The system assumes that what’s stored may be stale, wrong, contradicted by something that happened later, and the model has to verify before it trusts it. The vector DB pitch is the opposite of that: index everything, search by similarity, return the top-k, trust the result. Claude Code went the conservative route. Index little, search by word, return a hint, and stay skeptical until you’ve laid eyes on the actual fact.

The whole strategy works in two layers. Inside a single session: generous context plus grep plus smart compaction (microcompact, context collapse, autocompact). Between sessions: a subagent that consolidates memory asynchronously, using grep on the transcripts and treating the result as a tip, not as truth. Embeddings and vector DBs don’t show up in either layer. The deliberate choice was a smart reader chewing on raw text, not a dumb reader being spoonfed the top-k of an embedding.

The practical lesson for our debate is simple. The most advanced agents on the market are heading toward generous context, lexical search, and smart compaction, not toward classic RAG pipelines. If Anthropic, with all the infrastructure and talent they’ve got, picked this path for Claude Code, those of us building internal applications on a fraction of that budget should at least think about going the same way.

Where the story started turning

When the ceiling was 32k of context, retrieval was the bottleneck of the entire problem. You had to pre-filter aggressively, because anything that made it into the window was sacred space. A vector DB was the only halfway-decent way to do that semantic pre-filtering. The logic was: “the reader (LLM) is expensive and dumb, so the retriever has to be smart and selective.”

Today the equation has flipped. The reader is now the smartest one at the table, and the window grew big enough to hold an entire document. So the retriever can (and maybe should) go back to being dumb. The dumber, the better. You want high recall and low precision, and you let the model do the fine work. Grep does exactly that. So does BM25. And ripgrep flies through millions of lines without breaking a sweat.

And this isn’t just my hunch. The BEIR benchmarks have shown for a while now that BM25 matches or beats a lot of dense retrievers when the domain drifts away from where the embeddings were trained. Anthropic itself published a post on Contextual Retrieval that basically says the same thing: a lexical signal plus an LLM’s judgment beats pure embeddings on most knowledge tasks. And take a look at Claude Code, the tool I’ve been using every day for 500 hours: it navigates the repo with Glob and Grep. No vector DB, no embedding, no LangChain. It works ridiculously well.

The real problems with vector databases nobody advertises

The vector DB marketing sells the dream of perfect semantic search. Reality is messier.

False neighbors come first. Cosine similarity rewards topical similarity, not relevance. You ask “how do we handle authentication errors” and the DB returns every chunk that mentions authentication. The chunk that actually answers the question may be in tenth place, or may not have been retrieved at all because the doc author wrote “login” instead of “auth.”

Chunking is the second one, and it’s a disguised disaster. A 512-token window with a 64-token overlap sounds reasonable, until you realize your important table got cut in half, the function definition ended up separated from its usage, and the piece of documentation with the exact command got orphaned without the context of its section. The chunk boundary tends to land exactly where the answer was living.

When it fails, it fails without leaving a trace. When BM25 misses, you know why: the word isn’t there. When a vector DB returns garbage, you get a plausible-looking wrong chunk, with no diagnostic signal at all. Good luck debugging that in production at two in the morning.

The index gets stale. Every document update calls for re-embedding. If you have 10,000 docs and 200 of them change per day, that turns into a batch process, monitoring, a queue, retries, embedding API costs, and an unavoidable inconsistency window between what’s on disk and what’s in the index. Grep has none of that. File changed? The next query already sees it.

And there’s the operating cost nobody adds up. Pinecone charges per vector. Weaviate wants a cluster to maintain. pgvector saves you a new server but you still own a schema, an index, and a re-embedding pipeline. Each of those things wants engineer time, monitoring, tests, deploys. All of that to do a search that rg would often crack in 200ms.

Comparing the complexity

Look at the diagram:

On one side, eight steps, four or five services, an external index that needs to be maintained and kept up to date. On the other, four steps, zero new infrastructure. This isn’t a caricature: it is literally what you have to set up for each case.

The honest question: does the left column pay off? In 2023, yes, because the right column didn’t exist (no LLM had a 200k window). In 2026, in most cases, it doesn’t.

Pros and cons of each side

Classic RAG (vector DB)

For:

Works for huge document bases, on the order of hundreds of GB, where even rg won’t cut it without prior indexing
Handles heavy paraphrase and cross-lingual queries (“how do I cancel” vs. “subscription termination process”) where the user’s vocabulary doesn’t match the document’s
Works for non-textual modalities (image, audio) where grep has nothing to look at
Saves input tokens if you’re tight on budget or absolute latency

Against:

Complex stack: embedding, vector DB, chunking, reranker, re-indexing pipeline
Opaque failures, hard to debug
Chunking destroys the context of tables, code, long definitions
Operational overhead (index, queue, monitoring, re-embedding cost)
The semantic search the marketing is selling rarely works the way the marketing promises

Grep + long context

For:

Practically zero new infrastructure: ripgrep, sqlite, or a plain LIKE in Postgres
Always fresh: file changes, the next query sees them
Transparent failures: the word is either there or it isn’t
Loads the document in generous chunks, the model does the fine filtering with actual semantics
Cheaper in dev and ops, cheaper to pivot domains

Against:

Doesn’t scale to terabytes of raw text without some kind of indexing
Suffers when the user’s vocabulary is very different from the document’s
Doesn’t work for non-textual modalities
Per-query latency is higher in absolute terms (loading 100k tokens always costs more than loading 5k)
Per-query input cost is higher if you don’t have prompt caching

But what about cost?

This is the argument I get hit with the most when I defend the “load everything into context” thesis. “It’ll get crazy expensive, 200k tokens of input per query is absurd.” Let’s actually run the numbers.

In yesterday’s LLM benchmark post I mapped out the per-token price of every model. Take Claude Sonnet 4.6: $3 per million input tokens, $15 per million output. Take GLM 5 (which I proved actually works): $0.60 input, $2.20 output. Take GPT 5.4 Pro at the top of the heap: $15 input, $180 output (yeah, that one stings, I know).

Before we turn “200k tokens” into dollars, let’s land that number on something tangible, because “100k tokens” doesn’t mean anything to anyone. A token, on average, is roughly 0.75 of a word in English (Portuguese is similar, maybe a touch heavier because of longer words). So, translating:

100k tokens ≈ 75,000 words ≈ a whole short novel like Hemingway’s The Old Man and the Sea with room to spare, or about three long Wikipedia articles glued together.
200k tokens ≈ 150,000 words ≈ a big novel, like Crime and Punishment in full, or half of the first Game of Thrones book (which clocks in around 298k words, so roughly 400k tokens).
400k tokens ≈ 300,000 words ≈ A Game of Thrones in full, the entire first book of the series in your window.
1M tokens ≈ 750,000 words ≈ the entire Lord of the Rings trilogy plus The Hobbit, or the whole Bible (King James is around 783k words, roughly 1M tokens), or about two and a half Game of Thrones books stacked on top of each other.

So when I say “throw 200k tokens of input at the model,” what that actually means in the real world is “throw the entire Crime and Punishment in as the context for your question.” That’s a lot. And that’s exactly what makes the argument of this post viable: today’s models can read an entire novel in one go and still answer a specific question about it. In 2023, this was science fiction. In 2026, it’s the base case.

So picture a query that throws 200k tokens of input at the model (there goes Crime and Punishment again) and produces 2k tokens of output (about three pages of response):

Model	Input ($)	Output ($)	Total per query
Claude Sonnet 4.6	$0.60	$0.03	$0.63
Claude Opus 4.6	$3.00	$0.15	$3.15
GLM 5	$0.12	$0.0044	$0.12
Gemini 3.1 Pro	$0.40	$0.024	$0.42
GPT 5.4 Pro	$3.00	$0.36	$3.36

Now throw prompt caching into the mix. Claude has a cache that drops cached input to a fraction of the full price (in the ballpark of 10%, depending on the model). Gemini has a similar mechanism. When you fire a sequence of queries against the same 200k-token dump, the cost of subsequent queries plummets to pennies. With Sonnet cached, you can fairly call it about $0.10 per follow-up query without making things up.

Now compare that to the cost of running a Pinecone, or a Weaviate, or a pgvector. Setting aside the price of the subscription itself (which varies a lot), you need an engineer to wire up the pipeline, maintain it, monitor it, deal with embedding failures, redo the chunking when the domain shifts. Conservatively, you’re looking at somewhere between 40 and 80 hours of engineering to make the thing stable. At R$ 200/hour, that’s between R$ 8,000 and R$ 16,000. In USD, somewhere between $1,600 and $3,200 just to stand it up.

With $3,200, on Sonnet 4.6 with prompt caching, you can run something on the order of 30,000 queries of 200k tokens each. Thirty thousand queries, depending on the scale of the project, gives you several months or even an entire year of an average internal tool. And you didn’t pay an engineer to wire up a pipeline. There’s no vector DB server to maintain. And if the document changes, the system already sees it on the next query.

The “RAG is cheaper in tokens” argument ignores that tokens are the cheapest thing in the entire equation. Engineers cost a lot, servers cost a lot, bugs in production cost a whole lot more. Tokens have become a commodity, and they’re getting cheaper with every new model release.

The classic RAG argument was “the model is expensive, retrieval is cheap.” Today it’s the opposite: the model is the cheap part of the stack, smart retrieval is what costs a fortune to build and maintain.

Where the thesis doesn’t hold

I don’t want to come off as a fanboy. There are cases where classic RAG still wins:

Massive corpora. If you have 500 GB of raw text, even rg won’t solve it in acceptable time. You need some kind of indexing. It can be indexed BM25 (Tantivy, Elasticsearch), it can be a vector DB. But notice: the first option is still lexical, not vector.
Wildly scattered vocabulary. Customer support, where the user types “my wifi’s down” and the documentation says “loss of connectivity at the physical layer.” BM25 won’t catch that. Embedding will. Vector DB scores a point here.
Non-textual modalities. Image-by-image search, audio-by-audio. Embedding is mandatory.
Critical absolute latency. If you have to answer in 100ms with a 5k input budget, a generous dump won’t fit. Pre-filtering is necessary.
Compliance and audit. If you have to prove that a specific document was consulted to answer a specific query, having indexed and trackable chunks helps. A 200k-token context dump is more opaque from an audit standpoint.

For those cases, classic RAG still makes sense. But notice the size of the list. These are specific cases. The general case, things like “chat with our internal docs” or “ask the product manual,” almost all of it falls into the “grep + long context handles it better” bucket.

Lazy retrieval: the recipe I’d defend

If I were building a “chat with docs” tool today, from scratch, it would look more or less like this:

Keep the documents raw. Markdown, converted PDF, code, whatever. On disk, organized in folders that make sense for the domain.
Fast lexical filter. ripgrep with regex, or BM25 with Tantivy/SQLite FTS5, or a LIKE in Postgres if you already have one. Returns 100-300 hits.
Load generously. Grab not just the matching snippet, but the entire file, or a wide window around it. Throw all of it into the context.
Let the LLM do the fine work. Pass the original question, tell the model to find what matters, drop the rest, and answer with citations.
(Optional) Add embeddings only for the query classes where lexical fails, after you have real data showing that it fails.

This is the opposite of the old advice (“start with vectors, fall back to keyword”). It’s: start with keyword, and add vector only if you feel the gap. In most projects, you never will.

A toy implementation in Ruby

To make it concrete. Here’s a Ruby script using the ruby_llm gem (the same one from yesterday’s benchmark) that does exactly this flow: grep through the files, load the snippets with context, send to Claude, get the answer back. No vector DB, no chunking, no embedding, no LangChain.

#!/usr/bin/env ruby
require "ruby_llm"
require "open3"

DOCS_DIR = ARGV[0] || "./docs"
QUERY    = ARGV[1] or abort "uso: ./ask.rb  "

# 1. Fast lexical filter with ripgrep.
#    -i case insensitive, -l file names only, --type-add covers md/txt/extracted-pdf.
def lexical_search(dir, query)
  terms = query.downcase.scan(/\w{4,}/).uniq.first(8)  # words with 4+ letters
  pattern = terms.join("|")
  cmd = ["rg", "-l", "-i", "-e", pattern, dir]
  files, _ = Open3.capture2(*cmd)
  files.split("\n").reject(&:empty?)
end

# 2. Load entire files (up to a reasonable cap).
def load_context(files, max_chars: 600_000)
  total = 0
  files.map do |path|
    body = File.read(path)
    next if total + body.size > max_chars
    total += body.size
    "## #{path}\n\n#{body}\n"
  end.compact.join("\n---\n")
end

# 3. Send to Claude with the question and the documents.
def ask(query, context)
  chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4-6")
  prompt = <<~PROMPT
    Você tem acesso aos documentos abaixo. Responda a pergunta do usuário
    usando apenas o que está nos documentos. Cite o nome do arquivo nas
    referências. Se a resposta não estiver nos documentos, diga isso.

    --- DOCUMENTOS ---
    #{context}
    --- FIM DOS DOCUMENTOS ---

    Pergunta: #{query}
  PROMPT
  chat.ask(prompt).content
end

files = lexical_search(DOCS_DIR, QUERY)
abort "nenhum arquivo bateu" if files.empty?
puts "Encontrei #{files.size} arquivos. Carregando contexto..."
context = load_context(files)
puts ask(QUERY, context)

About 40 lines. No Pinecone dependency, no vector schema, no re-indexing pipeline. You run it as ./ask.rb ./docs "how do I configure the payment webhook" and that’s it.

That example is one-shot. You run it, it answers, done. For a real chat, with multiple questions in a row over the same documents, the design changes. Instead of running lexical_search upfront and shoving everything into the context at once, you expose the search as a tool to the model. Then it’s the agent that decides when it needs to pull more docs, what term to look for, which file is worth opening in full. That’s how Claude Code actually works: Glob, Grep and Read are tools, and the model picks the sequence. ruby_llm supports tool calling, so you can do the same thing in Ruby. It looks something like this:

require "ruby_llm"
require "open3"

DOCS_DIR = "./docs"

class SearchFiles < RubyLLM::Tool
  description "Procura arquivos cujo conteúdo casa com o padrão dado (regex). Retorna lista de paths."
  param :pattern, desc: "Padrão regex pra busca lexical (case-insensitive)"

  def execute(pattern:)
    out, _ = Open3.capture2("rg", "-l", "-i", "-e", pattern, DOCS_DIR)
    out.split("\n").reject(&:empty?)
  end
end

class ReadFile < RubyLLM::Tool
  description "Lê o conteúdo completo de um arquivo do projeto."
  param :path, desc: "Caminho relativo do arquivo"

  def execute(path:)
    File.read(path)
  rescue => e
    "erro: #{e.message}"
  end
end

chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4-6")
            .with_tools(SearchFiles, ReadFile)
            .with_instructions(<<~SYS)
              Você responde perguntas sobre os documentos em #{DOCS_DIR}.
              Use search_files pra encontrar arquivos relevantes e read_file
              pra ler o conteúdo. Sempre cite o arquivo na resposta.
            SYS

loop do
  print "> "
  msg = gets&.chomp
  break if msg.nil? || msg.empty?
  puts chat.ask(msg).content
end

The model gets the question, decides whether it needs to search, calls search_files, sees what came back, decides whether it needs to open any file, calls read_file, and only then answers. On the next question it already has the previous context in the session and can ask for more if it needs to. The context only receives what the model asked for, not the whole grep dump from the earlier example.

The same idea works for databases: swap rg for a SQL query with LIKE or tsvector (Postgres full-text), load the relevant rows, throw them in the context. If you have 10k records in an internal database, this handles it. If you have 10 million, you start needing smarter pagination or a more serious pre-filtering layer. But the mental model is the same: dumb filter + smart reader.

The point that matters

The most interesting thing in all of this isn’t even the Pinecone savings. It’s that the nature of the bottleneck has changed. In 2023, the bottleneck was retrieval: the reader was small, slow, expensive, and you needed a clever retriever to fill the window with the bare minimum. In 2026, the bottleneck is reasoning over messy context: the reader is big, relatively fast, and cheap. So it makes more sense to have a dumb retriever with high recall and let the model do the heavy lifting.

Anyone still designing systems with the 2023 mindset is paying a premium to solve a problem whose shape has changed. RAG didn’t die, the “R” got dumber and cheaper, and that’s an upgrade. The vector DB vendors aren’t going to tell you this, but it’s the path the more experienced folks have been quietly walking.

The next wave of LLM applications, in my bet, is going to be dominated by the people who got this inversion. Smaller stacks, simpler infrastructure, generous context, and a whole lot less LangChain.

What the recent literature says

Before I close out, I went and checked what the research crowd published on this. Blog hot takes age in three months in this field, so it’s better to look at the papers.

Retrieval Augmented Generation or Long-Context LLMs?, out of Google DeepMind, published at EMNLP 2024, is probably the most cited piece in the debate. Their conclusion: when the model has enough resources, long context beats RAG on average quality, but RAG is still much cheaper in tokens. They propose Self-Route, an approach where the model itself decides whether it needs retrieval or whether it can just go straight through context. The token savings are big and the quality loss is small.

Then LaRA, presented at ICML 2025, is more measured. The authors built 2326 test cases across four QA task types and three long-context types, ran them across 11 different LLMs, and the conclusion was: there is no silver bullet. The choice between RAG and long context depends on the model, the context size, the task type, and the retrieval characteristics. RAG wins on dialogue and generic queries, long context wins on Wikipedia-style QA.

Long Context vs. RAG for LLMs: An Evaluation and Revisits, from January 2025, is the one that most reinforces this post’s thesis. Long context tends to beat RAG on QA benchmarks, especially when the base document is stable. Summarization-based retrieval comes close, and chunk-based retrieval lags behind. In other words: the old way, chunk plus embed plus top-k, is the one that comes out worst.

Worth keeping on the radar too is the original Lost in the Middle (Liu et al., 2023, published in TACL in 2024). That’s the paper that showed even models with big windows have performance that depends on the position of the relevant information. Stuff at the beginning or end of the context is found easily; stuff in the middle degrades. For a long time this got used as the argument against long context, but the paper is from 2023, with 2023 models. Today’s models, the Claude 4.x and Gemini 3.x line, handle the middle a lot better. It’s not a solved problem, but it’s much smaller than it was.

On the lexical retrieval side, BEIR is still the canonical reference. The classic result is that BM25, all the way from the 90s, is still ridiculously competitive in out-of-domain scenarios. Dense models only win consistently when you have in-domain data to fine-tune the embeddings. In zero-shot scenarios, which is where most projects live, BM25 is hard to beat without serious work.

To wrap up, the Anthropic post on Contextual Retrieval, from September 2024, is the most practical piece on the list. They show that combining contextual embedding with contextual BM25 drops the top-20 failure rate from 5.7% to 2.9%. Add a reranker and it drops to 1.9%. Important detail: BM25 is the centerpiece of their result, not a sidekick. The right reading is “lexical plus vector plus reranker is the combination that works.” Anyone who can only pick one picks BM25 and still gets pretty far.

To sum up what we can actually nail down: the literature isn’t claiming “RAG is dead.” It’s saying that long context, when you can use it, tends to win on quality. It’s saying RAG’s cost is still its main argument. It’s saying lexical BM25 is much stronger than the vector DB marketing makes it sound. And it’s saying that when you really do need heavy retrieval, the robust combination is hybrid (lexical plus vector plus reranker), not pure vector. All of that lines up with what I’ve been defending in practice.

Sources

Li, Z. et al. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. EMNLP 2024 Industry Track.
Yuan, K. et al. (2025). LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing. ICML 2025.
Yu, T. et al. (2025). Long Context vs. RAG for LLMs: An Evaluation and Revisits. arXiv:2501.01880.
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL 2024.
Thakur, N. et al. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS Datasets and Benchmarks 2021.
Anthropic (2024). Introducing Contextual Retrieval. Blog post.
Akita, F. (2026). Claude Code’s Source Code Leaked. Here’s What We Found Inside. — my coverage of the leak, with more detail on the memory architecture, KAIROS and autoDream.

Testing Open Source and Commercial LLMs - Can Anyone Beat Claude Opus?

Sun, 05 Apr 2026 18:00:00 GMT

TL;DR: If you don’t want to read the whole analysis: the only models that produced code that actually works in our benchmark were Claude Sonnet 4.6, Claude Opus 4.6, GLM 5 and GLM 5.1 (from Z.AI, ~89% cheaper than Opus), and GPT 5.4 (which failed the benchmark due to runner incompatibility but which I tested extensively in Codex and works as well as Opus). Everything else — Kimi, DeepSeek, MiniMax, Qwen, Gemini, Grok 4.20 — invented APIs that don’t exist or ignored the gem we asked for.

There’s a new wrinkle in this update: I redid the local part of the benchmark on an RTX 5090 (instead of the AMD Strix Halo) and added a fresh batch of Qwen models, including a Qwen 3.5 27B distilled directly from Claude 4.6 Opus. That reopened the conversation on running open source models locally. The 5090’s memory bandwidth flips the game from “unworkable” to “workable with 1-2 follow-up prompts.” The bottleneck for open source models has moved to a lack of factual knowledge about specific libraries, which I unpack in detail in the new section on the Qwen family. The Claude distillation gamble, by the way, gave a pretty frustrating result that I haven’t seen documented in these terms before.

If you’ve been following my previous vibe coding pieces, you know I spent the last two months in a 500-hour marathon using Claude Opus as my main coding agent. The results were good, as I reported in the conclusion about business models. But there was an itch I couldn’t scratch: am I locked into one model? Is there a real alternative to Claude Opus for daily use on real projects?

I’ve got an RTX 5090 with 32 GB of GDDR7. I know I can run the latest open source models. I bought a Minisforum MS-S1 with an AMD Ryzen AI Max 395 and 128 GB of unified memory, and built a home server with Docker to serve local models. The infrastructure was ready. What was missing was actually testing it.

I built an automated benchmark to compare open source and commercial models under identical conditions. 33 models configured in total (25 from the original run plus 8 added in the NVIDIA rerun), 27 executed, 16 completed in some form. The code is on GitHub.

The bottleneck nobody explains: VRAM and KV Cache

Before getting to the results, I have to explain why running large models locally is much harder than it looks.

Take Qwen3 32B. The model in FP16 (full precision) takes ~64 GB. Quantized to Q4 (4 bits), it drops to ~19 GB. So it fits in my RTX 5090’s 32 GB, right? Wrong. That’s just the model weights. There’s a part nobody tells you about: the KV Cache.

KV Cache is the memory the model uses to “remember” what it has already read. Every time it processes a token (a word or piece of a word), it computes two vectors — K (key) and V (value) — for every attention layer. Those vectors stick around so the model doesn’t have to recompute everything when it generates the next token. Without that, generation would be quadratically slow.

The KV Cache scales linearly with the size of the context. The formula:

KV Memory = 2 × Layers × KV_Heads × Head_Dimension × Bytes_per_Element × Context_Tokens

For a model like Llama 3.1 70B in BF16, that comes out to ~0.31 MB per token. Sounds tiny, until you realize that a 128K context eats 40 GB of KV Cache alone. The model itself plus KV Cache adds up to way more VRAM than most GPUs have.

And for actual coding agent use, 128K tokens isn’t a luxury, it’s the bare minimum. The agent has to read files, keep conversation history, receive command output. In long benchmark sessions, our models consumed between 39K and 156K tokens. Less than 100K of context isn’t practical for day-to-day project work.

Google published TurboQuant (ICLR 2026), which compresses the KV Cache to 3 bits without accuracy loss — a 6x memory reduction and up to 8x speedup. It uses random vector rotation (PolarQuant) followed by a 1-bit algorithm on the residuals. Works online during inference, compressing on write and decompressing on read. Not yet implemented in the runtimes we use (llama.cpp, Ollama), but when it lands it’ll change the equation a lot.

For anyone wanting to dig deeper into the VRAM math, I recommend this link from Ahmad Osman for the article “GPU Memory Math for LLMs (2026 Edition)”.

The hardware problem: not all memory is created equal

“But I have 128 GB of RAM!” Cool, but that’s not what matters. What matters is memory bandwidth, and the difference between types is wild:

The RTX 5090 has 7x the bandwidth of the LPDDR5x memory in my Minisforum. That means even if a model fits in the AMD’s unified RAM, inference will be proportionally slower. On my Minisforum with LPDDR5x at 256 GB/s, Qwen3 32B runs at ~7 tok/s. On the RTX 5090 at 1,792 GB/s, it’d be much faster — if it fit entirely in VRAM alongside the KV Cache.

Most folks running local models are still on DDR4. At 50 GB/s, 32B models are basically unusable. And there’s another factor people forget: storage. When the RAM can’t keep up and the system swaps, the storage speed becomes the bottleneck:

Storage	Sequential Speed
SATA SSD	~550 MB/s
NVMe Gen3	~3,500 MB/s
NVMe Gen4	~7,000 MB/s
NVMe Gen5	~12,000 MB/s

From SATA to NVMe Gen5 you’re looking at a 22x difference. If you’re doing partial offloading to disk (which is common when the model doesn’t fit entirely on the GPU), NVMe Gen4 or Gen5 makes a real difference. SATA is a non-starter.

To sum up: running local models isn’t just “having enough RAM.” You need the right kind of memory, with the right bandwidth, and fast storage as a fallback. For a lot of people, a Mac Studio with high-bandwidth unified memory (up to 800 GB/s on the M4 Ultra with 512 GB) would be the more practical option, but it costs more than US$ 10,000. The AMD Ryzen AI Max is the cheaper alternative with unified memory, but its LPDDR5 caps out at 256 GB/s.

Ollama vs llama.cpp: why Ollama falls apart on benchmarks

Ollama is the most popular way to run local models. Install, pull the model, run. For casual use it works. But when I tried to use it for automated benchmarks with long unattended sessions, it broke in 6 different ways across 8 models:

Unloads the model mid-session. On long runs, Ollama decides the model isn’t being used and unloads it from the GPU. The agent sits there waiting for a response from a model that no longer exists.
Ignores the requested context. You ask for num_ctx=131072, Ollama accepts, then halfway through the run it reverts to the default without warning.
Unstable lifecycle. Asking for keep_alive: 0 to unload doesn’t always work. The model stays resident and blocks the next one.
Incompatible formats. Native bf16 variants on Ollama failed, while the same model as a Q8 GGUF from HuggingFace worked fine.

The fix: migrate to llama-swap, a Go wrapper that manages llama.cpp processes with hot-swap. A request comes in for a different model than the one currently loaded, it kills the current process and starts the new one. No context negotiation, no flaky lifecycle.

llama-swap fixed the loading of 6 of the 8 models that had failed under Ollama:

Model	Ollama	llama-swap
Gemma 4 27B	HTTP 500	47.6 tok/s
GLM 4.7 Flash	No output	47.4 tok/s
Llama 4 Scout	Unloaded	17.5 tok/s
Qwen 3.5 35B	Output off-spec	49.7 tok/s
Qwen 3.5 122B	Context drift	23.1 tok/s
GPT OSS 20B	Model not found	78.3 tok/s

But llama-swap isn’t magic.

Why “just use llama.cpp” doesn’t fix everything

llama.cpp solves Ollama’s lifecycle problems but brings its own:

Each model needs specific flags. GLM and Qwen 3.5 emit tags that break clients if you don’t pass --reasoning-format none. Gemma 4 needs build b8665+ for the tool call parser to work.

Not every model supports tool calling. llama.cpp needs a dedicated parser for each model’s tool call format. Llama 4 Scout uses a “pythonic” format ([func(param="value")]) that llama.cpp simply doesn’t parse and emits as plain text. vLLM has a parser for it, llama.cpp doesn’t.

And then there are the repetition loops. Gemma 4, even with the right parser, gets into an infinite loop after ~11 tool calls in long sessions. It’s a known bug that PR #21418 didn’t fully fix.

Tool calling compatibility per model:

Model	Tool Calling	Required Flags	Benchmark Result
Gemma 4 27B	Partial (b8665+)	`--jinja --reasoning-format none`	Infinite loop after ~11 steps
GLM 4.7 Flash	Yes	`--jinja --reasoning-format none`	2029 files, ended mid-tool-call
Qwen 3.5 (35B, 122B)	Yes	`--jinja --reasoning-format none`	Completed successfully
Qwen 3 Coder Next	Yes	`--jinja`	Completed (best local result)
GPT OSS 20B	Yes	`--jinja`	Tool calls ok, but app in wrong directory
Llama 4 Scout	No	—	No parser in llama.cpp

At the end of the day, llama.cpp is better than Ollama for automated runs, but “plug and play” it ain’t. Each model requires specific configuration, and some just don’t work for agentic coding yet.

Reasoning: models that think vs models that wing it

There’s one difference between models worth explaining: reasoning. The idea is that the model “thinks before answering” instead of generating tokens straight from left to right. Models with reasoning go through an internal chain-of-thought step where they evaluate the problem, consider alternatives, plan, and only then emit the response.

In practice this shows up as ... tags in the output, blocks of text the model writes to itself that shouldn’t go to the end user. Claude Opus 4.6, GPT 5.4, DeepSeek V3.2 and the Qwen 3.5 line support reasoning natively. The smaller ones (Gemma 4, GPT OSS 20B, older models) don’t have that capability.

Why does it matter for coding? When a coding agent gets “build a Rails app with 9 components,” it has to decompose the task into steps, decide the order, anticipate dependencies, adapt when something fails. Without reasoning, the model generates code sequentially with no planning. It works for simple tasks, falls apart on projects with interdependent parts.

In the benchmark, the difference was clear:

GPT OSS 20B (no reasoning, 20B parameters) created the app in the wrong directory. Couldn’t keep workspace instructions in mind while generating code.
Qwen 3 32B has reasoning, but at 7 tok/s it was too slow. The “thinking” tokens drag out the generation time.
Gemma 4 31B, with no reasoning trained for agentic use, fell into repetitive tool calling loops.
GLM 5 (cloud, 745B MoE) with reasoning and 44B active parameters, finished cleanly and used the correct API.

There’s a trade-off: reasoning consumes extra tokens (the blocks), which take up VRAM in the KV Cache and slow generation down. That’s why flags like --reasoning-format none are needed in llama.cpp. Some clients don’t know what to do with reasoning tokens and break. Models that emit reasoning when the runtime isn’t expecting it can produce garbage in the output.

And reasoning isn’t something you “turn on” in any model. It’s a capability trained with reinforcement learning on top of the base model, using data from problems that require multi-step thinking. The smaller open source models (20B-35B) typically didn’t go through that training, or went through it on a smaller scale. They know how to generate code, but they don’t know how to plan code. On tasks that require 50+ coordinated tool calls, that difference is fatal.

The benchmark: methodology

To compare models fairly, I built an automated harness in Python. Each model gets the exact same prompt: build a complete Ruby on Rails application, a ChatGPT-style chat SPA using the RubyLLM gem, with Hotwire/Stimulus/Turbo Streams, Tailwind CSS, Minitest tests, CI tools (Brakeman, RuboCop, SimpleCov, bundle-audit), Dockerfile, docker-compose and README.

The runner is opencode, which at the time of the benchmark was the most popular open source coding CLI, competing with Claude Code and Codex. Since then the project has been archived and development continued as Crush, maintained by the original author together with the Charm team (the folks behind Bubble Tea, Lip Gloss and several other Go terminal tools). If you read my piece on Crush, you already know it. Crush inherits everything from opencode — support for 75+ providers, LSP, MCP, persistent sessions — and adds the polished aesthetic that’s a Charm trademark. It runs everywhere: macOS, Linux, Windows, Android, FreeBSD.

I actually tried to use Crush for the benchmark first. The problem: it advertised a --yolo flag in its help to auto-approve every action (essential for unattended automated runs), but at runtime it rejected the flag. Without auto-approve there’s no way to do an unattended benchmark. opencode, on the other hand, had the opencode run --agent build --format json mode that emits JSON events with session IDs and token counts, perfect for automation. So we went with opencode.

I picked opencode (and not Claude Code or Codex) for two reasons:

Neutrality. Claude Code is optimized for Anthropic models. Codex is optimized for OpenAI models. opencode is agnostic, same interface for all.
Automation. opencode exposes a machine-readable JSON format. Claude Code and Codex don’t have an equivalent interface for external benchmarking.

Cloud models ran in two phases: phase 1 (build the app) and phase 2 (validate local boot, docker build, docker compose). Local models only ran phase 1.

Worth mentioning: the entire benchmark cost less than $10 in tokens on OpenRouter. Apart from GPT 5.4 Pro which torched $7.20 to fail, the other 11 cloud models added up to about $2.50 total. Local models cost only electricity. The point is: running your own benchmark is cheap. If you want to know whether a model works for your use case, drop the $2 and test it. The harness code is on GitHub, just swap the prompt for your own project.

Why GPT 5.4 failed the benchmark (but not in real life)

GPT 5.4 Pro is the only cloud model that consistently failed our benchmark. Two separate runs, same result: the model generated files but never reached finish_reason: stop. It always ended on finish_reason: tool-calls — wanted to keep calling tools but the loop kept breaking.

For folks who don’t know: tool calling is when an LLM needs to perform an action (read a file, run a command, edit code) and emits a “tool call” in a structured format. The client (opencode, Claude Code, Codex) interprets it, executes it, and returns the result back to the model. Each provider has its own format: Anthropic uses tool_use blocks, OpenAI uses function_calling with proprietary JSON schemas, Google uses FunctionCall.

GPT 5.4 is heavily trained for OpenAI’s native function calling format — tool_choice, tools with proprietary JSON schemas. When the benchmark routes through opencode → OpenRouter → GPT 5.4, the tool schemas get translated at every hop. If GPT emits tool calls in a format that OpenRouter or opencode doesn’t parse correctly, the agent loop breaks.

The evidence: every other cloud model (Claude Opus, Claude Sonnet, Kimi K2.5, DeepSeek V3.2, MiniMax M2.7, GLM 5, Qwen 3.6 Plus, Step 3.5 Flash) ended on finish_reason: stop. Only GPT ends on finish_reason: tool-calls.

A fair comparison for GPT 5.4 would require running it in its native environment — Codex or ChatGPT Pro ($200/month). On opencode through OpenRouter, this isn’t a fair test of GPT’s coding ability. That said, I used Codex extensively during my vibe coding marathon and I can vouch that GPT 5.4 is as good as Opus for real projects. In some ways I actually prefer Codex: it tends to think more “outside the box” and arrives at more creative solutions than Opus. On the other hand, it’s less disciplined — tends to forget previous instructions in long sessions and sometimes wanders off scope. Opus is more predictable and methodical. For me, that predictability is worth more day to day.

Sonnet and Opus through opencode/OpenRouter were probably also not pushed to their limits. Claude Code offers native tool support that opencode doesn’t replicate — meaning the benchmark results represent a floor, not a ceiling, for those models.

Open source models: reality vs the narrative

A lot of people are saying open source models have already caught up with the commercial ones and you can run your own “Claude” at home. In practice, not really.

The scale isn’t comparable. Frontier models like Claude Opus 4.6 and GPT 5.4 are closed-source, but estimates put them in the hundreds of billions to trillions of parameters range, trained with compute and data no open source company can replicate. The best models that fit on reasonable hardware are:

Model	Total Parameters	Active Parameters	Architecture
Qwen 3.5 35B-A3B	35B	3B	MoE (A3B)
Qwen 3.5 27B	27B	27B	Dense
Qwen 3 32B	32B	32B	Dense
Qwen 3.5 122B	122B	122B	Dense
GPT OSS 20B	20B	20B	Dense
Gemma 4 31B	31B	31B	Dense

Post-publication correction: Qwen 3.5 35B is actually the 35B-A3B, an MoE with only 3B active parameters per token (not dense, as I’d originally written). That’s why it runs relatively fast for its size. And for folks with 24 GB of VRAM, the model recommended by Unsloth themselves is the Qwen 3.5 27B dense — that one I didn’t get around to testing in the benchmark, but it’s worth a look. For anyone wanting to dig deeper into local models, @sudoingX has been doing some serious experimentation in this space. Thanks to @thpmacedo for the heads-up.

Even the largest open source MoE (Mixture of Experts) models that companies make publicly available activate only a small fraction of parameters per token:

Model	Total Parameters	Active Parameters	Notes
Kimi K2.5	1T	32B	384 experts, top-8 + shared
GLM 5	745B	44B	256 experts, 8 activated
DeepSeek V3.2	671B	37B	Sparse Attention
Qwen 3.5 397B	397B	17B	MoE, cloud-only

These large models aren’t self-hostable. Kimi K2.5 with 1T parameters needs GPU clusters with hundreds of GBs of VRAM. GLM 5 with 745B is the same. Even if Alibaba or Z.AI release the weights (and some do), nobody has home hardware to run them.

What fits on your home GPU are the 20B-35B models — and those have real limitations.

What each local model did in the benchmark

Results from the original run on the AMD Strix Halo:

Qwen 3 Coder Next (30B) — Completed in 17 minutes on the Strix, generated 1675 files, Rails app with all the artifacts. But only 3 tests. And more importantly: it invented RubyLLM::Client.new, a class that doesn’t exist in the gem. The app doesn’t run.

Qwen 3.5 35B — Completed in 28 minutes on the Strix, 1478 files, 11 tests. Used RubyLLM.chat without a model parameter — works only if the default is configured. No LLM mocking in the tests.

Qwen 3.5 122B — Completed in 43 minutes on the Strix, 1503 files, 16 tests. But it ignored the RubyLLM gem completely and built a custom HTTP client for OpenRouter. The prompt explicitly asked for ruby_llm.

GLM 4.7 Flash (local, Strix) — Produced 2029 files with all the artifacts, but the session ended mid-tool-call. The cloud version (GLM 5) works perfectly.

Gemma 4 31B (Strix) — Infinite tool call loop after ~11 productive steps. Known llama.cpp bug.

GPT OSS 20B (Strix) — Created the Rails app in the wrong directory (project/app/ instead of project/). A 20B model doesn’t follow workspace instructions reliably.

Qwen 3 32B (Strix) — Way too slow (7.3 tok/s). The hardware can’t keep up.

And the results from the rerun on the NVIDIA RTX 5090 (all with Q3_K_M or Q4_K_M and contexts between 64k and 128k to fit the 32 GB of VRAM):

Qwen 3.5 35B-A3B (5090) — 5 minutes at 273 tok/s. Recognizable Rails project, entry point RubyLLM.chat(model:) is right, but it hallucinates chat.add_message(role:, content:) and chat.complete instead of .ask. Fixable in 1-2 follow-ups. The best candidate for “OSS local that’s actually worth trying.”

Qwen 3.5 27B Claude-distilled (5090) — 12 minutes at 129 tok/s. Impeccable Claude style, total API hallucination (RubyLLM::Chat.new.with_model{}, add_message, response.text). More details in the distillation section below.

Qwen 3 Coder 30B (5090) — 6 minutes at 145 tok/s. Returned a hardcoded mock string instead of calling the API. Tier 3 unusable.

Qwen 2.5 Coder 32B (5090) — 90 minutes of timeout, zero files. The model spun without ever calling a write tool.

Qwen 3 32B (5090) — 4 minutes at 69 tok/s, partial scaffold, errors. The general version is better than the Coder one but still breaks.

Gemma 4 31B (5090) — 8 minutes at 213 tok/s. Same repetition loop it had on the Strix. The llama.cpp bug isn’t a hardware issue.

Qwen 3.5 27B Sushi Coder RL (5090) — Infrastructure failure (ProviderModelNotFoundError), couldn’t be evaluated. Redo on a future run.

GPT OSS 20B (5090) — Pulled from this run because of a recent llama.cpp main regression in the harmony family tool call parser. The logs show Failed to parse input at pos 755: <|channel|>... in multi-turn sessions. It worked on the Strix with llama.cpp b8643, broken on today’s main. Waiting on upstream to fix it.

Cloud models: what actually works

Of the 12 models that completed the benchmark, all of them generated a recognizable Rails project with all the requested artifacts (Gemfile, routes, views, JS, tests, README, Dockerfile, docker-compose). 9 out of 9 on the completeness checklist.

But here comes the question that matters: does the code run?

The correct RubyLLM API is simple:

chat = RubyLLM.chat(model: "anthropic/claude-sonnet-4")
response = chat.ask("Hello")
response.content  # => "Hi there!"

8 of the 12 models invented APIs that don’t exist. The most common pattern: hallucinating an interface that doesn’t match the actual gem:

Model	What It Invented
DeepSeek V3.2	`RubyLLM::Client.new` — nonexistent class
Qwen 3 Coder Next	`RubyLLM::Client.new` — same error
Qwen 3.5 122B	`Openrouter::Client` — nonexistent gem
Kimi K2.5	`add_message()` and `complete()` — nonexistent methods
MiniMax M2.7	`RubyLLM.chat(messages: [...])` — nonexistent signature
Qwen 3.6 Plus	`chat.add_message()` — nonexistent method
Gemini 3.1 Pro	`RubyLLM::Chat.new()` and `add_message()` — internal API, not public
Grok 4.20	Ignores the gem completely — uses `OpenAI::Client` (ruby-openai) hitting the OpenRouter URL directly

The models that got it right — both Claudes, GLM 5 and GLM 5.1 — used the simple two-step pattern (chat = RubyLLM.chat(model:) then chat.ask(message)). The ones that got it wrong tried to make RubyLLM look like the OpenAI Python SDK, which is a different thing. Grok 4.20 was the most brazen case: it didn’t even try to use the gem, it went straight for OpenAI::Client pointing at the OpenRouter URL, ignoring the explicit prompt.

And the tests? Only Opus, Sonnet, GLM 5 and GLM 5.1 did proper mocking of the LLM calls. All the others either hit the real API (which fails without a key) or mocked the invented API (tests pass but prove nothing). Test count is a misleading metric: Kimi K2.5 wrote 37 tests, more than anyone else, but none of them test real functionality because the API it uses doesn’t exist.

Real viability table

Model	Correct API?	Runs?	Test Mocking?	Problem
Claude Sonnet 4.6	Yes	Yes	Yes (mocha)	Clean implementation
Claude Opus 4.6	Yes	Yes	Yes (mocha)	Clean implementation
GLM 5	Yes	Yes	Yes (mocha)	Correct API, works
GLM 5.1	Yes	Yes	Yes	Correct API, works
Step 3.5 Flash	N/A	Yes*	No	Bypasses RubyLLM, uses HTTP directly
Grok 4.20	N/A	Yes*	No	Bypasses RubyLLM, uses `OpenAI::Client` directly
Qwen 3.6 Plus	Partial	Only 1st msg	No	`add_message()` doesn’t exist
Qwen 3.5 35B	Partial	Maybe	No	No model parameter
Kimi K2.5	No	No	No	`add_message()`/`complete()` invented
MiniMax M2.7	No	No	No	`RubyLLM.chat` signature wrong
DeepSeek V3.2	No	No	No	`RubyLLM::Client` nonexistent
Qwen 3 Coder Next	No	No	No	`RubyLLM::Client` nonexistent
Gemini 3.1 Pro	No	No	Wrong mock	`RubyLLM::Chat.new()` and `add_message()` don’t exist
Qwen 3.5 122B	No	No	No	`Openrouter::Client` gem doesn’t exist

*Step 3.5 Flash works by calling the OpenRouter REST API directly with Net::HTTP, completely bypassing the gem the prompt asked for.

Now, this doesn’t mean those models are useless. If you take Kimi K2.5 or DeepSeek V3.2 and tell it “the RubyLLM::Client class doesn’t exist, fix it to use the gem’s real API”, it’ll probably fix it. One or two follow-ups and the project becomes functional. Most of the models that failed here could deliver a working project with a few more rounds of conversation.

But that’s where the trade-off lives. With Opus or GPT 5.4, the first output already works. You ask, they deliver, you test, it runs. With the cheaper models, you’ll spend time fixing API hallucinations, debugging code that “looks right” but crashes, steering the model in the right direction. Each of those rounds is 10-30 minutes. Three extra rounds and you’ve spent an hour of your time to save $0.90 in tokens.

You save dollars, you spend time. And time is money. For someone learning or exploring without urgency, that trade can make sense. For someone who needs to ship, the frontier models pay for themselves fast.

Comparing the models that work

Model	Provider	Time	Tests	Cost/Run	vs Opus
Claude Sonnet 4.6	OpenRouter	16m	30	~$0.63	40% cheaper, more tests
Claude Opus 4.6	OpenRouter	16m	16	~$1.05	Baseline
GLM 5	OpenRouter	17m	7	~$0.11	89% cheaper
GLM 5.1	Z.AI direct	22m	24	~$0.13	~88% cheaper, more tests than GLM 5

Full ranking by time and tokens

Model	Provider	Time	Total Tokens	Tok/s	Cost/Run
Grok 4.20	OpenRouter	8m	63,457	412.54	~$0.04
Gemini 3.1 Pro	OpenRouter	14m	104,034	128.28	~$0.50
MiniMax M2.7	OpenRouter	14m	79,743	574.52	~$0.05
Claude Opus 4.6	OpenRouter	16m	136,806	347.18	~$1.05
Claude Sonnet 4.6	OpenRouter	16m	127,067	532.26	~$0.63
GLM 5	OpenRouter	17m	59,378	400.01	~$0.11
Qwen 3.6 Plus	OpenRouter	17m	88,940	182.91	Free
GLM 5.1	Z.AI direct	22m	81,666	166.62	~$0.13
Qwen 3 Coder Next	Local	17m	39,054	37.49	Electricity
Qwen 3.5 35B	Local	28m	76,919	46.03	Electricity
Kimi K2.5	OpenRouter	29m	63,638	160.14	~$0.07
Step 3.5 Flash	OpenRouter	38m	156,267	242.11	~$0.02
Qwen 3.5 122B	Local	43m	57,472	22.41	Electricity
DeepSeek V3.2	OpenRouter	60m	115,278	53.37	~$0.04

DeepSeek V3.2 is the slowest despite being cloud — it has no prompt caching, so it resends the full context on every turn.

Token efficiency and cache

Models with prompt caching pay much less in effective tokens:

Model	Total Tokens	Cache Read	Effective New Tokens
Claude Sonnet 4.6	127,067	126,429	638
Claude Opus 4.6	136,806	135,976	830
GLM 5	59,378	58,240	1,138
GLM 5.1	81,666	81,216	450
Grok 4.20	63,457	62,400	1,057
Gemini 3.1 Pro	104,034	98,129	5,905
DeepSeek V3.2	115,278	0	115,278
Kimi K2.5	63,638	0	63,638

Speed: the chasm between cloud and local

There’s an aspect that the cost tables hide: inference speed. And the difference is brutal.

Claude Sonnet generates 532 tok/s. Qwen 3.5 122B running locally on my Minisforum (AMD Strix Halo) generates 22 tok/s. That’s a 24x difference. In practice, what Sonnet does in 16 minutes, Qwen 3.5 122B takes 43 minutes. Qwen 3 Coder Next at 37 tok/s is the fastest of the local models on the Strix and even so it’s 14x slower than Sonnet.

And it’s not just clock time. When you’re in an interactive coding loop — ask for a change, wait for output, test, ask for another — the model’s speed sets your rhythm. At 37 tok/s, every long response makes you wait 30-60 seconds. At 530 tok/s, it appears almost instantly. Over a day, you feel it.

DeepSeek V3.2 is a curious case: it’s cloud but it runs at 53 tok/s, slower than the locally-running Qwen 3.5 35B on the Strix (46 tok/s). The reason is that DeepSeek has no prompt caching — it resends the full context on every turn, strangling throughput. Paying for a cloud model that’s slower than running it locally doesn’t make any sense.

Local models are free in tokens, but they pay in time. On the AMD Strix, that math was a non-starter for every Qwen I tested: two minutes waiting for a long response, multiplied by 50 turns, eats your whole afternoon. But that changes when the hardware changes, and that’s why I redid the local part of the benchmark on a different machine.

AMD Strix Halo vs NVIDIA RTX 5090: what changes when the memory bandwidth doubles

To check whether the bottleneck was hardware or model, I took the same Qwen models and reran the benchmark on a workstation with an NVIDIA RTX 5090 (Blackwell, 32 GB GDDR7, 1,792 GB/s bandwidth). The numbers shift in a way that’s worth looking at carefully.

Model	AMD Strix (LPDDR5x)	NVIDIA 5090 (GDDR7)	Speedup	Total time on 5090
Qwen 3 32B (dense)	7 tok/s	69 tok/s	~10x	4 min
Qwen 3 Coder 30B (Coder)	37 tok/s	145 tok/s	~4x	6 min
Qwen 3.5 35B-A3B (MoE)	46 tok/s	273 tok/s	~6x	5 min
Qwen 3.5 27B Claude (distilled)	timeout 90m	129 tok/s	n/a	12 min
Gemma 4 31B	(didn’t test on Strix)	213 tok/s	n/a	8 min
Qwen 2.5 Coder 32B	(didn’t test on Strix)	2.86 tok/s	n/a	timeout 90m

To put those speeds in context, remember that in the cloud Sonnet runs at 532 tok/s, Opus at 347 tok/s, Step 3.5 Flash at 242 tok/s, Gemini 3.1 Pro at 128 tok/s and Kimi K2.5 at 160 tok/s. Qwen 3.5 35B-A3B on the 5090, at 273 tok/s, is in the same neighborhood as Step 3.5 Flash, faster than Gemini, Kimi and GLM 5.1. Qwen 3 Coder 30B at 145 tok/s is in Gemini territory. The classic line “local models are ten times slower than cloud” stopped being true the moment the 5090 entered the conversation.

The practical consequence is that the “time is money” argument shifts. On the Strix, “waiting an hour for a Qwen 3.5 122B to do what Sonnet does in 16 minutes” is straight-up loss. On the 5090, waiting 5 minutes for Qwen 3.5 35B-A3B to do the work, plus 10-15 minutes for you to do 1-2 correction prompts, gives you a total in the 20-25 minute range. Sonnet does it in 16 minutes with zero corrections. The difference shrank enough that, if cost matters a lot, it’s worth it.

The catch: for this to be worth it, the model has to be close enough to the right answer that 1-2 correction prompts can fix it. When the error is “the model decided not to use the gem I asked for and returned a hardcoded mock string,” like Qwen 3 Coder 30B did, no easy correction prompt fixes that. That’s a redo.

Before you spend money on hardware thinking it’s the answer

I’ve got to give a warning here, because it’s the most common buying mistake I see right now. Every other week somebody tells me they’re going to grab a Ryzen AI Max because it has 128 GB of unified memory and that “lets you run huge models.” Technically, sure — the model fits. In practice, it’s almost unusable. The memory is LPDDR5x at 256 GB/s, seven times slower than the 5090’s GDDR7. What fits doesn’t run at human speed. My own Strix with Qwen 3.5 122B hit 22 tok/s and the run took 43 minutes. To do anything serious day to day, that’s not workable.

The 5090 is clearly superior, and it starts to make sense even for smaller models precisely because of the memory bandwidth. A Mac Studio with high-speed unified memory (up to 800 GB/s on the M4 Ultra) is the other viable option, and costs proportionally the same. But neither of those comes anywhere close to beating the commercial models on quality — and the per-token price of Claude, GPT or GLM, combined with their brutal inference speed, makes the math hard to justify for anyone who isn’t an enthusiast or a researcher. Expensive local AI hardware is a weekend hobby, a tool for people who need to run offline for compliance reasons, or a research playground. For day-after-day production work, right now, cloud is still the rational choice. A 128 GB Ryzen AI Max may look tempting on the spec sheet, but if the goal is serious coding agent work, it’s money badly spent.

The Qwen family: Coder vs General, distillation, and why nothing is a silver bullet

With so many different Qwens running in this rerun, it’s worth doing a more focused analysis. What I learned might surprise people who follow model benchmarks on Twitter.

Before getting to the results: what quantization is and what distillation is

These two concepts come up constantly in this discussion and they deserve a quick explanation.

Quantization is the technique of compressing the model’s weights so they take up less memory. A model trained in FP16 (16 bits per weight) can be quantized to Q8 (8 bits), Q4 (4 bits), Q3_K_M (3 bits, but with medium-sized groupings), and so on. Each step halves the size of the model on disk and in VRAM, at the cost of some loss of precision. Q8 is practically lossless. Q4 already loses something measurable. Q3 loses more. Q2 is the line where the model starts saying real nonsense. The rule of thumb is that for coding and multi-step reasoning, you want to stay at Q4 or higher. Q3_K_M is the minimum that still works for many models, and it’s what fits a 27B on the 5090 with 128k context.

The surprise from my test, and look, this goes against the consensus, is that quantization wasn’t the bottleneck here. I ran the Qwen 3.5 27B Claude-distilled in two versions: Q8 on the AMD Strix (~27 GB of weights) and Q3_K_M on the 5090 (~12 GB of weights). Both hallucinated exactly the same fake RubyLLM APIs. Q3_K_M even produced a cleaner Gemfile. The model’s limitation was in what those weights know, not in the precision they were compressed to.

Distillation is the technique of training a smaller model (the “student”) to imitate the output or behavior of a larger model (the “teacher”). The classic version is logit distillation — the student learns to approximate the teacher’s probability distributions. The modern, more popular version for coding agents is distillation of reasoning traces: you take chain-of-thought from the big model on real problems and train the smaller one to reproduce the same reasoning style.

The hype of the moment is distilling Claude and GPT into open source models. The promise is that you can have “Claude-at-home” running locally. I wanted to test this, and that’s why I added Jackrong’s Qwen 3.5 27B distilled from Claude 4.6 Opus to the benchmark. If any open source model was going to use RubyLLM correctly, this was the bet — after all, in the entire benchmark, Claude and GLM 5 are the only ones that get the API right.

What the Claude-distilled learned (and what it didn’t)

I ran the same distillation twice: once at Q8 on the AMD Strix (which blew through the 90-minute timeout), and once at Q3_K_M on the 5090 (completed in 12 minutes). Both produced the same elegant frustration.

The code that comes out looks like Claude. It has # frozen_string_literal: true at the top of every file. It has a separate Response class as a value object with explicit attribute readers. It has a clear separation between service, controller and model. It has doc comments at the top of every file. It correctly comments out things like active_record, active_job and action_mailer in application.rb. It has defensive case statements trying multiple return formats. Stylistically, it’s Claude.

Functionally, it’s a complete RubyLLM hallucination. Look at the service generated by the 5090 run:

RubyLLM::Chat.new.with_model(@model) do |chat|
  conversation_history.each do |msg|
    chat.add_message(role: :user, content: msg[:content])
  end
  response = chat.ask(message)
  Response.new(content: response.text, usage: build_usage(response))
end

Every primitive in this code is invented:

RubyLLM::Chat.new — the constructor isn’t public, the correct entry is RubyLLM.chat(model:)
.with_model(@model) do |chat| ... end — there’s no block API like that
chat.add_message(role:, content:) — doesn’t exist
response.text — the real API exposes response.content
response.usage.prompt_tokens — the object doesn’t have that shape

This will blow up with a NoMethodError on the first request. The initializer also tries config.openrouter_api_base= which doesn’t exist on RubyLLM.configure, so the app probably won’t even boot.

The Q8 version on the AMD Strix does the exact same thing, with one difference: the entry call is RubyLLM.chat(model:, provider: :openrouter) — the entry point is right, but provider: is invented and it’s immediately followed by the same fake chat.add_message(role:, content:). Worse, the Gemfile from the 90-minute run lists gem "ruby-openai" (wrong gem!), gem "minitest", "~> 6.0" (minitest 6.0 doesn’t exist) and gem "tailwindcss" (wrong gem name, it’s tailwindcss-rails). The Gemfile doesn’t include the gem the service code itself is trying to use.

For comparison, look at the actual Claude Opus 4.6 baseline, in the same benchmark, getting it all right:

@chat = RubyLLM.chat(model: model_id)
response = @chat.ask(message)
response.content

Twelve lines in the entire service. Zero hallucination. Includes streaming via block. The distilled model produced three times the code volume and got the API wrong.

The honest reading is that distillation transferred one layer and stopped. The layer that came along was the style: code organization, comments, class structure, the order of things. The layer that got left behind was factual memory about specific libraries. That makes sense when you think about it: Claude’s reasoning traces, even when written carefully, rarely contain repeated references to chat.ask(msg).content in some obscure Ruby gem. The student only learns what the teacher repeats, and Claude never had any reason to keep whispering “use ask, not complete” throughout its chains of thought. Library API knowledge is binary recall memory, the kind that’s either in the weights or it isn’t. Decomposing that into reasoning steps is impossible because it isn’t reasoning, it’s just raw memorization.

To wrap up the practical recommendation: if you need the model to actually use RubyLLM, or any less-popular library for that matter, Claude distillation won’t save you. Use real Claude or GLM 5. The “Claude-stand-ins” in open source will fail the same way the Qwen base would, just with prettier handwriting.

Coder vs General: the surprise of the “for coding” models

Almost everyone’s instinct is that models with “Coder” in the name are the best for programming. Makes sense, they were specifically fine-tuned on code. But in the benchmark, it was exactly the opposite.

Model	Type	Hardware	Time	Result
Qwen 3.5 35B-A3B	General (MoE)	5090	5 min	Runs Rails, hallucinates `add_message`/`complete` (1-2 follow-ups fix it)
Qwen 3 Coder 30B	Coder	5090	6 min	Returned a hardcoded mock string instead of calling RubyLLM
Qwen 2.5 Coder 32B	Coder	5090	timeout 90m	Zero files, model froze
Qwen 3 32B	General	5090	4 min	Partial scaffold, errors
Qwen 3.5 27B Claude-distilled	General + distilled	5090	12 min	Runs Rails, hallucinates the entire API
Qwen 3.5 27B Sushi Coder RL	Coder (RL)	5090	6 min	Infrastructure failure, couldn’t be tested

Of the three dedicated Coders, two failed catastrophically (full timeout and hardcoded mock string) and one didn’t even run properly because of an infra bug. Meanwhile, the Qwen 3.5 35B-A3B, which is the general model in the line (not the Coder), came closest to something usable: 5 minutes of execution, recognizable Rails project, and the problem is fixable in 1-2 prompts.

Qwen 3 Coder 30B is particularly disappointing. It went so far past trying to use the API that it didn’t really try at all: the controller it generated has this:

class Api::V1::MessagesController < ApplicationController
  def create
    render json: {
      response: "This is a mock response. In a real implementation, this would connect to RubyLLM with Claude Sonnet via OpenRouter."
    }
  end
end

The Gemfile lists gem "ruby_llm" but nothing imports it. The service layer is nonexistent. The model decided it was easier to return a fake string and call it a day. That’s Tier 3 garbage in a way no correction prompt fixes — you have to tell it to start over.

Qwen 2.5 Coder 32B is even worse: 90 minutes running, zero files. The 1.8 MB opencode-output.ndjson shows the model spinning without managing to write anything. It probably got stuck in a planning loop without ever calling the write tools. Total slot waste.

Why did the “Coder” Qwens do so badly? My read is that the coding-specific fine-tuning they got was trained on more isolated problems (Codeforces, Leetcode, short snippets), far from agentic flows with long-running tool calling. The general Qwen 3.5 35B-A3B has broader training and handles the orchestration part better. The popular intuition “Coder = best for coding agent” is wrong for this kind of task. The use case where Coders shine is “complete an isolated function,” which is exactly what they were trained for, and that’s a tiny fraction of what a coding agent does day to day.

The question I wanted to answer

It was this: running locally on the 5090, which Qwen model is worth the 1-2 correction prompts to deliver code that works?

The honest answer is: only Qwen 3.5 35B-A3B, and maybe the Claude-distilled if you don’t mind spending 12 minutes more.

Qwen 3.5 35B-A3B on the 5090: 5 minutes, correct entry point (RubyLLM.chat(model:)), errors on the subsequent calls. Realistic total until it works: in the 15-20 minute range with 1-2 follow-ups. Beats cloud OSS on cost.
Qwen 3.5 27B Claude-distilled on the 5090: 12 minutes, deeper hallucination (entry point is invented too). Realistic total: 25-30 minutes with 2-3 follow-ups. Still competes on cost, and loses on absolute time to the real Claude.
The others (Coder 30B, Coder 2.5 32B, 3 32B): don’t pay back the correction time. Each one has a structural problem that calls for a full rewrite from scratch.

For folks with hardware in this category who want to escape Anthropic vendor lock-in, it now works. It didn’t work on the 5090 from last year, and forget about it on the Strix Halo. In 2026, on NVIDIA Blackwell, with the right model, it works. For folks with low-bandwidth hardware (LPDDR5x, DDR4, DDR5), it’s still a waste of time: the clock alone takes down any plan to make this practical.

The Deep Code Review: Sonnet vs GLM 5 vs Gemini vs Kimi vs MiniMax

The tables above measure structural completeness. But does the project work? I did detailed code review of the models that completed the benchmark.

Claude Sonnet 4.6 — works and is the most complete. Synchronous responses via Turbo Stream. Chat history persisted in a session cookie with full replay of previous messages on every request. Correct LLM mocking in the tests with mocha (30 tests in 328 lines). LLM logic extracted into a separate LlmChatService. Views decomposed into 9 partials. Minor problems: duplicated model constant, leak in the auto-resize event listener. None are blockers. Of the generated projects, it’s the closest to something you’d actually put into production.

GLM 5 — works, but it’s the bare minimum. Uses the correct API (RubyLLM.chat(model:) then .ask()), does mocking with mocha in the tests. But the project is way leaner than Sonnet’s: 21-line controller (vs Sonnet’s 52), no service layer (LLM logic inline in the controller), no chat history persistence, every message handled in isolation. The first message works, but the app doesn’t keep conversation context, so you can’t have a multi-turn dialog. The tests exist (7 methods) but they’re skeletal: ruby_llm_test.rb only checks that the module is loaded, chat_flow_test.rb is a copy of the controller test. The Dockerfile, on the other hand, is the best of the four: multi-stage, non-root, jemalloc. But as a chat app? It’s more of a proof of concept than something functional. Funny detail: the README says “Powered by Claude Sonnet 4” instead of the model that actually generated the project.

Gemini 3.1 Pro — the fastest, but trips on the API. Completed in 14 minutes, the fastest along with MiniMax. The Rails code itself is well written: uses Rails.cache with session ID and a 2-hour expiration to keep state (instead of a database), Turbo Streams nicely integrated, Stimulus controller for auto-scroll, and the Dockerfile is the best of the group (multi-stage, non-root, jemalloc). The problem is the usual one: it uses RubyLLM::Chat.new() instead of RubyLLM.chat(), and calls add_message() which doesn’t exist. The app boots, Docker runs, the health check passes, but the first chat message returns 500. The tests (5 methods) mock with a FakeChat that replicates the wrong signature, so they pass. It’s frustrating because the rest of the code is the most “Rails way” of the non-Anthropic models. Fixing it would be 3 lines, but the benchmark measures what comes out the first time.

Kimi K2.5 — ambitious but broken. Tried the most sophisticated architecture: ActionCable streaming, configurable models, dual Dockerfiles, 37 tests in 374 lines. Problem: the streaming depends on ActionCable, which is commented out in config/application.rb. The return unless defined?(ActionCable) guard makes the method do nothing. The assistant never responds. The Stimulus controller has a scope bug: submitTarget references a button outside the controller’s subtree. Thread-unsafe storage with a hash in a class variable. Kimi wrote more tests than any other model (37), but none of them mock the LLM calls — so the tests pass without proving any of the functionality works.

Grok 4.20 — fast and wrong. It was the fastest in the entire benchmark: 8 minutes, 412 tok/s. Except it was fast because it cut corners. The prompt explicitly asked for the ruby_llm gem, and Grok ignored it. It went straight for OpenAI::Client from the ruby-openai gem pointing at the OpenRouter URL. Technically the first message comes back, so yeah, it “works.” But it’s the same trick as Step 3.5 Flash and Qwen 3.5 122B: skip the part that was actually being tested. No history, 33-line controller calling the HTTP client by hand, two tests, no real mocks. It was fast because it did less than what was asked.

MiniMax M2.7 — looks right, crashes. Calls RubyLLM.chat(model: '...', messages: [...]) — that signature doesn’t exist. No message persistence. Duplicated HTML (DOCTYPE inside the layout). Committed master.key. And the tests? They mock the wrong API, so they pass but they don’t prove anything.

Code review summary:

Aspect	Sonnet 4.6	GLM 5	Gemini 3.1 Pro	Kimi K2.5	MiniMax M2.7
Correct API	Yes	Yes	No	No	No
Chat history	Session cookie	None	Rails.cache (2h)	Broken (ActionCable off)	None
Service layer	LlmChatService	Inline in controller	LlmService	LlmService	ChatService (wrong API)
Tests (methods)	30	7	5	37	12
LLM mocking	Yes (mocha)	Yes (mocha)	FakeChat (wrong API)	No	Mocks wrong API
Dockerfile	Multi-stage	Multi-stage + jemalloc	Multi-stage + jemalloc	Dual (dev/prod)	Single-stage
Actually runs?	Yes	Yes (no history)	No (500 in chat)	No	No

GLM 5 vs GLM 5.1: what changed

GLM 5 was one of the few models that spat out functional code on the first try, so it was obvious to test the new version. One important detail before the numbers: GLM 5 ran via OpenRouter, GLM 5.1 wasn’t there yet when I ran this test, so I used the Z.AI direct API. Different provider, different infra, different cache. The numbers below are reference, not exact measurement.

Aspect	GLM 5	GLM 5.1
Provider	OpenRouter	Z.AI direct
Total time	17m	22m
Tok/s (final phase)	400	167
Effective new tokens	1,138	450
Cache read	58,240	81,216
Correct RubyLLM API	Yes	Yes
Test mocking	Yes (mocha)	Yes
Tests	7	24
Chat history	No	Yes (in-memory)
Service layer	Inline in controller	`ChatSession` model with `add_user_message`/`add_assistant_message`

The GLM 5.1 project came out way more complete. 24 tests vs 7. Real separation between ChatSession, ChatMessage and the controller, instead of GLM 5 cramming everything inline. Chat history persisted in memory during the session, so you can actually have a real multi-turn conversation (GLM 5 treated every message like it was the first). And the RubyLLM API is still correct, the same RubyLLM.chat(model:, provider:) pattern followed by c.user/c.assistant to build the context. There’s even a test covering the MODEL constant, which usually nobody does.

The price was speed. 22 minutes vs 17, and throughput dropped from 400 to 167 tok/s. Could be the provider (Z.AI direct isn’t the same infra as OpenRouter), could be a more loaded server during the run, could be that 5.1 reasons more. I didn’t run it multiple times to take an average, so I won’t say 5.1 is “slower.” A single run doesn’t prove a regression. What I can say is that, in my test, 5.1 delivered a better-structured project and took a bit longer to do it.

For folks who want to get out from under Anthropic without losing quality, GLM 5 and GLM 5.1 are the two options that work. If you need centralized billing on OpenRouter, GLM 5. If you can use Z.AI direct and want a more rounded project on the first try, GLM 5.1.

Costs: API vs Subscription

First, the per-token price of each model on OpenRouter:

GPT 5.4 Pro charges $180 per million output tokens. Claude Opus charges $25. GLM 5 charges $2.30. And Qwen 3.6 Plus is free (with a rate limit). The log scale on the chart hides some of the brutality of the gap: from free Qwen to GPT 5.4 Pro is orders of magnitude.

But per-token price isn’t the whole story. If you use Claude or GPT daily for coding, the monthly subscription can come out way cheaper than paying per token via the API:

Approach	Est. $/month*	Notes
Qwen 3.6 Plus (OpenRouter)	$0	Free but rate-limited
Local models	Electricity	Needs hardware
Claude Pro	$20	~44K tokens/5hr
ChatGPT Plus	$20	Includes Codex
Claude Max 5x	$100	~88K tokens/5hr
Claude Sonnet (OpenRouter API)	~$150	No cap, pay-as-you-go
Claude Max 20x	$200	~220K tokens/5hr
ChatGPT Pro	$200	GPT 5.4 Pro unlimited
Claude Opus (OpenRouter API)	~$450	No cap, pay-as-you-go
GPT 5.4 Pro (OpenRouter API)	~$990	Absurdly expensive

*Estimate for moderate coding use (~15M input + ~3M output tokens/month).

The main point: if you use GPT 5.4 Pro, the ChatGPT Pro subscription at $200/month with unlimited use is 5x cheaper than paying per token on the API. For Claude, Pro at $20/month covers light use, but for heavy users (a coding marathon like mine), the Max 20x at $200/month comes out cheaper than paying for Opus per token on OpenRouter (~$450/month). The open source models on OpenRouter all sit below $2.50/M output tokens, but as we saw, most of them generate code that doesn’t run.

What works for real use

After testing 33 models across both runs and looking at the generated code in detail:

Tier 1 (works plug and play):

Model	Quality	Cost/Run	Trade-off
Claude Sonnet 4.6	Better than Opus on opencode (30 vs 16 tests)	~$0.63	Cheaper, but on Claude Code Opus might do better
Claude Opus 4.6	Gold standard	~$1.05	Baseline
GPT 5.4 Pro	Practically equivalent to Opus	~$7.20*	Failed the benchmark due to opencode incompatibility, but I tested extensively in Codex and it works as well as Opus
GLM 5	Good (7 tests, correct API)	~$0.11	89% cheaper, non-Anthropic/OpenAI alternative that works
GLM 5.1	Good (24 tests, history, correct API)	~$0.13	~88% cheaper, more complete project than GLM 5

*GPT 5.4 Pro failed the automated benchmark because opencode doesn’t support OpenAI’s native tool calling format. Through Codex or ChatGPT Pro ($200/month with unlimited use), it works without problems.

Tier 2 (works with caveats):

Model	Hardware	Cost/Run	Caveat
Step 3.5 Flash	Cloud	~$0.02	Bypasses the requested gem, slow (38m)
Grok 4.20	Cloud	~$0.04	Bypasses the requested gem (goes straight to `OpenAI::Client`), but it’s the fastest in the benchmark
Qwen 3.5 35B-A3B	NVIDIA 5090	Free	Correct entry point, hallucinates `add_message`/`complete`. Fixable in 1-2 follow-ups. ~15-20 min total
Qwen 3.5 27B Claude-distilled	NVIDIA 5090	Free	Claude style, complete API hallucination. 2-3 follow-ups to fix. ~25-30 min total
Qwen 3.5 35B (local)	AMD Strix	Free	Works if default is configured, no mocking, and slow

Tier 3 (broken code, easier to redo than to fix):

Kimi K2.5, MiniMax M2.7, DeepSeek V3.2, Gemini 3.1 Pro, Qwen 3 Coder Next (Strix), Qwen 3 Coder 30B (5090, returned a hardcoded mock string), Qwen 3.5 122B, Qwen 3.6 Plus — all of them either invent APIs that don’t exist or don’t even try to use the gem.

Tier 4 (didn’t complete):

Gemma 4 (infinite loop on both hardware), Llama 4 Scout (no parser), GPT OSS 20B (wrong directory on Strix, parser regression on 5090), Qwen 3 32B (too slow on Strix, partial scaffold on 5090), Qwen 2.5 Coder 32B (90m timeout with zero files).

Simplified ranking (quality, time, price)

For folks who only want the report-card summary. Quality is whether the code runs and how complete it is. Time is the total runtime. Price is the estimated cost per execution on opencode. Hardware indicates where the model ran — Cloud, Strix (AMD Strix Halo, LPDDR5x 256 GB/s) or 5090 (NVIDIA RTX 5090, GDDR7 1792 GB/s). Cloud models ran via OpenRouter or the provider’s direct API.

Model	Type	Hardware	Quality	Time	Price
Claude Sonnet 4.6	Commercial	Cloud	A+	A	C
Claude Opus 4.6	Commercial	Cloud	A+	A	D
GPT 5.4 Pro	Commercial	Cloud	A+	—	F
GLM 5.1	OSS	Cloud	A	B	A
GLM 5	OSS	Cloud	A−	A	A
Qwen 3.5 35B-A3B	OSS	5090	B	A+	A+ (free)
Qwen 3.5 27B Claude-distilled	OSS	5090	C+	B	A+ (free)
Gemini 3.1 Pro	Commercial	Cloud	C	A+	B
Grok 4.20	Commercial	Cloud	C−	A+	A+
Step 3.5 Flash	Commercial	Cloud	C−	D	A+
Qwen 3.5 35B-A3B	OSS	Strix	C	C	A+ (free)
Qwen 3 Coder Next	OSS	Strix	D+	A	A+ (free)
Qwen 3 32B	OSS	5090	D	A+	A+ (free)
Qwen 3 Coder 30B	OSS	5090	D−	A+	A+ (free)
Qwen 3.6 Plus	Commercial	Cloud	D	A	A+ (free)
Kimi K2.5	OSS	Cloud	D	C	A
MiniMax M2.7	OSS	Cloud	D	A	A+
Qwen 3.5 122B	OSS	Strix	D	F	A+ (free)
DeepSeek V3.2	OSS	Cloud	F	F	A+
Qwen 2.5 Coder 32B	OSS	5090	F	F	A+ (free)
Gemma 4 31B	OSS	5090	F	A+	A+ (free)
Gemma 4 31B	OSS	Strix	F	—	A+ (free)
GLM 4.7 Flash	OSS	Strix	F	—	A+ (free)
Llama 4 Scout	OSS	Strix	F	—	A+ (free)
GPT OSS 20B	OSS	Strix	F	—	A+ (free)
Qwen 3 32B	OSS	Strix	F	—	A+ (free)

Quality criteria: A+ works and the code is well structured. A/B works with small to medium caveats. C runs but skips a prompt requirement or has a serious structural issue. D breaks on the first message because of an invented API. F didn’t complete the benchmark or produced garbage. GPT 5.4 Pro stays at A+ for real use in Codex, but didn’t run in this benchmark, hence the dash in time. “Type” separates commercial models (closed weights) from OSS (open weights, even when used through a hosted API). Some Qwens appear twice when they ran on both hardware profiles, because the results are different enough to justify it — Qwen 3.5 35B-A3B on the 5090 jumps to Tier B, on the Strix it stays at Tier C because of the wait time. Of the 33 models configured across both runs, some don’t appear in this table because they never even executed (no quota, broken runner, infra failure, or timeout before the first message).

The verdict

If cost matters and you want to leave Anthropic: GLM 5 or GLM 5.1 are the plug-and-play alternatives that work. Correct API, mocking in the tests, ~$0.11-$0.13 per run, ~88-89% cheaper than Opus. GLM 5.1 delivered a more complete project (24 tests, chat history) at the cost of about 5 more minutes.

If you want the best result regardless of cost: Claude Sonnet 4.6 beat Opus in this benchmark — cheaper, same speed, more tests, code that works. But there are two important caveats before you generalize that conclusion.

First, this result is on opencode, not on Claude Code. In the native environment (Claude Code), where Opus and Sonnet have access to Anthropic’s full tool support, Opus might do better. In my 500-hour Claude Code marathon, I used Opus and the experience was consistently good.

Second, and this is the bigger one: our test is a small, well-defined web app. Sonnet 4.6 and Opus 4.6 share the same 1M token context window, so what separates the two is the reasoning capacity they can apply inside that context. Opus 4.6 has a 128K max output token ceiling vs Sonnet’s 64K, and its training was specifically aimed at long-horizon tasks, multi-step planning and deep reasoning over complex code. On a small project like ours, those muscles stay idle, and in that scenario it’s either a tie or Sonnet wins by being faster. In larger projects, with weeks of work, big monorepos, architectural decisions that carry real consequences, that’s where the actual difference between Opus and Sonnet shows up. You can’t conclude that Sonnet is better than Opus in general just from this benchmark.

If you want to avoid total vendor lock-in and you have decent hardware: Qwen 3.5 35B-A3B running locally on an NVIDIA RTX 5090. Five minutes of execution at 273 tok/s, a Rails project that boots, and the API error fixes itself in 1-2 follow-ups. Realistic total until it works: ~15-20 minutes. Beats Sonnet on cost (zero) and lands close on total time. This option simply didn’t exist in the previous round of the benchmark, and it marks the point where “running OSS local” stops being a toy and becomes a real alternative. Important: this is specific to hardware with high memory bandwidth. On an RTX 4090 it should work similarly. On a laptop with LPDDR5x or a desktop with DDR4, forget it — you’ll wait 10x longer and the total time kills the argument.

If you want to avoid vendor lock-in but you’re on weak hardware: GLM 5 or GLM 5.1 remain the choice. They’re cloud, true, but at $0.11-$0.13 per run it’s basically the price of electricity.

If you want to test the “Claude at home” gamble via distillation: the Qwen 3.5 27B Claude-distilled is sitting there to play with, but I already warned you it hallucinates exactly the same fake APIs as the base Qwen. Distillation transferred Claude’s style, not its factual knowledge about libraries. It’s worth it as an experiment, not as production.

Yes, maybe with days of tweaking llama.cpp, calibrating flags, adjusting prompts, testing different builds, you could make Gemma 4 or other models work better. For most people, that isn’t realistic. The distance between frontier models (Claude, GPT) and self-hosted open source models is real. It isn’t marketing. The gap is shrinking, but it still exists, and the nature of it has changed: today what’s missing in open source is factual knowledge about specific libraries, not raw reasoning capacity. Hardware stopped being the bottleneck, at least for anyone with a recent GPU.

In the end, what matters is whether the code runs. A model can generate 3,405 files, write 37 tests, produce a 181-line README, and the app still won’t work because the API it uses doesn’t exist. Completeness metrics and test counts are necessary but not sufficient. The only reliable signal is whether the model uses real APIs correctly.

The full benchmark, with code, configuration, prompts and per-model results, is on GitHub.

Turning YouTube into a Karaoke App | Frank Karaoke

Sun, 05 Apr 2026 12:00:00 GMT

Project on GitHub: github.com/akitaonrails/frank_karaoke

I’ve always loved karaoke. I go out to sing with family or friends every now and then. In São Paulo there are good places in Liberdade and Bom Retiro, for instance, with private Japanese-style booths. If you’ve never been to a karaoke like that: you rent a private room by the hour, there’s a huge song catalog, two microphones, and a scoring system that grades your singing in real time. The best systems are Japanese, like Joysound and DAM. A score above 90 (out of 100) is considered advanced. DAM, in the LIVE DAM Ai series, even uses AI to give scores that feel more “human.”

But not every place has that level.

The problem with karaoke in Brazil

In Brazil we grew up with Videokê, the brand the Korean Seok Ha Hwang brought to the country in 1996, importing equipment from Korea. It became a craze in the 90s and 2000s, showed up in every bar, barbecue and birthday party. The problem is that those machines stopped in time. The current models, like the VSK 5.0, ship with around 12-13 thousand songs in the catalog, which you expand by buying cartridges or song packs. In practice, the repertoire is old, the interface is straight out of the 2000s, and if the song you want to sing came out after 2015, good luck.

The workaround a lot of bars adopted was to allow Chromecast or screen mirroring so that customers can search for songs directly on YouTube. Makes sense: on YouTube you can find karaoke for any song. With-lyrics version, instrumental version, vocal guide version.

But there’s a downgrade: you lose the scoring. One of the most fun parts of karaoke is the competition. Watching your score climb, comparing with friends, trying to beat the night’s record. If you’re just singing on top of a YouTube video, you get no feedback. It’s like bowling without a scoreboard.

And buying a professional system for home? Importing a Joysound F1 runs north of US$ 2,000 just for the hardware, not counting the monthly catalog subscription. For casual use it makes no sense.

The idea: YouTube with real-time scoring

Frank Karaoke came out of that frustration. If YouTube already has every song, why not build an app that works as a YouTube wrapper with a real-time scoring overlay? You search for any karaoke video, sing along, and the app analyzes your voice through the mic and shows a live score.

It’s a Flutter app for Android. Internally it loads YouTube into a webview and injects an overlay in HTML/CSS/JavaScript right into the page. The score display, the pitch trail, the settings panel, the mode selector — all of it rendered inside the webview through JS injection.

Scoring without a reference

Now, the real problem. Every professional karaoke system depends on prebuilt reference files for each song. Every single one.

Sony’s SingStar, which sold over 12 million copies between 2004 and the end of the PS3 era, had a hand-crafted note track for every song. Every note, every syllable, all mapped manually. The mechanism compared the singer’s pitch via FFT against that reference in real time. A detail I thought was clever: octave was ignored. If the right note was a C, it didn’t matter if you sang C3 or C4. Men sing women’s songs no problem.

Joysound and DAM in Japan go further and evaluate three separate dimensions: pitch accuracy (音感), rhythm/timing (リズム感) and expressiveness/dynamic volume (表現力). All based on MIDI data from the operator’s server. The open source equivalent format is UltraStar, where each song has a .txt file like:

: 12 4 5 Hel-    (NoteType StartBeat Duration Pitch Syllable)

Pitch 5 = MIDI 65 (F4). Scoring compares the singer’s pitch against the note’s pitch, modulo octave, with a tolerance of 1 semitone.

Frank Karaoke works with any YouTube video. There’s no reference file. There’s no MIDI. There’s no melody annotation. Zero metadata about what note you’re supposed to be singing.

I don’t know anything about karaoke scoring. I don’t know anything about audio processing, pitch detection, music theory applied to software. Nothing. So I asked Claude Code to do extensive research on the subject. What it brought back is documented in docs/scoring.md in the repository, and it’s a lot: academic papers on singing evaluation (Nakano et al. 2006, Tsai & Lee 2012, Molina et al. 2013), patents (Yamaha has one from 1999, US5889224A, that details MIDI-based scoring with 3 tolerance bands), and the source code of open source projects like UltraStar Deluxe, AllKaraoke, Vocaluxe and Nightingale.

The conclusion of the research: without a per-song reference, you have to evaluate vocal quality generically. Measure how the person is singing, not what they should be singing. And since no single metric works for every case, we decided to implement four different scoring modes, each measuring a different dimension of vocal quality.

The phone microphone problem

Before the scoring modes, I have to explain a more fundamental problem the research uncovered: the phone microphone.

When you sing karaoke with the phone, the mic picks up three things at once: your voice, the music coming out of the speaker, and ambient noise from the room. Your voice is physically closer to the mic, so it dominates the signal. But not enough for clean separation.

I tried several approaches to isolate the voice:

Spectral subtraction using YouTube’s reference audio. Dropped it. The YouTube CDN blocks direct audio extraction by non-browser user-agents, and even with the reference audio in hand, the speaker’s EQ, the room reverberation and the Bluetooth delay make the signal too different from what the mic captures. Naive subtraction produces artifacts worse than no subtraction at all.

Pre-emphasis + center clipping. Dropped that too. Center clipping destroys the waveform that the YIN algorithm needs for autocorrelation, and pre-emphasis amplifies noise as much as it amplifies voice.

What works is a 200-3500 Hz bandpass filter: a second-order IIR (Butterworth, Q=0.707) in cascade. The high-pass at 200 Hz kills bass, kick drum, bass guitar bleed from the speaker. The low-pass at 3500 Hz kills cymbals, hi-hats, high-frequency noise. Human voice fundamentals (85-300 Hz) and formants (300-3000 Hz) pass through the filter. It’s not perfect isolation, but it improves the voice/music ratio enough for pitch detection.

But the bandpass alone doesn’t solve everything. Guitars, synths and piano produce periodic signals in the same frequency range as voice, and YIN detects pitch in them too. To deal with that, the app does adaptive calibration: in the first 5 seconds of warmup (when nobody’s singing yet), it collects RMS samples from the signal to establish a baseline of the speaker’s level. During the song, it keeps that baseline updated (25th percentile of the last ~4 seconds of frames). For a frame to be scored, the RMS has to be at least 1.3x above the baseline. Your voice is closer to the mic, so it pushes the RMS above the speaker’s level. The instrumental melody stays near the baseline and gets filtered out. In testing, the original singer coming out of the speaker scored around 37 with sparse dots in the trail, while someone actually singing scored ~59 with dense dots.

Another annoying detail: on Android, specifically on Samsungs, the DSP’s AutomaticGainControl (AGC) attenuates the signal instead of amplifying it. On Galaxies, enabling AGC drops the mic peak from ~0.06 to ~0.003. Silence as far as pitch detection is concerned. So the app disables AGC, echo cancellation and noise suppression. When the peak falls below 0.01, it applies software gain (up to 30x) to bring the signal up to usable levels.

The YIN algorithm

To detect the voice’s pitch I use YIN, by Alain de Cheveigné (IRCAM-CNRS) and Hideki Kawahara (Wakayama University). It’s a fundamental frequency estimator in the time domain. The central idea is the Cumulative Mean Normalized Difference Function (CMNDF), which basically measures how periodic the signal is at each lag, normalizes it to reduce false positives, and uses parabolic interpolation to refine the result. It’s lightweight enough to run in real time on a phone, which is what matters here.

In the app, the YIN threshold is 0.70 (tuned for mixed voice + music signals), and frames with confidence below 0.3 get discarded. Below that, it’s probably noise or an instrument.

The 4 scoring modes

Each mode evaluates a different aspect of vocal quality. They all share the same audio pipeline (bandpass → YIN → confidence gate). The difference is how they interpret the detected pitch.

Pitch Match

Measures how cleanly you sustain notes. Uses Gaussian decay based on the standard deviation of MIDI values in a rolling ~15-frame window. Steady notes (deviation < 0.3 semitones) score 85-100%. A trembling voice (deviation > 2 semitones) scores near zero. Good for songs you already know well.

Contour

Measures the melodic shape of your singing. It doesn’t matter which exact note you hit, only the direction and the flow. Evaluates the pitch range and melodic movements (jumps > 0.5 semitone) in a rolling window. Monotone singing scores ~10%. Smooth melodic movement with a 2-6 semitone range scores 70-100%. Good for when you’re learning a new song.

Intervals

Measures the musical quality of jumps between consecutive notes. A whole tone (2 semitones) scores highest. Thirds and fourths score well. Wild jumps of an octave or more score low. Uses a Gaussian curve centered on the whole tone. Works when you’re singing in a different key from the original.

Streak

It’s Pitch Match with a combo multiplier. Each consecutive frame with a score above 0.4 increments the streak counter. The streak adds bonus points (up to +0.4 on a streak of 30+). Breaking a streak > 5 frames pushes a 0.05 penalty into the EMA. Silence freezes the streak, so instrumental breaks don’t hurt you. The most fun mode for parties.

The logic behind these four modes came from the research Claude did across academic papers. Each one measures a different dimension: pitch accuracy, melodic contour, phrasing and consistency. None of them is sufficient on its own, but together they cover, reasonably well, what you can evaluate without having the song’s reference melody.

The Pitch Oracle

Beyond the four purely vocal modes, the app has what I call the Pitch Oracle. The idea: instead of evaluating your voice in isolation, the app downloads the video’s reference audio via youtube_explode_dart, decodes it to PCM, runs YIN on it, and builds a timestamped pitch timeline of the entire song. During scoring, if the mic’s pitch matches the reference’s pitch at that moment in the video, it’s probably speaker bleed, and gets ignored. If it differs, it’s your voice, and gets scored.

The synchronization works through the currentTime of the HTML5 video element, sent to Dart through a JS timeupdate listener every ~250ms. The oracle queries the reference pitch at the exact playback position, accounting for pause, seek and speed change.

The first time you play a song, the oracle takes 5-15 seconds to download and analyze the audio. But the timeline is saved as JSON in the app’s local cache (pitch_oracle/.json). If you play the same song again, it loads instantly from cache, no network request. That also fixes YouTube’s rate limiting problem for the songs you sing the most.

With the oracle active, the modes change behavior. Pitch Match compares the singer’s pitch class against the reference’s, agnostic to octave (like SingStar). Contour uses cross-correlation between the singer’s pitch movement and the reference’s. Intervals compares semitone jumps against the reference’s.

When YouTube blocks the download with rate limiting (happens after many consecutive requests from the same IP, clears in 15-30 minutes), the oracle silently fails and the modes fall back to purely vocal analysis.

The road to here

The app you see now went through a lot of iteration before reaching this state.

First, I tried to make a Linux desktop version to make debugging easier. Makes sense, right? Test on the desktop, iterate fast, then port to mobile. The problem is that Flutter has no webview backend for Linux desktop. webview_flutter simply doesn’t work. I tried webview_cef, which is based on the Chromium Embedded Framework. CEF spawns its own GPU process, and on Hyprland (a Wayland compositor based on wlroots) that conflicts with the compositor’s render pipeline. On my NVIDIA setup, the entire Hyprland session froze. Locked screen, no keyboard response, I had to kill it from a TTY. On top of that, CEF requires downloading a ~200MB binary on the first build. I gave up on CEF and wrote a native bridge in C++ with Claude using WebKitGTK and Flutter method channels. It worked, but every YouTube quirk required separate code for Linux and Android. just_audio also has no Linux desktop implementation. The Linux version turned into dead weight. I deleted ~1,500 lines of Linux-specific code and focused only on Android.

Then came the Samsung mic saga. On my Galaxy Z Fold, the mic was capturing an absurdly low signal. Peaks of ~0.005, basically silence as far as pitch detection was concerned. I spent two hours trying to figure it out. I lowered thresholds, raised software gain to 50x, disabled audio preprocessors. Nothing was working right. Until I figured out the real problem: Android’s AutomaticGainControl. The name says “automatic gain control,” which suggests it amplifies weak signals. In the Samsung DSP implementation, it does the opposite. It attenuates the signal to a low reference level, optimized for voice calls. With AGC on, the peak dropped from ~0.06 to ~0.003. Disabling AGC fixed it. But then the audio_session package was re-enabling AGC under the hood. I removed that one too. It was three rounds of fixes, each finding one more layer of the problem.

And the scoring. The scoring took longer than everything else combined. The first implementation used a cumulative average, which kept the score stuck at one value and never responded to live singing. I switched to a rolling window. Then the score was stuck at ~50% because of a bug in the primary score weight. I fixed it, and it started showing 70% even with nobody singing. Fixed it again. Streak mode wasn’t resetting properly during silence. The chromatic snap was giving high scores for anything. The pitch history wasn’t being cleared on silence gaps and the modes were going stagnant. Every fix revealed another bug. It took more than 25 commits just on the scoring, from the first prototype to the current state.

The result isn’t perfect. I know. But it works well enough to be fun, which was the goal from the start.

Settings

The settings panel lives behind the gear icon on the overlay. There are three mic presets for different environments (clean external mic, normal room, loud party), each adjusting confidence and amplitude thresholds. There’s a pitch shift for when the song is too high for your vocal range. The shift moves both the video audio and the scoring at the same time: it uses the HTML5 element’s playbackRate with preservesPitch=false, so +2 semitones speeds the audio up to 1.12x (pitch goes up) and -2 semitones slows it down to 0.89x (pitch goes down). The scoring compensates for the offset, so you sing in your comfortable range and the system grades you correctly. There’s mic calibration, a 3-second process that measures the room noise and adapts the thresholds. And there’s a restart to reset the score without reloading the video.

To switch scoring modes, tap the score box during playback.

Usage flow

Open the app. YouTube loads inside the app with the Frank Karaoke logo.
Search for a karaoke video. Any video works, but instrumental tracks with on-screen lyrics give better results.
The video pauses briefly to initialize the mic, download the song’s data for the pitch oracle, and prepare the overlay. The first time with a new song this takes 5-15 seconds. If you’ve played it before, it loads from cache instantly.
Sing. The “live” score reflects your current performance (exponential moving average with alpha 0.15, ~1 second response). The “overall” score is the cumulative average of the entire song.
When the video pauses, scoring pauses with it (so it doesn’t score ambient noise). If you seek, the score resets and gets a 5-second warmup.

How to install

The app isn’t on the Play Store yet, I’m waiting for Google to verify my developer identity. It should show up there in the next few days. In the meantime, it’s an open project and you can install it directly.

The easiest way is to download the signed APK directly from the GitHub releases page. On your Android phone or tablet, download FrankKaraoke-0.2.0-android.apk, open it and tap Install. If Android complains about “unknown sources,” enable it under Settings > Security for your browser. On the first run the app will ask for mic permission. Then go into settings (the gear icon) and calibrate the mic before singing — three seconds.

If you want to compile from source or contribute, the repository is on GitHub. You’ll need Flutter SDK 3.10+, Android SDK API 24+, and a physical device for mic testing (an emulator doesn’t give representative results).

git clone https://github.com/akitaonrails/frank_karaoke.git
cd frank_karaoke
flutter pub get
flutter run -d

The README has the rest.

Stack: Flutter + Riverpod for state management, webview_flutter for YouTube, youtube_explode_dart for audio extraction, record for PCM mic capture, audio_decoder for reference decoding via Android MediaCodec, and the YIN algorithm implemented in pure Dart.

The technical documentation for the scoring system is in docs/scoring.md in the repository. It covers how SingStar, Joysound and DAM work, the academic papers, the pitch oracle architecture, the voice isolation problems on Android, and the roadmap.

The scoring is experimental

I have to be straight: the scoring system is experimental. Without per-song reference files, the evaluation is approximate. The app measures whether you’re in tune, whether you follow a melodic contour, whether your intervals are musical, whether you’re consistent. But it doesn’t tell you whether you’re singing the correct melody for this specific song (unless the pitch oracle manages to download the audio, and that doesn’t always work).

If you have experience with audio processing, pitch detection, or music evaluation, the repository is open and the research documentation in docs/scoring.md details what was tried, what works and what doesn’t. In particular: tuning the modes’ thresholds, improving voice isolation, and integrating with UltraSinger (which generates reference files from songs using Demucs + basic-pitch + WhisperX) are areas where contribution from people who know the subject would make a real difference. I’d appreciate any help from specialists on calibrating these systems.

Oh, and the name. Frank Karaoke. It’s a tribute to Sinatra. Who else?

Project on GitHub: github.com/akitaonrails/frank_karaoke

Bitcoin on the Home Server: Sovereignty and Privacy with Coldcard, Sparrow and Fulcrum

Wed, 01 Apr 2026 19:00:00 GMT

This post is a direct follow-up to my recent articles about the new home server with openSUSE MicroOS and the Minisforum MS-S1 Max. Those covered the foundation. Here I want to show one concrete use for it: putting together a decent Bitcoin stack at home, focused on privacy, operational sovereignty and safe transactions on my side.

First things first: this isn’t an evangelism piece or a day-trading pitch. Quite the opposite. As I write this, on April 1, 2026, Bitcoin is around US$ 68k and close to R$ 391k, below the 2025 peaks. Plenty of people look at that and either panic or start fantasizing about leveraged trades. I think both reactions are wrong. There’s a “super cycle” thesis floating around based on institutional demand, spot ETFs and the lagged halving effect. Maybe. Maybe not. What I do know is that short-term candles don’t change the part I actually care about: infrastructure. If you need leverage to “speed up your gains,” you’re probably just speeding up your chances of getting liquidated.

For me, the useful question isn’t “is it going up tomorrow?” The useful question is: “if I want to store and move Bitcoin without outsourcing everything to an exchange, a web wallet and a public API, how do I set that up properly at home?”

The real problem: too much convenience costs too much privacy

Most people’s default flow is simple: buy on an exchange, leave the balance sitting there, or install some random wallet on the phone and call it done. It works. It also concentrates risk and leaks metadata everywhere.

If you leave a balance on an exchange, you have custody risk. If you use a desktop wallet pointed at a public server, you have privacy risk. If you use a hardware wallet casually, bought second-hand on Mercado Livre, you have supply chain risk. Mix all that with hurry, and it gets worse.

That’s why I ended up at a combination that, for someone technical who wants to run their own infra, feels pretty solid:

Coldcard for cold storage
Sparrow Wallet on Linux as the desktop wallet and transaction coordinator
Fulcrum on the home server as a private Electrum server
bitcoind on the same server as a real full node, validating the chain and broadcasting without depending on third parties

It’s not the easiest path. But that’s exactly the point. Real security rarely comes from the easiest path.

The concepts that confuse beginners

Before getting into the stack, it’s worth aligning on four terms that usually get tossed around like everyone already knows them:

Concept	What it is	Why it matters
Airgap	A device that never touches the internet, not even over data USB	Reduces the signer’s attack surface
PSBT	Partially Signed Bitcoin Transaction	Standard format for preparing, signing and finalizing transactions in stages
Watch-only wallet	A wallet that sees balances/addresses but doesn’t hold a private key	Great for the desktop: it observes and assembles the transaction, but doesn’t sign
Full node	A node that validates blocks and protocol rules locally	You don’t have to “trust” anyone’s API
Electrum server	An indexing layer that quickly answers wallet queries	Without one, desktop wallets end up dependent on public servers

In plain language, the flow looks like this:

Sparrow, on the desktop, builds the transaction.
That transaction becomes a PSBT.
The PSBT goes to the Coldcard via microSD.
The Coldcard signs it offline.
The signed file goes back to Sparrow.
Sparrow broadcasts through your own server, not through someone else’s public infrastructure.

That’s what people mean by “airgapped workflow.” It’s not magic. It’s just disciplined separation of roles.

Coldcard: cold signer, offline, the right kind of annoying

I use Coldcard as cold storage. The reason is simple: it was designed from day one as a Bitcoin-only device, with a heavy focus on airgapped operation through microSD. That alone eliminates an entire category of “conveniences” that many people find practical, but that I’d rather not have anywhere near my keys.

In practice, the Coldcard holds the most important part of the system: the private key. It doesn’t need to know about a server, Electrum, public API, exchange, or any of that. Its job is one thing: sign transactions offline.

That decoupling is great for two reasons:

The desktop can be convenient without becoming a single point of failure.
The signer stays isolated even if your main machine has problems.

And here’s a warning I really want to put in mental all-caps:

Never buy a hardware wallet second-hand. Ever.

This isn’t an exaggeration. You have no way to actually know what happened to that device before it reached your hands. It could have a pre-generated seed, tampered firmware, swapped components, repackaged box, compromised supply chain, or simply some dumb trick waiting for you to let your guard down. Hardware wallet is one of those categories where saving R$ 300 buying used is insanity. Always buy from the manufacturer’s official site or from a reseller officially authorized by the manufacturer. And even then, check seals, provenance and firmware.

Can you do something similar with an old phone?

You can. But I’d treat it as a study or budget alternative, not as an obvious substitute for a Coldcard.

The most serious path for that today is AirGap Vault, which was specifically designed to use an old smartphone as an offline signer over QR codes, keeping the device off the network. The idea is good, and for many people it might be the right entry point.

But there are trade-offs:

An old smartphone wasn’t designed as a dedicated hardware wallet
The device’s prior history matters
An aged battery, bad screen and abandoned Android are real problems
The threat model is less clear than on a dedicated device

So my view is simple: can you use it? Yes. Would I recommend it as the main solution for storing meaningful wealth? No. For that I still prefer dedicated hardware bought from the right source.

Sparrow Wallet: the best desktop piece in this puzzle

On Linux, I use Sparrow Wallet. For me, today, it’s one of the best pieces of software in this ecosystem.

What I like about it:

works very well on Linux desktop
supports hardware wallets properly
understands PSBT without drama
makes it crystal clear what’s happening in a transaction
it’s great as a watch-only wallet

In my flow, Sparrow does three things:

Holds the watch-only wallet.
Builds the transaction with outputs and fees.
Receives the signature back from the Coldcard and broadcasts it.

That separation is elegant. The desktop becomes the coordinator. The signer stays cold.

Why Coldcard + Sparrow works so well

This combo is good because each piece does what it does best:

the Coldcard protects the key
Sparrow organizes the human use of the wallet
the server handles the infrastructure

A lot of wallets try to do everything. I prefer this modular design. It’s less “magic,” more explicit, and easier to reason about without lying to yourself.

If I’m at the desktop, I want visibility. If I’m at the signer, I want isolation. If I’m at the server, I want validation and a local index. That division is clean.

The Sparrow problem when you don’t run your own infra

Now comes the important detail. Sparrow alone doesn’t solve privacy.

If you install it, open it and just use public servers, the people on the other end learn quite a lot about your wallet: your address set, xpubs or derivations, balance, history, query behavior, broadcast. It’s not custody, but it’s still exposure.

That’s the hole Fulcrum fills.

Fulcrum: the home’s private Electrum server

Fulcrum is an Electrum server. Instead of letting Sparrow ask things of a third-party public server, it asks my own server.

In practice, that means:

local balance lookups
local history
local address discovery
local broadcast

In other words: the desktop wallet stops “phoning home” to the world every time you open the program.

In my current setup, Sparrow points at a Fulcrum running on the home server on the LAN, with port 50001 on the internal network and 50002 with TLS.

And why Fulcrum isn’t enough on its own

Because Fulcrum doesn’t replace a full node. It indexes on top of a full node.

The thing actually validating blocks, consensus rules, scripts, transactions and the chain is bitcoind. Fulcrum sits in front of it as an indexing layer, because plain Bitcoin Core wasn’t built to serve a desktop wallet with that kind of fast querying.

So the correct architecture is:

Coldcard (offline signer)
        ^
        | microSD / PSBT
        v
Sparrow Wallet (desktop watch-only + coordinator)
        |
        v
Fulcrum (private Electrum server)
        |
        v
bitcoind (full node)

What I actually brought up on the home server

On my home server, the stack lives in a dedicated Docker Compose folder and is made of two containers:

bitcoin-bitcoind
bitcoin-fulcrum

The compose is simple. And that’s good. Sensitive infra doesn’t gain anything by getting clever in YAML.

The main design is this:

services:
  bitcoin:
    image: lncm/bitcoind:v28.0
    container_name: bitcoin-bitcoind
    user: "${BITCOIN_UID}:${BITCOIN_GID}"
    restart: always
    security_opt:
      - label:disable
    volumes:
      - /srv/bitcoin/data:/data/.bitcoin
    ports:
      - "8333:8333"
    stop_grace_period: 5m
    healthcheck:
      test: ["CMD", "bitcoin-cli", "-datadir=/data/.bitcoin", "ping"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

  fulcrum:
    image: cculianu/fulcrum:latest
    container_name: bitcoin-fulcrum
    restart: always
    security_opt:
      - label:disable
    volumes:
      - /srv/bitcoin/fulcrum:/data
      - /srv/bitcoin/data:/bitcoin:ro
    command: ["Fulcrum", "/data/fulcrum.conf"]
    ports:
      - "50001:50001"
      - "50002:50002"
    depends_on:
      bitcoin:
        condition: service_healthy

In my case, the restriction of 50001 to LAN happens at the host’s network layer. The YAML above is the skeleton of the stack, not the entire firewall policy.

The most important parts of this:

restart: always because this is a long-running service
explicit volume so state isn’t lost
user: "${BITCOIN_UID}:${BITCOIN_GID}" because the persistent directory needs to match the storage’s real ownership, so I’d rather pin UID/GID explicitly than trust the image’s default
the RPC isn’t published on the host; it stays on the Compose internal network, which is all Fulcrum needs
the healthcheck uses Bitcoin Core’s own local .cookie, so there’s no need to spread a fixed password through commands
Fulcrum mounts the node’s datadir as read-only just to authenticate via the .cookie without inventing parallel credentials
in fulcrum.conf, that becomes a simple configuration: talk to bitcoin:8332 and read the mounted .cookie, instead of repeating credentials in plaintext
security_opt: label:disable because on this host with MicroOS, SELinux and sensitive bind mounts, I preferred the pragmatic route of disarming this specific friction rather than wasting time fighting labels on a volume that’s already being handled in a controlled way
depends_on with service_healthy so Fulcrum only comes up after bitcoind’s RPC is responding
stop_grace_period: 5m because bitcoind needs real time to flush state on a graceful shutdown

The final version

Today, the design I want to keep is this: bitcoind with txindex, dbcache=1024, persistent volume, 5-minute graceful stop, .cookie authentication, and Fulcrum in front serving Sparrow over LAN or TLS.

The current stack looks like this:

Component	State
Bitcoin Core	`28.0`
Fulcrum	`2.1.0`
Container stop timeout	`300` seconds
Node data dir	dedicated persistent volume mounted at `/data/.bitcoin`
Network	`8333` for P2P, RPC only on the Compose internal network, `50001/50002` for the private Electrum

I’m not interested in turning this into a spectacle. The point is simpler: the final infrastructure has to be boring, predictable and stable.

The tunings that actually matter

There’s no magic here. There are a few parameters that make a real difference and a bunch of stuff that just decorates compose files.

stop_grace_period: 5m exists because bitcoind isn’t a disposable stateless API container. It maintains chainstate, indexes and an in-memory cache. If you don’t give the process time to finish properly, you create unnecessary work for the next start.

user: "${BITCOIN_UID}:${BITCOIN_GID}" is there for a much less glamorous and much more important reason: persistent storage with the wrong permissions is an excellent way to break a working service. So I’d rather align the container with the volume’s actual ownership instead of leaving that implicit.

dbcache=1024 is the spot I find most reasonable for a domestic node that’s always on. Big enough not to suffer constant I/O, small enough that every restart isn’t a labor.

txindex=1 I keep because I want the complete node, not a minimalist install just to claim “it runs Bitcoin.” If the goal here is operational autonomy, I’d rather have the full index.

rpcworkqueue=512 and rpcthreads=16 are the kind of tweak that makes sense when you know you’ll have Fulcrum querying the node all day and you want some headroom.

On the Fulcrum side, the main parameters are:

db_mem = 8192
db_max_open_files = -1
bitcoind_clients = 8
worker_threads = 0
peering = false

Again: nothing esoteric. Just enough cache, reasonable parallelism and absolutely no announcing this server as a public service.

In my current bitcoin.conf, the important core ended up like this:

server=1
txindex=1
prune=0
rpcbind=0.0.0.0
rpcallowip=172.0.0.0/8
rpcthreads=16
rpcworkqueue=512
dbcache=1024
maxmempool=512

All of this makes sense on a server with decent RAM and fast NVMe. But the detail that matters most is still the clean shutdown. Wallet infrastructure has no room for “we’ll deal with it later” thinking.

The actual size of all this

This is another point a lot of people underestimate.

If you look at older Bitcoin Core documentation, you’ll find numbers like 350 GB of disk for a node with default config. That’s outdated. More current data on the size of the blockchain points to something around 725.82 GB on March 11, 2026, and that’s just the raw chain, without the extra indexes that many technical folks will want to keep.

And here comes the catch: the stack I’m describing isn’t “a bare Bitcoin Core just to claim you run a node.” It’s bitcoind with txindex, plus Fulcrum, plus headroom for rebuild, logs, snapshots and normal network growth.

So to put together something similar today, I’d think like this:

below 1 TB: I wouldn’t even start
1 TB: pragmatic minimum
2 TB: comfortable range
above that: if you want long-term headroom, snapshots and less operational anxiety

And here’s the most important observation of all in self-hosting: don’t assume persistence, mount and backup are right just because the YAML looks clean. Verify.

Another thing I wouldn’t forget on a btrfs host: put Fulcrum’s database (fulc2_db) on a separate subvolume. The reason is mundane. That directory grows, changes constantly and has nothing to do with generic automated snapshots of /var. If you mix everything, you end up dragging a large rebuildable index along with system snapshots, burning space and making maintenance more annoying than it needed to be. The Fulcrum index isn’t sensitive configuration. It’s heavy, volatile, rebuildable data. I treat it exactly like that.

Hardening: what I’ve already applied

This is where the difference between “ran on my laptop” and “I’d trust this to operate my wallet” shows up.

In the current state of the stack, the points I consider important ended up like this:

Bitcoin Core’s RPC no longer relies on unnecessary host exposure; Fulcrum talks to bitcoind over the Docker internal network, which is what actually matters
50001 is restricted to internal LAN use
50002 is available with TLS, which is the right move when you need to leave plaintext behind
shutdown is graceful, with stop_grace_period: 5m, so bitcoind has time to flush state instead of dying any old way
the storage mount isn’t on a “we’ll see later” basis; there’s a mount check before Docker comes up, precisely to avoid silent drift

Each of those items exists for a very concrete reason.

Pulling the RPC off the host’s surface reduces attack at zero cost. Fulcrum is already in the same Compose and can already talk to the service by its internal name. There’s no real gain in leaving that port exposed where it doesn’t need to be exposed.

Separating 50001 and 50002 also helps keep the house in order. Within a controlled LAN, plaintext is acceptable. Outside of that, the minimum reasonable thing is TLS. Mixing the two scenarios usually turns into a mess.

stop_grace_period: 5m looks like a container detail, but it isn’t. Anyone who’s ever had a database, an index or a blockchain node killed without grace knows how that turns into hours of work later. A stateful service needs a decent stop.

And the mount check is one of those annoying things that saves you from yourself. The YAML can look beautiful. If the storage didn’t mount and the service came up writing where it shouldn’t, you’ve just manufactured a really irritating problem.

There’s also one detail I really like in this final version of the stack: Fulcrum authenticates to bitcoind through the .cookie file, not through a fixed plaintext password. That’s interesting for two reasons:

you don’t need to leave a static credential showing up in compose, inspect, healthcheck or documentation
the authentication is more aligned with the way Bitcoin Core already knows how to operate locally

In practical terms, that reduces accidental leakage of operational secrets. It’s not a magic solution to everything, but it’s much better than spreading rpcuser and rpcpassword across files, logs and commands.

The only kind of hardening I try to avoid here is the one that’s overly performative in YAML and loose in operation. I’d rather have less “stage engineering” and more basic discipline:

minimum network
minimum secrets
minimum privilege
clean shutdown
verified storage
separate subvolume for large rebuildable data, like the Fulcrum index

And, again, document everything. Good infrastructure isn’t the kind that just works today. It’s the kind that keeps working when you come back to it six months later.

Why this improves transactions on your side

When I build a transaction in Sparrow and sign it on the Coldcard, the chain of trust is much better defined:

the private key never touches the internet
the desktop wallet doesn’t have to trust a public server
the broadcast can come out of my own node
the address history doesn’t need to land on a third-party Electrum server

This doesn’t make anything invulnerable. There’s still a risk of malware on the desktop, badly stored seeds, human error, social engineering and physical disaster. But the design becomes much more coherent.

What about Lightning? Especially in Brazil?

There I separate things.

Reserves and larger-value transactions I treat one way. Day-to-day spending I treat another.

For day-to-day spending, especially in Brazil, I think it’s operationally dumb to carry a lot of balance in a hot wallet. Lightning wallets and spending apps have to be almost like a “pocket wallet”: just enough for daily life.

That goes double if you use a hybrid or custodial solution like RedotPay. I get why it’s interesting for Brazilians: a Hong Kong company, international focus, a reasonably practical bridge between crypto and card spending. For travel, online shopping and life outside the Brazilian banking axis, it makes sense. But I’d never treat it as a place to store wealth. That’s a spending tool, not a vault.

Same logic for Bitrefill Brasil. I think the service is interesting precisely because it solves a real pain in Brazil: turning sats into concrete utility without selling your full position or depending on banking integration all the time. Gift cards, top-ups, small expenses. As a use tool, it makes a lot of sense.

For a Lightning wallet on the phone, I’d look at first:

Phoenix for people who want something very good and simple
Breez for people who want a great payments experience
ZEUS if you’re more technical and you eventually plan to operate your own Lightning node

All of them, in my head, fit into the “pocket wallet” category. Small balance. Daily use. Don’t turn a phone app into a retirement vault.

So is it worth it?

For most people, honestly, probably not in the first month. It’s labor-intensive, has a learning curve, and demands discipline.

But for programmers, engineers and any technical person who wants to learn not to depend on someone else’s service all the time, I think it’s an excellent exercise.

You learn about:

separation of concerns
persistence and state
graceful shutdown
observability
secret isolation
the trade-off between convenience and security

And all of that is valuable beyond Bitcoin.

In the end, that’s what interests me most about this stack. It isn’t about preaching “hyperbitcoinization” or posing as a price prophet. It’s about building a system at home that I can trust more because I’m the one who installed it, measured it, broke it, fixed it and documented it.

Is it work? Yes.

But that kind of work teaches exactly what modern software tries to make you forget: depending less on others is more work upfront, but it usually buys a lot more control in the long run.

My Sim Racing Cockpit - Formula FX1

Wed, 01 Apr 2026 17:00:00 GMT

I’ve loved cars for as long as I can remember. My first real contact with racing games was at the arcades of the 80s and 90s. And when I say “real,” I mean sitting at a cabinet with a wheel, pedals, a hard plastic seat and the screen wrapping in front of you. OutRun (1986), Rad Mobile (1991), Virtua Racing (1992), Ridge Racer (1993), Daytona USA (1994), Scud Race (1996). Every one of those games left a mark on me. But Daytona USA stuck in a different way. That twin cabinet, two machines side by side, the “DAYTONAAA, let’s go away” track blasting through the arcade, the wheel rumbling in your hand. I still remember it.

But the game that really got me hooked on the simcade genre was the original Gran Turismo, in 1997, on the PlayStation 1. “The Real Driving Simulator” on the cover. I played that game obsessively. Around the same time I was watching the Initial D anime, which premiered in 1998 in Japan. I bought every manga volume and read all of it from start to finish. The story of Takumi Fujiwara going down Mount Akina at dawn delivering tofu in his father’s AE86 is, to me, one of the best motorsport stories ever told in any medium.

I still follow Shuichi Shigeno’s work today. After Initial D came MF Ghost (2017-2025), set in the same universe but in a near future where combustion cars have become museum pieces. And now, since July 2025, I’m reading Subaru and Subaru, the direct sequel that ties the Initial D and MF Ghost universes together with two protagonists named Subaru — one from Gunma, one from Kanagawa — competing in a new racing series. It’s Shigeno at his best.

Two years ago I traveled to Japan with my girlfriend and made a point of going to Daikoku PA, the famous parking area on the Shuto Expressway in Yokohama where the JDM culture concentrates. As an old fan of Tokyo Xtreme Racer, by Genki, I needed to see Daikoku with my own eyes at least once. And it didn’t disappoint. Instead of renting a car, we booked a tour with a local guide in his prepped Nissan GT-R. Better that way. On the drive there he explained the history of the Wangan, how the scene works, what’s YouTube exaggeration and what’s real. When we got there on a Friday night and I saw the whole thing in person — Skyline R34, RX-7, Supra, GT-R, tuned kei trucks, insane bosozoku — the feeling was strange in the best possible way. It looked like Tokyo Xtreme Racer, except with the smell of fuel in the air and the sound of real exhaust pipes.

And there’s another detail: I’m playing the new Tokyo Xtreme Racer reboot on PC, and it’s exactly the kind of game that understands its own audience. Strong single-player campaign, addictive progression, the right vibe, and none of the loot box nonsense. I’d recommend it without hesitation. For the same reason, I’m also really looking forward to Forza Horizon 6, which this time is going to be set in Japan. I’ve already pre-ordered it and I can’t wait to play it on the new cockpit.

Driving for real

Now that I’m semi-retired, I’ve had the chance to take my Mercedes to track days. I’ve driven at Autódromo de Interlagos (the Autódromo José Carlos Pace), the 4.309 km circuit in São Paulo that’s been hosting the Brazilian F1 GP since 1973, famous for the S do Senna corner complex and the circuit’s wild elevation changes. I’ve also driven at Autódromo Velocitta, a modern 3.443 km circuit opened in 2014 in Mogi Guaçu in the interior of São Paulo, which hosts Stock Car Brasil and Porsche Cup.

In Las Vegas I’ve driven supercars on those track-day experiences. And when I traveled with my girlfriend to Gramado, in Rio Grande do Sul, we went to Super Carros, which is on Av. das Hortênsias 4635. They have a 2,400 m² hangar with more than 50 cars — Ferraris, Lamborghinis, Porsches, GT-Rs, Corvettes, American muscle cars. You pick a car, head out with an instructor, and drive a roughly 17 km route between Gramado and Canela. I took out a Nissan GT-R and a Ferrari California.

Three years ago I also went to Abu Dhabi with my girlfriend and we went to Ferrari World, which has some of the best racing simulators I’ve ever tried. Hydraulic platform with 6 degrees of freedom, F1 cockpit, the works. I’ve always loved testing simulators wherever I go.

But driving real cars on real tracks is a very expensive hobby. Tires, fuel, insurance, maintenance, registration. And more important: I’m an introvert. I prefer being alone. My simulator cockpit is perfect for when I want to drive without having to deal with anyone. That’s why I love rally so much — it’s me, the virtual co-driver, and the road. Nothing else.

The games I play

I know most people who build a cockpit like this do it to play serious sims — iRacing, Assetto Corsa Competizione, Automobilista 2. I respect that, but it isn’t my thing. I don’t like playing online with other people. I have zero intention of starting a live streaming career. This is purely for my own enjoyment.

These days I play Gran Turismo 7 on the PS5, Forza Motorsport (the 8, from 2023) on PC, but where I have the most fun is in rally games: EA SPORTS WRC, WRC 10 and DiRT Rally 2.0. My first experience with Forza was on the Xbox One with Forza Motorsport 5 and then Forza Horizon 4, which kept me hooked for hundreds of hours.

And I have a huge soft spot for retro games. The original Colin McRae Rally from 1998 on the PS1 was my first rally game. But my favorite of all time is Colin McRae Rally 2.0 (2000), also on the PS1. I recently played through the entire campaign again on the PC version — you can find repacks that run in high resolution and widescreen, much better than the original PlayStation versions. I’d recommend that for any of the titles in the series.

After that came Colin McRae 3 (2002), Colin McRae Rally 04 (2003) and Colin McRae 2005 (2004). Other arcades I revisit often: OutRun 2 SP (2004) and OutRun 2006: Coast 2 Coast (2006) — the best OutRun ever made, in my opinion.

But my game of the year, by a long shot, is Super Woden: Rally Edge. An indie made by a solo developer (ViJuDa, from Spain) that launched in January 2026 for less than R$ 60. Eight countries, more than 80 cars, a career mode, local split-screen multiplayer for up to 4 players, online leaderboards. The behind-the-car camera instead of the top-down view of the previous Super Woden GP made all the difference. 96% positive reviews on Steam with more than 1,300 ratings. It’s the kind of game that proves you don’t need a million-dollar budget to make something amazing.

The evolution of my wheel setup

Logitech G29 era (~2015-2021)

I always wanted a racing wheel. I started over 15 years ago with something equivalent to Logitech’s entry-level wheel, a G29. The G29 is a fine wheel to start with — gear-driven force feedback, pedals with a clutch, 900 degrees of rotation. But its force feedback is noisy and a bit crude. You can feel the gears turning.

Thrustmaster T300RS + SXT V2 stand era (~2021-2024)

Around 2021 I upgraded to the Thrustmaster T300RS, a belt-driven wheel that’s a huge jump up from the Logitech. The force feedback is much smoother and more precise. And I bought the Extreme Sim Racing SXT V2 stand, which is much sturdier than those generic desk clamps.

First I set it up in front of my desktop PC, which at the time had an RTX 3090. It worked, but it was a hassle to keep mounting and unmounting the stand and the cables every time I wanted to play.

Then I built a setup with a long fiber-optic HDMI cable to connect my 60" TV to the PC at the back of the room. I moved the stand in front of the couch. Less of a hassle, but I still had to take it down whenever I wanted to watch a movie with my girlfriend.

Around 2024 or 2025 I swapped my couch for one of those VIP cinema couches from Star Seat, which reclines and the whole nine yards. The problem: it was way taller than the previous couch. I had to do all kinds of workarounds to make the stand work at that height. I even 3D-printed mounts and sent them to PCBWay to machine steel plates so I could attach big wheels under the stand and gain a few centimeters of height. But that left the setup way too wobbly to drive comfortably.

Fanatec CSL DD + Direct Drive era (~2024-2025)

In the meantime I gave the T300 to my brother and upgraded to a Fanatec CSL DD Gran Turismo Edition. The CSL DD’s direct drive motor delivers 5 Nm of torque at the base, but the Gran Turismo DD Pro kit comes with the Boost Kit 180 that brings it up to a sustained 8 Nm with no active cooling. Direct drive means there’s no gear or belt between the motor and the wheel — the motor’s shaft IS the wheel’s shaft. The difference is absurd. The T300 was already great, much better than the Logitech. But the Fanatec is on another level. You feel every asphalt texture, every bump, every incipient slide. There’s no going back once you try it.

I bought it together with the CSL Pedals with Load Cell kit, which measures pressure on the brake instead of displacement. Makes all the difference in braking — you learn to modulate by foot pressure, not by how far the pedal moves. Way more natural.

I also wanted to try the H-pattern manual shifter with the ClubSport Shifter SQ V1.5 from Fanatec and a separate handbrake. It’s fun to try out old cars with a clutch and an H-pattern, but in practice I never adapted. The SXT V2 stand was already shaking a lot with the direct drive, and using the shifter on that unstable setup was frustrating. And I know there are people who want to do heel-toe, but in the simulator I prefer to keep my left foot on the brake and my right on the gas and modulate both at the same time. Works better for me. Now that I have the McLaren wheel with the analog handbrake and clutch paddles right on the wheel, the H-pattern shifter and the external handbrake have been retired. For rally, an analog handbrake on the wheel is much more natural.

I also bought the PS5 with Gran Turismo 7 around this time. I put on the dbrand Darkplates matte black faceplates to replace the original white plates — it looks much better and more discreet.

But the setup was still the SXT V2 stand in front of the VIP couch. The same kludge. The same wobble. I obviously wasn’t going to give up the cinema couch. The situation became unsustainable.

The computer and the hardware setup

A note on my gaming hardware. I bought a Minisforum UX790 Pro to be my dedicated Steam machine. It’s a mini-PC with an Intel Core Ultra 9 285H processor, fits in the palm of your hand. Together I bought the Minisforum DEG1, an external GPU dock that connects via OCuLink (PCIe 4.0 x4, 64 GT/s). It’s an open design — basically a board with a PCIe x16 slot and room for an ATX or SFX power supply. There’s no card size limit, so an RTX 4090 fits comfortably. The performance loss compared to a native PCIe slot is minimal. I put the RTX 4090 in it. The 4090 came from my desktop — at the start of 2025 I went to Miami and took the chance to buy an RTX 5090 because I was using more and more local AI and LLMs. I gave my old 3090 to my girlfriend to use for video editing. The 4090 went into the mini-PC.

So my gaming setup today is: Minisforum UX790 Pro + eGPU with RTX 4090 for Steam and PC games, and a PlayStation 5 with matte black Darkplates for Gran Turismo 7 and exclusives.

The cockpit: Formula FX1

To round out the setup I also needed a decent monitor. I was already used to my 80" Samsung OLED TV in the living room and didn’t want to downgrade picture quality. So I invested in the Samsung Odyssey OLED G8 32". It’s a 4K (3840x2160) OLED monitor with a 240 Hz refresh rate, 0.03 ms response time (GTG), HDR True Black 400, HDR10+, 99% DCI-P3 coverage, 1,000,000:1 contrast, and FreeSync Premium Pro. It has 2 HDMI 2.1 inputs and 1 DisplayPort 1.4.

In practice: the colors pop, black is real black (it’s OLED, no backlight), and with the RTX 4090 I run most games at 4K and 120 fps with no problem. On lighter titles like Super Woden: Rally Edge, it easily hits 240 Hz. The smoothness is absurd. For a cockpit where you’re 60-70 cm from the screen, 32" OLED in 4K is the sweet spot. Bigger and you start to see pixels. Smaller and you lose the immersion.

In January 2026, after years of kludges, I finally ordered a dedicated cockpit. I researched a lot. I considered the Cockpit AX160, made of aluminum profile and very modular, and the Cockpit 4.0, which is the more traditional tubular steel kind. But neither was available at the time of purchase. And then I found the Formula FX1 in black and green from Extreme Racing — Petronas colors, F1-styled.

The FX1 is very different from traditional cockpits. The whole structure is welded thick steel tubing. When I say it doesn’t shake, I mean it doesn’t shake at all. Zero wobble. It’s a brutal difference compared to a stand in front of the couch. The driving position is reclined, F1-style — your feet are at the same height or higher than your hips. You’d think it would be uncomfortable, but it isn’t. You can sit there for hours without complaining. It comes with a padded adjustable seat, an articulating monitor mount, a tilt-adjustable pedal mount, and a height-adjustable wheel mount.

I had to wait about a month for delivery. In the meantime, as anyone who follows my blog knows, I dove into a 16-hour-a-day marathon testing the new AI agents from Anthropic and OpenAI — check the #vibecoding and #agents tags to see everything I built. After about 30 days of that insane marathon, my lower back gave out and I started developing what looks like a herniated disc. I had to see a doctor and take heavy anti-inflammatories.

And right that week, the cockpit decided to arrive.

The build

I was in absurd pain, but I built the cockpit anyway. It took an entire day to unbox and assemble the heavy steel pieces with my back screaming, but I did it.

The official assembly video I followed:

The McLaren GT3 V2 wheel

After assembling the cockpit, I decided that the standard wheel that comes with the CSL GT kit wasn’t enough. I upgraded to the Fanatec CSL Elite McLaren GT3 V2 (~R$ 4,990). It’s a 1:1 scale replica of the McLaren GT3 wheel, with carbon fiber, an OLED display, and compatibility with PC, Xbox, PS4 and PS5.

What I like most about it: it has the normal shift paddles behind it (shift up/down), but it also has two additional analog paddles that can be configured in four different modes. In mode B, which is what I use, the left paddle works as an analog handbrake and the right one as an analog clutch. That’s perfect for rally — I can pull the handbrake mid-corner without taking my hand off the wheel. It also has two 2-position toggles, two 12-position rotaries, 7 standard buttons with interchangeable caps, and Fanatec’s 7-direction FunkySwitch. It’s a complete racing controller.

The final setup

The cockpit ended up in a corner of my bedroom, between the manga shelves (you can spot Akira, Initial D, and 500-something other volumes in the background). I mounted the mini-PC and the PS5 on the cockpit’s side structure, together with the eGPU and the RTX 4090. Everything stays permanently connected. That’s what makes the difference: I no longer have to set anything up or take it down. I sit, turn it on, and I’m driving in 30 seconds.

Gran Turismo 7 running on the final cockpit

The audio system

To round out the setup I needed dedicated audio. I didn’t want to use the monitor’s audio (terrible) and I didn’t want to be on a headset all the time. The fix was to build a separate audio system with HDMI audio extraction.

The centerpiece is an HDMI 2.1 switcher from OREI with audio extraction. It has 2 inputs and 1 HDMI output, supports 4K at 120Hz (48 Gbps of bandwidth), and extracts the audio through optical TOSLINK and 3.5mm. I connect the HDMI output of the RTX 4090 to one input and the PS5 to the other. Video goes to the monitor. Audio goes out through the optical port.

The optical audio goes to an Aiyima D03 amplifier, a compact 2.1 channel amp with 150W per channel, integrated DAC (PCM1808 chip), and Bluetooth 5.0 with aptX HD. It has optical, coaxial, USB, RCA and Bluetooth inputs. It even has a dedicated subwoofer output for when I get around to adding one. It uses Texas Instruments’ TAS5624 amplifier chip and has bass and treble control through the remote. For a cockpit setup where you’re 1 meter from the speakers, 150W is more than enough.

In practice, I keep the amp at 50% and Windows volume at 50%, and that’s already loud as hell. Which is to say: this isn’t a “good enough” little system. It’s set up to actually go loud if I want.

The speakers are Edifier P12, passive, with a 4-inch woofer and a 19mm tweeter. Frequency response from 55Hz to 20kHz, 6 ohm impedance, 20W RMS each. The MDF cabinet with wood finish has a rear bass-reflex port that helps the lows. For passive speakers this size, they deliver well. The mid-range is clean and the highs don’t distort even at high volume.

The setup logic is: HDMI switcher handles the switching between PS5 and PC, extracts audio to optical, the amp converts and amplifies, and the passive speakers deliver the sound. All without having to touch the monitor or swap cables. I press a button on the switcher and switch consoles.

When I want to play without bothering anyone, I plug my Meze 109 Pro directly into the 3.5mm output of the HDMI switcher. The Meze 109 Pro is an open-back headphone with 50mm dynamic drivers, 40 ohm impedance, 112 dB SPL/1mW sensitivity, and 5Hz to 30kHz response. The ear cups are walnut wood with handcrafted finish. It’s an audiophile headphone that works perfectly without a dedicated amplifier thanks to the low impedance. The sound is warm, with full bass and rich mids. You can hear every detail of the engines.

I haven’t decided about a subwoofer yet, but it’ll be my next upgrade. A dedicated sub is going to add that low-end weight that makes you feel the engine in your chest.

The verdict

The couch with a stand works. The PC desk with a stand works. But neither comes anywhere close to a dedicated cockpit. The FX1’s steel structure doesn’t move a millimeter, even with the Fanatec direct drive at max torque. The reclined F1 position is comfortable for sessions of hours. The load cell pedals stay firm on the base. The monitor is exactly at the right height and distance. And best of all: it’s always ready. I don’t need to assemble anything, take anything down, run cables, none of it. I sit and I drive.

For anyone who’s wondering whether it’s worth investing in a dedicated cockpit instead of staying with a desk or couch stand: it is. If you already have a direct drive wheel, the cockpit is the missing piece. I spent years thinking “this is fine” with the stand on the couch. It wasn’t fine. The difference in driveability is something else entirely. And for my case — introvert, single-player only, simcade — I couldn’t have built it any sooner. To be honest, I think I’ve finally landed on the simulator setup that’s perfect for my taste.

Me driving on the final cockpit

Shopping list: how much it all cost

Here’s the consolidated list of everything in my current setup, with approximate prices (some were bought in dollars and converted to reais at the time’s exchange rate):

Item	Estimated Price (R$)	Link
Cockpit Formula FX1 Black and Green	~6,290	Extreme Racing
Fanatec Gran Turismo DD Pro 8Nm (motor + wheel + pedals + Boost Kit)	~9,590	Fanatec / Racing Wheel Brasil
Fanatec CSL Pedals LC (with Load Cell)	~1,500	Fanatec
Fanatec CSL Elite McLaren GT3 V2	~4,990	Racing Wheel Brasil
Fanatec ClubSport Shifter SQ V1.5	~2,500	Fanatec
Minisforum UX790 Pro	~5,000	Minisforum
Minisforum DEG1 eGPU Dock + RTX 4090	~12,000	Minisforum / bought separately
PlayStation 5 + dbrand Darkplates	~4,500	Sony / dbrand
Samsung Odyssey OLED G8 32"	~2,500	Samsung
OREI BK-21A HDMI 2.1 Switcher 2x1 with audio extraction	~450	Amazon
Aiyima D03 Amplifier	~900	Mercado Livre
Edifier P12 (pair)	~799	Edifier
Meze 109 Pro	~5,390	Mercado Livre / Heinrich Audio
Cables (HDMI 2.1, optical, 3.5mm, power)	~300	Various
TOTAL ESTIMATED	~56,709

Yes, almost R$ 57k is a lot of money. I worked like a dog for decades. Now that I’ve managed to retire honestly, my family is well taken care of, I have no debts, and I can finally give myself something I always wanted as a kid but couldn’t afford. When I sat at those OutRun and Daytona USA cabinets at the arcade, I dreamed of having something like this at home. It took 30-something years, but I got there.

And if you add up the years of kludges, stands that didn’t work, 15-meter HDMI cables, 3D prints, machined steel plates, and the frustration of mounting and unmounting everything — a dedicated cockpit saves your sanity. Unlike a PC that depreciates fast, a steel cockpit lasts decades.

Claude Code's Source Code Leaked. Here's What We Found Inside.

Tue, 31 Mar 2026 17:00:00 GMT

Updated April 2, 2026: if you already read this yesterday, jump straight to the new update section.

This morning (March 31, 2026), security researcher Chaofan Shou discovered that the entire source code for Claude Code, Anthropic’s official CLI for AI coding, was sitting there for anyone to grab on the public npm registry. 512,000 lines of TypeScript. 1,900 files. All of it exposed through a 59.8 MB source map file accidentally bundled into version 2.1.88 of the @anthropic-ai/claude-code package.

Within hours the code had been mirrored on GitHub, picked apart by thousands of developers, and Anthropic had put out a statement calling it “human error in release packaging, not a security breach.” Which is technically true and ignores that the result is the same.

I use Claude Code every day. Some of the articles you read here, I wrote with it. So I figured I’d take a look at what’s inside. I actually started writing this piece in Claude Code itself, but my Max plan ran out before I finished. I closed the rest in Codex.

How the leak happened

Claude Code is bundled with Bun, the JavaScript runtime Anthropic acquired in late 2024. When you build with Bun, source maps are generated by default. Those .map files contain the full original source code, not just mappings. Every file, every comment, every internal constant, every system prompt.

The initial theory was that a known Bun bug had caused the leak: even with development: false, source maps were still being served and bundled in. But Jarred Sumner, Bun’s creator, shut that down: “This has nothing to do with claude code. This is with Bun’s frontend development server. Claude Code is not a frontend app. It is a TUI. It doesn’t use Bun.serve() to compile a single-file executable.” In other words, the Bun bug affects the frontend dev server, not the build process that generated the Claude Code npm package.

What actually happened is simpler: somebody at Anthropic forgot to add *.map to .npmignore or didn’t configure the bundler to skip source map generation in production builds. And worse: according to The Register, the source map didn’t just point to the original files, it referenced a ZIP hosted on Anthropic’s own Cloudflare R2 bucket. npm happily served it to anyone running npm pack, and the rest was mirroring work.

The irony is that the code contains a whole system called “Undercover Mode” built specifically to prevent Anthropic’s internal information from leaking in commits and PRs. They built a subsystem to stop the AI from revealing internal codenames, and then a source map exposed everything.

What’s inside: the hidden features

The source code reveals 44 feature flags covering functionality that’s ready but not yet shipped. This isn’t vaporware. It’s real code hiding behind flags that compile to false in external builds. Let me highlight the most interesting ones.

KAIROS: a Claude that never stops

Inside the assistant/ directory, there’s a mode called KAIROS, a persistent assistant that doesn’t wait for you to type. It observes, logs, and acts proactively on things it notices. Maintains append-only daily log files, receives prompts at regular intervals to decide whether to act or stay quiet, and has a 15-second budget: any proactive action that would block the user’s workflow for more than 15 seconds gets deferred.

KAIROS-exclusive tools: SendUserFile (sends files to the user), PushNotification (push notifications), SubscribePR (monitors pull requests). None of this exists in the public build.

BUDDY: a Tamagotchi in the terminal

I’m not making this up. Claude Code has a complete pet companion system in the Tamagotchi style called “Buddy.” A deterministic gacha system with 18 species, rarity, shiny variants, procedurally generated stats, and a “soul” written by Claude on the first hatch.

The species is determined by a Mulberry32 PRNG seeded by the hash of the userId. Same user always gets the same buddy. There are 5 stats (DEBUGGING, PATIENCE, CHAOS, WISDOM, SNARK), 6 eye styles, 8 hat options, and sprites rendered as 5-line ASCII art with animations. The code references April 1-7, 2026 as the teaser window, with full launch slated for May 2026.

ULTRAPLAN: 30 minutes of remote planning

ULTRAPLAN offloads complex planning tasks to a remote session running Opus 4.6, gives it up to 30 minutes to think, and lets you approve the result through the browser. The terminal shows polling every 3 seconds, and once approved, a sentinel value __ULTRAPLAN_TELEPORT_LOCAL__ “teleports” the result back into the local terminal.

Multi-Agent: “Coordinator Mode”

The multi-agent orchestration system in the coordinator/ directory turns Claude Code from a single agent into a coordinator that spawns, directs and manages multiple workers in parallel. Parallel research, synthesis by the coordinator, implementation by workers, verification by workers. The prompt teaches parallelism explicitly and forbids lazy delegation: “Do NOT say ‘based on your findings’ - read the actual findings and specify exactly what to do.”

And there’s more. The leak also shows in-process teammates with AsyncLocalStorage to isolate context, workers in separate processes via tmux/iTerm2 panes, memory synchronization between agents, and flags ready for BRIDGE_MODE, VOICE_MODE, WORKFLOW_SCRIPTS, AFK mode, advisor-tool and history snipping. None of that guarantees a launch, but it suggests a roadmap that’s a lot further along than the public version lets on.

The memory architecture

The memory system caught my eye. It’s not a “store everything and retrieve” approach. It’s a three-layer architecture:

MEMORY.md is a lightweight pointer index (~150 characters per line) that stays permanently loaded in the context. It doesn’t store data, it stores locations. The actual knowledge lives distributed across “topic files” fetched on demand. Raw transcripts are never reloaded into the context whole, only searched with grep for specific identifiers.

And it comes with an important discipline: the system writes to the topic file first and only then updates the index. MEMORY.md doesn’t become a fact dump. It stays a map. If you let the index turn into storage, it pollutes the permanent context and degrades the whole system.

The “Dream” system (services/autoDream/) is a memory consolidation engine that runs as a background subagent. The name is intentional. It’s Claude dreaming.

The dream has a three-gate trigger: 24 hours since the last dream, at least 5 sessions since the last dream, and acquiring a consolidation lock (preventing concurrent dreams). All three have to pass.

When it runs, it follows four phases: Orient (ls in the memory directory, read the index), Gather (look for new signals in logs, stale memories, transcripts), Consolidate (write or update topic files, convert relative dates to absolute, delete contradicted facts), and Prune (keep the index under 200 lines and ~25KB).

There are four memory types: user (user profile), feedback (corrections and confirmations), project (context about ongoing work), reference (pointers to external systems). The taxonomy explicitly excludes things derivable from the code (patterns, architecture, git history, file structure).

The dream subagent gets read-only bash. It can look at the project but can’t modify anything. It’s purely a consolidation pass.

And there’s another detail I found elegant: memory isn’t treated as truth. It’s treated as a hint. The system assumes that memory may be stale, wrong or contradicted, so the model still has to verify before trusting it. That’s the opposite of the fantasy of “throw everything in a vector database and let the magic happen.”

“Undercover Mode”

Anthropic employees (identified by USER_TYPE === 'ant') use Claude Code on public and open source repositories. Undercover Mode (utils/undercover.ts) prevents the AI from accidentally revealing internal information in commits and PRs.

When active, it injects into the system prompt:

## UNDERCOVER MODE - CRITICAL

You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository. Your commit
messages, PR titles, and PR bodies MUST NOT contain ANY Anthropic-internal
information. Do not blow your cover.

NEVER include in commit messages or PR descriptions:
- Internal model codenames (animal names like Capybara, Tengu, etc.)
- Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
- Internal repo or project names
- Internal tooling, Slack channels, or short links
- The phrase "Claude Code" or any mention that you are an AI
- Co-Authored-By lines or any other attribution

There’s no way to turn it off. If the system isn’t sure it’s in an internal repository, it stays in undercover mode. That confirms something kind of uncomfortable: Anthropic uses Claude Code to contribute to open source, and the agent is instructed to hide that it’s an AI.

The internal codenames are animal names: Tengu (codename for the Claude Code project), Fennec (Opus), Capybara, Numbat (in testing). “Fast Mode” is internally called “Penguin Mode” with endpoint claude_code_penguin_mode and kill-switch tengu_penguins_off.

The most paranoid parts

There’s a piece of the analysis I almost missed because I was looking more at the hidden features. But maybe the most revealing thing about Anthropic’s mindset is in the defense mechanisms against copying and abuse.

According to Alex Kim’s analysis, there’s an anti-distillation mode that can ask the server to inject fake tools into the system prompt. The idea is to poison traffic recorded by anyone trying to distill Claude Code’s behavior to train a competitor. There’s also a second mechanism for summarizing connector text, cryptographically signed, so that part of the observable traffic doesn’t match the original raw reasoning. It’s not perfect protection. It’s another layer of friction. But it shows the company is explicitly thinking about copy-by-observation, not just traditional security.

And there’s the more aggressive part: client attestation. Every request includes a billing header with a placeholder cch=00000, and the Bun native runtime replaces that with a hash computed below the JavaScript layer. In other words, looking like Claude Code isn’t enough. The binary tries to prove that it is Claude Code. That helps explain why the fight with third-party tools like OpenCode got so sensitive: it wasn’t just a commercial or legal issue. There was technical enforcement built into the transport.

Update: the DRM died in less than 24 hours

Remember when I mentioned client attestation as “the most aggressive part”? Yeah. It lasted less than a day.

For context: Anthropic had been waging a war against third-party tools since January 2026. First came the server-side blocking of OAuth tokens from non-official clients. Then in March, the OpenCode maintainer merged a PR removing all Claude authentication from the project. The commit message was two words: “anthropic legal requests.” The Register reported that Anthropic updated its terms of service to make it explicit that OAuth tokens from Pro/Max subscriptions can only be used in the official Claude Code and on Claude.ai. People paying $100-200/month for Max who wanted to use the tool of their choice were left holding the bag.

The technical mechanism behind the block was the cch= header. With the leaked code you can see the system has two parts. The first is a version suffix: the cc_version field includes 3 hex characters derived from the user’s first message via SHA-256, using a 12-character salt embedded in the JavaScript. The second is the body hash itself: the entire body of the request (messages, tools, metadata, model, thinking config, everything) is serialized as compact JSON with the cch=00000 placeholder, then hashed with xxHash64 using a fixed seed. The result is masked with 0xFFFFF (20 bits) and formatted as 5 lowercase hex characters. The placeholder is replaced with the computed hash before the request leaves the process.

The detail that makes the difference: that substitution happens inside the Bun native runtime, written in Zig, below the JavaScript layer. Bun literally mutates the JavaScript string in-place, overwriting the 00000 bytes in the string buffer with the computed hash. If you ran the same bundle in Node or in a stock Bun, the placeholder would go to the server as-is and the request would be rejected.

And then the leak happened. With the source code exposed, @StraughterG (Jay Guthrie) announced that same night: “Yesterday I said Anthropic’s compiled Zig cch= hash was banning 3rd-party Claude clients. Tonight, the DRM is dead. We extracted the algorithm from the binary. It’s not advanced cryptography. It’s a static xxHash64 seed.”

The seed is 0x6E52736AC806831E. The full algorithm, as he explained in a thread of tweets, fits in a few lines of TypeScript:

import xxhash from "xxhash-wasm";

const { h64Raw } = await xxhash();
const body = JSON.stringify(request); // com cch=00000 no placeholder
const hash = h64Raw(new TextEncoder().encode(body), 0x6E52736ACn | (0x806831En << 32n));
const cch = (hash & 0xFFFFFn).toString(16).padStart(5, "0");
// substituir cch=00000 por cch={valor calculado}

@paoloanzn celebrated: “we cracked it. the cch= signing system in claude code is fully reverse engineered.” And he immediately put the bypass in free-code, a fork of Claude Code with telemetry removed, system prompt guardrails stripped, and all 54 experimental feature flags unlocked.

The technical point that matters: xxHash64 is not cryptography. It’s a checksum hash designed for speed, not security. The seed is static, embedded in the binary. It changes with each version of Claude Code, but within a version it’s the same for everyone. The “security” depended entirely on nobody being able to extract the seed from the compiled Zig binary. With the source code leaked, that obscurity evaporated in hours.

Now any third-party client — OpenCode, Claw-Code, whatever — can intercept the fetch(), hash the body with the correct seed, and pass the server’s validation as if it were the official Claude Code. The barrier Anthropic built to protect its $2.5 billion ARR business model was, in the end, security by obscurity over a non-cryptographic hash.

The third detail is small but says a lot about real product in production: the system detects user frustration with regex. Yes, regex. Profanity, insults, “this sucks,” that kind of thing. It’s funny to see an LLM company doing sentiment analysis with wtf|ffs|shit, but it’s also the kind of pragmatic solution you reach for when you need a cheap immediate answer, not conceptual elegance.

What the code reveals about how you use Claude Code

@iamfakeguru compiled a thread with seven technical findings from the code that any user should know:

Claude Code has a 2,000-line cap per file read. When you ask it to read a larger file, it silently truncates. Tool results are cut off at 50,000 characters. The context window compression system drops old messages to fit more new context. And there’s a difference between the access level of Anthropic employees (USER_TYPE === 'ant') and public access: internal tools like ConfigTool and TungstenTool are invisible in the external build.

The most useful finding from the thread is how Anthropic employees work around the limitations external users face. The code reveals that USER_TYPE === 'ant' unlocks internal tools, exclusive beta headers (cli-internal-2026-02-09), access to staging (claude-ai.staging.ant.dev), and a ConfigTool that lets you change configurations at runtime. External builds compile all of this to false via dead code elimination.

But the point that matters is: the CLAUDE.md you put at the root of your project gets read in full by Claude Code and injected into the system prompt. It’s literally the place where you control how the agent behaves. @iamfakeguru published a complete override with 10 mechanical rules, and then put the whole file in a separate repository: iamfakeguru/claude-md.

I’m not going to paste the whole block here. What matters is the content: it forces post-edit verification (tsc and eslint before declaring success), it imposes re-reading files before editing, it requires reading large files in chunks, it assumes silent truncation of long results, and it tells you to break larger work into phases or parallel subagents. In other words: it turns into explicit rules everything that external users were having to figure out empirically.

These aren’t magic instructions. They’re guardrails. The difference is that now we know which limits the system actually has and we can write a CLAUDE.md that works with them, not against them.

Cache bugs that cost real money

@altryne (Alex Volkov) reported cache invalidation bugs that make uncached tokens cost 10-20x more than cached ones. There are two bugs: a string substitution bug in Bun that affects the standalone CLI (workaround: use npx @anthropic-ai/claude-code instead of the installed binary), and another in the --resume flag that breaks the cache with no known workaround. Over 500 users reported similar quota exhaustion issues. If you’ve been feeling like Claude Code was burning tokens faster than expected over the past few days, it probably wasn’t your imagination.

“Staff engineer spaghetti”

The code analysis revealed real problems. A comment in the source itself admits: “1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls per day globally.” The fix was three lines: limit consecutive failures to three before disabling compaction.

The print.ts file has 5,594 lines with a single 3,167-line function containing twelve levels of nesting. main.tsx is 803,924 bytes in a single file. interactiveHelpers.tsx is 57,424 bytes. These are files no human can review with confidence.

The most viral reaction came from @thekitze: he asked GPT-5.4 to evaluate the codebase and the score came in at 6.5/10. The description: “This is not junior spaghetti. This is staff-engineer spaghetti: performance-aware, feature-flagged, telemetry-instrumented, surgically optimized spaghetti.” In other words, it isn’t bad code from inexperience. It’s bad code from pressure to ship fast without paying the cost of cleaning up afterwards.

@thekitze also elaborated in another thread on how the code shows a lack of basic engineering practices. And this is where I feel vindicated.

I’ve been repeating in several posts on vibe coding that speed without discipline produces exactly this. The principles I defend, small increments, tests at every step, review before committing, continuous refactoring, CI that rejects high cyclomatic complexity, are the same Extreme Programming principles that have worked since the early 2000s. Anthropic apparently didn’t follow any of them on their own product.

A 3,167-line function with 12 levels of nesting isn’t something that appears overnight. It’s accumulation. It’s the result of dozens of additions where nobody stopped to refactor because “it works, don’t touch it.” It’s the classic anti-pattern of vibe coding without discipline: generate code with AI, see that it compiles, commit, repeat. Without rigorous review. Without complexity limits in CI. Without the basic rule that if a function passes 50 lines, it needs to be broken up.

The irony is that Anthropic sells the most popular vibe coding tool on the market and doesn’t practice what I call responsible vibe coding. Claude Code is worth $2.5 billion in ARR. The code that generates that revenue is rated 6.5/10.

The “clean room” question

With the entire source code public, there’s a serious legal and competitive implication. And here I think a lot of people started using the term “clean room” with a lightness that doesn’t fit the subject.

A real clean room isn’t just “I rewrote it in another language” or “I didn’t copy and paste.” The classic model is much more annoying: one group studies the original and produces a functional spec; another group, isolated, implements from that spec without ever seeing the original code. The whole point is to reduce contamination risk.

@braelyn_ai raised another interesting point: with generative tools, somebody could try a “clean room rebuild” using tests, observable behavior and documentation, without reusing the original implementation. In theory, that makes sense. In practice, what shows up in the heat of a leak usually lands in a much grayer zone.

The Claw-Code case illustrates this well. The project presents itself as an independent rewrite and has already shifted focus to Python and Rust, but the README itself admits direct study of the exposed code and even mentions a parity audit against a local archive. So I wouldn’t call that a clean room in the strictest classic sense. I’d call it an inspired reimplementation, with a deliberate attempt to move away from the leaked snapshot.

That doesn’t mean every reimplementation is doomed. Software copyright doesn’t protect abstract ideas, generic tool flow, high-level architecture or “a CLI that does X.” It protects concrete expression. But that’s exactly why discipline matters. The more a project wants to maintain independence, the less it should rely on the leaked material as a direct benchmark.

There’s a more pragmatic detail in there: the literal copies of the leaked source will probably disappear quickly when the first DMCAs start arriving. Mirrors fall easily. That’s why a reimplementation matters more than a raw mirror. It doesn’t erase the legal discussion, but it changes the type of fight quite a bit and the chances of staying online.

That’s more or less what I did myself when I rewrote OpenClaw in Rust. The point wasn’t to copy line by line. It was to understand the behavior and rewrite the whole piece in my own code.

The satirical site malus.sh appeared today offering “Clean Room as a Service” with the tagline “Robot-Reconstructed, Zero Attribution.” The joke: AI robots recreate open source projects eliminating attribution obligations, with guarantees like “This has never happened because it legally cannot happen. Trust us.” and indemnification via an offshore subsidiary in a jurisdiction that doesn’t recognize software copyright. It’s satire, but it’s satire that describes what someone is going to actually try to do.

Update on April 2, 2026

Since the text above opens in the heat of March 31, it’s worth recording what happened right after. I decided to add this update after reading this tweet from @k1rallik, which captures the post-leak mood well but mixes verifiable facts with a touch too much epic.

First: the DMCA part got messier than it looked. The notice itself published in the github/dmca repository says GitHub processed the takedown against the entire network of 8,100 repositories, because the notification claimed that “all or most of the forks” were infringing to the same extent as the main repository. The next day, Anthropic published a partial retraction: it asked for the reinstatement of all the removed repositories, except nirholas/claude-code and 96 forks listed individually. So the thesis that the initial attempt was too broad is correct. The final picture, however, isn’t “8,100 repositories were taken down.” What happened was a formal walk-back after the mass removal.

Second: the Claw-Code project really did blow up. By the time I was updating this post, GitHub was already showing 142,829 stars and 101,510 forks. That alone is enough to say the story has moved out of the “curious fork of the leak” category and into the “real competitive side effect” category. The viral tweet that went around today is right about the size of the damage but exaggerates some details. The project’s own README describes itself as “the fastest repo in history to surpass 50K stars” and says the milestone came in two hours. I couldn’t independently confirm that historical record, so I’d rather treat that as the project’s own claim, not as a settled fact.

Third: the Rust part also needs nuance. Yes, there’s already a Rust workspace on the main branch and the Cargo.toml is at version 0.1.0. But I couldn’t find a public release on GitHub to support the line “release 0.1.0 already shipped” as a formal launch. What I can say with confidence is something else: the project already has a Python base, already has a Rust workspace, and has already drawn enough attention to keep existing even without the literal mirror of the leaked code.

What Anthropic should have done

Anthropic responded fast. They pulled the compromised package, put out a public statement, and cleaned up what they could. But the damage was done. The code was mirrored before the takedown. Mirrors on GitHub, analyses in blogs, threads on X/Twitter. There’s no way to un-publish something on the internet.

What bothers me isn’t the leak itself. Bugs happen. What bothers me is that this was avoidable with basic engineering practices:

Add *.map to .npmignore. One line.
Configure the bundler to not generate source maps in production builds. One flag.
Have a CI check that rejects publication if the package contains .map. A 5-line script.
Have a release pipeline with manual review before publishing to npm. Process, not code.

None of these are hard. They’re all the kind of thing that gets dropped when you’re moving too fast and don’t have discipline in the release process. It’s exactly what I preach as disciplined vibe coding: moving fast doesn’t mean skipping the guardrails.

And the second failure: the quality of the code itself. 512,000 lines with 3,000-line functions and 12 levels of nesting isn’t engineering. It’s accumulation. It’s what happens when you generate code with AI without rigorous review, without continuous refactoring, without CI that rejects high cyclomatic complexity. The irony of being precisely the company that sells the most popular vibe coding tool in the world doesn’t go unnoticed.

Sources

Migrating my Home Server with Claude Code | openSUSE MicroOS

Tue, 31 Mar 2026 16:00:00 GMT

My old home server was a mess. An Intel NUC with Ubuntu Server that I’d been patching together for two years. Containers with hardcoded paths, volumes mounted in random places (/home/akitaonrails/docker/, /home/akitaonrails/sonarr/, /mnt/terachad/), docker-compose files scattered with no consistent layout. It worked, but if I lost the disk it would take days to rebuild everything from memory.

With the new Minisforum MS-S1 Max that I bought, I decided to do the migration properly. And I decided to use Claude Code from the start to speed up the process. It’s a home server, only I use it, so the risk of making a mistake is low. But if this were a real production server, I would never do this without rigorous human review at every step.

What follows is the migration writeup and a guide for anyone who wants to replicate it. If I ever have to rebuild from scratch, this post is the documentation.

Choosing the operating system

Why not Ubuntu Server again

I used Ubuntu Server on the NUC for convenience. But do-release-upgrade is Russian roulette. Every time Ubuntu releases a new version, the upgrade is a real risk of breaking things. Packages change, configs get overwritten, dependencies clash. For a server that has to be running constantly, that’s unacceptable.

Why not Arch Linux

I use Arch on the desktop and I like it. But Arch is a rolling release with no stability guarantees at all. For a desktop where I can stop and fix problems, fine. For a headless server running 49 Docker containers that has to work after every reboot, no.

Fedora CoreOS vs openSUSE MicroOS

The two modern options for a container server are Fedora CoreOS and openSUSE MicroOS. Both are immutable systems: the root filesystem is read-only, updates are atomic (they either apply in full or don’t apply at all), and rollback is instant.

The difference: Fedora CoreOS uses Ignition (declarative configuration before the first boot) and is designed to be provisioned automatically. MicroOS uses transactional-update and allows normal interactive use. For a home server where I want SSH and the ability to poke around manually, MicroOS fits better.

What makes MicroOS different

The immutable system concept changes how you operate the server:

Every package install or /etc edit goes through transactional-update, which creates a new btrfs snapshot, applies the change in that snapshot, and on the next reboot the system boots into the updated snapshot. If the change breaks something, you do transactional-update rollback and you’re back on the previous snapshot in seconds.

Updates are automatic and daily. The transactional-update.timer downloads patches, creates a snapshot, and rebootmgr reboots in a configured window (in my case, between 4am and 5:30am). If the update breaks the boot, GRUB automatically falls back to the previous snapshot.

SELinux is enforcing by default. That caused 90% of the problems during the migration, but it’s the right setting for security.

Initial setup

Hardware

AMD Ryzen AI Max+ 395, 128GB LPDDR5X
96GB allocated as VRAM via BIOS (UMA Frame Buffer Size)
2TB NVMe (system + Docker)
Wired 2.5Gbps network
Synology DS1821+ NAS at 192.168.0.21 (NFS)

First steps

MicroOS install is standard. Then:

# Create user with a UID that matches the NAS (so NFS works without permission issues)
useradd -u 1026 -m akitaonrails
passwd akitaonrails

# Configure sudo (inside transactional-update shell)
sudo transactional-update shell
# inside: add akitaonrails to sudoers
exit
sudo reboot

Synology NFS

The Synology NAS exports /volume1/TERACHAD over NFS. The mount point on MicroOS is /var/mnt/terachad (not /mnt/, which lives on the immutable root).

In /etc/fstab (applied through transactional-update):

192.168.0.21:/volume1/TERACHAD /var/mnt/terachad nfs4 nfsvers=4.1,rsize=262144,wsize=262144,hard,_netdev 0 0

Details that matter: nfsvers=4.1 because 4.2 didn’t work with the Synology. rsize=262144,wsize=262144 (256KB buffers) was the biggest NFS performance improvement. hard instead of nofail so the mount keeps retrying indefinitely if the NAS disconnects temporarily.

GPU / ROCm

This step was a pain. The Radeon 8060S in the AI Max+ 395 is gfx1151, which ROCm doesn’t officially support. Three steps were needed, and all three are mandatory:

BIOS: set UMA Frame Buffer Size to 96GB
Kernel: add amdttm.pages_limit=25165824 amdttm.page_pool_size=25165824 to /etc/kernel/cmdline
Docker: use HSA_OVERRIDE_GFX_VERSION=11.5.1 in every ROCm container

Without step 2, ROCm only sees 15.5GB even with the BIOS allocation in place. The numbers are 96GB / 4KB (page size) = 25,165,824 pages.

sudo transactional-update shell
echo "amdttm.pages_limit=25165824 amdttm.page_pool_size=25165824" >> /etc/kernel/cmdline
exit
sudo sdbootutil update-all-entries  # OUTSIDE the transactional shell
sudo reboot

Verification:

cat /sys/class/drm/card1/device/mem_info_vram_total
# 103079215104 (96 * 1024^3)

Docker on MicroOS

sudo transactional-update --non-interactive pkg install docker
sudo reboot

sudo systemctl enable --now docker
sudo usermod -aG docker akitaonrails
# logout and login for the group to take effect

# Install standalone docker-compose (the openSUSE package doesn't include it)
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64" \
  -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
mkdir -p ~/.docker/cli-plugins
ln -s /usr/local/bin/docker-compose ~/.docker/cli-plugins/docker-compose

daemon.json

{
  "log-level": "warn",
  "log-driver": "local",
  "log-opts": {"max-size": "10m", "max-file": "5"},
  "selinux-enabled": true,
  "live-restore": true,
  "userland-proxy": false,
  "exec-opts": ["native.cgroupdriver=systemd"]
}

live-restore: true keeps containers alive across a Docker daemon restart. userland-proxy: false uses iptables directly instead of proxy processes (less overhead). selinux-enabled: true is mandatory on MicroOS.

SELinux and Docker: the biggest source of problems

This deserves an entire section because it was responsible for 90% of the bugs during the migration.

On MicroOS with SELinux enforcing, every container that writes to a host bind-mounted directory needs special handling. There are two approaches: the :Z suffix on volumes and the security_opt: label:disable option.

NEVER use `:Z`. Use `security_opt: label:disable`.

:Z tells Docker to relabel the host directory with the container’s SELinux context. Sounds like the right thing. In practice:

SQLite databases break. The relabeling changes the file’s context and SQLite may refuse to open the WAL journal.
NFS mounts silently ignore :Z. NFS doesn’t support SELinux xattrs. The kernel ignores the flag without error, but the container still doesn’t have permission.
:ro,Z mounts try to relabel even when read-only, which fails on NFS and can corrupt the context on local paths.

The right solution for every container on this system:

services:
  myservice:
    security_opt:
      - label:disable     # disables SELinux enforcement for this container
    volumes:
      - ./data:/data      # NO :Z
      - ./config.yml:/etc/config.yml:ro  # NO :Z even on :ro

label:disable disables SELinux label enforcement for that container only, not for the entire system. Combined with Docker’s network and process isolation, it’s safe for a home server.

Migrating the stacks

All Docker stacks were reorganized into /var/opt/docker//docker-compose.yml. On the old server, they were scattered across /home/akitaonrails/docker/, /home/akitaonrails//, with no pattern.

Substitutions applied across every compose file

Before	After
`/mnt/terachad/`	`/var/mnt/terachad/`
`192.168.0.145`	`192.168.0.90`
`/home/akitaonrails//`	`/var/opt/docker///`
`OLLAMA_BASE_URL=http://192.168.0.14:11434`	`OLLAMA_BASE_URL=http://192.168.0.90:11434`

Media stack (Plex, Radarr, Sonarr, etc.)

The media stack is the most complex. Plex needs its own LAN IP (macvlan) for direct streaming to work. The setup:

docker network create -d macvlan \
  --subnet=192.168.0.0/24 \
  --gateway=192.168.0.1 \
  -o parent=enp97s0 \
  plex_macvlan

In compose, Plex needs to be on two networks: the macvlan (for the IP 192.168.0.6) and the default bridge (so other containers like Seerr can talk to it):

plex:
  networks:
    plex_macvlan:
      ipv4_address: 192.168.0.6
      mac_address: "02:42:c0:a8:00:06"
    default: {}    # mandatory — without this, Seerr can't see Plex

A detail that almost broke me: Plex stores absolute paths in its database. If the container’s internal volume changed from /media to /data, Plex no longer finds anything. You have to use exactly the same mount target as the old compose.

Ollama with ROCm

New stack, didn’t exist on the previous server:

ollama:
  image: ollama/ollama:rocm
  container_name: ollama
  devices:
    - /dev/kfd:/dev/kfd
    - /dev/dri:/dev/dri
  security_opt:
    - seccomp:unconfined
    - label:disable
  group_add:
    - "485"   # render group GID
    - "488"   # video group GID
  environment:
    - HSA_OVERRIDE_GFX_VERSION=11.5.1
    - PYTORCH_ROCM_ARCH=gfx1151
    - OLLAMA_KEEP_ALIVE=30m
    - OLLAMA_NUM_PARALLEL=4
    - OLLAMA_FLASH_ATTENTION=1
    - OLLAMA_KV_CACHE_TYPE=q8_0
  volumes:
    - /var/lib/ollama:/root/.ollama
  ports:
    - "11434:11434"

OLLAMA_FLASH_ATTENTION=1 activates flash attention. OLLAMA_KV_CACHE_TYPE=q8_0 uses 8-bit KV cache, cutting the bandwidth needed per token in half. Free performance optimizations.

Monitoring (Grafana + Prometheus)

Grafana uses a named volume (grafana_data) which is NOT included in normal filesystem backups. That’s the reason I lost all my dashboards on the first attempt. The fix is an explicit backup of the named volume:

# On the old server:
docker run --rm -v grafana_data:/data:ro -v /tmp:/backup alpine \
  tar czf /backup/grafana_data.tar.gz -C /data .

# Transfer and restore on the new one:
docker run --rm -v grafana_data:/data -v /tmp:/backup alpine \
  sh -c "cd /data && tar xzf /backup/grafana_data.tar.gz"

Same thing for Portainer (portainer_data). Any volume defined in the volumes: block of a compose without a host path needs this treatment.

Cloudflare Tunnel

I use Cloudflare Tunnel to access all the services from outside the house without opening ports on the router. The migration was the easiest part: copy the tunnel’s JSON credentials file and the config.yml, update the IPs from .145 to .90, and bring the container up. The tunnel keeps the same ID, no need to recreate DNS.

The hostnames live in config.yml: portainer, grafana, plex, seerr, qbittorrent, syncthing, radarr, sonarr, bazarr, prowlarr, vault, gitea, kavita, and others. Everything accessible via https://.example.com from anywhere.

Gitea (image registry)

Gitea acts as a private Docker registry on port 3007. The Frank FBI, Frank Mega, Frank Yomik and Mila projects have Docker images that are built and pushed to Gitea. For it to work, Docker’s daemon.json needs:

{
  "insecure-registries": ["192.168.0.90:3007"]
}

Gitea SSH gave me a problem during the migration: the old app.ini had SSH_LISTEN_PORT=22, but the container’s entrypoint also starts sshd on port 22. Conflict. Fix: GITEA__server__SSH_LISTEN_PORT=2222 as an environment variable in compose.

All 49 containers running

The migrated server runs 49 containers across 15 stacks. The media stack alone has 10 containers (Plex, Radarr, Sonarr, Bazarr, Prowlarr, qBittorrent, SABnzbd, Jackett, FlareSolverr, Seerr). The personal projects (Frank FBI, Frank Mega, Frank Yomik, Mila) add another 11. Monitoring with Grafana, Prometheus, node-exporter and cAdvisor. Utilities like Portainer, Vaultwarden, Syncthing, Organizr, Watchtower. Gitea as private Docker registry. Immich as self-hosted Google Photos. Kaizoku for manga with Kavita as the reader. Ollama with ROCm. And Bitcoin Core/Fulcrum indexing the blockchain off the NAS.

Backups: two layers

Layer 1: local btrfs snapshots (snapper)

/var lives on a 3.7TB btrfs partition. snapper creates automatic snapshots: 7 daily + 1 weekly. They’re crash-consistent, not application-consistent (postgres can be slightly inconsistent if there’s heavy writing during the snapshot).

To recover an accidentally deleted file:

sudo snapper -c var list
sudo cp /var/.snapshots/5/snapshot/opt/docker/media/radarr/appdata/config/radarr.db \
        /var/opt/docker/media/radarr/appdata/config/radarr.db

For a full stack rollback:

sudo docker compose -p media down
sudo snapper -c var undochange 7..0 /var/opt/docker/media
sudo docker compose -p media up -d

Layer 2: restic to the NAS (off-machine)

restic runs every night at 3am, doing incremental backups to /var/mnt/terachad/homelab-backups/. Retention: 7 daily + 4 weekly. Content-based deduplication, so Plex config (19GB) and Gitea repos (12GB) only transfer deltas.

Before restic runs, a pg_dump exports the postgres databases (Immich, Kaizoku). The dumps go to /tmp/homelab-db-dumps/ and are included in the backup.

What’s NOT included in the backup (re-downloadable): Bitcoin blockchain (785GB on the NAS), Docker images (re-pullable), Ollama models (re-downloadable), HuggingFace/EasyOCR caches, Plex transcoding scratch.

Large re-downloadable directories were converted to btrfs subvolumes so snapper ignores them: /var/lib/ollama and /var/opt/docker/bitcoin/fulcrum/fulc2_db.

Performance tuning

btrfs with zstd compression

Added compress=zstd:1 to the fstab for the /var partition. zstd level 1 has near-zero CPU overhead on NVMe and compresses Docker metadata, JSON configs and logs nicely. Incompressible data (SQLite, postgres) is automatically skipped by btrfs.

zram swap

With ~30GB of RAM available for the system (96GB go to VRAM), in-memory compressed swap helps. zram creates a ~15GB swap device (ram/2) with zstd compression, much faster than disk swap.

# /etc/systemd/zram-generator.conf
[zram0]
zram-size = ram / 2
compression-algorithm = zstd
swap-priority = 100

btrfs nodatacow on database directories

Copy-on-write + random database writes = write amplification. I disabled CoW on the directories that hold SQLite and postgres:

sudo chattr +C /var/opt/docker/gitea/data/gitea.db
sudo chattr +C /var/opt/docker/immich/db/
sudo chattr +C /var/opt/docker/media/radarr/appdata/config/

CPU in performance mode

On a headless server, there’s no point saving energy:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference

Persisted via systemd service cpu-epp.service.

Docker shutdown fix

A problem I discovered: Docker ships with KillMode=process, which means at system shutdown, systemd kills only dockerd and leaves all the containerd-shim processes (one per container, ~49 in my case) orphaned. systemd-shutdown then has to hunt them down one by one after the journal has already stopped, causing a silent multi-minute hang.

Fix:

# /etc/systemd/system/docker.service.d/shutdown.conf
[Service]
KillMode=control-group
TimeoutStopSec=30

The problems we ran into

This is the table of actual problems we hit during the migration. If you’re planning something similar, read it before starting:

Problem	Cause	Fix
ROCm only sees 15.5GB of VRAM	Kernel TTM caps pages even with BIOS at 96GB	Add `amdttm.pages_limit=25165824` to kernel cmdline
Every container: permission denied on volumes	SELinux `container_t` can’t write to unlabeled paths	`security_opt: label:disable` on every service
NFS with `:Z` silently fails	NFS doesn’t support SELinux xattrs	Never use `:Z` on NFS paths
SQLite breaks with `:Z`	Relabeling changes the context, WAL mode fails	Drop `:Z`, use `label:disable`
Radarr/Sonarr showed the setup screen	Backup at `appdata/config/` but compose mounted `appdata/`	Fix to: `appdata/config:/config`
Grafana lost dashboards	Named volume not included in filesystem backup	Explicit named volume backup
Plex can’t find media	Internal path changed from `/media` to `/data`	Restore the original path in compose
Seerr can’t connect to Plex	macvlan isolated from the bridge network	Add `default: {}` to Plex networks
Fulcrum crash: “option -b missing”	Env vars not supported by the image	Use CLI flags in `command:`
bitcoind rejects RPC	Binds on `::1` by default	Add `-rpcbind=0.0.0.0 -rpcallowip=172.0.0.0/8`
sdbootutil warning in transactional shell	Has to run outside the transaction	Run `sdbootutil update-all-entries` in the normal shell
Watchtower permission denied on docker.sock	SELinux blocks socket access	`label:disable`
Gitea SSH crash	Conflict: entrypoint sshd on port 22 + app on port 22	`GITEA__server__SSH_LISTEN_PORT=2222`
docker-compose not installed with Docker	The openSUSE package only installs the daemon	Install standalone binary manually

What to tell Claude Code before you start

If I were redoing the migration from scratch, I’d give Claude Code these instructions on the very first message. In order of importance:

Tell it that SELinux is enforcing and that it should NEVER use :Z on any Docker volume, but rather security_opt: label:disable on every service. Tell it that /var/mnt/terachad/ is an NFS mount and that :Z should never appear on NFS paths. Tell it to always look at the original compose before rewriting and only change IPs, paths and container names, without inventing new volume layouts. Warn that named volumes need explicit backup (Grafana, Portainer). Explain that Plex runs on macvlan and needs default: {} in the networks. Inform that the GPU is gfx1151, not officially supported, and that it needs UMA 96GB in the BIOS + kernel TTM params + HSA_OVERRIDE_GFX_VERSION=11.5.1. And tell it that Bitcoin/Fulcrum don’t process environment variables, everything goes as an argument in command:.

Those instructions would have prevented 80% of the problems we hit.

Final server layout

/var/opt/docker/
├── bitcoin/          (bitcoind + fulcrum)
├── cloudflared/      (Cloudflare tunnel)
├── frank_fbi/        (email fraud analysis)
├── frank_mega/       (Mega clone)
├── frank_yomik/      (manga translation)
├── gitea/            (Docker registry)
├── immich/           (self-hosted Google Photos)
├── kaizoku/          (manga downloader + reader)
├── media/            (Plex + *arr stack)
├── mila/             (Discord bot)
├── monitor/          (Grafana + Prometheus)
├── ollama/           (local LLM with ROCm)
├── rip/              (HandBrake)
└── utils/            (Portainer, Vaultwarden, Syncthing, etc.)

/var/mnt/terachad/    (Synology NFS)
├── Bitcoin/data/     (blockchain, 785GB)
├── Downloads/        (torrents + nzbget)
├── Videos/           (Radarr movies + Sonarr series)
└── Ollama/models/    (model overflow if local disk fills up)

A warning about using AI to administer servers

I used Claude Code to speed up the migration. It created compose files, wrote backup scripts, configured the firewall, diagnosed SELinux problems. It worked well for my case: home server, only I use it, and I was reviewing every step.

But there are traps. Claude doesn’t know that :Z breaks SQLite unless you tell it. It doesn’t know that Fulcrum doesn’t accept env vars unless it’s already seen the Dockerfile. It will invent “better” volume layouts that break Plex because Plex stores absolute paths in its database.

If it were real production: don’t do this without review. Every compose file Claude generates, read it in full before applying. Every destructive command (rollback, delete, recreate), confirm manually. And have tested backups before you start. Claude is great for generating the first version and diagnosing errors, but the architecture decisions and the safety validations are yours.

The previous home server posts that may give additional context:

Review: Minisforum MS-S1 Max | AMD AI Max+ 395 with 96GB of VRAM

Tue, 31 Mar 2026 15:00:00 GMT

If you’ve been following my home server posts, you know I used to run everything on an Intel NUC Core i7 with 32GB of RAM. It worked. But as open source AI models grew, the NUC became the bottleneck. Without a dedicated GPU, any LLM inference would fall back to the CPU and turn unusable.

I bought a Minisforum MS-S1 Max with the new AMD Ryzen AI Max+ 395 chip for one specific reason: this chip supports up to 128GB of unified RAM, and I can allocate 96GB of it as VRAM for the iGPU. That gives me more VRAM than any consumer gaming card, including the RTX 5090 (32GB). And that changes what I can run locally.

Why ditch the Intel NUC

The NUC was a fine Docker server for two years. But the limitation was clear: without a GPU with enough VRAM, I couldn’t run LLMs locally in any usable way. Frank Yomik, my automatic manga translation system, needed CPU-based OCR (slow) and would connect remotely to the Ollama running on my desktop (AMD 7950X3D + RTX 5090) for translation. It worked, but it meant my desktop had to be on for the server to do its job.

With the Minisforum, Frank Yomik now runs entirely on the server. The worker uses ROCm for OCR on the iGPU, and Ollama runs locally with 96GB of VRAM. Zero dependency on the desktop.

You can get a sense of the size from the photo. The NUC is the tiny cube on the left. The Minisforum is bigger but it’s still a mini-PC. It fits on the rack shelf under my Synology NAS without any trouble.

The specs

The chip is the AMD Ryzen AI Max+ 395: 16 cores / 32 threads Zen 5, with an integrated Radeon 8060S iGPU and 128GB of unified LPDDR5X. In the BIOS, I set the UMA Frame Buffer Size to 96GB, which leaves ~30GB of RAM for the operating system and containers. Plus the kernel parameters for TTM (without them, ROCm only sees 15.5GB even with the BIOS allocation in place).

The operating system is openSUSE MicroOS (more on that in the next post). The whole machine pulls under 100W, which is absurd if you’re used to dedicated GPUs that draw 450W+ on their own.

Minisforum vs my desktop: benchmarks

I ran a set of benchmarks comparing the Minisforum against my desktop (AMD 7950X3D, 96GB DDR5, RTX 5090 32GB GDDR7). The results are clear.

CPU

Test	7950X3D	AI Max+ 395	Winner
Prime sieve (single-core)	0.021s	0.018s	Strix Halo +14%
Float pi (single-core)	1.335s	1.706s	7950X3D +28%
Multi-core sieve (32 threads)	0.181s	0.118s	Strix Halo +53%
SHA-256 throughput	2.714 MB/s	2.488 MB/s	7950X3D +9%
AES-256-CBC throughput	1.613 MB/s	1.410 MB/s	7950X3D +14%

Mixed results. The AI Max+ 395 is better at pure parallelism (multi-core sieve), probably thanks to lower latency in the unified memory architecture. The 7950X3D wins at float and crypto because of its higher clocks and the 3D V-Cache.

LLM inference (models that fit on both)

This is where it gets interesting. For models that fit in the 32GB of the RTX 5090, the comparison is purely about memory bandwidth:

Model	Size	RTX 5090 (tok/s)	Strix Halo (tok/s)	5090 advantage
phi4	9.1 GB	155.1	23.2	6.7x
qwen3:14b	9.3 GB	138.9	22.6	6.1x
phi4-reasoning	11.1 GB	130.2	19.1	6.8x
qwen3:32b	20.2 GB	66.9	10.0	6.7x

The RTX 5090 is ~7x faster. The explanation is simple: GDDR7 has ~1,792 GB/s of bandwidth. LPDDR5X has ~256 GB/s. The ratio (7x) lines up almost exactly with the measured speed difference (6.7x). LLM inference is a problem dominated by memory bandwidth. Whoever reads weights faster, generates tokens faster.

And what about prompt processing?

Model	RTX 5090 (tok/s)	Strix Halo (tok/s)	5090 advantage
phi4	~1,933	~212	9.1x
qwen3:14b	~1,474	~155	9.5x
qwen3:32b	~767	~68	11.3x

Prompt processing is even worse: 7-11x slower. That makes sense, because the prompt has to be processed in full before generating the first token, and it’s an even more bandwidth-intensive operation.

Where the Strix Halo wins: large models

Now we get to the reason I bought this PC. Models that don’t fit in the RTX 5090:

Model	Size	Strix Halo (tok/s)	Notes
gpt-oss:20b	13.8 GB MXFP4	48.9	MoE, faster than expected
qwen3.5:35b	23.9 GB	43.2	MoE, only ~4B active params
qwen3-coder-next	51.7 GB	29.5	MoE, 50GB+
qwen3.5:122b	81.4 GB Q4_K_M	19.2	122B params, MoE
glm-4.7-flash:bf16	59.9 GB	17.9	Full precision bf16
qwen2.5:72b	47.4 GB Q4_K_M	4.5	Dense 72B, bandwidth-limited

qwen3.5:122b with 81GB of weights running at 19 tok/s. On a mini-PC. That’s simply not possible on an RTX 5090. On the NVIDIA card, that model would have to offload layers to system RAM, dropping to 2-3 tok/s. In practice, unusable.

The difference between MoE and dense models is brutal. qwen3.5:35b runs at 43 tok/s because, despite having 35B total parameters, only ~4B are active per token. A dense 72B model like qwen2.5:72b has to read 40GB+ of weights per token, and at 256 GB/s of bandwidth, the theoretical maximum is ~6.7 tok/s. The 4.5 measured represent ~67% efficiency, which is what you’d expect for an iGPU (overhead from shared bus and drivers).

Summary: when to use which machine

Use case	Best machine
Interactive chat/coding (models <32GB)	RTX 5090 (6-7x faster)
Large models (50GB+)	Strix Halo (only option)
Dense 70B+ models	Strix Halo (only option)
Full-precision bf16	Strix Halo (only option)
Batch processing with long context	Strix Halo (more VRAM for KV cache)
API serving with low latency	RTX 5090 (sub-150ms TTFT)

A ROCm bug that’s still around

Not everything works. Models like deepseek-r1:70b, llama3.3:70b and llama4:scout crash with a ggml bug (GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed). The embedding tensor of these models exceeds 2GB and the ROCm copy kernel uses a 32-bit integer for the size. On CUDA (NVIDIA) it’s already been fixed, but on ROCm it hasn’t. Waiting for the fix in Ollama 0.20.0+.

LPDDR5X vs GDDR7: why this difference exists

The next question is: why is LPDDR5X so much slower?

GDDR7 is dedicated GPU memory. It’s soldered onto the graphics card, connected by a wide bus (384 or 512 bits on the RTX 5090) at high clocks. Its only job is to feed data to the GPU. LPDDR5X is unified memory that serves everything: the operating system, applications, and the GPU all at once. The bus is narrower and shared.

In practice: GDDR7 delivers ~1,792 GB/s dedicated to the GPU. LPDDR5X delivers ~256 GB/s that still need to be split between CPU and GPU. LLM inference is basically “read all the model weights from memory, multiply by the current token, generate the next token, repeat.” Whoever reads faster, generates faster. There’s no shortcut.

The Strix Halo’s advantage isn’t speed. It’s capacity. 96GB of VRAM in a 100W chip that costs a fraction of a professional GPU. The RTX 5090 is 7x faster, but it’s stuck at 32GB. Models that don’t fit, don’t run.

The alternatives: who else does this?

If 96GB isn’t enough or you want more speed, the options are limited.

The Framework Desktop uses the same AI Max+ 395 chip with up to 128GB of RAM. Same platform, same performance, but with the differential of being modular and repairable (it’s Framework, after all). In practice it’s equivalent to the Minisforum on specs and price.

Above that, the alternative is a Mac Studio with M3 Ultra. The M3 Ultra chip supports up to 512GB of unified memory, with ~819 GB/s of bandwidth (more than 3x the Strix Halo). Apple manufactures the memory chips on the package, so latency and bandwidth are superior. You could potentially allocate ~400GB as VRAM and run models that don’t fit anywhere outside of professional GPU servers.

Apple’s internal NVMe is also another level: ~7.4 GB/s of sequential read on the M3 Ultra, compared with ~14 GB/s on the Crucial T700 (PCIe 5.0). The T700 is faster on raw throughput, but Apple’s NVMe latency tends to be lower on random I/O thanks to the SoC integration.

Spec	Minisforum MS-S1 Max	Mac Studio M3 Ultra (max)
Max RAM	128 GB LPDDR5X	512 GB unified
Allocatable VRAM	~96 GB	~400 GB
Memory bandwidth	~256 GB/s	~819 GB/s
CPU	Zen 5, 16C/32T	Apple M3 Ultra, 32C
GPU compute	ROCm (gfx1151, experimental)	Metal (mlx, mature)
Power draw	~100W	~135W
NVMe	PCIe 5.0 (standard slot)	Custom Apple (~7.4 GB/s)
Price (US)	~$1,500-2,000	~$9,999 (512GB config)
Estimated price (Brazil)	~R$ 12,000-15,000	~R$ 110,000+ (imported)

The Brazilian price is the elephant in the room. The Mac Studio’s max config costs $9,999 in the US. With import taxes (~60% + state ICMS), it goes past R$ 110,000. The Minisforum with 128GB lands at R$ 12,000-15,000. A nearly 8x price gap buys you a lot.

If you need more than 96GB of VRAM for truly enormous models (DeepSeek-V3 with 671B parameters fits in ~400GB Q4, for example), the Mac Studio with 512GB is the only consumer option. The alternative would be professional NVIDIA A6000 GPUs (48GB VRAM, ~$6,000 each, and you’d need several in NVLink). For everything that fits in 96GB, the Minisforum gets the job done at a fraction of the cost.

And projects that promise to run big LLMs on small GPUs?

There’s a concept called “layer offloading” that projects like llama.cpp already support. The idea: if the model doesn’t fit fully in VRAM, keep some layers on the GPU and the rest in system RAM. The GPU processes the layers it has, hands off to the CPU to process the rest, and back.

In practice, it doesn’t work well. The bottleneck is PCIe: transfer speed between system RAM and GPU VRAM is ~32 GB/s (PCIe 5.0 x16). Each generated token needs to transfer data back and forth. The result is that you drop from 150 tok/s (everything in VRAM) to 2-8 tok/s (partial offload). It’s too slow for interactive use.

VRAM is the fundamental limitation because LLM inference is memory-bandwidth-bound, not compute-bound. The GPU has compute to spare. What’s missing is the ability to read the model weights fast enough. When part of the weights live in system RAM through PCIe, the entire pipeline waits on the transfer.

That’s why unified memory (like in the Strix Halo or Apple Silicon) makes a difference. There’s no PCIe in the middle. CPU and GPU access the same physical memory. The Strix Halo’s 256 GB/s is slow compared to GDDR7, but it’s 8x faster than offloading through PCIe.

Advances in LLM optimization (up to 2026)

To understand why some models run so much better than others on the Strix Halo, you have to understand what’s changed in the ecosystem over the last two years.

Mixture of Experts (MoE)

If you run local models, MoE is the advance that matters most. An MoE model has high total parameters (e.g. 122B in qwen3.5:122b), but only activates a fraction of them per token (e.g. ~4B). The inactive weights stay in VRAM but aren’t read on every token, which drastically reduces the bandwidth needed.

In the Strix Halo benchmarks, MoE models run 3-10x faster than dense models of the same size. qwen3.5:35b (MoE, ~4B active) runs at 43 tok/s while qwen2.5:72b (dense, 72B active) runs at 4.5 tok/s.

DeepSeek and training optimization

DeepSeek V3 (December 2024) showed that it was possible to train 671B parameter models at a cost an order of magnitude lower than predicted. They combined MoE with FP8 quantization during training (not just inference), multi-stage training with curriculum learning, and several inter-GPU communication optimizations. The impact: everyone copied. Qwen, GLM, MiniMax, all of them adopted variations of the technique.

Quantization: from FP16 to Q4 without losing much

Quantization compresses model weights from 16 bits (FP16) to smaller formats: 8 bits (Q8), 4 bits (Q4), or even 2 bits. A 70B model that would take ~140GB in FP16 fits in ~40GB at Q4_K_M. Quality loss exists, but in modern formats (GGUF Q4_K_M, AWQ, EXL2) it’s small enough for practical use.

GGUF (the llama.cpp format) became the standard for local inference. AWQ and GPTQ are alternatives with more sophisticated calibration, but the ecosystem converged on GGUF because it works on CPU, CUDA and ROCm without recompilation.

Distillation: smaller models that know more

Distillation is training a small model using the responses of a large model as the teacher. Microsoft’s Phi-4 (14B) was trained with distillation from GPT-4 and competes with 70B models on several benchmarks. Qwen3 did the same: qwen3:14b is surprisingly capable for its size.

Flash Attention and optimized KV Cache

Flash Attention (Tri Dao, 2022) changed how attention is computed: instead of materializing the full attention matrix in memory, it processes in blocks keeping the data in the GPU’s on-chip SRAM, reducing memory consumption from O(n²) to O(n). Without that, contexts of 128K+ tokens would be impractical. It already went through versions 2 and 3, with optimizations for FP8 and async operations on H100. PagedAttention (vLLM, UC Berkeley) did the same for the KV cache during serving: applies virtual memory concepts to the cache, eliminating fragmentation and improving throughput by 2-4x.

In Ollama, I set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 on the server. The first activates flash attention, the second uses 8-bit KV cache instead of fp16, cutting the bandwidth needed per token in half. These are zero-hardware-cost optimizations that improve throughput measurably.

What Qwen, Kimi, MiniMax and GLM are doing

Qwen (Alibaba) has consistently been the best price/performance in open source models. Qwen3:14b is dense and strong; Qwen3.5:122b is MoE and runs surprisingly well in 96GB. GLM-4.7 (Zhipu AI) is notable for offering bf16 full precision versions that fit in 96GB. MiniMax experimented with long contexts (up to 4M tokens). Kimi (Moonshot AI) focused on large context windows with linear architectures.

What runs well in 96GB of VRAM

With 96GB on the Strix Halo, the models that work well for daily use:

Model	Size	tok/s	Use
qwen3.5:35b	24 GB	43.2	General purpose, excellent
qwen3-coder-next	52 GB	29.5	Code, MoE
qwen3.5:122b	81 GB	19.2	Heavy but usable
glm-4.7-flash:bf16	60 GB	17.9	Full precision
qwen2.5-coder:32b	20 GB	10.2	Code, dense
deepseek-r1:32b	20 GB	7.4	Reasoning

Dense 70B+ models (deepseek-r1:70b, llama3.3:70b) are still blocked by the ROCm bug I mentioned. When it gets fixed, they should run at ~4-6 tok/s, usable for batch but not for interactive chat.

Conclusion

I bought the Minisforum to run models that don’t fit in any gaming GPU. For that, it works. It’s not fast. 19 tok/s on a 122B model isn’t the experience you get with Claude or ChatGPT. But it’s local, it’s private, and it runs on my shelf consuming less power than an old lightbulb.

For people asking about the Mac Studio: if you have the budget, it’s the best machine for running local LLMs. 512GB of unified memory, 819 GB/s of bandwidth, mature Metal/mlx ecosystem. You can run DeepSeek-V3 in full Q4. But in Brazil, with import duties, it crosses R$ 110k. The Minisforum with 128GB at R$ 12-15k is the realistic option.

And for people who think you can work around the VRAM limitation with layer offloading: you can’t. PCIe is too slow. The model has to fit fully in VRAM for inference to be usable. It’s the reason gaming GPUs with 32GB of ultra-fast GDDR7 are still capped on model size, and why the unified memory of the Strix Halo and Apple Silicon changed the equation.

In the next post I tell the story of how I migrated the entire home server to the Minisforum using Claude Code, the problems I ran into, and how openSUSE MicroOS behaves as a Docker server operating system.

Teaching People to Question the News | Frank Investigator

Fri, 27 Mar 2026 10:00:00 GMT

Heads up: Frank Investigator is an experimental project, in active development, and isn’t meant to be the final word on any article it analyzes. It doesn’t tell you what’s true or false. What it does is ask the questions the article refused to ask, identify known rhetorical patterns, and search for external sources the author left out. If you want to help, contribute on GitHub or send feedback. If you want to follow the results, The Makita Chronicles newsletter is going to have a new section called “Notícias Duvidosas” (Dubious News) where I’ll publish the investigator’s summary and a link to the full report.

With that out of the way, let me explain why I built this.

The problem with the Brazilian press

I’m fed up.

Fed up of opening the newspaper and having to do mental gymnastics to separate information from narrative. Fed up of outlets like Folha de São Paulo, UOL, Carta Capital, Brasil 247, O Globo and several others that use misleading headlines, omit context on purpose, transpose evidence from other countries with no caveat, and create the appearance of consensus among outlets that are all saying the same thing because they’re following the same coordinated agenda.

This isn’t conspiracy theory. It’s a verifiable editorial pattern. And the worst part: most readers don’t have the time or the tools to notice. You read the headline, read the first two paragraphs, and walk away with the impression the article planted in your head.

What Frank Investigator does

You give it a news article URL. The system fetches the article with a headless Chromium (to get past paywalls and cookie walls), extracts the content while filtering out ads and sidebars, and decomposes the text into verifiable claims. Then the real analysis starts: divergence between headline and body, rhetorical fallacies (false causality, appeal to authority, strawman, bait-and-pivot), source distortion, temporal manipulation, selective citation, authority laundering. It expands the links cited in the article to verify whether the sources actually say what the author claims. It evaluates each claim with consensus from 3 AI models (Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro) through OpenRouter. It detects contextual gaps, coordinated campaigns across outlets, and measures the ratio between passion and evidence. 15 stages in total.

The central principle is “Truth Above Consensus”: a primary source (official data, government document, original academic study) vetoes any number of secondary sources repeating the same information. Ten newspapers repeating the same thing without a primary source still adds up to zero.

Let me show you five real examples.

Example 1: The Noelia Castillo case (BBC)

The BBC published a story with the headline “Noelia Castillo dies: a 25-year-old’s fight in the Spanish courts against her father to receive euthanasia.” At first glance it looks like a story about the right to euthanasia and a family dispute. A young quadriplegic woman who fought against her religious father’s opposition to exercise her right to die.

Except when you compare it with other outlets, like Veja, facts come up that the BBC almost completely omitted. And those facts change everything.

Noelia was taken from her family by the Spanish government at age 13 and placed under state custody. While she was under that custody, she suffered multiple gang rapes. The sexual violence resulted in serious psychiatric damage and a mental health history that already added up to 67% disability rating before the events of 2022. When she attempted suicide in October 2022 by jumping from the fifth floor of a building, she was left paraplegic. Her disability rating rose to 74%.

The euthanasia request was approved by Catalonia’s Guarantee and Evaluation Commission. The procedure was scheduled for August 2, 2024, but was suspended for over 600 days because of the father’s appeals. Five judicial instances ruled on it. The Constitutional Court dismissed any violation of fundamental rights. Spain’s Supreme Court denied the appeal. The European Court of Human Rights rejected the suspension request. On Friday, March 26, 2026, Noelia underwent euthanasia at the Sant Camil Residential Hospital, in Catalonia’s Garraf region.

But there’s a detail that Veja mentions that’s disturbing: Noelia reportedly expressed doubts before the procedure. And the hospital allegedly accelerated the process because her organs were already committed for donation.

The Frank Investigator report compared the coverage from several outlets.

The cross-analysis of the articles showed that some outlets, like the BBC, omitted facts that change the interpretation of the entire case. Others, like Veja, included the full context. The episode of gang sexual violence in October 2022, the criminal investigation, the psychiatric history since age 13, all of that appears in some coverage but is completely absent in others. And among the outlets that omitted these facts, none touched on the ethical question of whether the physical basis for the euthanasia request derives from a suicide attempt.

The convergent framing across outlets is one of “judicial battle,” “death she asked for,” “to stop suffering,” “to leave in peace,” softening the definitive nature of the procedure and positioning Noelia as the heroic protagonist and the father as the obstructive antagonist. The father, Gerônimo Castillo, and his Christian Lawyers are labeled “ultra-Catholics” or “ultra-conservatives” without any independent source backing that editorial classification.

Narrative coordination came in at 55%. It doesn’t seem to be active coordination between newsrooms, but thematic editorial alignment: everyone bought the autonomy-of-the-individual narrative without questioning it. What raises the score is the convergence of omissions. No outlet explains the legal distinction between the ECHR “actively authorizing” the euthanasia and simply refusing the provisional protective measures the father requested, which is what actually happened. The medical and bioethical implications of approving euthanasia in a case stemming from a suicide attempt show up nowhere. And any voice critical of the procedure is automatically framed as religious or ideological, never as medical or legal.

It’s the kind of case where the omission is the manipulation. The outlets that omitted these facts didn’t lie at any point. But by framing it as a “family dispute over euthanasia rights” and omitting the causal chain (state custody → gang rapes → psychiatric damage → suicide attempt → paraplegia → euthanasia), the reader walks away with an impression that’s radically different from reality. Comparing coverage is exactly the kind of thing the investigator does well: exposing what each outlet chose to show and what it chose to hide.

Example 2: “Government cuts taxes on nearly a thousand imports” (UOL)

UOL published that the government cut import taxes on nearly a thousand products, from medicine to hops. Positive headline, framed as a benefit to the consumer. Several other outlets ran the same thing with similar framing: the government did something good, prices are going down.

Except Gazeta do Povo tells the other half of the story. The government didn’t cut old taxes. What actually happened was: at some point before February 2025, the government raised import tariffs on more than 1,200 items, a measure that would have generated an estimated R$ 14 billion in revenue. Then, under social-media and public pressure, it partially backed down. Tariffs on around 970 capital goods, computing and telecommunications items were dropped to zero. Taxes on 120 IT products were reduced. And now they’re calling that a “tax cut.”

In other words: they raised taxes, took public pressure, partially walked it back, and rebranded it as a generous concession. The majority of the 1,200+ items that had tariffs raised still have higher tariffs than before. No final consumer prices went down. Prices went back to where they were for some products, and remain higher for most.

The Frank Investigator report cross-checked the articles and exposed what was omitted.

What the comparison between articles exposes: none of the outlets with the positive framing mentions that most of the 1,200+ items with raised tariffs still have higher tariffs after the two rounds of “cuts.” The fiscal impact on the R$ 14 billion revenue target from the original increases — nobody calculates it. Voices from the domestic industry that may be hurt by the tariff reduction on imported competitors — none. The Gecex’s objective criteria for defining “insufficient supply in the internal market” — they don’t show up.

The two articles analyzed build the same positive framing for the government. The paradox that the “cuts” are a partial reversal of increases made by the same government the previous year stays buried or simply absent.

It’s the classic kind of manipulation through reframing. Nobody lied. But “government cuts taxes” and “government backs down on tax hike after public pressure” describe the same event with opposite impressions. The editorial choice of which version to publish IS the manipulation.

Example 3: “Globo apologizes for PowerPoint” (Brasil 247)

This case blew up over the past few days. GloboNews showed a diagram on the Estúdio I program connecting President Lula, his ministers and Daniel Vorcaro, owner of Banco Master, who’s at the center of documented fraud. Globo later issued a public retraction, called the material “erroneous and incomplete,” and fired an editor.

Brasil 247 published a story with the headline “Globo apologizes for PowerPoint that tried to dump the Master case on Lula’s lap.”

The Frank Investigator report exposed what’s going on under the hood:

The central fact is real: Globo apologized and fired someone. But Brasil 247’s framing goes way beyond what the facts support. The headline saying “tried to dump it on Lula” attributes deliberate intent where the documents point to an editorial mistake. Globo’s retraction described the material as “erroneous and incomplete,” not as an attempt to incriminate anyone.

What stands out in this case is the coordinated campaign. The investigator gave it 62% narrative coordination.

Several outlets editorially aligned with the government used the phrase “without evidence” in a convergent way to describe the association between Lula and the Master case. All of them focused on Globo’s mistake as the central narrative point, instead of investigating the actual connections. No outlet mentioned which other political names were excluded from the original PowerPoint. None investigated Vorcaro’s documented connections with different spheres of power. The focus is meta-journalistic: they criticize the broadcaster instead of covering the scandal.

The fallacies detected: loaded language (“tried to dump,” “without evidence” used to frame an editorial mistake as a deliberate political attack), false causality (Globo’s retraction doesn’t prove the connections are false), cherry-picking (highlights the omission of names tied to the Lula government without contextualizing which other names were omitted), and bait-and-pivot (uses Globo’s apology as a hook to minimize the Banco Master scandal).

And the questions none of these outlets asked: which Supreme Court justices and politicians from other parties were also excluded from the PowerPoint? Does the Master case have documented connections to Lula government figures, or are they limited to previous administrations? Does Brasil 247, which is publishing this article, have a declared editorial alignment with the Lula government? What was the journalism Ethics Council’s reaction?

Overall confidence: 13%. The article doesn’t fabricate facts. But it selects, frames and omits in a way that builds a narrative the data doesn’t support.

Example 4: “Why expensive fuel is good” (Folha)

The economist Bernardo Guimarães published a column on Folha arguing that expensive fuel is good for society because it stimulates innovation in clean energy. He cites real academic articles (Popp 2002, NBER) and has verifiable academic credentials (PhD from Yale, professor at FGV EESP). It looks solid.

The full Frank Investigator report shows another picture.

The article is right to cite real studies. But it omits who pays the bill. The column completely ignores the distributive impact: low-income populations, residents of peripheral and rural areas, who depend more on personal vehicles and have less access to clean alternatives. For an economist, ignoring distributive effects is either incompetence or editorial choice.

There’s a worse problem. The empirical evidence he cites (US patents, electric vehicle data in California between 2014-2017) is transposed to Brazil with no caveat at all. Brazil has an ethanol matrix and flex-fuel infrastructure that completely changes the causal mechanism. The article treats it as if the Brazilian consumer were in the same situation as the Californian one, which is false.

And there’s the context the article mentions in passing but doesn’t develop: the war in Iran is making fuel prices rise around the world. Brazil should be in a privileged position thanks to the pre-salt and to ethanol. But decades of mismanagement and corruption at Petrobras mean we’re paying the same price as the rest of the world. Instead of questioning that, the column sells the idea that “at least it’ll stimulate clean energy.” It’s a rationalization of a problem that shouldn’t exist.

The investigator identified 5 questions the article refused to address, with 35% contextual completeness. The fallacies detected include false dilemma (presents expensive fuel as the only viable climate policy), cherry-picking (acknowledges short-term inelasticity but emphasizes only long-term effects), and loaded language (describes alternatives as “playing at planting a little sapling”).

Overall confidence: 25%. It isn’t fabricated disinformation. It’s opinion with cherry-picked evidence in favor of the thesis and omission of relevant counterpoints.

Example 5: “There’s no strong cinema without streaming regulation” (O Globo)

Renata Magalhães, president of the Brazilian Academy of Cinema, gave an interview in Miriam Leitão’s column on O Globo arguing that regulating streaming is a necessary condition for strengthening Brazilian cinema.

The Frank Investigator report found serious problems.

First: the central claim that “many films have low audiences” appears with no empirical data. No numbers, no historical series. It’s a pure authority claim with no analytical basis.

Second: there’s an internal contradiction the article doesn’t resolve. The text opens by saying that Brazilian cinema is “in international spotlight” (awards, festivals). And then immediately argues that the absence of regulation prevents the industry from being strengthened. But if Brazilian cinema is already winning international awards without that regulation, the argument that the regulation is a necessary condition collapses. The article doesn’t address that contradiction.

And here’s the elephant in the room that none of these articles mentions: people simply don’t want to watch most of these films. Internationally awarded Brazilian cinema is made to compete in Cannes and at the Oscars, not to fill movie theaters in Brazil. Instead of asking why the Brazilian audience isn’t interested, the industry would rather ask for streaming regulation to force platforms to fund and screen content that has no spontaneous audience. It’s the classic playbook: use public money and regulation to keep alive an industry that doesn’t sustain itself in the market.

The interviewee is the president of the Brazilian Academy of Cinema. She has a direct institutional interest in the regulation. The article doesn’t present any opposing voice and doesn’t discuss the costs to consumers: subscription price increases, catalog reduction. A single source, with a declared conflict of interest, with no counterpoint.

And here comes the coordinated campaign, with 55% narrative coordination. Multiple outlets reproduce the same emotional framing: cinema “at risk,” regulation as “essential.” None of the identified sources discusses comparable empirical international evidence on the effectiveness of content quotas (Europe has experiences with contradictory results). None mentions the Brazilian Academy of Cinema’s conflicts of interest. The fact that the internationally awarded Brazilian productions were made without the proposed regulation is omitted convergently across all outlets. Only one isolated site (targethd.net) mentioned negative impacts on consumers.

Overall confidence: 9%. The article is legitimate editorial advocacy, but with analytical flaws that limit its informational value to nearly zero.

What the investigator analyzes

The 5 examples above show different patterns, but the analysis criteria are the same.

The strongest signal of manipulation is omission. It isn’t what the article says that misleads, it’s what it leaves out. The contextual gap analysis identifies the questions the article should have answered and didn’t, and searches for counter-evidence on each one. In the examples above, articles that omit most of the relevant context aren’t informing anyone. And when cross-comparison between outlets shows that some covered the facts and others didn’t, like in the Noelia case or the tax cut, it’s hard to argue that the omission was accidental.

Then comes the detection of coordinated campaigns. Several newspapers covering the same subject is normal. All of them using the same loaded language, focusing on the same points and omitting the same counterpoints at the same time isn’t. The strongest signal of coordination isn’t what outlets say in common, but what they omit in common.

There’s also reframing, which is more subtle. In the tax case, the government raised tariffs, backed down under pressure, and the outlets called it a “cut.” Nobody lied technically, but the choice of framing completely changes the interpretation. This kind of manipulation is harder to detect because each individual statement is defensible.

The rhetorical fallacies catch specific constructions: false dilemma (“either you regulate or cinema dies”), bait-and-pivot (open with a positive fact and pivot to a crisis narrative), loaded language (“without evidence” used to attribute intent). Each detected fallacy comes with the exact citation of the passage and the explanation of why that construction is problematic.

And there’s the principle that ties it all together: if 10 outlets repeat the same claim citing each other, the LLM consensus has to reflect that the chain of evidence is circular, not that the claim is well supported. Volume of coverage is not a proxy for truth.

Why you can’t “just ask ChatGPT”

The first reaction many people will have is: “why do I need this if I can just paste the article into ChatGPT and ask it to analyze?”

Try it. Take a political article and ask ChatGPT to criticize it. It’ll criticize. Take the same article and ask it to confirm. It’ll find arguments to confirm. The LLM isn’t searching for truth. It’s predicting which response you most likely want to hear given the framing of your question. If you ask “analyze the problems with this article,” the model will find problems. If you ask “is this article correct?”, it’ll find merits. It’s automated confirmation bias.

There’s another problem. General-purpose LLMs are trained to be agreeable. The sycophantic tendency (agreeing with the user) is documented in every large model. If your conversation history indicates you’re left-leaning, the model tends to frame answers in a way that pleases that profile. If you’re right-leaning, same thing in the other direction. It’s not lying on purpose. It’s optimizing for user satisfaction, which is literally the metric it was trained for via RLHF.

And worse: LLMs hallucinate. If they don’t have enough evidence to support the answer they think you want to hear, they invent it. They fabricate plausible citations and data, attribute statements to people who never made them. If you ask it to criticize an article about fuel, it might invent a fictional study that “proves” the opposite of the article. It sounds convincing. But it doesn’t exist.

Frank Investigator was built precisely to avoid these problems. The first design decision is that no human asks the LLM a question. There’s no open-ended prompt like “analyze this article.” Each step in the pipeline has structured prompts that ask for specific analyses: “list the rhetorical fallacies in this passage,” “identify which contextual information is missing,” “compare the headline with the body.” The model doesn’t know whether the operator agrees or disagrees with the article, because the operator never expresses an opinion. That eliminates confirmation bias at the root.

To deal with hallucination, every analyzer that uses an LLM includes the instruction “CRITICAL — NO HALLUCINATION: Only reference URLs, sources, claims, quotes, and data that are EXPLICITLY present in the input provided to you. Do not invent, guess, or fabricate any URL, source name, statistic, quote, or claim. If you cannot verify something from the provided text, mark it as unverifiable — never fill in details.” It doesn’t eliminate hallucination completely, but it cuts it down a lot. And since 3 models from different companies (Anthropic, OpenAI, Google) answer the same questions, when one hallucinates the other two usually disagree. The consensus is weighted by confidence, not by simple majority. If two models say “supported” with 70% confidence and one says “mixed” with 95%, the “mixed” weighs more. The further apart the disagreement, the bigger the penalty on final confidence. If one model starts giving inconsistent answers, it gets put in quarantine and the other two carry on.

But the safeguard I consider most important is the primary source veto. If a primary source (IBGE data, court ruling, original study) contradicts a claim, confidence is capped at 60% and the verdict is forced to “mixed,” even if all 3 LLMs say “supported.” Ten newspaper articles repeating a claim don’t override one official datum that contradicts it. Along the same lines, if 5 stories “confirm” a claim but they all come from the same editorial group (Folha/UOL, Globo/G1/Valor), the system knows they’re the same voice and reduces the weight. Volume doesn’t replace independence.

None of this makes the system perfect. But it’s categorically different from pasting text into ChatGPT and asking “what do you think?”.

The numbers

Metric	Value
Commits	129
Active days of work	~9
Lines of Ruby (code)	19,444
Lines of test code	9,190
Test files	108
Total lines (code)	24,301
Services (analyzers + services)	80
Misinformation analyzers	15
ActiveRecord models	14
Background jobs	19
Database migrations	31
Pipeline stages	15
LLM models in consensus	3
Locales	2 (en, pt-BR)

Stack: Ruby 4.0.1, Rails 8.1.2, SQLite with WAL mode, Solid Queue (jobs inside Puma), Solid Cable (WebSockets), Tailwind CSS v4, headless Chromium via Ferrum CDP, deploy with Kamal to GitHub Container Registry. AGPL-3.0.

The development process

The project has 129 commits in 9 active days of work (March 11-16 and 25-27). The first day was the heaviest: more than 60 commits on March 11 alone, going from zero to a working system with content extraction, claim decomposition, LLM evaluation, and a web interface with live updates via Turbo Streams.

The commits tell the story. It started with the foundation:

1e32d5a - Initial Frank Investigator foundation
564eb97 - Add recursive source crawling and RubyLLM scaffold
c8c5357 - Add Brazil source registry and authority connectors
3d30617 - Add U.S. authority profiles, source role modeling, and specialized connectors

Then came the misinformation analyzers, one by one:

a65fef7 - Add rhetorical fallacy analyzer for detecting narrative manipulation
5dcf99d - Add headline-body divergence detection and headline citation amplification
a113efd - Add smear campaign defense with circular citation and viral volume detection
56d501a - Add media ownership modeling, syndication detection, and independence analysis

The hardening phase against noise and false positives is the one that took the most time:

46e4bc5 - Add pre-fetch defenses: URL filtering, circuit breaker, fetch prioritization
168fea1 - Add post-fetch content gate, claim noise filter, and duplicate content skip
b2843d7 - Add paywall detection, pricing noise filter, and ofertas.* host rejection
b1b230d - Rewrite Chromium fetcher with Ferrum CDP for anti-bot evasion

Then came the interface, deployment, and the more advanced analyzers:

c5afcea - Replace custom CSS with Tailwind CSS v4 and rewrite all templates
95a8ec4 - Add Docker deployment with bin/deploy script
ccfacc9 - Add coordinated narrative detection across media outlets
ba3e2f4 - Add 6 new misinformation detection analyzers with cross-analysis
fda984b - Add LLM-generated investigation summary with quality assessment
a4ebe78 - Add contextual gap analysis to detect manipulation through omission

And in the most recent commits, simultaneous 3-model consensus and test optimization:

5cc9c05 - Enable 3-model LLM consensus: Sonnet 4.6, GPT-4.1 Mini, Gemini 2.5 Pro
cdf5fb5 - Batch 5 content analyzers into single LLM call, add anti-hallucination
28c305e - Add WebMock stubs for LLM and web search — tests run in 1.4s (was 540s)

The tests that used to run in 9 minutes now run in 1.4 seconds with WebMock stubbing of every LLM and web search call. That made a huge difference in iteration speed.

Limitations

News fact-checking is a hard problem. A single article doesn’t contain everything needed for a complete analysis. The author chose what to include and what to omit, and that choice is already the first level of manipulation. The sources cited inside the article were selected by the author to support their narrative, not to give a balanced view.

Frank Investigator doesn’t tell you what’s true and what’s false. The result is a report with strong and weak points, not a “true” or “false” stamp. Even with all the safeguards I described above, the counter-evidence searched for automatically may not be the most relevant. The fallacies detected can have false positives. The coordinated campaign analysis depends on what web search returns at the moment of the lookup.

Use the reports as a starting point to form your own opinion, not as a final verdict.

The project is open source, AGPL-3.0. If you want to contribute, test, report bugs or suggest improvements: GitHub. If you want to follow the analyses, the The Makita Chronicles newsletter is going to have a “Notícias Duvidosas” (Dubious News) section with the summary and a link to the full report of each investigation.

I Rewrote OpenClaw in Rust. Did It Work? | FrankClaw

Mon, 16 Mar 2026 08:00:00 GMT

Before anything else: FrankClaw is still in heavy alpha. It works for simple tasks, but I haven’t tested complex workflows. If you want to help, open Issues on GitHub with whatever you find. There’s a lot to test.

It “works,” but this project was more about the exercise. With that out of the way, let me tell you why I did it.

The problem with OpenClaw

OpenClaw is a gateway that connects messaging channels (Telegram, Discord, Slack, WhatsApp, etc.) to AI providers (OpenAI, Anthropic, Ollama). You configure it, bring up the server, and you can chat with LLMs straight from Telegram or any other channel. It’s a popular project with plenty of activity.

Too much activity, actually.

I did a depth-1 clone of the repo and ran tokei:

Metric	Value
TypeScript files (without tests)	3,794
Lines of TypeScript	~1,247,000
Test files	2,799
Dependencies (root package.json)	73

Over a million lines of TypeScript. The 2,799 test files sound like a lot in absolute numbers, but proportional to the codebase size, the coverage is low. Most of the code lives in 29 packages of a monorepo with 21 channel extensions.

I went looking for more commits to understand the development pace. In the 100 commits I managed to pull, all of them landed in just 2 days (March 9 and 10). ~50 commits per day, from 42 different contributors. Vibe coding to the extreme.

The conclusion is the one you’re imagining: enormous volumes of AI-generated code being dumped into a repository at a speed that makes serious human review impossible. And that bothered me enough to go investigate further.

The security audit

I asked Claude to do a complete security audit of OpenClaw’s code. The report found:

Severity	Count
CRITICAL	7
HIGH	9
MEDIUM	12
LOW	6
INFO	5

Seven critical vulnerabilities. Let me list a few:

Timing side-channel in token comparison — safeEqualSecret() does an early return on the type-check, letting an attacker distinguish malformed tokens from wrong tokens by measuring latency.
eval() in the browser tool — arbitrary JavaScript execution with no sandbox.
Shell with no allowlist — any tool can run any command on the system.
Slack webhooks with no signature verification at all.
Transcripts and config in plaintext on disk, no encryption.
No effective rate limiting — IPs can be spoofed if the operator configures trusted proxies broadly.

Those are just the ones Claude found in an automated scan. There are probably more.

I’m not going to run that on my machine. Not even inside a Docker container. A gateway that receives webhooks from the internet, executes shell commands, connects to AI APIs with your keys, and stores conversation history, all of it with 7 known critical vulnerabilities? No, thanks.

So I did what any rational developer would do: I decided to build my own.

The first attempt: Claude Code

I started the most direct way. Cloned OpenClaw, pointed Claude Code at the code, and asked: “rewrite this in Rust.”

Didn’t work. The codebase is too big. Over a million lines of TypeScript spread across 29 packages. Claude can’t keep all of it in context. The initial result was very incomplete: lots of types created but no implementation, todo!() everywhere, too much boilerplate and not enough functionality.

I switched to Codex 5.4 to test. Same thing: I asked it to analyze and rewrite. It improved a bit in certain aspects, but the fundamental problem is the same. No AI today takes a project of that size and rewrites it in one go. The context doesn’t fit.

The technique that actually works

What works is going slowly. One step at a time.

Ask Claude (or Codex) to analyze the original code in stages. Make a long plan detailing each feature. Then implement one feature at a time in Rust, with tests, commit, and repeat. It’s tedious, but it produces code that compiles and actually works.

The reason is simple: the original code is so massive that no AI agent (and not even several in parallel) can keep all of it in context at the same time. You have to decide what matters, implement that, validate, and move on to the next thing.

And you have to decide what to cut. OpenClaw has 21 channel extensions: Google Chat, iMessage, IRC, Teams, Matrix, Mattermost, Nostr, Twitch… I don’t need any of those. I kept the mainstream channels: Web, Telegram, Discord, Slack, Signal, WhatsApp and Email. TTS? Out. Polls? Out. WhatsApp Web through Baileys? Out, I use the official Cloud API. They’re features that add complexity without proportional value.

Discovering IronClaw

In the middle of development, I came across IronClaw, which presents itself as “OpenClaw in Rust.” Great, I thought. Let me see what they did.

I cloned the repo and asked Claude to write a comparison report. IronClaw has good things. I adopted 12 features:

Circuit breaker with retry and exponential backoff for LLM provider resilience. Credential leak detection in the output. LLM response cache with SHA-256 of the prompt. Cost tracking with budget guards (warning at 80%, block at 100%). Extended thinking for Claude 3.7+ and o1. MCP client for external tool servers. Lifecycle hooks on inbound, tool calls and outbound. Smart model routing that sends simple queries to cheaper models. Tunnel support (cloudflared, ngrok, tailscale). Interactive REPL (frankclaw chat). Routines with event triggers beyond cron. Job state machine with auto-repair.

But IronClaw depends on PostgreSQL + pgvector, has a WASM sandbox (wasmtime adds ~10MB), and is part of the NEAR AI ecosystem. I want a single binary with embedded SQLite and zero external dependencies.

What FrankClaw brings from OpenClaw

The OpenClaw core is there: 7 messaging channels, multi-provider with failover, agent runtime, command system, skills, subagents, browser automation through CDP, bash tool with allowlist, cron jobs, interactive REPL. Plus the 12 IronClaw features I listed above.

But several pieces had to be rewritten in a way the original didn’t. Context compaction, for example, uses a sliding window with token estimation, message pruning and automatic repair of tool pairs that get orphaned when the context is cut. Provider failover is now model-aware: if you ask for a Claude model and the current provider is OpenAI, it skips automatically instead of erroring out. The canvas renders SVG, HTML and Markdown with revision conflict detection. Things that in OpenClaw either didn’t exist or were half-built.

Then came the phase of adding what was missing for parity. FrankClaw today has 30+ LLM tools: web_fetch (SSRF-safe with HTML-to-text), web_search (Brave API), file_read/file_write/file_edit (sandboxed in the workspace with path traversal protection), pdf_read, image_describe (through vision models), audio_transcribe, sessions_list/sessions_history/sessions_delete, message_send/message_react, cron_list/cron_add/cron_remove, config_get (auto-redacts secrets), agents_list, memory_get/memory_search, plus the browser tools (browser_open, browser_extract, browser_snapshot, browser_click, browser_type, browser_wait, browser_press, browser_sessions, browser_close, browser_aria).

Some additions that don’t exist in the original OpenClaw: a memory/RAG system with SQLite FTS5 and embeddings (OpenAI, Gemini, Voyage) that automatically syncs workspace files. An OpenAI-compatible API (/v1/chat/completions and /v1/models), so any client that speaks that protocol (Cursor, Continue, Open WebUI) can use FrankClaw as a backend with no adaptation. A ratatui TUI for people who prefer the terminal. Interactive approval of destructive tools before execution.

Smaller things that make a difference in practice. You can configure multiple API keys per provider with round-robin and automatic backoff, so if one key hits the rate limit, the next one takes over. The model catalog already knows context windows and costs of OpenAI and Anthropic models without you having to configure them. URL extraction from messages has a private IP blocklist against SSRF. The command system accepts inline directives (/think, /model) in addition to aliases.

On the operational side: ACP (Agent Client Protocol) over JSON-RPC 2.0 on top of NDJSON for people who want to integrate programmatically. Plugin system with manifests and enable/disable lifecycle. i18n with 9 locales via FRANKCLAW_LANG. Workspace identity files (SOUL.md, IDENTITY.md) to define the bot’s personality per project. Channel health monitor with auto-restart. WebSocket with ping keepalive that survives proxy and tunnel timeouts. frankclaw start/stop/status for people who want to run it as a daemon with PID tracking. And the entire configuration migrated from JSON to TOML.

Hardening: where the real difference is

The audit report we ran on OpenClaw found 7 critical and 9 high vulnerabilities. FrankClaw fixes all of them:

Area	OpenClaw	FrankClaw
Token comparison	SHA-256 + timingSafeEqual with early return that leaks timing	Constant-time byte-by-byte comparison, no early returns
Shell execution	No mandatory allowlist	Deny-all by default + binary allowlist + metacharacter rejection + optional ai-jail sandbox
Browser tool	`eval()` with no sandbox	CDP with 15s timeout, SSRF guard, crash recovery, ARIA inspection
Slack webhook	Zero signature verification	HMAC-SHA256 with replay protection
Discord webhook	Hardcoded placeholder	Ed25519 with timestamp validation
Cryptography	Plaintext on disk	ChaCha20-Poly1305 on sessions and config
Password hashing	No password authentication at all	Argon2id (t=3, m=64MB, p=4)
File permissions	0o644 (world-readable)	0o600 (owner-only)
Prompt injection	Basic sanitization	Unicode Cc/Cf stripping + boundary tags + 2MB limit
Malware scanning	None	Optional VirusTotal on uploads
Input validation	No limits	255 byte IDs, 800 byte session keys, configurable WS frames
SSRF	Partial protection	Full blocklist (RFC 1918, loopback, CGNAT, link-local) + DNS rebinding defense
Tool execution	No user confirmation	Interactive approval for mutating/destructive tools

FrankClaw compiles with #![forbid(unsafe_code)] in all 13 crates. Zero unsafe blocks.

And the audit didn’t stop at OpenClaw. We did a per-component audit in 14 phases comparing each part of FrankClaw against the original: channels, providers, runtime, tools, sessions, crypto, cron, webhooks. All documented.

Deploy: how to install

FrankClaw runs with Docker Compose. Three containers: gateway, headless Chromium (for browser tools), and Cloudflare tunnel (to receive webhooks).

1. Clone and configure

git clone https://github.com/akitaonrails/frankclaw.git
cd frankclaw
cp .env.docker.example .env.docker

Edit .env.docker with your keys:

# Providers de IA (preencha os que usar)
OPENAI_API_KEY=
ANTHROPIC_API_KEY=

# Canais (preencha os que quiser usar)
TELEGRAM_BOT_TOKEN=         # via @BotFather
WHATSAPP_TOKEN=             # Meta Business Platform
WHATSAPP_PHONE_ID=
WHATSAPP_VERIFY_TOKEN=
DISCORD_BOT_TOKEN=
SLACK_BOT_TOKEN=
SLACK_APP_TOKEN=

# Embedding providers (só se usar memória/RAG)
# GEMINI_API_KEY=
# VOYAGE_API_KEY=

# Opcional: criptografia de sessions (recomendado)
# Gere com: openssl rand -base64 32
FRANKCLAW_MASTER_KEY=

# Opcional: scan de malware em uploads
VIRUSTOTAL_API_KEY=

2. Configure the gateway

The frankclaw.toml file defines agents, models and channels. Use the wizard or the examples:

# Generate a base config with the web channel
cargo run -- onboard --channel web

# Or copy from the examples
ls examples/channels/

For each channel, the CLI has ready-made templates:

cargo run -- config-example --channel telegram
cargo run -- config-example --channel whatsapp

3. Cloudflare Tunnel (to receive webhooks)

If you’re going to use channels that need a webhook (Telegram, Discord, Slack, WhatsApp), you need a public tunnel. The Docker Compose already includes cloudflared:

# Copy your Cloudflare credentials
cp docker/cloudflared/config.yml.example docker/cloudflared/config.yml
cp ~/.cloudflared/.json docker/cloudflared/credentials.json
cp ~/.cloudflared/cert.pem docker/cloudflared/cert.pem
# Edit config.yml with your hostname

4. Bring it up

docker compose up -d

The gateway comes up on port 18789 (internal to Docker). cloudflared routes external traffic. Chromium stays on the internal network for browser tools.

To test locally without Docker:

cargo run -- chat

Opens the interactive REPL straight in the terminal (it now also has a ratatui TUI with dark mode and tabs). No gateway, no webhook. Good for validating that the AI provider is responding before configuring channels.

Validation

cargo run -- check     # validates config
cargo run -- doctor    # full diagnostic
cargo run -- audit     # security audit with severity ratings

audit is the one I like the most. It checks whether you have encryption enabled, whether file permissions are correct, whether webhooks have signature verification, whether the bash tool is in deny-all. It exits with a non-zero exit code when it finds critical issues, so you can drop it in CI.

The development process

The project has 178 commits in ~5 days of work (March 10-16). Almost 57 thousand lines of Rust in 120 files, organized in 13 crates.

The commits tell the story. The first few dozens were scaffolding: workspace structure, basic types, the HTTP/WebSocket gateway, first version of the channel adapters. Mass-generated code, lots of incomplete pieces.

Then the channel adapters started, one by one:

236aa1c - Minimal Discord channel adapter
fd67017 - Minimal Slack channel adapter
9f51373 - Minimal Signal channel adapter
1052e47 - WhatsApp channel webhook adapter
035f86e - Email channel adapter (IMAP inbound, SMTP outbound)

When the basic channels were in place, the IronClaw integration came in one big commit:

c87ab32 - IronClaw-derived features: circuit breaker, retry, leak detection, cache, cost tracking, extended thinking

And then came the hardening. That was the phase where I had to step in manually the most, because Claude generates functional code but doesn’t think about attack vectors on its own:

db34198 - Prompt injection sanitization, external content wrapping, prompt size limit
5719b34 - Optional VirusTotal malware scanning for file uploads
ccd2b2b - Harden input validation across all user-facing entry points
aa918ee - Optional ai-jail sandbox for bash tool
2d7b1df - Security audit command with severity-rated findings
d12cc97 - 3-tier ToolRiskLevel system replacing binary browser mutation flag
21e0c91 - Timing-safe token comparison in WhatsApp, crypto audit tests
e240c1b - Webhook replay prevention with timestamp verification
876a78c - Gateway & media: SSRF redirect validation, filename hardening

22 security hardening commits. Plus 10 more component audits. Each finding became a commit with a fix.

Then came the per-channel audits, each one uncovering different edge cases:

43b085f - Discord audit: HELLO timeout, fatal close codes, message chunking
12c7cff - Telegram audit: caption overflow, parse fallback, edit idempotency
f515062 - WhatsApp audit: message type filtering, send error classification
3c42aff - Slack audit: fatal auth errors, send error classification

Browser tools needed extra attention. A headless Chrome that gets URLs from an LLM is an obvious attack vector:

3217a96 - Browser automation: CDP timeout, SSRF guard, session limits, crash recovery
d98a803 - Gate mutating browser tools behind operator approval
014f56e - Browser screenshot/ARIA tools for accessibility tree inspection

In the more recent commits, the project started diverging from OpenClaw:

5d73c4d - OpenAI-compatible HTTP API (/v1/chat/completions, /v1/models)
d832c36 - Memory/RAG system with SQLite FTS5, embeddings, and file sync
2b05f47 - Interactive tool approval for mutating/destructive tools
49034eb - Web console: dark mode, 8 tabs, focus mode, tool sidebar
9f51a18 - TUI, Gemini/Voyage embeddings, plugin management, ACP protocol
3c0703b - Config migration from JSON to TOML
eb130ad - Channel health monitor with auto-restart and rate limiting
2c3ea7e - Workspace bootstrap files (SOUL.md, IDENTITY.md) to system prompt
9df61bf - Model-aware failover routing, canvas SVG rendering
c7ba108 - WebSocket ping keepalive and auto-reconnect to web console

The OpenAI-compatible API is the one I use the most day to day. Cursor, Continue, Open WebUI, anything that speaks the OpenAI protocol can use FrankClaw as a backend without touching anything on the client side.

The numbers

Metric	Value
Commits	178
Days of work	~5
Lines of Rust	56,586
Rust files	120
Crates	13
LLM tools	30+
Security hardening commits	22
Audit commits	10
Supported channels	7
AI providers	9 (OpenAI, Anthropic, Ollama, Google, OpenRouter, Copilot, Groq, Together, DeepSeek)
OpenClaw critical vulnerabilities fixed	7/7
OpenClaw high vulnerabilities fixed	9/9
Unsafe blocks in the code	0

Compared to OpenClaw:

Metric	OpenClaw (TS)	FrankClaw (Rust)
Lines of code	~1,247,000	56,586
Source files	3,794	120
Runtime dependencies	73	~40 crates
Channels	28	7
Critical vulnerabilities	7 known	0

The numbers aren’t directly comparable. OpenClaw has 21 channel extensions I cut, a more complete web UI, and niche features I didn’t port. But the core (gateway, mainstream channels, providers, runtime, sessions, tools, memory) is there with 22x less code.

How to help (beta testing)

FrankClaw works for basic conversation through Web and Telegram (I tested both). WhatsApp works for simple messages. Discord, Slack, Signal and Email are implemented but haven’t had extensive testing. We haven’t done any complex workflows yet.

If you want to test it: clone the repository, bring it up with Docker Compose, configure at least one channel (Telegram is the easiest) and try to use it normally. Send messages, test tool calls, try to break the system. Open Issues on GitHub with whatever you find.

What I know needs more eyes: workflows with tools (browser, bash, MCP), subagent orchestration, failover between providers, session persistence with encryption, smart routing between models, scheduled jobs, the memory/RAG system, and the OpenAI-compatible API. Basically everything that goes beyond “send a message, get a reply.”

You don’t have to be a Rust developer. The bigger value is in using the system in ways I didn’t think of and finding the edge cases that only show up under real use.

FrankClaw doesn’t replace OpenClaw today. OpenClaw has more channels, more features, more people working on it. But it carries the weight of over a million lines of TypeScript generated at 50 commits per day by dozens of contributors, with documented critical vulnerabilities. FrankClaw is the alternative for anyone who looks at that and thinks “I’m not running this code on my machine.”

But I’ll be honest: as fun as it was to build, I personally don’t know if I need this. FrankClaw is a generic gateway, designed to be flexible, to connect any channel to any provider, with an agent runtime, tools, subagents, jobs, hooks. It’s a lot of infrastructure.

What I’ve discovered over the last few months is that I can build custom bots for specific tasks much faster. In 1 day I have a bot working, focused on what I need, without carrying the weight of a generic framework. That’s what I did with Marvin on the newsletter project, for instance. A custom-built Discord bot that does exactly what I need and nothing else.

A generic gateway like FrankClaw makes more sense for someone who wants a unified interface between several chat channels and several AI models without coding. If that’s your case, give it a shot. If you’re a developer and you know what you want, maybe a custom bot will serve you better. Up to you.

The repository is here. AGPL-3.0.

Going After Email Fraud | Frank FBI

Mon, 09 Mar 2026 13:00:00 GMT

This past weekend I worked on 3 projects. Two of them I already published: easy-ffmpeg, a smart wrapper for FFmpeg in Crystal, and easy-subtitle, a port of 10 thousand lines of Python to Crystal in less than 40 minutes. I keep improving both, adding features and fixing edge cases as I use them day to day.

But the third project is a different beast. It’s a security project. And the motivation comes from an old pain.

The problem: too many emails

As an ex-YouTuber and content creator, I get an absurd amount of email. Event invitations, collab proposals, sponsorship offers, requests to promote stuff. Every kind of pitch.

I delete 100% of them. I don’t read them. Most of them I mark as SPAM without even opening. The easiest way to get reported by me is to send me an email — I don’t care, because I don’t need to. I also don’t answer the phone. Ever. And I automatically block anyone who messages me directly on WhatsApp, regardless of the content. My time is too valuable to spend triaging messages from strangers. I delete everything and move on.

It works for me. But I know most people don’t operate that way.

The poison is VANITY

Most people who fall for email phishing don’t fall because of technical ignorance. They fall because of VANITY.

“Hello, we’d like to invite you to our exclusive event.” “Your brand has been selected for a special partnership.” “Congratulations, you’ve been nominated as a reference in your sector.”

The dopamine hits before reason has a chance to step in. Someone recognized your work, someone wants to give you money. That’s exactly when the scammer gets you.

It isn’t the multinational CEO who falls for the Nigerian prince scam. It’s the micro-influencer who gets a sponsorship offer that’s “too good to be true.” It’s the small business owner who gets an invite to an event that looks legitimate. Vanity turns off critical thinking.

And the scams are getting more sophisticated by the day. With LLMs, any criminal can generate perfect emails in Portuguese, with no grammar errors, professional formatting, and domains that mimic real companies. The old “look at the typos” doesn’t work as a filter anymore.

The idea: Frank FBI

Instead of trying to teach everyone how to spot phishing (which doesn’t work, because vanity beats training), why not build a tool that does it automatically?

Got a suspicious email? Forward it to a dedicated address. In a few minutes, you get back a detailed report with everything the tool managed to find out about that email.

That’s how Frank FBI was born — Fraud Bureau of Investigation.

How it works

You set up a dedicated Gmail account (any Gmail with an App Password works). You register the email addresses authorized to use the service. From there, the flow is:

You receive a suspicious email in your personal inbox
You forward it to the Frank FBI address
The system analyzes it automatically
You receive the response in the same thread, with the full report

To give you a sense of the result, here’s a report for a legitimate email (cold outreach from a real company):

And here’s a suspicious one, where the sender’s domain tries to impersonate another company:

Frank FBI runs 6 layers of analysis, each with a specific weight in the final score:

Layer 1 — Header authentication (15% weight)

SPF, DKIM, DMARC. Does the Reply-To match the From? Are there suspicious anti-spam headers? This layer answers the most basic question: did the email actually come from who it claims to?

Layer 2 — Sender reputation (15% weight)

Looks up the domain’s age via WHOIS (a domain registered last week is already a signal), checks whether the IP is on DNS blacklists (DNSBL), and maintains a local reputation database that improves as more emails are analyzed. If the sender pretends to be a company but emails from a Gmail, that weighs in.

Layer 3 — Content analysis (15% weight)

This is where pattern matching comes in: artificial urgency (“your account will be locked in 24 hours”), requests for personal data, authority impersonation, financial offers. It also detects URL shorteners and links where the displayed text doesn’t match the real href — that classic “click here” pointing at a completely different domain. Checks for dangerous attachments (.exe, .scr, Office macros).

Layer 4 — External APIs (15% weight)

URLhaus (abuse.ch) maintains a known-malicious URL database. VirusTotal aggregates results from dozens of antivirus engines. If any URL in the email has already been flagged as malware or phishing by these databases, the score goes up. Results are cached with a TTL so we don’t blow through the rate limits of the free APIs.

Layer 5 — Entity verification (10% weight)

This is the layer I find the most interesting. Frank FBI does OSINT — Open Source Intelligence. It uses Brave Search to verify whether the sender or company actually exists. Does WHOIS on the domain directly. Captures a screenshot of the site with headless Chrome. Cross-references all of it to try to answer: “is this entity real and is it who it claims to be?”

Here’s a real example of this layer in action, analyzing an email that tried to impersonate a legitimate company:

The domain was 83 days old, registered through Namecheap, with no verifiable online presence. The system found discrepancies between the name in the email and the public records, and identified that the legitimate company’s real domain was a different one.

Layer 6 — LLM analysis (30% weight)

Queries 3 AI models in parallel: Claude Sonnet (Anthropic), GPT-4o (OpenAI) and Grok 4 (xAI), via OpenRouter. Each model analyzes the email independently. A confidence-weighted consensus system combines the results. If all 3 agree it’s fraud, confidence is high. If they disagree, the system weighs them. If every LLM fails, it falls back to a neutral score of 50 instead of guessing — better to admit ignorance than to hallucinate.

The final score is a confidence-adjusted weighted average. Possible verdicts: Legitimate (0-25), Suspicious OK (26-50), Suspicious Fraud (51-75) or Fraudulent (76-100). There’s also a risk escalation policy that imposes minimum floors: if critical indicators were detected (confirmed malicious URLs, DKIM failure), the score can’t fall below certain thresholds, even if the other layers found nothing.

Self-hosted: your data stays with you

Frank FBI is self-hosted. You run it on your own server. Your emails don’t pass through any third-party service (except the verification APIs like VirusTotal, which only receive URLs, not the email’s content).

You can install it on your home server, like I did, or inside your company’s infrastructure for your employees to use. The deploy is via Docker Compose with 4 containers: the Rails app, a background job worker, an IMAP poller, and an initial database setup. Bring everything up with a docker compose up -d and it’s running.

Reporting fraud to the community

Beyond analyzing emails for you, Frank FBI can optionally report confirmed fraud to community databases. This only happens when the score is >= 85 and the verdict is “fraudulent.” It’s opt-in.

ThreatFox (abuse.ch) is an open database of indicators of compromise (IOCs) maintained by the security community. When you report a malicious URL or domain there, firewalls, email filters and SIEMs around the world can consume that information to block the threat.

AbuseIPDB is the same idea, but for IPs. If the IP of the sender of the fraudulent email gets reported there, email providers and network admins can block malicious traffic before it reaches users.

And SpamCop is one of the oldest spam reporting services. Frank FBI forwards the full email to SpamCop, which analyzes the headers and reports to the responsible providers. It’s reporting directly to whoever can act.

Each report is a contribution so other people don’t fall for the same scam. And it’s automated: if the email is clearly fraudulent, the report goes out without manual intervention.

But reporting wrong things does damage. That’s why the system has anti-poisoning protections: a list of ~40 known domains (Microsoft, Apple, Google, Amazon, PayPal, governments) that are never reported; domains with clean scans are excluded; cloud infrastructure IPs (Google, Microsoft, Cloudflare) are filtered; free email domains are ignored. Only genuinely malicious IOCs reach the community databases.

Deployment: how to get it running

Let me explain how I deployed it on my home server. If you have a VPS, NAS or any Linux machine with Docker, the process is the same.

1. Clone and build the image

You need a private registry to store the Docker image. I use Gitea on my home server — it’s a lightweight self-hosted GitHub that includes a container registry. If you already have a private GitHub, you can use the GitHub Container Registry (ghcr.io) instead.

# Clone the repository
git clone https://github.com/akitaonrails/frank_fbi.git
cd frank_fbi

# Build the Docker image
# If using Gitea (replace with your registry's IP/port):
docker build -t seu-servidor:3007/frank_fbi:latest .
docker push seu-servidor:3007/frank_fbi:latest

# If using GitHub Container Registry:
docker build -t ghcr.io/seu-usuario/frank_fbi:latest .
docker push ghcr.io/seu-usuario/frank_fbi:latest

2. Configure the .env

Copy .env.example and fill it in. The required variables:

# Gere com: ruby -rsecurerandom -e 'puts SecureRandom.hex(64)'
SECRET_KEY_BASE=

# Gere com: bin/rails db:encryption:init (roda local, copia os 3 valores)
ACTIVE_RECORD_ENCRYPTION_PRIMARY_KEY=
ACTIVE_RECORD_ENCRYPTION_DETERMINISTIC_KEY=
ACTIVE_RECORD_ENCRYPTION_KEY_DERIVATION_SALT=

# Gmail dedicado pro Frank FBI (crie uma conta só pra isso)
# Ative 2FA e gere um App Password em https://myaccount.google.com/apppasswords
GMAIL_USERNAME=seu-frank-fbi@gmail.com
GMAIL_PASSWORD=xxxx-xxxx-xxxx-xxxx
GMAIL_IMAP_HOST=imap.gmail.com
GMAIL_SMTP_HOST=smtp.gmail.com

# Senha do ingress do Action Mailbox (qualquer string aleatória)
RAILS_INBOUND_EMAIL_PASSWORD=uma-senha-qualquer-longa

# LLM via OpenRouter (obrigatório pra Camada 6)
OPENROUTER_API_KEY=sua-chave-openrouter

# Seu email de admin (pra gerenciar remetentes autorizados)
ADMIN_EMAIL=seu-email@pessoal.com

The external APIs are optional but recommended. Without them, the corresponding layers simply don’t run:

# VirusTotal (grátis, 4 requests/min) - https://virustotal.com
VIRUSTOTAL_API_KEY=

# WhoisXML (grátis, 500 requests/mês) - https://whoisxmlapi.com
WHOISXML_API_KEY=

# Brave Search (grátis, 1 req/s) - https://brave.com/search/api/
BRAVE_SEARCH_API_KEY=

And the community reporting, which is fully opt-in. Leave it blank if you don’t want to report:

THREATFOX_AUTH_KEY=
ABUSEIPDB_API_KEY=
SPAMCOP_SUBMISSION_ADDRESS=

3. Docker Compose on the server

Create the docker-compose.yml on your server. Replace the image with your registry:

services:
  setup:
    image: seu-servidor:3007/frank_fbi:latest  # ou ghcr.io/seu-usuario/frank_fbi:latest
    command: ["./bin/rails", "db:prepare"]
    env_file: .env
    environment:
      - RAILS_ENV=production
    volumes:
      - ./storage:/rails/storage

  app:
    image: seu-servidor:3007/frank_fbi:latest
    env_file: .env
    environment:
      - RAILS_ENV=production
    volumes:
      - ./storage:/rails/storage
      - ./emails:/rails/emails
    depends_on:
      setup:
        condition: service_completed_successfully
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/up"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s
    restart: unless-stopped

  worker:
    image: seu-servidor:3007/frank_fbi:latest
    command: ["./bin/jobs"]
    env_file: .env
    environment:
      - RAILS_ENV=production
    volumes:
      - ./storage:/rails/storage
    depends_on:
      setup:
        condition: service_completed_successfully
    restart: unless-stopped

  mail_fetcher:
    image: seu-servidor:3007/frank_fbi:latest
    command: ["./bin/rails", "frank_fbi:fetch_mail"]
    env_file: .env
    environment:
      - RAILS_ENV=production
      - APP_HOST=http://app:3000
    volumes:
      - ./storage:/rails/storage
    depends_on:
      app:
        condition: service_healthy
    restart: unless-stopped

4. Bring it up and register the first user

docker compose up -d

setup runs the migrations and exits. app brings up Rails, worker processes background jobs, and mail_fetcher polls IMAP every 30 seconds.

To register authorized senders, send an email from your ADMIN_EMAIL to the Frank FBI address with the subject add email1@exemplo.com, email2@exemplo.com. To list who’s registered, send with subject list. To see statistics, stats.

From there, any registered sender can forward suspicious emails and receive the analysis reports.

The development process

This project started as a simple idea: “what if I could forward suspicious emails somewhere and get an analysis back?” But security projects aren’t like normal projects.

In a regular project, a bug is an inconvenience. In a security project, a bug can be a vulnerability. If the system says a fraudulent email is legitimate, someone might lose money. If it incorrectly reports a legitimate domain as malicious, it can hurt an innocent company. If a malicious email manages to exploit the analyzer itself, the attacker gets access to the server.

That changes the development mindset. Every decision needs to consider: “how could a bad actor abuse this?”

Evolution through commits

Let me show how the project evolved by looking at the most relevant commits.

The project started with the basic email analysis structure and quickly needed access control:

eb59474 - Add admin access control and allowed senders whitelist
036e9f1 - Harden access control against email spoofing and admin impersonation

The first one adds the authorized sender list. The second solves the obvious problem: what if someone spoofs an authorized sender’s email? The system started verifying SPF/DKIM before accepting any submission. Per-sender rate limiting came along with that, to prevent abuse.

Then came concerns about scoring quality:

e891270 - Fix zero-dilution scoring bug and add critical alert UX
2cbb272 - Harden fraud scoring and reporting
50d96a1 - Move text pattern detection from regex to LLM consensus layer

The zero-dilution bug is subtle: when a layer fails and returns score 0 with low confidence, it would dilute the overall average downward, making fraudulent emails look safer than they were. The fix implemented dampening that discards low-quality layers instead of letting them drag the score down.

The shift from regex to LLM in pattern detection was pragmatic. Fraud patterns in natural language are hard to capture with regex. Too many false positives. LLMs understand context and intent in a way regex can’t.

Race conditions showed up when the pipeline started running layers in parallel:

2e4276d - Fix race conditions across pipeline and add Brave Search rate limiting
ac433d9 - Fix WHOIS race condition and Brave Search gzip error logging

Concurrent jobs trying to write to the same email record, WHOIS queries stepping on each other, external API rate limits getting blown. The classic “works in sequential tests, breaks in production.”

Preventing LLM hallucination got its own dedicated commit:

e08da9e - Prevent LLM hallucination in fraud reports

LLMs make things up. If the model says “this domain was registered yesterday” without evidence in the data, the report is compromised. This commit implemented cross-validation: claims from the LLMs are verified against the concrete data from the deterministic layers. If the LLM asserts something that contradicts the facts, the information is dropped or flagged.

And when community reporting was implemented:

a01d769 - Add community threat intelligence reporting
1b64eee - Harden community reporting against poisoning and add rate limiting

The first commit implements the feature. The second adds the anti-poisoning protections. If an attacker realizes Frank FBI reports automatically, they can try to send emails that contain IOCs from legitimate companies as “fraud,” making the system report innocent domains to community databases. That’s IOC poisoning, and it’s a real attack against threat intelligence systems.

Continuous hardening

9430c63 - Harden risky attachment analysis and warnings
bdc503d - Harden screenshot capture with failure recovery and pipeline timeout
125e73a - Separate suspect content from submitter signature in forwarded emails
6f7a522 - Handle forwarded message fidelity

Malicious attachments that could exploit the parser. Site screenshots that could lock up headless Chrome. Sender signatures that were being confused with the content of the suspicious email. Forwarded emails that lost fidelity in the forward. Each of these fixes a different attack vector or edge case.

Security isn’t a feature you implement once. It’s a continuous process of “what didn’t I think could go wrong?”

The numbers

Metric	Value
Commits	38
Hours of work (~)	17
Lines of Ruby	14,616
Ruby files	161
Application code (app/)	8,217 lines
Test code (test/)	5,312 lines
Test/code ratio	0.65
Analysis layers	6
Async jobs	20 classes
Data models	9 tables
Commits/hour	~2.2
Lines/hour	~860

860 lines per hour. Obviously AI-assisted development. But look at the hardening commits: none of them came from the LLM suggesting “hey, let’s protect against spoofing.” That was me stopping and thinking “wait, what if someone spoofs the authorized sender?”, “what if an attacker uses IOC poisoning against my reporting system?”. The LLM doesn’t ask those questions on its own. It implements them when you ask, but the one who has to spot the hole is you.

A serious warning: DO NOT offer this as a service

If you looked at Frank FBI and thought “cool, I’m going to offer this as a SaaS to other people,” I have one piece of advice: DON’T do that.

I can think of several ways to exploit a service like this offered to the public. The operator would have access to every email forwarded by users — personal information, corporate data, sensitive correspondence. A centralized service becomes a high-value target: compromise the server and you have access to a continuous stream of confidential emails from people who are already in a vulnerable situation.

I know enough people to know that whoever would deploy this as a service wouldn’t worry about encryption at rest or about destroying the emails after analysis. And who guarantees that the operator won’t read users’ emails? Nobody. It’s the kind of thing that looks like a useful service but in practice creates a honeypot of sensitive data managed by someone with no incentive to protect it.

Frank FBI was built to be self-hosted. To run on your server, under your control, with your data staying with you. Or in your company’s infrastructure, managed by your IT team.

And the project is licensed under AGPL-3.0. If you use my code and offer it as a network service, you’re legally required to release all the derived code. No exceptions. I picked AGPL for that reason — to make sure nobody takes the project, adds tracking and telemetry, and offers it as a “free email verification service” while collecting user data behind the scenes.

The repository is here. AGPL-3.0.

Porting 10K Lines of Python to Crystal with Claude: easy-subtitle

Sat, 07 Mar 2026 22:00:00 GMT

In the previous article I showed why I picked Crystal for command-line CLIs. In this one I want to show a more ambitious case. It isn’t a tool from scratch anymore — it’s a feature-parity port of a 10,000-line Python open source project.

The problem: subtitles

Anyone who maintains a movie and TV collection knows the pain. You download the MKV, but the embedded subtitle is out of sync. Or worse: there’s no subtitle at all in the language you want. So you go to OpenSubtitles, download a subtitle, and it’s 3 seconds ahead because it was made for a different release. The manual flow is:

Extract subtitle tracks from the MKV with mkvextract
Go to OpenSubtitles, look for a subtitle for the movie
Download, test, see it’s out of sync
Run some sync tool (ffsubsync, alass)
Rename the file, move it to the right place
Repeat for every language, for every movie

For someone with 10 movies, it’s tolerable. For someone with hundreds, it’s insanity. It’s exactly the kind of thing that should be automated.

Subservient: the Python solution

Looking at what already existed, I found Subservient. It’s a Python project that automates exactly this flow: it extracts subtitles from MKVs, downloads them from OpenSubtitles via REST API, syncs them with ffsubsync, and cleans ads out of SRT files.

The project is complete. It has movie mode and series mode, smart sync (tests every candidate in parallel and picks the best one) and first-match (stops at the first one that works). It uses the OpenSubtitles hash for exact matching and cleans watermarks and ads with more than 30 regex patterns.

But it has the typical Python distribution problems:

7 pip dependencies: colorama, requests, langdetect, ffsubsync, platformdirs, pycountry, tqdm
ffsubsync as the sync engine: which in turn depends on numpy, auditok, and a bunch more Python packages
Interactive menu UI: good for manual use, terrible for scriptability
Config in INI format: not the end of the world, but YAML is more ergonomic
10,220 lines across 6 Python files: 2,700-line files with hundreds-of-lines functions each

The point isn’t that Python is bad for this. Subservient works. But installing and maintaining it in production is another story. If you want to run it on a headless server, you need Python 3.8+, pip, virtualenv (or you’re going to pollute the system), and pray no dependency breaks with the next OS update.

The experiment: porting it to Crystal with Claude

Here’s where it gets interesting. I wanted to test a hypothesis: can Claude take a large open source project, understand the architecture, and do a complete port to another language?

I’m not talking about translating it file by file. I’m talking about understanding what the project does, redesigning the architecture where it makes sense, and generating idiomatic code in Crystal.

What I did:

Asked Claude to clone and analyze the Subservient repo
Explained the design decisions: use alass (a Rust binary, no Python dependencies) instead of ffsubsync, CLI subcommands instead of interactive menus, YAML instead of INI
Asked for a feature-parity port, with tests

alass is an important detail. ffsubsync works fine, but it’s a Python package that pulls in numpy and does audio analysis. alass does the same thing (subtitle synchronization through timing analysis), but it’s a standalone Rust binary. Swapping one for the other eliminates the biggest Python dependency in the stack.

The result: easy-subtitle

Five commits. Less than 40 minutes from the first to the last.

Commit	Time	What it did
Initial implementation	21:47	Complete port: 42 src files, 16 test files, CI, install script
Track shard.lock	21:56	Dependency lock for reproducible builds
Prefer ~/.local/bin	22:03	Install script fix
Add doctor command	22:20	New `doctor` command to validate the setup + bump v0.2.0
Homebrew formula	22:24	Support for `brew install` and auto-update workflow

The first commit already delivers a working project: 8 CLI commands, OpenSubtitles client with rate limiting, 76 passing tests, GitHub Actions with CI and release for Linux and macOS.

Numbers

Metric	Subservient (Python)	easy-subtitle (Crystal)
Source code	10,220 lines (6 files)	2,516 lines (42 files)
Tests	0	800 lines (76 specs)
Runtime dependencies	7 pip packages + ffsubsync	0 (just webmock for tests)
Binary	n/a (needs Python + deps)	~6MB static
Config	INI	YAML
Sync engine	ffsubsync (Python)	alass (Rust)
UI	Interactive menu	CLI subcommands
Concurrency	ThreadPoolExecutor	Crystal fibers + channels

The LOC difference is loud: 10,220 vs 2,516. But that’s not all Crystal’s doing. The original Python has monolithic files of thousands of lines, with a lot of duplication and UI logic mixed with business logic. The port separates the responsibilities into small, focused modules.

Architecture of the port

easy-subtitle/
  src/easy_subtitle/
    cli/           # Router + 9 commands (init, extract, download, sync, run, clean, scan, hash, doctor)
    core/          # Language map, SRT parser/writer/cleaner, video scanner
    acquisition/   # OpenSubtitles API client, auth, search, download, movie hash
    extraction/    # MKV track parsing, extraction, remuxing
    synchronization/  # alass runner, offset computation, smart/first-match strategies
    models/        # VideoFile, SubtitleCandidate, CoverageEntry

Every module has a clear responsibility. The biggest file is 144 lines (config). In the original Python, acquisition.py alone has 2,726 lines.

What each command does

# Generate the default config
easy-subtitle init

# Extract subtitles from inside MKVs
easy-subtitle extract /path/to/movies

# Download subtitles from OpenSubtitles
easy-subtitle download -l en,pt /path/to/movies

# Sync downloaded subtitles with the video
easy-subtitle sync /path/to/movies

# Full pipeline: extract → download → sync
easy-subtitle run /path/to/movies

# Clean ads/watermarks from SRTs
easy-subtitle clean /path/to/subtitles

# See subtitle coverage by language
easy-subtitle scan --json /path/to/movies

# Compute the OpenSubtitles hash (debug)
easy-subtitle hash /path/to/movie.mkv

# Validate the setup: config, credentials, dependencies
easy-subtitle doctor

doctor is a command I added later. It checks whether the config exists, whether the API key is configured, tests login against the API, and checks whether mkvmerge, mkvextract and alass are on the PATH. It shows OS-specific install instructions when something is missing.

Smart sync with fibers

Smart sync is the part I most enjoyed seeing in the port. In the original Python, it uses ThreadPoolExecutor to run multiple candidates in parallel. In Crystal, the same logic is more natural with fibers and channels:

def execute(candidates : Array(Path), video : VideoFile) : SyncResult?
  channel = Channel(SyncResult).new(candidates.size)

  candidates.each do |candidate|
    spawn do
      result = sync_one(candidate, video)
      channel.send(result)
    end
  end

  results = Array(SyncResult).new(candidates.size)
  candidates.size.times do
    results << channel.receive
  end

  accepted = results.select(&.accepted?)
  accepted.min_by(&.offset)
end

Each subtitle candidate gets synced in a separate fiber (via spawn). The results come back through the Channel. At the end, it picks the accepted one with the smallest offset. No ThreadPoolExecutor, no futures, no callbacks.

API rate limiting

OpenSubtitles requires throttling of 500ms between requests. The Crystal client implements that with a Mutex:

RATE_LIMIT_MS = 500

private def throttle! : Nil
  @mutex.synchronize do
    elapsed = Time.utc - @last_request_at
    remaining = RATE_LIMIT_MS - elapsed.total_milliseconds
    sleep(remaining.milliseconds) if remaining > 0
    @last_request_at = Time.utc
  end
end

Simple, thread-safe, no external library.

Installation

The static binary comes out of GitHub Actions and can be installed three ways:

# Homebrew (macOS / Linux)
brew install akitaonrails/tap/easy-subtitle

# Install script
curl -fsSL https://raw.githubusercontent.com/akitaonrails/easy-subtitle/master/install.sh | bash

# Or grab the binary directly from Releases

One binary. No Python, no pip, nothing.

On porting things “just because”

I’ve always argued that porting software from one language to another just for the language fetish is a waste of time. How many projects have been rewritten in Rust “just because”? How much effort spent on rewrites that delivered no new value?

But I have to admit this experiment made me reconsider.

When the cost of porting drops from weeks/months to less than 40 minutes, the equation changes. Porting Subservient to Crystal with Claude wasn’t an exercise in linguistic vanity. I wanted a static binary I could drop on a server and forget. No managing a Python runtime, no pip install breaking on the next system update.

And the result isn’t a mechanical port. It’s 2,516 lines across 42 files, against 10,220 in 6 monolithic ones. The port came with 76 tests the original didn’t have, CI with automatic release for Linux and macOS, a Homebrew formula and an install script with checksum verification.

The point isn’t that Python is bad. It’s that the bar for “is it worth porting?” got ridiculously low. Feature-parity port with tests in less than an hour. Hard to argue against that.

The repository is here. GPL-3.0, like the original.

Crystal and a Smart FFmpeg Wrapper Built in 3 Hours | easy-ffmpeg

Sat, 07 Mar 2026 18:00:00 GMT

Anyone who’s ever needed to convert a video on the terminal knows the FFmpeg pain. It does everything. Absolutely everything. But to do anything at all, you need to remember flag combinations that look like incantations:

ffmpeg -i input.mkv -c:v libx264 -crf 23 -preset medium \
  -profile:v high -level 4.1 -c:a aac -b:a 128k \
  -movflags +faststart output.mp4

What does that do? Convert to H.264 with reasonable quality for the web, using AAC for audio and moving the moov atom to the start of the file to allow progressive streaming. If you already knew that, congratulations. If you didn’t, welcome to the club of 99% of people who use FFmpeg by copying commands from Stack Overflow.

And that’s the easy case. Want to make an animated GIF from a sequence of PNGs? You need a two-pass pipeline with palette generation:

ffmpeg -framerate 10 -i frame_%04d.png \
  -vf "split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" output.gif

Want to cut a clip, resize to 720p and force a 16:9 aspect ratio? Good luck assembling the filter chain.

I got tired of memorizing flags. I wanted a CLI where I’d say “convert this to MP4 optimized for the web” and it would figure it out. So I built easy-ffmpeg.

What easy-ffmpeg does

It’s a smart wrapper. You give it the input file and the output format, and it analyzes the video with ffprobe, finds out which video, audio and subtitle streams exist, checks codec compatibility against the destination container, and decides on its own what can be copied directly (no re-encoding, instant) and what needs to be transcoded.

The most basic use:

# Convert MKV to MP4 (copies compatible streams, no unnecessary re-encoding)
easy-ffmpeg input.mkv mp4

# Optimized for the web (H.264 + AAC, faststart)
easy-ffmpeg input.mkv mp4 --web

# Optimized for mobile (720p, AAC stereo, smaller file)
easy-ffmpeg input.mkv mp4 --mobile

# Maximum compression (H.265, CRF 28)
easy-ffmpeg input.mkv mp4 --compress

# High quality for streaming (H.265, CRF 18)
easy-ffmpeg input.mkv mp4 --streaming

Examples of what you can do

Cut a clip from the video:

# From minute 1:30 to 3:00
easy-ffmpeg video.mp4 mp4 --start 1:30 --end 3:00

# The first 90 seconds
easy-ffmpeg video.mp4 mp4 --duration 90

# Accepts several time formats: 90, 1:30, 01:30.5, 1:02:30

Resize:

# To 720p
easy-ffmpeg video.mp4 mp4 --scale hd

# To 1080p
easy-ffmpeg video.mp4 mp4 --scale fullhd

# Presets: 2k (1440p), fullhd (1080p), hd (720p), retro (480p), icon (240p)

Change aspect ratio:

# Force 16:9 (adds black bars if needed)
easy-ffmpeg video.mp4 mp4 --aspect wide

# TikTok/Stories format (9:16 vertical)
easy-ffmpeg video.mp4 mp4 --aspect tiktok

# Square for Instagram
easy-ffmpeg video.mp4 mp4 --aspect square

# Crop instead of adding bars
easy-ffmpeg video.mp4 mp4 --aspect wide --crop

Image sequence to video:

# Auto-detects the numbering pattern (frame_0001.png, frame_0002.png...)
easy-ffmpeg /folder/with/frames/ mp4

# Animated GIF with optimized palette
easy-ffmpeg /folder/with/frames/ gif

# Image sequence at 720p, 30fps
easy-ffmpeg /folder/with/frames/ mp4 --fps 30 --scale hd

See what it’ll do without running it:

easy-ffmpeg video.mkv mp4 --web --dry-run
# Shows the exact ffmpeg command that would be executed

Combine everything:

# Cut, resize and compress to send over WhatsApp
easy-ffmpeg video.mkv mp4 --start 0:30 --end 2:00 --scale hd --compress

And if you run easy-ffmpeg with no arguments in an interactive terminal, it opens a TUI mode with file selection through fuzzy search, preset choice through a menu, and time input with validation — all without having to remember any flags.

Why Crystal

I hadn’t touched Crystal in years. The last time was for fun, before version 1.0. And I wanted to revisit it.

Crystal occupies an interesting niche. Go and Crystal compete for the same space: compiled languages for applications, generating static binaries, with garbage collection and no runtime dependency. But the approach is very different.

Go is famously, deliberately simple. No generics for years (they only landed in 1.18), no exceptions (error returns), no expressiveness. The argument is that this makes it easier to read and maintain in big teams. In practice, it produces verbose, repetitive code, with if err != nil on every line.

Crystal has static typing with inference, compile-time macros, blocks as closures (just like Ruby), exceptions, generics from day one, and a syntax any Rubyist will recognize:

# Crystal: read a JSON and extract data
streams = json["streams"].as_a.map do |s|
  StreamInfo.new(
    codec: s["codec_name"]?.try(&.as_s) || "unknown",
    width:  s["width"]?.try(&.as_i) || 0,
    height: s["height"]?.try(&.as_i) || 0,
  )
end

# Go equivalent: would be 3x more lines with type assertions and error checks

Crystal’s stdlib has an HTTP server, JSON parsing, YAML, regex, fibers (green threads with a cooperative scheduler), channels (just like Go), and even IO::FileDescriptor with raw mode for the terminal — which I used for the interactive mode. Concurrency works with spawn (the equivalent of Go’s go) and Channel (identical to Go’s chan). The difference is that everything comes with Ruby’s ergonomics.

For a CLI that needs to compile into a static binary with no dependencies, run on Linux and macOS, and be distributed as a direct download — Crystal is perfect. The compiler generates native binaries, and with Docker Alpine you can do static linking with musl for Linux. The final easy-ffmpeg binary is around 6MB.

I think Rust is better for systems code (kernels, drivers, databases, things where ownership and lifetime matter). But for a video conversion CLI? Rust would be overkill. Crystal gives you the same end result (fast static binary) with a third of the code and without fighting the borrow checker.

The numbers

The whole project was built in one afternoon. 10 commits between 18:32 and 21:33 on March 7, 2026. Three hours.

Language     Files       Lines     Code
Crystal      13          2,823     2,342
Shell        1           109       91
Markdown     1           266       210
YAML         2           46        34
─────────────────────────────────────────
Total        20          3,326     2,419

2,342 lines of Crystal do: a CLI with 5 presets, media analysis through ffprobe, smart conversion planning with a codec compatibility matrix, a progress bar with ETA, an interactive mode with fuzzy search, support for image sequences and GIFs, and trimming/scaling/aspect ratio with validation.

The tool already has CI/CD on GitHub Actions, compiling static binaries for Linux (x86_64 and arm64) and macOS (arm64), with installation through a one-line curl:

curl -fsSL https://raw.githubusercontent.com/akitaonrails/easy-ffmpeg/master/install.sh | sh

Three hours of one afternoon, from the first line of code to a release on GitHub with binaries for three platforms. FFmpeg keeps doing the heavy lifting — I just put a decent interface in front of it.

The repository is here. MIT license.

37 Days of Vibe Coding Immersion: Conclusions on Business Models

Thu, 05 Mar 2026 14:00:00 GMT

For the last 37 days I locked myself into a vibe coding immersion. The goal was simple: actually understand what the current generation of LLMs and coding agents can do. Not on a weekend toy project, but building real things, with production deploys, users, and the maintenance pain that comes along.

The result was 653 commits, ~144K lines of code across 8 projects published on GitHub, and a series of articles documenting each one. If you’ve been following along, you’ve already read the post-mortems. If you haven’t, here’s the index:

The articles

The projects

Project	Commits	LOC	Time	What it does
FrankMD	234	38K	~5 days	Markdown editor in Rust/Tauri
FrankYomik	131	21K	~9 days	Manga/webtoon translator (Go + Python + Flutter)
FrankSherlock	103	37K	~6 days	Intelligent image indexer (Rails + Python)
mila-bot (private)	60	30K	~3 days	Data mining system (Rails + Discord)
TVClipboard	49	5K	~2 days	Cross-device clipboard app with GLM 4.7
ai-jail	46	6K	~4 days	Security sandbox for AI agents (Rust)
FrankMega	29	7K	~1 day	Mega clone for the home server (Rails)
The M.Akita Chronicles	1+	many	~6 days	Full blog/newsletter in production

All of this happened between January 27 and March 5, 2026.

I’m not going to repeat the technical details of every project. Each post-mortem above already covered that. What I want to discuss here is the consequence.

What these 37 days showed me

Projects that used to take weeks or months for an experienced developer now take days. A complete Markdown editor in 5 days. A Mega clone in 1 day. A data mining system with 40+ bot tools in 3 days. An image indexer with vision AI in 2 days.

And I’m not talking about throwaway prototypes. These projects are in production. They have tests. They have automated deployments. It’s real software built with real engineering practices, just at a speed that was unthinkable before.

This speed isn’t unique to me. Cloudflare just demonstrated something similar. In a recent post, they describe how an engineer reimplemented the Next.js API surface on top of Vite in a week, using Claude as the main tool. US$ 1,100 in API tokens, more than 800 sessions, 1,700 unit tests, 380 E2E tests, builds 4.4x faster than the original Next.js. Controversies aside about whether it’s a “complete” reimplementation or not, the central point is real: software that used to require teams and months can now be built by one person in days.

And that changes everything for anyone wanting to start a company.

The death of the easy startup

For years, the classic startup model worked like this: someone has an idea, puts together a small team, builds an MVP in 3-6 months, raises seed money, and tries to scale. The barrier to entry was development cost. Programmers are expensive, development is slow, and the first to market has the advantage.

That model is breaking.

If I can build a working Mega clone in 1 day, what’s a “cloud storage” startup worth? If Cloudflare reimplements the Next.js core in 1 week with one engineer and US$ 1,100 in tokens, what’s the real moat of a platform like Vercel? The “social listening” SaaS that charges R$ 500/month is competing with something I built in 3 days as a side project.

Every entrepreneur who used to show up describing their idea as “it’s like Uber, but for…” or “it’s like Airbnb, except…” or “another social network, but with…” — those people need to stop and rethink. When any competent developer with access to Claude Code or GPT Codex can replicate your MVP in a week, your idea isn’t worth anything anymore. Execution got too cheap.

I’m not exaggerating. In my 37 days, I built things I wouldn’t even attempt in 6 months before. Small CRMs, ecommerces, content managers, productivity tools, data mining apps, processing pipelines — all of that became commodity. The code itself is no longer the differentiator.

So what is the differentiator?

This is where most vibe coding enthusiasts miss the point. To illustrate, let me use my own project as an example.

Frank Yomik translates manga pages from Japanese to English in near real-time. The full pipeline (balloon detection, OCR, translation, rendering) works. But the most important component of the system isn’t any code I wrote. It’s the ogkalu/comic-text-and-bubble-detector model, an RT-DETR-v2 trained on ~11,000 labeled comic images.

I didn’t train that model. I couldn’t easily train that model. Collecting 11,000 diverse comic images, manually labeling the speech balloons in each one (or generating labels with some semi-automated pipeline, which isn’t trivial either), configuring the training, and running it on the necessary hardware — that’s work of a different nature. It’s work that vibe coding doesn’t solve.

And that’s the case of a relatively small model. An object detector can be trained with a few hundred to a few thousand labeled images on a single GPU in hours. Studies show useful results are possible with 100-350 images for specific domains, but robust real-world detectors usually need thousands. The cost is low, in the hundreds of dollars range.

Now look at what happens when we scale up to bigger models.

The numbers that matter

GPT-4 cost more than US$ 100 million to train, according to Sam Altman himself. Stanford estimated the compute cost of Google’s Gemini Ultra at US$ 191 million. Meta’s Llama 3 consumed 39.3 million GPU-hours on H100s, and Meta built two clusters of 24,576 GPUs each to make it possible — and by the end of 2024 it planned to have 350,000 H100s in its infrastructure.

These costs are accelerating. According to Epoch AI, the cost of hardware and energy to train frontier models grows by 2.4x per year since 2016. If that trend continues, the largest training runs will cost more than US$ 1 billion before the end of 2027. Dario Amodei, Anthropic’s CEO, has said frontier developers are probably spending close to a billion per training run now, with US$ 10 billion training runs expected within two years.

And the hardware? An H100 GPU costs US$ 25,000-40,000 per unit. A server with 8 GPUs runs between US$ 200,000 and US$ 400,000. The HBM memory those GPUs use is at capacity — SK Hynix, Samsung and Micron have already announced ~20% price increases for 2026. NVIDIA consumes more than 60% of the global HBM production. It’s a structural bottleneck, not a temporary one.

In terms of energy: global data centers consumed ~415 TWh of electricity in 2024, according to the IEA, about 1.5% of the world’s electricity. The projection is ~945 TWh by 2030. New data centers are being built with capacities of 100 MW to 1 GW each.

And the big tech investments reflect that. In 2025, the aggregate capex of Amazon (~~$125B), Google (~~$91B), Microsoft (~~$80B) and Meta (~~$71B) crossed US$ 400 billion, a 62% increase over 2024. Goldman Sachs projects more than US$ 500 billion in 2026.

These numbers aren’t meant to scare. They’re meant to put the real entry barrier in context.

The new barrier: exclusive data and training capacity

If software got cheap to produce, the competitive differentiator migrated. The question stopped being “who writes code the fastest?” and became “who has access to data nobody else has, and knows how to turn that data into useful models?”

DeepSeek-V3 announced that its training cost US$ 5.5 million in compute. The press celebrated the “cheap Chinese model.” But The Register reported that the acquisition cost of the 256 GPU servers used was over US$ 51 million — and that excludes R&D, data acquisition, data cleaning, and all the failed training runs before the final successful one. The real cost of developing the capability is one or two orders of magnitude above the marginal cost of the final successful training run.

That’s why we only see large companies producing frontier models. Meta, Alibaba, Google, Amazon, Microsoft, Anthropic — companies investing tens of billions in hardware and energy. A garage startup can’t compete on this dimension.

But the question goes beyond frontier models. Even smaller specialized models require something you can’t buy: high-quality proprietary data.

Llama 3 was trained on 15 trillion tokens. Epoch AI has documented that we’re approaching the limits of human-generated text data on the internet. Public data is being exhausted. Whoever has exclusive data — medical, financial, industrial, logistics, sensor, user behavior — has something that vibe coding can’t replicate.

And even those who train specialized models with proprietary data face a problem: the advantage is temporary. Another competitor can collect similar data and train a competing model in months. The differentiator needs to be continuously fed: more data, better curation, more efficient training pipelines, access to hardware that’s increasingly scarce and expensive.

The picture

If you’re thinking about starting a company, the question that matters has changed.

Before it was: “can I build this software?” Now the answer is almost always yes, fast and cheap.

The question now is: “do I have exclusive data nobody else has, and do I know how to turn that data into something useful that’s hard to replicate?”

If the answer is no, any competitor with Claude Code replicates your product in days. And then another one shows up. And another. The race to the bottom on price is immediate when the cost of building is close to zero.

If the answer is yes, you have a real moat — but a temporary one. Competing models trained on similar data can show up in months. Your differentiator needs to be continuously fed.

The era of easy startups is over. Not because building software got harder — it got way easier. But precisely because of that: when everyone can build the same thing in a week, the competitive advantage has to come from somewhere else. And that “somewhere else” requires capital and infrastructure that are orders of magnitude more expensive than writing code.

The garage still works. But what comes out of it can no longer be “an app.”

My First Vibe Code Failure and How I Fixed It | Frank Yomik

Thu, 05 Mar 2026 08:00:00 GMT

Before anyone asks, all the code is in this repository. And there are pre-built client binaries on the releases page. But the app alone isn’t enough because it needs the server component, which runs on a machine of yours (local or cloud) with a GPU of at least 16GB of VRAM, and that you need to configure — I’m not going to maintain a public server, just a personal one for my own private use.

–

I have a huge collection of manga bought on Amazon.co.jp that I read through Kindle web. Shounen manga normally has furigana — that small text in hiragana next to the kanji — that helps me read, because, despite having studied Japanese, I was never formally trained. But manga aimed at adult audiences (seinen, not porn) usually comes without furigana. It’s pure kanji and my reading speed crawls.

For years I’ve wanted a tool that solves this. The idea is simple: detect the speech bubbles on a manga page, extract the text with OCR, and either add furigana to the kanji, or translate directly to English and render it back into the bubble. Sounds easy, right?

(kanji with no caption/furigana)

(with furigana injected in real time)

Yeah, it sounded easy. And because I knew it sounded easy but wouldn’t be, I never had the patience to actually do it. I know how to build the first 80% of any project. The problem is the last 20% — that phase of experimentation, tweaking, fine-tuning, handling edge cases — that consumes more time than all the rest combined. And in a computer vision project, that 20% is especially treacherous.

But then the vibe coding era arrived. And I thought: maybe now the 20% is feasible. I started the project on February 24, 2026, at 23:10 at night. And it became an example of how easy it is to produce massive volumes of useless code.

The original idea: OpenCV and heuristics

My original plan — the concept I’d imagined for years — was: use OpenCV — which is a famous and old computer vision library — to detect the speech bubbles. Manga bubbles are typically white areas with a black contour inside the panels. In theory, you just threshold the image to grab white regions, find contours, filter by size and shape, and you’re done.

In 24 hours I already had a working proof of concept: bubble detection, OCR with manga-ocr (a model trained specifically on Japanese manga text), furigana with MeCab for morphological analysis, translation with Ollama running Qwen3:14b locally, and rendering the text back into the bubble. The initial commit (9169d73) on Feb 24 already did all of that.

The next day, Feb 25, I was already extending the pipeline with a Korean webtoon translation flow. 27 commits that day. Everything seemed to be flowing.

And then the hell started.

The hell of false positives

The problem with OpenCV bubble detection is that manga isn’t a standardized document. Every artist has their own line, scan quality varies enormously between print eras, and color pages need completely different parameters from black-and-white pages. And the thing that looks the most like a white speech bubble on a manga page is… a face.

Character faces are light areas, relatively rounded, with a dark contour. Exactly like speech bubbles. And no matter how many filters you stack, there’s always a case where the face of a character with blue hair sails through every filter, or where a legitimate bubble gets rejected because it has an unusual shape.

Look at how my bubble detector ended up at peak complexity — 551 lines with 7 layers of false-positive filters (abridged version):

# --- False positive filters ---

# 1. Edge density
edge_pixels = cv2.countNonZero(cv2.bitwise_and(edges, edges, mask=mask))
edge_density = edge_pixels / area
if edge_density > max_edge_density:
    continue

# 2. Bright pixel ratio
bright_pixels = cv2.countNonZero(cv2.bitwise_and(bright_thresh, mask))
bright_ratio = bright_pixels / area
if bright_ratio < min_bright_ratio:
    continue

# 3. Mid-tone ratio
mid_mask = cv2.inRange(gray, 80, 220)
mid_pixels = cv2.countNonZero(cv2.bitwise_and(mid_mask, mask))
mid_ratio = mid_pixels / area
if mid_ratio > max_mid_ratio:
    continue

# 4. Contour circularity
circularity = 4 * np.pi * area / (perimeter * perimeter)
if circularity < 0.10:
    continue

# 5. Border darkness
border_mean = cv2.mean(gray, mask=border_only)[0]
if border_mean > 160:
    continue

# 6. Background uniformity (white_std)
white_pixels = gray[(mask > 0) & (gray > 200)]
if float(np.std(white_pixels)) > 15:
    continue

# 7. Dark content analysis (text strokes)
very_dark = np.sum((inner_mask > 0) & (gray < 60))
dark_ratio_60 = very_dark / inner_area
if dark_ratio_60 < min_dark_ratio:
    continue

Each one of those filters was added in response to a specific false positive. Edge density separated bubbles (which have sparse text strokes) from faces (which have hair, eyes, nose creating dense edges). Bright pixel ratio checked whether the region was actually white. Circularity discarded shapes that were too irregular. And so on.

But the worst part is that those filters interacted with each other in unpredictable ways. Look at this commit from Feb 26 (70c814a):

“Revert rect_dark, mid_ratio, and early-split changes that caused face FPs”

I had tried to relax two thresholds — rect_dark from 0.10 to 0.11, mid_ratio from 0.15 to 0.16 — to recover bubbles that were being missed. Result: faces and body regions started passing as false positives in Adachi manga. I had to revert everything.

That’s the pattern that repeated for days: recovering a missed bubble meant opening the door for false positives. Fixing a false positive meant losing a legitimate bubble. It was infinite whack-a-mole.

The band-aids: CLAHE, edge detection, watershed

When the 7 basic filters weren’t enough, I started stacking additional passes.

Commit 294e785 (Feb 26): added CLAHE (Contrast Limited Adaptive Histogram Equalization) as a second detection pass. Bubbles with mid-range brightness, near the 200 threshold, were being missed. CLAHE equalized the contrast and revealed those borderline bubbles.

But CLAHE also made faces pass through the filters because it artificially inflated skin brightness. So I had to add an entire validation function against the original image:

def _validate_on_original(candidate, gray_orig):
    """Check if a CLAHE-detected candidate looks bubble-like on original."""
    roi = gray_orig[y1:y2, x1:x2]
    mean_brightness = roi.mean()

    # Already bright enough for pass 1 — rejected for good reason
    if mean_brightness > 215:
        return False

    # Must have text strokes (dark pixels)
    dark_ratio = np.sum(roi < 60) / roi.size
    if dark_ratio < 0.07:
        return False

    # White pixel variance: text creates high std, face skin is uniform
    white_pixels = roi[roi > 200]
    if len(white_pixels) > 50:
        if float(np.std(white_pixels)) < 9:
            return False

    return True

Commit 5dddb31 (March 1): added an entire third detection pass based on edges (edge-based segmentation). To catch bubbles where the white interior blended into the white background of the page. Dilated the Canny edges, inverted, did an AND with bright regions, and looked for contours in the result.

Commit b695295 (Feb 26): added recovery of small bubbles via morphological gradient + OCR validation. If OCR confirmed there was valid Japanese text in the region, it was probably a real bubble.

Each band-aid added 50-100 lines of code and another layer of complexity. And each one had its own edge cases and false positives.

The final tally for v0.1

On March 1, I tagged v0.1. At that point I had:

90 commits in 6 days of development (Feb 24 to March 1)
551 lines in bubble_detector.py alone
7 layers of false-positive filters, each with empirical thresholds
Two complete detection passes (original + CLAHE)
A third edge-based pass
Cross-validation against the original image
Watershed separation for overlapping bubbles
Separate threshold profiles for color vs black-and-white pages
20+ magic numbers tuned empirically against a limited sample of pages
Regression tests pinning each specific false positive (face of a girl with blue hair, window frame, thin horizontal strip, concrete floor…)

And even with all that, it still wasn’t reliable. Every new manga I tested revealed some little case that broke the detector. It was an extremely brittle monolith.

SYMPTOM: you’re patching a fix that fixed another fix that was fixing yet another fix, and when you touch one piece, you accidentally break another: that means the code is brittle, a house of cards about to collapse. That’s the moment to give up and rethink!

The decision: research alternatives

At v0.1 I stopped and asked the question I should have asked at the start: is there someone who already trained an ML model to do exactly this?

I asked Claude to research available models for comic bubble detection. The research produced the document docs/yolo_bubble_detection_plan.md, where we analyzed the alternatives.

The first one that came up was a YOLOv8 Medium from ogkalu (comic-speech-bubble-detector-yolov8m), trained on ~8,000 images of manga, webtoon, manhua and Western comics. It detects only one class (speech bubble). But digging deeper, we found another model from the same author: ogkalu/comic-text-and-bubble-detector, an RT-DETR-v2 with a ResNet-50-vd backbone (42.9M parameters), trained on ~11,000 images, with three classes: bubble, text_bubble and text_free. Both Apache 2.0.

We also evaluated the comic-text-detector (DBNet + YOLOv5, ~13,000 images from Manga109), but that one detected text regions and not bubbles. And as future training data, there was Roboflow with 4,492 already-labeled images, and the Manga109 dataset with 147,918 annotations across 21,142 pages.

The RT-DETR-v2 with 3 classes was the most promising because it distinguished speech bubbles, text inside bubbles and free text (narration, SFX). It could replace both bubble_detector.py and text_detector.py in a single inference pass.

The conclusion of the research document was direct:

“Even without fine-tuning, these models were trained on 8 to 11 thousand images of diverse comics. They should handle the artistic style diversity that our manual filters struggle with. The 7-filter heuristic cascade and its magic numbers would be completely eliminated.”

And if it weren’t enough, we included a fine-tuning plan using paired data (original Japanese pages vs English fan translations) that generated labels automatically through image diff. But first we wanted to test the baseline.

Replacing everything: RT-DETR-v2

On March 4, I made commit 0df63f2: “Replace OpenCV heuristic detection with RT-DETR-v2, add bubble shape masking”.

The diff speaks for itself:

16 files changed, 732 insertions(+), 1112 deletions(-)

1,112 lines deleted. More lines removed than added. The 551-line bubble detector was replaced by 262 lines — and most of those are shape mask extraction (contour mask) from the detected bbox, not the detection itself.

The detection core became this:

MODEL_ID = "ogkalu/comic-text-and-bubble-detector"
DEFAULT_CONFIDENCE = 0.35

def detect_bubbles(img_cv, confidence=DEFAULT_CONFIDENCE):
    model, processor, device = _get_model()
    img_pil = Image.fromarray(img_cv[:, :, ::-1])  # BGR→RGB
    inputs = processor(images=img_pil, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    results = processor.post_process_object_detection(
        outputs, target_sizes=target_sizes, threshold=confidence
    )
    # ... map classes, deduplicate, sort

No magic thresholds. No separate profiles for color vs B&W. No CLAHE. No edge detection. No watershed. No 7 filter layers. A model trained on 11,000 diverse comic images already knows how to tell a bubble from a face better than my pile of “ifs.”

What also vanished alongside it:

The entire text_detector.py (387 lines) — replaced by RT-DETR’s text_free class
The false-positive feedback system — RT-DETR detections are reliable enough to not need manual marking
Hundreds of lines of regression tests pinning specific false positives

In the end, I didn’t even need to try fine-tuning my own model. The model that already existed solved more than 99% of the cases, which for me was already excellent.

People who don’t understand statistics struggle with this. With my “manual” OpenCV procedure I was already getting 80%, maybe more. But that’s very little. If on every page a face gets a bubble slapped on top of it, that’s terrible.

Even if with a lot of effort (another five hundred different “ifs”) I could get to 95%, that’s still not enough. Reaching 80% is easy. The last 20% costs exponentially more, and the last 1% can be impossible in many cases. That’s how things work. Everyone stops at 80%.

The other pains: Flutter on Linux

While the Python and Go backend was relatively stable, the Flutter client on Linux was a separate chapter of suffering.

Flutter uses WebKitGTK for the WebView on Linux, and that component has painful particularities. Commit 8e5d168 (Feb 28) tells the story: WebKitGTK can’t resolve Promise return values from asynchronous JavaScript, generating a PlatformException. I had to rewrite all the overlays as synchronous IIFEs with decode().then() and an opacity: 0.999 nudge to force a texture refresh in the GPU compositor.

On my NVIDIA + Wayland setup, the WebView was unusable at full resolution. Commit a36e1eb (Feb 27) tried to fix it by forcing CPU rendering with WEBKIT_SKIA_ENABLE_CPU_RENDERING=1 and disabling accelerated compositing. Then I had to revert that (commit 8e5d168) and force the AMD iGPU’s Mesa via __EGL_VENDOR_LIBRARY_FILENAMES to stop WebKitGTK from grabbing the NVIDIA dGPU.

Each one of those discoveries cost hours of debugging things that simply had no documentation. In the end I got the Linux version to a reasonable state, but it’s nothing spectacular. I don’t know if I’m missing something obvious, but Flutter on Linux I found to be a drag, especially right after building a native app with Rust/Tauri (much better). But since I wanted an app that worked on Linux and Android, there weren’t that many options.

The Kindle prefetch that died

An idea that seemed brilliant and went wrong: create a second hidden WebView in Flutter that shared cookies/session with the main WebView and kept navigating pages ahead to pre-process them.

The reason is that the Kindle website only loads one page at a time. It doesn’t load everything at once or in chunks, just one page. So you can’t process pages forward or backward. When the server was still slow at processing pages, I wanted to keep everything pre-cached, so when the user turned the page the translation was already there.

Commit a36e1eb (Feb 27) implemented the entire KindlePrefetchManager: 406 lines of Dart, with batched prefetch (3 pages ahead), trusted GDK events for turning the page (isTrusted=true), rate limiting with human pacing, a window with sensitive=FALSE so it wouldn’t steal focus.

In the following days, the fixes piled up:

cfe08eb: improve prefetch reliability and overlay matching
4461b5f: harden overlay selection and Kindle recovery
f141b2e: stop destroying the background webview on every page turn

And in the end (commit 2b93e99, March 4), I deleted everything. 657 lines removed, replaced by a simple spinner in the toolbar.

With the move to the RT-DETR model, and parallelizing the bubble translation, the translation became “almost real-time,” loading in less than 10 seconds, so it’s no longer such a big deal to wait for a page to come in, and you can ask one at a time. The prefetch added too much complexity for marginal gain, and the right way to solve it was simply to process the page on demand with proper visual feedback.

The numbers

In total, the Frank Yomik project today has:

111 commits across 9 days of development (with a 2-day break in the middle)
29,428 lines of code in 181 files
~13K lines of Python (processing pipeline + worker)
~6K lines of Dart (Flutter client)
~3.4K lines of Go (API server)
345 unit tests + 34 integration tests
Support for Japanese manga (Kindle) and Korean webtoons (Naver/Webtoons)

The pipeline today works like this:

The Flutter client (Android or Linux desktop) opens the Kindle or Webtoons site in a WebView
Captures the page image (capturing the image blob on the Kindle page, fetching the for webtoons)
Sends it to the Go API which queues it on Redis Streams with SHA256-based dedup
Python worker processes: RT-DETR-v2 detects bubbles → manga-ocr or EasyOCR extracts text → Ollama qwen3:14b translates → text_renderer renders back into the image
Result comes back over WebSocket, the overlay replaces the original image

For furigana: fugashi (a MeCab wrapper) does morphological analysis and generates the hiragana reading for each kanji. I switched from pykakasi to fugashi because pykakasi doesn’t consider sentence context (「人」 became にん instead of ひと).

If I knew then what I know now — that the RT-DETR-v2 model already existed and solved the detection problem with a confidence threshold — I would have eliminated the entire OpenCV phase. The OCR, translation, rendering and Flutter parts were already reasonably stable. Detection was the bottleneck, and it was exactly the part I could have saved myself, if I’d given up sooner.

What I learned by failing

I spent 5 days polishing an OpenCV bubble detector that had 20+ tuning commits, 7 filter layers, 3 detection passes, and that in the end was replaced by a 50-line wrapper around a pretrained model.

It was 1,112 lines deleted in a single commit. But it wasn’t wasted time. Those 5 days taught me exactly why heuristics fail in computer vision. I understood the problem, the cascade where touching one threshold breaks another, and it was that understanding that made me recognize when to stop and research alternatives.

And here’s where the real role of vibe coding in this story comes in.

Prompting doesn’t replace thinking.

I generated code very fast with Claude, but the problem wasn’t writing speed, it was the approach. No prompt in the world turns 7 layers of heuristic filters into a robust solution. The right solution was to change the approach completely.

But vibe coding made the failure cheap. In the pre-AI days, those 5 days of OpenCV would have been maybe 2-3 weeks. The cost of being wrong would have been high enough to make the decision to throw it out really hard. With vibe coding, 5 days were discarded without pain because I knew I could rebuild fast. And in fact, the RT-DETR-v2 integration and the restructuring of the entire project were done in a single day (March 4, 15 commits).

The question that remains: if I had done the yolo_bubble_detection_plan.md research on day 1 instead of day 5, I’d probably have reached the current state in 2 days. The difference between weeks of work and a weekend was a HuggingFace search. Researching before implementing seems obvious in hindsight, but in the heat of the moment the temptation to solve it by hand is strong.

The project is now open source. I initially didn’t know if I wanted to open the code, there was a lot of kludge and that 551-line detector that I was embarrassed about. But after the refactoring, the code became clean enough to share. It’s the version I would have liked to have built from the start, but that I could only build because I screwed up first.

The bigger headache was the bubble detection and replacement model, but there are several other points I didn’t detail: did you notice that webtoons are colored and the bubbles themselves have art? I had to use an image model to do in-painting and have the AI redraw the bubble before placing the text on top.

Another headache I don’t think I’ll solve: translation coherence. Today it translates each bubble in isolation, with no context of the story before or after. In Japanese, there’s no gender distinction in words. So the captain was talking about Nami, but the bubble says “He” instead of “She.” There’s no way to know without reading the previous text. For this to work better, like in a GPT chat, you have to add part of the previous text to know when to use the correct gender or, worse, when there are puns that appeared volumes ago and are referenced in the future (something an Oda loves to do). All those subtleties get lost if you only translate one bubble at a time.

I imagine that’s why nobody has done something like this yet, translating in near real time, because for it to really be good, the translation work would be exponential for each chapter further along in the story.

Anyway, the name Frank Yomik comes from “yomi” (読み, reading in Japanese) and “ik-da” (읽다, to read in Korean). Frank is a nod to a frank, direct translation. The app reads in both languages.

For anyone who wants to try it: the repository is on GitHub. The server needs a GPU with at least 8GB of VRAM for the detection model + OCR + translation. The Flutter client runs on Android and Linux desktop. And if you, like me, have a stack of Japanese manga you’d like to read more fluently — now you can.

I Built a Data Mining System for My Influencer Girlfriend — Tips and Tricks

Wed, 04 Mar 2026 10:00:00 GMT

My girlfriend is an influencer in games, anime, cosplay and pop culture. She does interviews, covers conventions, produces content for several platforms, negotiates with sponsors. It’s a one-person professional operation with a lean team. And like every professional operation, she needs data.

The problem is that gathering data from social networks is grunt work. Open Instagram, YouTube, X, look at numbers, compare to competitors, check the events calendar, monitor sponsors, read comments, calculate engagement, decide how much to charge for a sponsored post. All manual, all repetitive, all eating time that should go to creating content.

I did what any programmer boyfriend would do: I built a full data mining system, with automated collection, LLM-driven analysis, and a Discord chatbot where she talks to the data in plain language.

This article isn’t about the code itself (the project won’t be open source). It’s about the build process, the decisions that only show up after you start, the technical tricks that saved the project, and why a wishlist document from your user is worth more than any functional spec.

Start with the Wish, Not the Architecture

The first thing I did was sit down with her and ask her to talk freely about what she wanted. No technical jargon, no form, no user stories. Her own words. I recorded it in an IDEA.md file that became the north star for the whole project.

What came out of it:

“My biggest difficulty today is figuring out what kind of content to make that can really take off and why the ones that worked actually worked, what made them get the result they got. I have to read the comments on the videos to get a general sense of why.”

“How much I should/could charge for a sponsored post, based on engagement research, competition, sponsors, events.”

“How to make a sponsored post that doesn’t look like a sponsored post — paid content that brings engagement without looking like just a sales pitch.”

She didn’t ask for a dashboard. She didn’t ask for charts. She didn’t ask for PDF reports. She asked for answers to concrete day-to-day problems. That completely changed how I approached the architecture.

If I had started from a technical spec, I’d have built an analytics platform with pretty charts and CSV exports. The system she actually needed was something more like an assistant that knows the data and answers questions in Portuguese.

The IDEA.md also listed initial competitors and sponsors. Brands she could work with (themed restaurants, figure shops, Crunchyroll, Piticas, Fanlab). Reference profiles on Instagram. All of it became seed data. The document wasn’t a spec — it was a conversation that turned into organic requirements.

58 Commits in 3 Days

The project was built in 3 days. Not 3 weeks, not 3 months. 58 commits, 3 distinct phases:

Day 1 — The Data Engine. 12 commits. Rails 8 scaffold, models, collectors for YouTube/Instagram/X, LLM integration. Discovery pipeline to find new profiles automatically. Per-post performance scoring. Sentiment analysis of comments via Claude. By the end of the day, the system was collecting data from all 3 platforms and classifying every post as viral, above average, average, below or flop.

Day 2 — The Brain and the Voice. 35 commits. The most intense day. Two big subsystems showed up: the “Oracle” (tracking events, conventions, game/movie/anime releases, news) and the Discord chatbot with tool calling via RubyLLM. Also production hardening, Docker, deploy guide. And a big pivot: I swapped the weekly email report for 5 daily Discord digests.

Day 3 — Entertainment and Resilience. 11 commits. Steam games (Store API + SteamSpy), AniList for tracking seasonal anime, growth analytics, image generation via Gemini, an auto-healing system for broken URLs.

The final numbers:

Metric	Value
Files	430
Total lines	37,088
Lines of Ruby	25,562
Tests	916 (0 failures)
Models	17
Scheduled jobs	25+
Bot tools	40+
YAML prompts	23
External integrations	12+ APIs

These numbers aren’t to my credit. Claude wrote a good chunk of the code. But the direction, the architectural decisions, and especially the validation of every step were human. Claude doesn’t know what a Brazilian cosplay influencer needs. Neither did I — but she told me.

The Architecture That Emerged

The final system is a headless Rails 8. Zero web interface. No views, no real controllers (just /up for the health check). All functionality is delivered via background jobs, rake tasks, and the Discord chatbot.

The stack:

Rails 8.1 with Solid Queue (jobs) and Solid Cache
SQLite3 in production (WAL mode, bind mount between containers)
RubyLLM for integration with Claude Sonnet via OpenRouter and Grok
Ferrum for scraping with headless Chrome
Discordrb for the chatbot
Docker Compose with 4 services

The docker-compose.yml ended up lean:

x-app: &app
  build: .
  env_file: .env
  restart: unless-stopped
  volumes:
    - ./data/storage:/rails/storage

services:
  app:
    <<: *app
    ports:
      - "127.0.0.1:3000:80"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/up"]
    deploy:
      resources:
        limits:
          memory: 512M

  jobs:
    <<: *app
    command: bundle exec rake solid_queue:start
    depends_on:
      app:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 1G

  discord:
    <<: *app
    command: bundle exec rake discord:start
    depends_on:
      app:
        condition: service_healthy

  chrome:
    image: chromedp/headless-shell:stable
    deploy:
      resources:
        limits:
          memory: 1G

4 containers: app (Puma, basically just the health check), jobs (Solid Queue running 25+ scheduled jobs), discord (the bot), and chrome (headless browser for scraping). All state lives in SQLite via a bind mount on the host. No Redis, no Postgres, no extra infrastructure.

Idempotency Above All

If I had to pick one concept to summarize this project, it would be idempotency. Every job can run twice in a row without creating duplicates, without corrupting data, without side effects.

The base pattern:

module Collection
  class BaseCollectorJob < ApplicationJob
    SNAPSHOT_DEDUP_WINDOW = 1.hour

    retry_on StandardError, wait: :polynomially_longer, attempts: 3

    def perform(profile_id: nil)
      scope = profile_id ? SocialProfile.where(id: profile_id) : profiles_to_collect
      scope.find_each do |profile|
        log = find_or_create_log(profile)
        begin
          items = collect_for_profile(profile)
          profile.touch(:last_collected_at)
          log.update!(status: :completed, items_collected: items || 0)
        rescue => e
          log.update!(status: :failed, error_message: "#{e.class}: #{e.message}")
          raise if should_raise?(e)
        end
      end
    end

    def upsert_post(profile, attrs)
      post = profile.social_posts.find_or_initialize_by(
        platform_post_id: attrs[:platform_post_id]
      )
      post.assign_attributes(attrs.except(:platform_post_id))
      post.save!
      post
    end

    def record_snapshot(profile, metrics)
      return if profile.profile_snapshots
        .where("captured_at > ?", SNAPSHOT_DEDUP_WINDOW.ago).exists?
      profile.profile_snapshots.create!(captured_at: Time.current, **metrics)
    end
  end
end

Two protection mechanisms: find_or_initialize_by(platform_post_id) for posts (upsert, not insert), and SNAPSHOT_DEDUP_WINDOW for snapshots (skip if we already collected one in the last hour). If the job crashes mid-run and Solid Queue requeues it, nothing duplicates.

Scraping errors (BlockedError, RateLimitError) get swallowed silently — retrying doesn’t help, the rate limit needs to cool off, and getting blocked is expected. Real errors (database, network, bugs) bubble up to the retry with polynomial backoff.

This decision to “swallow certain errors” feels wrong until you run the system for a week and see that Instagram blocks one in every 5 collection runs. If every block triggered 3 retries with exponential backoff, the system would be perpetually behind.

nil vs Zero

This is one of those conceptual bugs you only discover with real data. When Instagram doesn’t return the share count for a post (because the API simply doesn’t expose that field), should the value be nil or 0?

The difference matters. nil means “we don’t have this data”. Zero means “we have the data and it’s zero”. If you treat nil as zero, the LLM analysis concludes that nobody shares posts on Instagram — which is false. The API just doesn’t expose that metric.

I built a reusable prompt snippet just for this:

When a field is null: don’t say “0 likes” — say “data not available for this platform”. When comparing platforms, warn that certain metrics aren’t comparable. When it’s zero: report it normally — the data is real and confirmed by the API.

That snippet gets included in every prompt that handles numeric data. Without it, the LLM happily mixes “not collected” with “actually zero” and draws wrong conclusions.

The Headless Chrome Trick: Host Header Bypass

The project uses chromedp/headless-shell in a separate container for scraping. Works perfectly. Until you try to connect Ferrum (the Ruby Chrome automation gem) to it over Docker networking.

The problem: chromedp/headless-shell uses a socat proxy on port 9222 that rejects any request whose Host header isn’t localhost or an IP. When Ferrum tries to connect to http://chrome:9222, the Host header goes out as chrome:9222, and socat refuses.

The fix was to discover the WebSocket URL manually, forging the header:

def discover_ws_url(chrome_url)
  uri = URI.parse("#{chrome_url}/json/version")
  req = Net::HTTP::Get.new(uri)
  req["Host"] = "localhost"  # bypass

  response = Net::HTTP.start(uri.hostname, uri.port) { |http| http.request(req) }
  data = JSON.parse(response.body)

  ws_url = data["webSocketDebuggerUrl"]
  return nil unless ws_url

  # WS URL points to 127.0.0.1 inside the container — swap it for the reachable hostname
  remote_uri = URI.parse(chrome_url)
  ws_url.sub(%r{://[^/]+}, "://#{remote_uri.host}:#{remote_uri.port}")
end

I do the GET to /json/version with Host: localhost to slip past socat. I get the WebSocket URL back (which points at 127.0.0.1 — useless from outside the container). I rewrite the hostname to chrome:9222 (the service name in Docker Compose). I hand the ws_url straight to Ferrum, which opens the WebSocket without going through socat’s HTTP layer.

This kind of thing doesn’t come to mind upfront. You discover it after 1 hour of connection refused and reading the source code of chromedp/headless-shell.

Tool Calling: The Bot That Queries the Database on Its Own

The coolest part of the project is the Discord chatbot. She types a question in Portuguese (“how were my posts this week?”) and the bot calls the right tools, queries the database, and answers with real data.

The secret is RubyLLM’s tool calling. Every tool is a Ruby class with description and params:

class PostPerformanceTool < RubyLLM::Tool
  description "Returns post performance stats: baseline, " \
              "breakdown by content type, best times, best hashtags."

  param :username, type: :string, desc: "Profile username (e.g. milaoliveira.png)"
  param :analysis, type: :string,
    desc: "Type: baseline, content_types, timing, hashtags",
    required: false

  def execute(username:, analysis: "baseline")
    profile = SocialProfile.find_by(username: username.to_s.strip)
    return "Profile '#{username}' not found." unless profile

    case analysis.to_s.strip.downcase
    when "baseline"
      baseline = PostPerformance.compute_baseline(profile)
      {
        username: profile.username,
        post_count: baseline[:post_count],
        mean_views: baseline[:mean_views].round(0),
        stddev_views: baseline[:stddev_views].round(0),
        mean_engagement: baseline[:mean_engagement].round(2)
      }
    when "timing"
      PostPerformance.timing_breakdown(profile).first(10).map do |entry|
        { day: days[entry[:day]], hour: "%02d:00" % entry[:hour],
          avg_percentile: entry[:avg_percentile].round(1) }
      end
    # ...
    end
  end
end

The LLM gets the list of 40+ tools with their descriptions, decides which ones to call with which parameters, gets the results back, and formulates the answer. The Discord bot shows real-time status while it works: “Querying profile metrics… Checking growth… Analyzing post performance… (3 queries)”.

The trick was making each tool return structured data (Hashes and Arrays), not formatted text. The LLM is much better at formatting raw data into a conversational reply than at parsing semi-structured text.

Another detail that only showed up in real use: clamping parameters. When the LLM asks for limit: 999 on a parameter that maxes out at 50, instead of returning an error, I do [[val.to_i, 1].max, 50].min. The LLM gets parameters wrong more than you’d think. Every error generates a correction round-trip that costs tokens and time.

Composable Prompts: YAML > Hardcoded Strings

Every prompt in the system lives in YAML under config/prompts/. Each one has a system (fixed instructions), a template (with data interpolation), and optionally includes (reusable snippets):

module Llm
  class PromptBuilder
    PROMPTS_DIR = Rails.root.join("config", "prompts")

    class << self
      def build(name, data = {})
        prompt = load_prompt(name)
        base = load_prompt(:_base_context)

        includes = resolve_includes(prompt["includes"])
        system_parts = [base["system"], includes, prompt["system"]]
          .compact.reject(&:blank?)
        system_message = system_parts.join("\n\n")
        user_message = interpolate(prompt["template"], data)

        { system: system_message, user: user_message }
      end
    end
  end
end

The _base_context.yml carries info every prompt needs (who the influencer is, niches, audience, general guidelines). The snippets in config/prompts/snippets/ solve recurring problems:

null_vs_zero.yml — the nil/zero distinction explained above
never_invent.yml — “NEVER invent data, only analyze what’s actually present”
json_only.yml — “respond ONLY in valid JSON”

When a prompt needs any combination of these, you just list them in includes. No duplication, no risk of conflicting instructions across prompts.

This decision came out of a real bug: two different prompts gave contradictory instructions on how to handle null values. One said “use 0”, the other said “use null”. The LLM obeyed one or the other depending on the day. Centralizing in snippets killed off an entire class of inconsistency.

Discovery Pipeline: Autonomous Profile Mining

The initial system started from a manual list of competitors and sponsors. But the influencer lives in a dynamic ecosystem — new creators show up every week, brands appear and disappear. The discovery pipeline automates that.

It runs every Friday at 2 a.m. It starts by mining data we already have in the database: who’s being @mentioned in posts, who comments often and with high engagement, which profiles show up in bio links, who uses brand hashtags (#publi, #parceria), which Linktree links point to other creators. All pure SQL, no API calls, no rate-limit risk.

The candidates that come out of that get validated against the platforms’ APIs (does the Instagram exist? Is the YouTube channel real?), then enriched with cross-platform connections (if we validate an Instagram, we look for the YouTube and X of the same creator).

A batch LLM call evaluates all the candidates, classifies each one as competitor, potential sponsor or irrelevant, and assigns relevance and niche-fit scores. The top 3 become actually tracked profiles — the system starts collecting data from them automatically.

After that, it groups everything into Creator entities. The influencer is depth 0. Direct competitors, depth 1. Mentions of competitors, depth 2. Maximum 3 degrees of separation, to avoid infinite cascade.

From Weekly Report to Daily Digests: Listening to the User

The original plan was a weekly email report with 9 sections. But midway through I decided it would be more dynamic to not even leave Discord and have that information always available.

I pivoted to 5 daily digests on Discord:

Monday — Performance Recap (followers, growth, top posts)
Tuesday — Competitor Radar (snapshot, news, strategies)
Wednesday — Content Playbook (schedule, trends, hashtags)
Thursday — Opportunities and Pricing (brands, pricing, packages)
Friday — Next Week (events, priority actions)

Every digest has numbered items with feedback buttons. She marks what she found useful and what she didn’t. That feedback feeds future analyses — the system learns which kinds of insight she values.

This change, which wasn’t in any plan, is probably what improved her experience the most. Every morning at 9 a.m. she opens Discord and gets a fresh, digestible, actionable summary. The weekly email got demoted to backup, commented out in recurring.yml.

The Oracle: Context the Algorithm Doesn’t See

Social media data without external context is kind of useless. “This post got 3x more views than average” is a fact. “This post got 3x more views because it dropped on the day the new Sonic trailer came out” is an insight.

The Oracle is the subsystem that provides that context.

The most critical part is convention tracking: CCXP, Anime Friends, BGS, and dozens of other geek calendar events. Dates, prices, venues, ticket links, the event’s Instagram/X accounts. It monitors daily for changes (date moved, cancellation, guest announcement). The influencer plans her entire calendar around conventions. If CCXP changes its date, she needs to know that day, not the following week.

It also tracks movie/series releases via TMDB, games via IGDB (with Twitch OAuth), anime via AniList — all filtered for geek content, with a 90-day forward window. Franchise anniversaries (using TMDB, IGDB, AniList, Wikimedia) and Brazilian holidays with floating-date calculation. Yes, there’s an Easter calculator in the code based on the computus algorithm. And news via RSS and X/Twitter scraping every 6 hours, classified by relevance to the niche.

The event tracker scrapes convention sites with headless Chrome and uses the LLM at temperature 0.1 to extract structured metadata (date, price, venue). It has a regex fallback for Brazilian-format dates (“15 de março de 2026”). Deduplication by fuzzy name matching + date proximity — because “CCXP 2026”, “CCXP São Paulo 2026” and “CCXP SP” are the same event.

SQLite in Production: Yes, It Works

SQLite3 in production. With WAL mode, a single writer at a time is enough because the system has a predictable write pattern (sequential jobs, never concurrent in the same second). The data lives in a Docker bind mount (./data/storage:/rails/storage) and backup is cp data/storage/*.db backups/.

Rails 8 treats SQLite as a first-class citizen. Solid Queue and Solid Cache run on SQLite without issues. The overhead of a Postgres for a system that serves one person doesn’t justify itself.

25+ Scheduled Jobs

The config/recurring.yml has 25+ jobs with staggered schedules so the machine doesn’t get overloaded:

production:
  youtube_collection:
    class: Collection::YoutubeCollectorJob
    schedule: every day at 3am

  instagram_collection:
    class: Collection::InstagramCollectorJob
    schedule: every day at 4am

  x_collection:
    class: Collection::XCollectorJob
    schedule: every 2 days at 5am

  oracle_events:
    class: Oracle::EventTrackerJob
    schedule: every Monday at 2am

  convention_monitor:
    class: Oracle::ConventionMonitorJob
    schedule: every day at 5am

  discovery_pipeline:
    class: Discovery::OrchestratorJob
    schedule: every Friday at 2am

  comment_sentiment:
    class: Analysis::CommentSentimentJob
    schedule: every Sunday at 5am

  weekly_analysis:
    class: Analysis::WeeklyAnalysisJob
    schedule: every Sunday at 6am

Collection runs in the early morning (3am-5am). The Oracle intelligence kicks off Monday at 2am. Discovery on Friday at 2am. Heavy analysis (sentiment, strategy) on Sunday when there’s no digest to deliver. Digests Monday through Friday at 9am, each day on a different theme. Maintenance (cleaning up finished jobs, URL health checks, aggregating old data) fills the gaps.

The pattern is deliberate: no heavy job competes with the morning digests. If the weekly analysis runs late, Monday’s digest still goes out because it uses Saturday’s collection data, not Sunday’s analysis.

What You Can’t Predict Without Running It

Some things I only found out after the system was in production.

Instagram blocks scraping at random. It doesn’t matter how stealth your browser is. The initial setup was scraping via Ferrum only. Right off the bat I added Apify as the primary source (paid API, more reliable) and the scraping became fallback.

LLMs hallucinate timestamps. I asked the bot to create a reminder for “tomorrow at 3pm” and it calculated a date in 2024. The fix was to inject current_datetime into every prompt that touches time and instruct it explicitly: “calculate based on current_datetime, convert to ISO 8601 with UTC-3”.

Discord has a 2000-character per-message limit, which seems obvious until your bot sends a 4000-char analysis and Discord just truncates it. I built an automatic split that breaks at the last \n before the limit.

Tool calls fail on parameters more often than I expected. The LLM asks for limit: 100 when the max is 50, or sends the username without the @ when it should have. Silent clamping ([[val.to_i, 1].max, 50].min) and input normalization (.strip, .delete("@")) on every tool killed off an entire category of errors that were burning tokens in retry loops.

Conventions change dates more than I imagined. I added an announcement system that tracks every change (date, venue, price, cancellation) with a timestamp. The event tracker runs every Monday, but the convention monitor runs DAILY, because the last thing the influencer wants is to find out a convention moved its date after she already booked the hotel.

The Bot as an Interface

The smartest decision in the project was not building a web interface. Discord is already where the influencer spends her day. The bot lives there, available any time. She types “what should I post this week?” and gets a contextualized answer with data from her own posts, from competitors, from the events calendar, from hashtag trends, and from the season’s anime/games.

The chatbot combines several tools into one answer. If she asks about content strategy, the LLM calls: content digest + events calendar + hashtag trends + seasonal anime + Steam games. Five database queries, all stitched into a conversational reply in Portuguese.

There’s even a “deep thinking” mode that activates when it detects words like “research”, “analyze”, “investigate”. In that mode, the bot uses ALL relevant tools, runs multiple queries, cross-references data across sources.

Conclusion

The whole project shipped in 3 days, from zero to production, collecting real data, sending out digests every morning. No Jira, no sprint planning, no 6-month discovery phase. An IDEA.md with the user’s wishes in her own words, iterative development commit by commit, and continuous validation with the person who’s going to actually use it.

If I had started from the architecture, I’d have spent weeks designing a dashboard she would never open. Starting from the wish, the system ended up being a Discord chatbot she uses every day. Async jobs that aren’t idempotent are time bombs — the rate limit is going to blow, the scraping is going to fail, the container is going to restart, and if the job can’t survive that without duplicating data, the system doesn’t work.

In the end, the measure of success for the project isn’t lines of code, isn’t the number of passing tests, isn’t elegant architecture. It’s the influencer opening Discord on a Monday morning and having what she needs to plan the week.

ai-jail: Sandbox for AI Agents — From Shell Script to Real Tool

Sun, 01 Mar 2026 14:00:00 GMT

This post is a direct follow-up to AI Agents: Locking Down Your System, where I showed how to use bubblewrap to build a manual jail for your AI agents. If you haven’t read it, read it before continuing.

–

In January I published a ~170-line shell script that built a sandbox with bubblewrap to run Claude Code, OpenCode, Crush and any other AI agent. It worked. It solved the problem. But it was a bash script dropped in ~/.local/bin/ that you had to copy, paste, and pray you wouldn’t need to customize too much.

Two months of using that script every day showed me its limits. I wanted per-project configuration. I wanted macOS support for the devs on my team. I wanted to stop editing bash arrays every time I needed an extra directory. And I wanted something someone could install with brew install or cargo install in 10 seconds, without reading 170 lines of script.

Result: ai-jail. A Rust tool, ~880KB, 4 dependencies, 124 tests, that does exactly what that script did and more. I’ll explain what changed and why you should be using it.

The Problem (again, for those who skipped the previous post)

AI coding agents need access to your filesystem. They need to run compilers, linters, grep, ls, make, cargo, npm. The minimum to be useful. The problem is that along with that access comes the ability to read ~/.aws/credentials, exfiltrate your SSH keys, or run an rm -rf outside the project directory.

It isn’t paranoia. Supply-chain attacks are real. Every other week some NPM, PyPI or RubyGems lib gets compromised. If the agent runs npm install and a malicious post-install script tries to exfiltrate your data, the only thing between the attacker and your credentials is whatever barrier you set up beforehand.

The answer is a sandbox. Specifically, one that lets the agent work in the project directory with the tools it needs, but makes the entire rest of the system invisible.

From Script to Tool

The shell script from the previous post already solved this with bubblewrap. ai-jail solves the same problem, but addresses the limitations that two months of daily use revealed:

Shell script	ai-jail
Configuration by editing bash arrays	Per-project `.ai-jail` TOML file
Linux only	Linux + macOS
Hardcoded GPU/Docker/Display	Auto-detection with flags to turn things off
No dry-run	`--dry-run --verbose` shows everything
No lockdown	`--lockdown` for paranoid mode
Copy/paste to install	`brew install`, `cargo install`, `mise`
No bootstrap	`--bootstrap` generates permission configs for Claude/Codex/OpenCode

The core logic is the same: bubblewrap creates isolated PID, UTS and IPC namespaces, mounts $HOME as an ephemeral tmpfs, and only mounts the project directory with write access. The difference is that all of that is now configurable without editing code.

Installation

Four ways:

# Homebrew (macOS and Linux)
brew tap akitaonrails/tap && brew install ai-jail

# Cargo
cargo install ai-jail

# Mise
mise use -g ubi:akitaonrails/ai-jail

# Direct binary from GitHub Releases
curl -fsSL https://github.com/akitaonrails/ai-jail/releases/latest/download/ai-jail-linux-x86_64.tar.gz | tar xz
sudo mv ai-jail /usr/local/bin/

On Linux, bubblewrap needs to be installed separately: pacman -S bubblewrap (Arch), apt install bubblewrap (Debian/Ubuntu), dnf install bubblewrap (Fedora). On macOS no extra dependency is needed.

Basic Usage

cd ~/Projects/my-app

# Run Claude Code inside the sandbox
ai-jail claude

# Run Codex
ai-jail codex

# Run OpenCode
ai-jail opencode

# Plain bash for debugging
ai-jail bash

# Any command
ai-jail -- python script.py

On the first run, it creates an .ai-jail file in the project directory:

# ai-jail sandbox configuration
# Edit freely. Regenerate with: ai-jail --clean --init

command = ["claude"]
rw_maps = []
ro_maps = []

This file is committable to the repo. When another dev on your team clones the project and runs ai-jail, the same configuration applies.

If you want to add extra directories, you can do it from the CLI or directly in the TOML:

# Extra directory with write access
ai-jail --rw-map ~/Projects/shared-lib claude

# Extra read-only directory
ai-jail --map /opt/datasets claude

Want to see what the sandbox is going to do without running anything?

ai-jail --dry-run --verbose claude

It shows every mount point, every isolation flag, the full bubblewrap command. No surprises.

Why Bubblewrap on Linux

I evaluated the alternatives before deciding. The full analysis document is in the repository, but the short version is:

Bubblewrap (bwrap) is the sandbox Flatpak uses to isolate every desktop app. ~50KB binary, ~4000 lines of C, maintained by the GNOME team. It runs without root using CLONE_NEWUSER to create namespaces without elevated privileges. It’s packaged in every relevant Linux distro and tested at scale by millions of Flatpak installations.

I considered and rejected the alternatives. Firejail requires setuid root, and trusting setuid to protect against agents already running on your system is contradictory. nsjail and minijail are designed for production environments (Google uses them internally), too complex for a dev workstation. systemd-nspawn requires root and is meant for system containers, not for isolating a single process.

Landlock is a different case. It doesn’t replace bubblewrap — it has nothing to do with namespaces or mount isolation. But it complements. Landlock is a Linux Security Module that controls access at the VFS level, independent of mount namespaces. That closes vectors bwrap alone doesn’t cover: escape paths through /proc, symlink tricks inside permitted mounts, and it serves as a safety net against bugs in the namespace machinery itself. As of v0.4.0, ai-jail applies Landlock automatically on kernels 5.13+ as defense-in-depth. It uses ABI V3 (Linux 6.2+) with graceful degradation to V1 on older kernels, and turns into a silent no-op if the kernel doesn’t support it. If it causes problems with some specific tool, --no-landlock turns it off.

Bubblewrap hits the exact sweet spot: real isolation without root, on every distro, and simple enough to wrap in an 880KB binary.

What the Sandbox Does on Linux

When you run ai-jail claude, this is what happens:

The agent runs in isolated PID, UTS and IPC namespaces, with hostname ai-sandbox, and dies automatically if the parent dies (--die-with-parent).

The filesystem is mounted in a specific sequence (bubblewrap is order-dependent). /usr, /etc, /opt, /sys come in read-only for system tools. /dev and /proc are mounted for device and process access. /tmp and /run come in as fresh tmpfs. GPUs auto-detected (/dev/nvidia*, /dev/dri). Docker socket, X11/Wayland, /dev/shm, all auto-detected and mounted if they exist.

The most important part is how the home directory is handled. $HOME is mounted as an empty tmpfs. Then, selectively, dotfiles get layered on top. .gnupg, .aws, .ssh, .mozilla, .sparrow are never mounted (sensitive data). .claude, .crush, .codex, .aider, .config, .cargo, .cache, .docker come in as read-write because the agents need to write here. Everything else comes in read-only. Inside ~/.config, browser subdirectories are hidden behind tmpfs: BraveSoftware, Bitwarden. Same in ~/.cache: BraveSoftware, chromium, spotify, nvidia. The agent can’t even see those directories exist.

The current project directory is the only place with write permission (besides the tool dotdirs). The agent modifies the code, but doesn’t touch anything else.

macOS: sandbox-exec

On macOS, the backend is sandbox-exec with SBPL (Sandbox Profile Language) profiles. It’s a legacy Apple API, officially deprecated but with no public replacement. It works today, but Apple may remove it in the future.

ai-jail generates an SBPL profile at runtime that mirrors the same logic as Linux:

Default deny on everything
Allows process operations (exec, fork, signal)
Allows network (except in lockdown)
Allows global reads, denies sensitive paths (.gnupg, .aws, .ssh, ~/Library/Keychains, ~/Library/Mail)
Allows writes only in the project directory, tool dotfiles, and /tmp

The limitations are real. GPU (Metal) and Display (Cocoa) are system-level on macOS, sandbox-exec can’t restrict them. The --no-gpu and --no-display flags simply have no effect on macOS. Cross-platform parity is approximate, not exact.

Even with those limitations, it’s better than running the agent completely open. sandbox-exec protects against access to sensitive filesystem areas and, in lockdown, removes write and network permissions.

Windows: Not Supported (And Probably Never Will Be)

It’s not for lack of interest, it’s lack of primitives. Windows has no userspace equivalent to Linux namespaces. No mount API like bubblewrap. AppContainers exist but use a completely different security model, require admin privileges, and mapping bwrap functionality to AppContainers would effectively mean writing another project from scratch.

The Windows answer is WSL 2:

# Inside WSL 2 (real Linux kernel)
sudo apt install bubblewrap
cargo install ai-jail
# or: mise use -g ubi:akitaonrails/ai-jail

cd /mnt/c/Users/you/Projects/my-app
ai-jail claude

WSL 2 runs an actual Linux kernel. Bubblewrap works normally. Windows files are accessible at /mnt/c/. I/O performance is slower across the 9p mount, but it works. For large projects, cloning inside ~/Projects/ on the Linux side improves performance considerably.

Lockdown Mode

For workloads you really don’t trust, there’s --lockdown:

ai-jail --lockdown bash

Lockdown does everything normal mode does, but goes further. The project gets mounted read-only (not read-write). GPU, Docker, Display and mise are disabled. --rw-map and --map flags are ignored. $HOME becomes pure tmpfs, no host dotfiles. On Linux, the network is cut with --unshare-net and the environment is wiped with --clearenv. On macOS, environment variables are wiped and write and network rules are removed from the SBPL.

It’s the most restrictive sandbox you can build short of using a VM. Useful for auditing third-party code or running agents on projects you don’t know.

Bootstrap: Automatic Permission Configuration

ai-jail --bootstrap generates permission configurations for the tools you use:

ai-jail --bootstrap

For Claude Code, it generates ~/.claude/settings.json with allow/deny/ask lists. Allows git status, diff, log, ls, grep, cargo, npm, python, docker compose. Blocks rm -rf, sudo, chmod 777, git push –force. Asks before git push, rm, docker run.

For Codex, it generates ~/.codex/config.toml with approval_policy = "on-request".

For OpenCode, it generates ~/.config/opencode/opencode.json with bash, edit, write permissions.

Before overwriting any file, it makes an automatic backup (settings.json.bak). And it rejects operations if the target is a symlink, to avoid path traversal.

It’s exactly the content I put in manually in the previous post, but automated and tested.

But Claude Code Already Has Its Own Sandbox

It does. Since October 2025, Claude Code has offered a runtime sandbox via the /sandbox command. And guess what it uses underneath? Bubblewrap on Linux and sandbox-exec on macOS. The same stack.

But the differences matter.

ai-jail is tool-agnostic. It works with Claude, Codex, OpenCode, Crush, and any command. Claude’s sandbox only protects Claude. If tomorrow you switch agents, ai-jail keeps working the same.

The thing that bothers me most is the escape hatch. When a command fails because of a sandbox restriction, Claude can retry with dangerouslyDisableSandbox, falling back to the normal permission flow. You can disable that ("allowUnsandboxedCommands": false), but it’s opt-out, not opt-in. In ai-jail, there is no escape hatch. The process runs inside bwrap or sandbox-exec, period. There’s no way for the agent to decide on its own to leave the sandbox.

Another practical difference: .ai-jail lives in the project directory and can be committed to the repo. Any dev who clones the project inherits the same sandbox policy. Claude’s sandbox depends on a global settings.json.

When run inside Docker, Claude’s sandbox falls back to an enableWeakerNestedSandbox mode that, in the words of its own documentation, “considerably weakens security”. ai-jail wasn’t designed to run inside Docker (it runs directly on the dev’s workstation), so this problem doesn’t exist.

About the network: Claude’s sandbox routes traffic through a proxy and allows/blocks by domain. ai-jail in normal mode inherits the host network; in lockdown, it cuts the entire network with --unshare-net. Claude’s approach is more granular; ai-jail’s is simpler and harder to circumvent.

The two aren’t mutually exclusive. You can run Claude’s sandbox inside ai-jail. ai-jail handles filesystem isolation; Claude’s sandbox adds per-domain network filtering. Security layers stack.

Why Not Use –dangerously-skip-permissions Without a Jail

I’ll be blunt: if you run Claude Code with --dangerously-skip-permissions without any sandbox, you’re trusting blindly that the LLM will never execute anything destructive. And you’re trusting that none of your project’s dependencies has been compromised in a supply-chain attack.

Every --dangerously flag has that name for a reason. Claude Code is explicit: that mode exists for CI/CD and automation in environments that are already isolated (containers, throwaway VMs). Not for your personal workstation with ~/.aws/credentials, ~/.gnupg/, SSH keys, and your browser’s password vault.

With ai-jail, the agent has total autonomy inside the sandbox. It does whatever it wants in the project directory, uses the tools it needs, and can’t access anything outside what was explicitly permitted.

ai-jail + Git: The Safety Net You Already Have

There’s something I haven’t mentioned yet that changes the risk calculus: if your project is in a Git repo, with a remote on GitHub/GitLab, and the agent doesn’t have permission to git push, the damage it can cause is limited to the local directory.

Think about it. The worst-case scenario inside ai-jail is the agent corrupting every file in the project. Annoying? Sure. Catastrophic? No. You run git checkout . and you’re back to the last commit. If it corrupts .git somehow (unlikely, but possible), you delete the directory and clone again. The remote was never touched.

That’s why ai-jail’s --bootstrap puts git push on the “ask” list (ask before running), not the “allow” list. And git push --force goes straight to “deny”. The agent can commit locally all it wants, can create branches, can rebase. None of that affects the remote. When it comes time to push, you review what it did and decide whether it goes live.

That combination, sandbox for the filesystem + Git for the code + manual push, already gives you a very reasonable security level for daily use. ai-jail protects your personal data and the system. Git protects your code. And the decision to publish stays yours.

If you want to go further, the next two sections cover additional layers.

ai-jail vs Dev Containers

Since I wrote ai-jail, the most frequent question is: “why not use Dev Containers?”. The short answer is that one doesn’t replace the other. They solve different problems.

Dev Containers (the containers.dev spec) define a complete development environment in a devcontainer.json. You specify base image, tools, VS Code extensions, environment variables, and the editor mounts everything for you in a Docker container. Docker also recently launched Docker Sandboxes, which go further and run each agent in a microVM with Firecracker, with hardware isolation.

ai-jail does none of that. It doesn’t define an environment. It doesn’t install tools. It doesn’t run a Docker image. It takes the environment that already exists on your machine and restricts what the process can access.

In practice, the difference is:

	Dev Container	ai-jail
What it does	Defines and provisions a complete isolated environment	Restricts process access to the existing filesystem
Setup	`devcontainer.json` + Docker	`.ai-jail` TOML + bubblewrap
Startup	Seconds (image pull, container build)	Milliseconds (fork + exec of bwrap)
Tools	Whatever you put in the image	Whatever’s already installed on your machine
GPU	Requires NVIDIA Container Toolkit configuration	Auto-detects `/dev/nvidia*` and `/dev/dri`
Daemon	Requires Docker daemon running	Nothing besides bwrap
Reproducibility	High (fixed image)	Depends on what’s installed on the host
Network isolation	Docker Sandboxes: per-domain firewall	Lockdown: cuts everything with `--unshare-net`

Dev Container makes more sense when you need the whole team to have exactly the same environment, or when the project has dependencies nobody wants to install on the host, or for running non-interactive agents in CI/CD. Docker Sandboxes with microVM are the strongest isolation that exists outside a dedicated VM.

ai-jail makes more sense when you already have the environment configured and want instant startup with no Docker daemon. Or when you use tools that are annoying to run inside a container (CUDA, Wayland, mise). Or simply when you want the same protection for any agent, not just the ones with devcontainer integration.

And you can combine them. I know people who run ai-jail inside a Dev Container to get environment reproducibility + filesystem restriction. Security layers stack.

Immutable Operating Systems: The Last Layer

If you want to take security seriously, the third layer is the operating system itself.

Immutable systems like Fedora Silverblue, NixOS, and openSUSE Aeon have a read-only root filesystem. The base system can’t be modified by any process, even with root. Updates are atomic: they either apply completely or not at all. And if something goes wrong, you roll back to the previous image in one reboot.

In practice, that means even if an AI agent escaped the sandbox (exploiting a kernel vulnerability, for example), it couldn’t modify the system persistently. On the next reboot, the system returns to the declared state. On NixOS, the entire system is defined by a configuration file (configuration.nix). On Silverblue, the base is an OSTree image that gets atomic updates via rpm-ostree.

For developers, the catch is: your dev tools run in containers (Toolbox/Distrobox on Silverblue, nix-shell on NixOS). The base system stays untouched. Desktop apps come via Flatpak, which already runs in a sandbox. The result is that the host’s attack surface is minimal.

Fedora Silverblue is probably the most accessible entry point. It’s already Fedora underneath, with GNOME, works with hardware Fedora supports, and Toolbox gives you a containerized Fedora Server where you install whatever you want without touching the host. NixOS is more powerful (full reproducibility, declarative rollback), but the learning curve is real.

The full combination looks like this: the immutable OS handles the system (read-only filesystem, atomic updates, one-reboot rollback). ai-jail handles the session (isolated namespace, ephemeral home, sensitive data invisible). And Git handles the code (remote untouched as long as the agent doesn’t have push).

None of those layers is perfect on its own. But the attack that punches through all three at the same time — escaping the namespace, persisting on a read-only filesystem, and corrupting a Git remote — is a scenario I’d be comfortable calling unlikely.

The best part is that none of them requires changing how you work. ai-jail is one command before your agent. Git you already use. And an immutable OS is an installation, not a workflow change.

Technical Details (For Those Who Care)

Written in Rust with 124 tests and 4 dependencies: lexopt (CLI parsing without clap), serde + toml (config), nix (Unix syscalls). No async runtime, no color framework (raw ANSI), ~880KB binary with LTO and strip.

Signal handling is done correctly: SIGINT, SIGTERM, and SIGHUP are forwarded to the child process. The handler only calls libc::kill, which is async-signal-safe. Process reaping uses waitpid in a loop with retry on EINTR.

Temporary files (like the custom /etc/hosts the sandbox mounts) use RAII: a SandboxGuard that implements Drop in Rust. If the parent process dies for any reason, cleanup happens.

Configuration compatibility is guaranteed by development policy: never remove fields, never rename, new fields always with #[serde(default)], unknown fields are silently ignored. Regression tests for old .ai-jail formats guarantee that updating the binary never breaks existing configs. There are 32 tests just for config.

Roadmap

What’s left:

More tool backends in bootstrap: Aider, Cursor, Windsurf. As more agents standardize permission configuration files.
Profile sharing for monorepos, so you don’t have to configure each service separately.

Installation and First Use (Quick Recap)

# 1. Install
brew tap akitaonrails/tap && brew install ai-jail
# or: cargo install ai-jail

# 2. On Linux, install bubblewrap
sudo pacman -S bubblewrap  # Arch
# sudo apt install bubblewrap  # Debian/Ubuntu

# 3. Enter the project and run
cd ~/Projects/my-app
ai-jail claude

# 4. (Optional) Generate permission configs
ai-jail --bootstrap

# 5. (Optional) See what the sandbox does
ai-jail --dry-run --verbose claude

The .ai-jail file it creates can be committed to your repo. From then on, any dev who clones the project runs ai-jail claude and gets the same sandbox.

Conclusion

January’s shell script solved the problem. ai-jail solves the problem properly. Per-project config, macOS support, lockdown mode, permission bootstrap, dry-run for auditing, and an 880KB binary that installs in 10 seconds.

If you use AI agents to code, run them in a sandbox. The LLM’s good intentions are no guarantee of anything, and supply-chain attacks don’t pick their victims. Process isolation is the barrier that works.

The project is GPL-3.0 and is on GitHub: github.com/akitaonrails/ai-jail

Issues and PRs are welcome.

Software Is Never 'Done' — 4 Projects, Life After Deploy, and Why One-Shot Prompting Is a Myth

Sun, 01 Mar 2026 12:00:00 GMT

And don’t forget to subscribe to my newsletter The M.Akita Chronicles!

–

I published the “final post” of the Behind The M.Akita Chronicles series 10 days ago. 274 commits, 1,323 tests, deployed to production. I wrote down the lessons, did the conclusion, dropped the quote at the end. Done.

125 post-production commits later, I can confirm: software “done” doesn’t exist.

Today the M.Akita Chronicles repo has 335 commits and 1,422 tests. Tomorrow, Monday, subscribers get the 3rd consecutive newsletter — generated, reviewed, and sent by a system that won’t stop evolving. Meanwhile, Frank Sherlock went from 50 commits and v0.1 to 103 commits and v0.7 with face detection. FrankMD got 3 external contributors and shipped v0.2.0. And even FrankMega — a project built in 1 day — needed fixes when real users showed up.

Throughout February, I published more than a dozen posts detailing the build of each project. This post is different. It isn’t about building from scratch, I covered that already. It’s about what happens after. And what happens after destroys the one-shot prompt narrative. Software needs an experienced human at the wheel. Iterative development is the only thing that works. Anyone who disagrees hasn’t put a system in production yet.

I’ll prove it with git log.

Life After Deploy

Project	Post-publication commits	Highlight
The M.Akita Chronicles	56	3rd week in production, new features, 13 bug fixes, prompt tuning
Frank Sherlock	53	From v0.3 to v0.7: video support, face detection, AUR publish
FrankMD	14	3 external contributors, 4 PRs merged, v0.2.0
FrankMega	2	MIME types nobody saw coming
Total	125

Let’s get into the details.

The M.Akita Chronicles: 3 Weeks in Production

The newsletter system has been live since February 16. It’s already sent 2 newsletters and tomorrow it sends the third. 56 commits have happened since the “final post”:

Date	Commits	What happened
Feb 20 (Thu)	16	Steam Gaming (whole new section), bug fixes in paywall detection
Feb 21 (Fri)	12	/preview command, YouTube collage, /rerun notifications
Feb 22 (Sat)	1	Fix: /preflight sending the result to the wrong channel
Feb 23 (Sun)	10	2nd newsletter publication, Gmail clipping warning, TTS revert
Feb 24 (Mon)	1	Podcast prompt tweak
Feb 28 (Sat)	7	Marvin prompt tuning, rate limiting, /rerun comments
Mar 1 (Sun)	9	Mood system, switch to Grok-4, QA pipeline, config centralization

Features Nobody Saw Coming

Steam Gaming. Wasn’t in the original plan. It was born because the entertainment section only had anime — and gamer readers were left out. Result: 6 commits just on launch day. The first commit (bf41e25) added the whole section — Steam API service, generation job, tests, Hugo shortcode. The next 5 fixed things: Portuguese date parsing, ranking offset so the top 10 didn’t repeat, wishlist filter for releases, dark theme. No spec in the world predicts that the Steam API returns dates with Portuguese abbreviations ("fev", "mar", "abr") when you ask for l=brazilian. You find that out when the parser breaks in production.

Preview system. /preview was born because I needed to see how the auto-generated content looked before publishing the entire newsletter. Seems obvious in retrospect, but in v1 the only way to validate was to generate everything and look at the final result. 5 commits to build a preview system with 9 sections, aliases (hn for hacker_news, steam for steam_gaming), separate comment-preview mode, and rescue when one section fails without taking the others down (d23c725 — because a section with an error was killing the entire preview).

Marvin’s moods. Marvin (the bot’s sarcastic persona) suffers from a chronic LLM problem: smoothing. Without intervention, every comment ends up the same lukewarm tone. The iterative fix was a mood system — 9 modes the operator picks per story:

MARVIN_MOODS = {
  "sarcastic"   => "Be extra sarcastic and biting in your commentary.",
  "ironic"      => "Use irony and dark humor to make your point.",
  "grounded"    => "Be more neutral and journalist-like, factual and measured.",
  "provocative" => "Be provocative and controversial, challenge assumptions.",
  "counter"     => "Take the opposite position from Akita's comment.",
  "insightful"  => "Find a non-obvious angle -- a historical parallel, a second-order consequence.",
  "positive"    => "Find the genuinely positive angle.",
  "hopeful"     => "Be cautiously optimistic. Acknowledge the good potential.",
  "negative"    => "Be extra pessimistic and bleak."
}.freeze

None of this came from a spec. It came from 3 weeks of reading bland comments and thinking “this isn’t good enough”. There are 7 commits just for prompt tuning across 10 days. I banned formulaic patterns (d4bd3a6). I killed off “Ah,” as an opener (4171628). I removed Marvin from entire podcast sections because he was contaminating the tone (246cc60). I pushed for substance instead of puns (8bc47c4). Each prompt commit is a micro-correction that only makes sense after reading the previous output.

Bugs That Only Exist in Production

Of the 56 commits, 13 are bug fixes. Some favorites:

Gmail clipping (3258f5b): Gmail silently cuts off emails larger than 102KB. I found out when the 2nd newsletter went out longer and Gmail readers didn’t see the end. Fix: a size check in the content preflight that warns before publishing. I shrunk the history sections from 10 to 5 items (51ce528) to fit the limit.
False paywall detection (1fd8739): the scraper marked sites as paywalled when it found accidental block text in a generic footer. Only showed up when the story list grew to dozens of real sources.
t.co link mangling (6bd056f): Twitter shortens URLs with t.co, but it also auto-links file names that look like domains. config.yml becomes a link to the config.yml domain. 158 lines of fix to handle tweet text correctly.
TTS language revert (7803ad9 -> c1dd668): I tried switching the TTS language from “Portuguese” to “Auto”. The model produced a mixed accent. Reverted the same day.
Podcast section ordering (5bef18c): podcast sections came out in the wrong order. 3-line fix + untracking of 1,504 lines of generated content that had been committed by accident.
Elementor sites with multiple
(e70b756): sites built with Elementor use the
tag in a non-semantic way. The content extractor was grabbing the wrong block. 171 lines of fix + tests.

None of these bugs could have been predicted in a spec. They only exist because the system is live, processing real data, hitting real APIs, being read by real people.

From 1,323 to 1,422 Tests

The test suite kept growing along with it:

App	Tests (Feb 20)	Tests (Mar 1)	Growth
marvin-bot	970	1,060	+90
newsletter	353	362	+9
Total	1,323	1,422	+99

New feature? Comes with a test. Bug fix? Comes with a regression test. Without that, the 56 post-production commits would be 56 chances to break something that worked yesterday. TDD isn’t a phase, it’s a habit.

The Model Changed: From Claude to Grok-4

Commit 8f9d11c: Switch default model to x-ai/grok-4. The architecture had isolated the LLM model into an environment variable from day one. But the model name was still hardcoded across 24 files. Result: commit c8c688e — Centralize LLM model config — 24 files touched to make swapping models a 1-line config change.

This kind of refactoring only shows up in operation. When you start the project, you don’t know you’re going to change models. When you change, you want it to be trivial. And the cleanup to make it trivial? Only happens when the pain shows up.

Frank Sherlock: From v0.1 to v0.7 in 4 Days

The Frank Sherlock post covered the first 2 days and ~50 commits: the benchmark research, the Tauri app scaffold (Rust + React), the classification pipeline with Ollama, and the v0.1.0 release with binaries for Linux, macOS and Windows.

What the post didn’t cover: the next 53 commits, in 4 days, that took the project from v0.3 to v0.7:

Version	Date	Highlight
v0.4.0	Feb 24	Duplicate detection (SHA-256 + dHash), PDF password manager
v0.5.0	Feb 24	Auto-update, scan performance 2x, checkpoint resume
v0.6.0	Feb 25	Video support (11 formats), directory tree, FTS stemmer
v0.7.0	Feb 27	Face detection with native ONNX, person management

Face Detection: A Feature Born from Iteration

v0.1 was an image indexer with LLM-based classification. v0.7 has face detection with 512-dimensional embeddings, cosine-similarity clustering, and per-person search (face:alice). None of that was in the original plan. It came up because I started using the app with real photos and thought “I want to find every photo of so-and-so”. The feature was born from use, not from a spec.

The path to get there was methodical:

f34132f: A/B testing framework for face detection — benchmark before implementing, not after
ef3be82: A/B results (SCRFD + ArcFace won)
6d9174a: Native implementation with ONNX Runtime — no Python, no external dependency
3b25eaf: Clustering, person management, complete FacesView
a02d67f: Refactoring — extract helpers, delete dead code, share CSS

Benchmark -> implement -> refactor. The same cycle as always. And something no one-shot prompt produces.

Video Support: Another Emergent Feature

The app was built for images. But real media folders have videos. Commit a67a2f9 added full support: scanning of 11 formats (MP4, MKV, AVI, MOV, WebM…), keyframe extraction for LLM classification, black frame skipping, .srt subtitle parsing for the full-text index, and inline preview with HTTP Range streaming.

1,626 lines added in one commit. An entire feature born from real use, not from a spec.

7 Releases, 3 Platforms, Automatic AUR

In 7 days, Frank Sherlock had 7 releases. Each one with binaries compiled for Linux (AppImage), macOS (DMG with notarization and code signing), and Windows (MSI). The CI/CD pipeline (release.yml) does everything automatically. Starting at v0.5.0, an additional workflow (aur-publish.yml) publishes automatically to the AUR — the Arch Linux package repository.

Final tally: 17,210 lines of Rust, 10,863 lines of TypeScript, 621 tests (322 Rust + 299 frontend). Cross-platform. Releases published and installable by any user. One week, one dev, one AI agent.

The CI/CD cross-platform part is where the “works on my machine” fantasy dies. There were 17 commits just for builds: Windows UNC paths, macOS signing with notarization, Linux AppImage, release workflow permissions, macOS Intel target removed. No LLM in the world knows that macOS requires specific entitlements for hardened runtime, or that paths on Windows start with \\?\. That kind of deploy work demands someone who’s been through that pain before.

FrankMD: Real Open Source

FrankMD was the first project in this saga, a self-hosted Markdown editor with Rails 8. The February posts covered the build. What happened after is more interesting: other people started contributing.

14 commits since February 20. 3 external contributors. 4 PRs merged. v0.2.0 release on the 28th:

PR	Author	Type	What it did
#34	@murilo-develop	Bug fix	Ollama integration: ModelNotFoundError + API base URL
#36	@rafaself	Bug fix	Table hint behavior in the editor (3 iterative commits)
#38	@LuccaRomanelli	Feature	Auto-sync theme with the Omarchy desktop environment
#39	@LuccaRomanelli	Feature	“New Folder” in the explorer’s context menu

The Maintainer’s Pattern

What’s most telling isn’t the contributions themselves, but what I did after every merge.

When @LuccaRomanelli submitted the Omarchy theme sync PR (+308 lines), I merged it and immediately committed fix: harden Omarchy theme integration and fix broken tests — +163 lines of fixes and tests in the next commit. The contributor implemented the feature. The maintainer hardened it.

When @rafaself submitted the table hint fix, there were 3 iterative commits the same day — enhance, improve, streamline — showing the same progressive refinement pattern I do with the AI agent. The next day, he sent a separate commit updating faraday, nokogiri and rack for security advisories.

On the 28th, I sat down and did everything in one shot: merged the 4 PRs, committed the hardening, updated 5 gems (brakeman, bootsnap, selenium-webdriver, web-console, mocha), and published the v0.2.0 release notes. A classic “release day”.

FrankMD today has 226 commits, 1,804 tests (425 Ruby + 1,379 JavaScript), and active contributors. It isn’t “my little personal project” anymore. It became software with a community. And community doesn’t show up for a test-less prototype that “works on my machine”. It shows up for a project with green CI, documentation, versioned releases, and code you can actually read.

FrankMega: Even the Smallest Project Needs Post-Production

FrankMega was built in 1 day. 26 commits, secure file sharing with Rails 8, 210 tests, deploy via Docker + Cloudflare Tunnel. Post published the same day. Done, right?

Three days later, 2 commits:

db4bb705 — Add macOS, Linux, and Windows package MIME types to seed defaults
b0c4829a — Fix MIME types to match Marcel detection, add normalizes strip

Users tried to share .dmg, .deb, and .msi files. The MIME type detection (via the Marcel gem) didn’t recognize those formats because they weren’t in the seed defaults. Two commits. 22 lines. 15 minutes.

But without them, FrankMega couldn’t be used to share installer packages — which is exactly the most common use case on a dev’s home server.

No prompt in the world predicts that your users will want to share .deb files on day one.

The simplest project of all, with the shortest post-production. And even then, it needed iteration.

The Consolidated Numbers

Project	Commits (total)	Tests	Post-production
The M.Akita Chronicles	335	1,422	56 commits, 10 days in production
Frank Sherlock	103	621	53 commits, 4 extra releases
FrankMD	226	1,804	14 commits, 3 contributors
FrankMega	28	210	2 commits, MIME fixes
Total	692	4,057	125 post-publication commits

692 commits. 4,057 tests. 4 projects in production. February 2026. Me and an AI agent.

Why One-Shot Prompting Is a Myth

Look at the 125 post-production commits and tell me which one of them could have been predicted by a spec.

Gmail cuts off emails larger than 102KB, and you only find that out when you send the first long email. The Steam API returns dates with Portuguese abbreviations ("fev", "abr"), and you only find that out when the parser breaks on generation Sunday. macOS users try to share .dmg files and can’t because the MIME type isn’t in the seed. Marvin opens every comment with “Ah,” and you only notice it after reading 30 in a row and wanting to claw your eyes out. The LLM model has to be swappable with 1 line of config, but it’s spread across 24 files. Windows uses UNC paths starting with \\?\, and CI explodes in your face. And TTS in “Auto” mode? Generates an accent that sounds like Lisbon Portuguese trying to be carioca.

None of this is “debugging” in the traditional sense. It’s navigating a problem space that only reveals itself when the software meets reality. Each one of these was a real-time decision made by someone with context.

The one-shot prompt fantasy is that, if you write a sufficiently detailed spec, the AI produces the perfect software. But the perfect spec would require you to know in advance everything that’s going to go wrong. And if you knew in advance everything that’s going to go wrong, you wouldn’t need the spec — you’d already have the software done in your head.

Good software is the result of hundreds of micro-decisions made with the system running. Not a single macro-decision made before the first line is written.

The Role of the Experienced Developer

AI accelerated all of this. Without it, 692 commits in a month would be impossible for one person. But acceleration without direction is just faster entropy.

In every project there were decisions the AI couldn’t have made on its own. Switching from Claude to Grok-4 because the previous model was weak in a specific domain. Benchmarking SCRFD against alternatives before implementing face detection, because the wrong choice would cost days. The +163 lines of hardening I committed immediately after merging the Omarchy PR, because I saw where it was going to break. The TTS revert from Auto mode the same day, because I know that “Auto” on a TTS model for Brazilian Portuguese will produce a mixed accent.

The circuit breaker case is the most illustrative. When I added rate limiting, the Brave Search API started returning 429s every once in a while. If I had asked, the agent would have implemented a retry with exponential backoff. But I didn’t ask for retry. I asked for a circuit breaker:

class WebSearcher
  CIRCUIT_BREAKER_SECONDS = 120

  def self.search(query, max_results: 5)
    return [] if circuit_open?

    response = brave_search(query, max_results)

    if response.code == 429
      trip_circuit!
      Rails.logger.warn("WebSearcher rate limited (429)")
      return []
    end
  rescue Net::OpenTimeout, Net::ReadTimeout => e
    trip_circuit!
    []
  end
end

If the API returns 429 or times out, it stops trying for 120 seconds. No retry, no backoff, no queue. Why? Because I know the cron runs every day at 8am and that on newsletter generation Sundays the query volume triples. A herd of retries delaying everything is worse than empty results in one section.

The AI implements a circuit breaker when you ask it to. But it isn’t going to ask to implement a circuit breaker. It doesn’t have the operational context. That’s knowledge that comes from running systems in production for decades.

The agent writes the code. I decide what code to write. And that decision requires experience no prompt can replace.

The Lessons

1. Software in production diverges from the spec in hours, not months. The first M.Akita Chronicles bugs showed up in the first real newsletter. FrankMega’s MIME types broke on the first uploads. Anyone who thinks deploy means done is living in a world that doesn’t exist.

2. Post-production isn’t “maintenance”. 56 commits in 10 days on the M.Akita Chronicles aren’t patches. They’re new features (Steam Gaming, preview, moods), architecture refactoring (LLM config centralization), security hardening (rate limiting). That’s development. The software didn’t stop evolving just because I said “done” in a blog post.

3. TDD protects evolution. 99 new tests in 10 days. FrankMD’s 1,804 tests let me merge 4 external contributor PRs without fear. Without tests, every merge is Russian roulette.

4. Small releases keep you sane. 7 Frank Sherlock releases in 7 days. Green CI, compiled binaries, release notes. Something broke? Roll back one version. Compare that with “6 months + big bang release” and tell me which one works better.

5. Community shows up for real projects. Nobody sends a PR to a test-less prototype that “works on my machine”. FrankMD got 3 contributors because it has green CI, documentation, and versioned releases.

6. The developer’s experience is the bottleneck, not the AI’s speed. 692 commits in a month. But every commit that mattered required decades of experience to know it was needed. The AI types fast. I know what to type.

7. One-shot is for demos. Iteration is for production. If the goal is a 10-minute video showing a “finished SaaS”, one-shot will do. If the goal is software that survives contact with real users, only iteration works. And sustainable iteration demands discipline: TDD, CI, small releases, continuous refactoring. No shortcut.

To the Senior Still Sitting on Their Hands

So far I’ve been hitting the amateurs who think they can fire off a prompt and the AI spits out a finished system. Fair enough. But there’s another group that worries me just as much: the senior developer who saw the AI mess up three times, declared “this is useless”, and went back to doing everything by hand.

I get the reasoning. The AI hallucinates, generates code with subtle bugs, suggests over-engineered solutions. All true, and I documented every one of these problems in the previous posts. But tell me something: have you never done exactly the same thing? Have you never spent 2 hours reading documentation only to find out it was a typo in the config? The difference is that when the AI gets it wrong, you catch it in the tests and fix it in minutes. When you get it wrong on your own, it takes you the same amount of time to make the mistake and even longer to find it, because you trust your own code.

Let’s look at the concrete numbers of what I shipped in February, one person and one AI agent:

Project	Time	Result
M.Akita Chronicles	8 days	4 apps (Rails + Python + Hugo), 335 commits, 1,422 tests, 3 weeks in production
Frank Sherlock	7 days	Tauri desktop app (Rust + React), 103 commits, 621 tests, 7 releases, binaries for 3 OSes
FrankMD	~4 days	Rails 8 Markdown editor, 226 commits, 1,804 tests, 3 external contributors
FrankMega	1 day	Rails 8 file sharing, 28 commits, 210 tests, Docker + Cloudflare deploy

Now do the math in your head. How long would it take you to build Frank Sherlock on your own? A Tauri app from scratch, with an LLM classification pipeline in Rust, OCR via Python, full-text search with FTS5, duplicate detection by perceptual hash, face detection with native ONNX, video support with keyframe extraction, CI for 3 platforms with macOS code signing and notarization, auto-update, and automatic publishing to the AUR. With 621 tests. In Rust, which doesn’t forgive.

To be honest: without AI, a good senior dev would take at least 3-4 weeks on this, probably more. I did it in 7 days, alone, and every release is published with binaries anyone can download and install.

I already estimated The M.Akita Chronicles in the previous post: ~200 user stories. In Scrum with a senior team of 2-3 devs, no impediments, that would be 10-15 weeks. I did it in 8 days. Today, 3 weeks later, the system keeps running, evolving, with 99 more tests than when it “was done”.

FrankMega is more modest, but secure file sharing with Rails 8, I18n in 2 languages, 22 security issues fixed, Docker deploy, 210 tests. I did it in 1 day. A good senior, without AI, would do it in 1-2 weeks at best.

And I’m not talking about disposable prototypes. People from outside send PRs to FrankMD. Real subscribers get the M.Akita Chronicles newsletter every Monday. Anyone can download Frank Sherlock from GitHub Releases or the AUR. Green CI, Brakeman clean, real tests, automated deploys. That’s production software, not a conference demo.

“Yeah, but I write better code without AI.” Maybe. But how long does it take you? What I showed here isn’t that AI writes perfect code. Far from it. It’s that with TDD, CI, pair programming with the agent, and continuous refactoring, the end result is production code with quality. 4,057 tests are there to prove it. Brakeman clean. 125 post-production commits show the code can take evolution without turning into a ball of mud.

And using AI here and there to generate a snippet of code, like glorified autocomplete, also isn’t the answer. You’re leaving 90% of the gain on the table. What I did in February was full-time pair programming with an agent. From the first commit to production deploy. With the same discipline I’d use with a human pair. Result: 4 projects in production in 1 month, with quality I put my name on. Because I did put my name on it.

If you’re a senior and you’re still waiting for AI to “get better” before you really start using it, here’s my message: it’s already good enough. The 692 commits are there to prove it. The bottleneck now is you learning to work with it.

Conclusion

In February 2026, I built 4 projects from zero to production with AI. But the build is the easy part. What separates real software from a demo is the 125 commits that came after, when the bugs nobody predicted appeared, when external contributors sent PRs that needed hardening, when new features came up from real use and not from a requirements spreadsheet.

AI made me absurdly more productive. Without it, Frank Sherlock wouldn’t have face detection in 7 days. Without it, M.Akita Chronicles wouldn’t be in its 3rd week of operation with 1,422 tests. The speed is real.

But none of the decisions that mattered came from the AI. Switching models, reverting the TTS the same day, looking at the contributor PR and seeing where it was going to break, asking for a circuit breaker instead of retry. All of that was me. The AI executed. The decisions were mine.

The AI is the accelerator. Extreme Programming techniques (TDD, small releases, pair programming, continuous refactoring) are the brake and the steering wheel. Without discipline, AI produces fast code that piles up technical debt even faster. With discipline, AI produces software that actually evolves, week after week.

692 commits. 4,057 tests. 4 projects in production. And tomorrow, Monday, at 7am, M.Akita Chronicles subscribers get the 3rd newsletter. Generated, reviewed, and sent by a system that will never be “done”. Because finished software is dead software.

“Finished software is dead software. Live software iterates.”

RANT: Did Akita Bend Over for AI??

Tue, 24 Feb 2026 11:54:26 GMT

TL;DR: I have always “bent over” for AI. But if you only watched clips from the podcasts, I get why you are confused. Let me explain.

Anyone remember this video of mine: “16 Languages in 16 Days”? It is from late 2023 (1 year after the first ChatGPT launched) and it is about the first Rinha de Backend. How do you think I wrote code in 16 languages in 16 days??

I have been using AI to write code since early 2023 (3 years already). I did not start now: I never stopped. 🤷‍♂

“But you said AI would never be any good, you hate AI”

That is what you understood because you are lazy and only watch out-of-context clips or out-of-context tweets. If you read my blog, I have literally dozens of posts (particularly in the last 2 years) detailing every aspect of the evolution and use of AIs (whether for code, images, or 3D modelling).

“Hype?”

And what about my famous line:

“Your hype about AI is inversely proportional to your knowledge of AI” ?

Still correct. Same as the people who used to say Low Code would replace every programmer. It will not. Same thing now: no matter how good AI gets, it is not going to replace every programmer. Only the bad ones, as I already explained in this other rant.

Programmers like me have zero problem: we are the decision-makers who can decide what AIs cannot today and never will (it is an impossibility baked into the foundations of computing, I will come back to that).

The “hyped-up” crowd I talked about is every non-programmer startup bro who - with or without AI - throws garbage code into production and that is how things like the “Tea” app fiasco happen, remember?

There was no “hacking” attempt at all; their database was already wide open to the public in production: anyone could just download it. That is a “hyped about AI” guy.

Or the morons who already fired all their programmers because they think AI code will be so much better. Spoiler alert: it will not.

So now you are even more confused. “You don’t like AI then?” No, no, I like it. “But if you like it, it means it is already intelligent?” No, no. See how your line of reasoning makes no sense.

“Stochastic Parrot?”

Someone reminded me today that I also said:

“AI is just a glorified text generator”

Confused? “How can it be good if it is just a text generator??”

I really do not understand this line of reasoning. Text generators are objectively good. You and I use the auto-corrector on every email and messaging app every day. Whether it is the iPhone or Android keyboard. Whether it is Grammarly. None of you will say it has “personality” or even “intelligence,” but I think we all agree it is very useful. Even when it slips up now and then, most of the time it fixes our text just fine, right?

GPT or Claude or GLM or Gemini or DeepSeek, all of them are still “stochastic parrots” or “glorified text generators.”

Stochastic parrot: something that repeats stuff using a random or partially random method (there is an element of entropy)
Text generators: they use probabilistic math (transformers on the complicated side, or Markov chains on the simple side) to try to figure out the most likely next word/token, given the preceding text.

Every “AI” (more correctly “LLM”) has worked exactly like this from their launch in 2022 until today: gigabytes of hyper-dimensional matrices of weights and probabilities where your text/prompt is computed against those matrices to pick the “next token.” Concatenate that new token onto the original text and recompute everything again against the same matrices (more correctly, tensors), draw the next token from a small group of probabilities, concatenate, repeat, and so on.

That is a “text generator.” But yes, they are so good that they “seem” to have personality in their replies. And human beings are very easily fooled into “anthropomorphizing” non-human things. It does not take much.

Back in 2023, in the video “Understanding How ChatGPT Works” I explain the Turing Test, one of the first “conversational” apps that passed the test, “Eliza”, which already passed it without needing AI at all.

That is why I said they are “glorified,” because everyone who has little knowledge of AI thinks of it the way the ape in the movie thinks of the monolith: as a divinity that must be glorified.

“But What About Flow?”

“But Akita, on Flow you were against AIs.”

Nope, again, you only watched clips. Look at how I open the conversation: I am talking about the hysteria around the “AI 2027 Report”, I am talking about Sam Altman’s nonsense that they were very close to hitting “AGI” (General Artificial Intelligence, an intelligence that, like a human, reasons and decides things on its own and is capable of evolving autonomously). And all the controversies about AIs coming up with a “plan” to kill their owner (what I call “fanfics”).

Then I spend hours explaining in detail the history and exactly how the “next token calculation” is done, as I also gave more details here on the blog. And I always say there has to be some “new breakthrough” (that has not happened yet). As long as every new GPT or Claude is just an evolution of the previous one, there are limits. And I explained the limits.

It is not a “matter of time.” That does not exist. Even though we have never been to, say, Saturn, we know what it takes and more importantly: how much it costs. “WITH TODAY’S TECHNOLOGY” - which is always the premise - there is no practical way, it does not make sense.

Tomorrow some “new breakthrough” may appear that changes everything. But until it shows up, we cannot “count on it.” And my entire explanation is based on that premise. And the conclusion is objective and mathematical:

AGI is not achievable. That has not changed.

It does not mean current AIs are USELESS. That is the counter-conclusion whose reasoning I cannot understand either. On every podcast after the first one, I would come back saying: “The way I talk, some people seem to understand that I do not like AIs, but it is the opposite: I like them.”

I got tired of repeating this on the podcasts, on the blog, on X, but that is the part everyone pretends is not there and OMITS. That is why I always put everything in writing here on the blog and, as I said, you can see I was already using it back in 2023 - 3 years ago.

In summary: EVERYTHING I said on the latest videos of my channel and on the podcasts STILL HOLDS. The explanation is still correct. What is wrong are YOUR CONCLUSIONS. Review them and interpret them literally and not subjectively, with the lack of knowledge you people have. Do not ignore the terms I used that you do not understand. The argument only makes sense if it is complete, and it cost me 4 hours to explain it in each video.

A Message to the “Hyped”

I said this on X the other day:

It does not matter if you want to call it “Vibe Code,” “Agentic Engineering,” “AI Assisted Programming,” it is all bullshit.

EVERYTHING now is “programming with AI.”

And there are only two ways to do it: the right way and the wrong way. The wrong way is the way you and the owner of the Tea app (probably, allegedly) do it: you let AI write the code, you have no clue how to review it or criticize it, you push it to production and let users use it, with no idea of the risks.

The right way: I spent TWO MONTHS writing more than TWENTY POSTS on this blog explaining the right way, which I called “Agile Vibe Code,” but it is basically “Software Engineering applied to AI” and guess what: you need to have studied and have experience to know it.

It is not 2026 and it will not be 2027 or anywhere near as soon as you think before Claude or Codex “replace ALL programmers.” No, they WILL replace all the bad ones and that is excellent! Again, read this rant.

You are never going to build a “new Linux” and “replace Linus Torvalds” with AI. You will manage to “copy parts,” sure. Copying is not enough. Bad programmers were already copying before, and it never made them good.

A Message to the “AI Haters”

Give up, this is a one-way street. Same as mobile, same as the internet, same as microcomputers, etc. Pandora’s box is open, there is no closing it anymore. AIs are here to stay.

“But you said AI was a bubble that was going to pop!”

Again, either you are playing dumb or you are dumb.

The 2001 Internet Bubble popped, but the Internet did not disappear. On the contrary, it grew! What disappeared were the companies that thought they would make easy money riding the bubble wave, and everyone who blindly believed in it. Still, whoever was an “Internet Hater” lost out too.

At this point, every anti-AI argument already sounds like a lousy excuse. Let’s look at a few:

“AI makes lots of mistakes, every now and then I see some half-assed code.”

True, and that is exactly why I said not to get hyped for nothing. It makes mistakes, I catch them every day. Up until last year, the error vs. success rate was high enough that I could not recommend any amateur use it on a daily basis. Only someone who pays close attention and knows how to fix it - like me.

However, in 2026, that rate has improved significantly. It does in fact hit more than it misses now, enough for me to trust it without “micro-managing” every step.

The truth is every human programmer makes mistakes, and they make MANY. Whoever makes mistakes has confirmation bias and does not think they mess up that much. Only someone watching from the outside - someone like me, who has managed hundreds of programmers across dozens of projects - knows they screw up much more than they are prepared to admit.

I GOT TIRED of asking people to only push PRs with unit tests: everyone ignores it.

I GOT TIRED of asking people to at least run it once on their own machine to see if it works, instead of me pulling the PR and finding out it does not even run: everyone ignores it.

I GOT TIRED of asking people not to just copy and paste from stackoverflow, and at least adjust and adapt it for our project. Everyone ignores it.

I GOT TIRED of asking people not to dump everything into the same giant file and to refactor from time to time so technical debt does not pile up: everyone ignores it.

I GOT TIRED of asking people not to push more commits with messages saying “bug fix,” without explaining what is actually in there: everyone ignores it.

I GOT TIRED of asking people to remember to update the documentation when they change some feature, to make it easier for whoever is going to test it: everyone ignores it.

I GOT TIRED of asking people to cover a bug fix with a regression test so we would not see the same error again. The same error kept repeating: everyone ignores it.

I GOT TIRED of asking people to follow the conventions we agreed on in the project and not write different code with different patterns. Everyone ignores it.

I GOT TIRED of asking people to adjust deploy scripts, CI or things like that when there are pieces that affect the infra: everyone ignores it.

I GOT TIRED of asking people not to pile up a bunch of code that makes no sense together and not to git add . at the end and just write “new feature” and commit everything together. Everyone ignores it.

These are just a few examples. Know what is new: Claude and Codex do not ignore. Everything I just listed, which happens on EVERY project with humans, no matter the size of the project, does not happen to me anymore with Claude.

Understand: all of this should be the basics of the basics, intern level. But I GOT TIRED of asking seniors to be more careful, to not set a bad example for the juniors: EVERYONE IGNORES IT.

And now, all these people who WORE ME OUT, are precisely the ones who became “AI Haters.” But of course, the AI does everything they DO NOT. Look in the mirror and reflect on whether your code was really that good (Spoiler: it was not).

90% of all code produced is nowhere near something like “optimization of the Linux memory management solution” or “bug fix for a performance regression in the file system drivers” or “improvement in the firewall security algorithm” or “rewriting this old Assembly part in C”..

90% of most code produced day to day is MUNDANE, it is consuming an API, it is writing a front-end, it is one more report, one more CRUD, one more email validation, one more cleanup job, one more deploy script. Absolutely NOTHING actually “interesting.”

What did I like about LLMs? It REMOVES from my plate all the mundane tasks and lets me focus on the parts I like: research, forming hypotheses, benchmarks, a/b tests so I can make better decisions, integration tests that make sure the various parts of my system actually work. Everything I could not do before, because 90% of my time was spent fixing somebody’s crappy CSS.

I HATE messing with CSS. It is about time I did not have to anymore.

I HATE writing CRUD. It is about time I did not have to anymore.

I HATE doing the initial dev environment setup for every different project. It is about time I did not have to anymore.

I HATE having to spec out idiotic things like mapping fields on a poorly designed fintech/bank API. It is about time I did not have to anymore.

I HATE having five hundred poorly designed front-end frameworks to stitch everything together and pray it works through trial and error. It is about time I did not have to anymore.

I HATE having to attend sprint meetings, where what was asked for keeps getting modified along the way because nobody paid attention. It is about time I did not have to anymore.

I HATE being blocked, having to wait around because another dev or another team is working on something that affects my side and in the end I will spend days later just fixing merge conflicts by hand. It is about time I did not have to anymore.

Who are the “AI Haters”: exactly everyone who used to block me before, the ones responsible for dragging out and delivering in bad quality all the MUNDANE tasks, and thinking they were doing something big.

What History Has Taught Me

Computers understand machine instructions. They do not give a damn about our “favorite languages.” That is what I spent hours explaining on the podcasts and in the videos on my channel.

Fuck your favorite language/framework. Fuck your favorite design pattern. At the end of the day, the machine only cares about the binary that is going to run.

Back in the day, it was extremely costly to feed those instructions to the machine. We had to literally key in, bit by bit, each instruction, at the exact address in memory. Whether with WIRES or with SWITCHES, ONE BIT AT A TIME:

Fortunately, things evolved and we improved the ways of “INPUTTING” instructions and data. Whether with punched paper or teletypes (electric typewriters adapted as dumb terminals):

Programming was expensive, because it was VERY costly to “fix” an error. There was no easy “backspace.” The moment you went to type, you had to be VERY SURE of what you were going to type, without screwing up!

Even by the late 70s, early 80s, it had already evolved a lot, but having permanent storage was an OPTIONAL thing on most “micro” computers. We had to type the programs in from scratch to run them and when the machine shut down, whatever was in RAM was wiped and we lost everything. Recording was expensive, and one of the most popular options was recording to cassette tape:

But to record/read just a measly 1 kilobyte of data, it could take MORE than 20 seconds. A little 10-kilobyte game (which is very little), would take almost 4 MINUTES to load.

You have no idea what 10 kilobytes means. Any PNG or SVG icon or any stupid little JS file on your website has way more than that.

All this to say the following:

computers run machine instructions, to this day
it has always been costly to feed those instructions to the computer

Nowadays we have NVME SSDs that let me read 1 gigabyte today (<70ms) faster than I used to read 10 kilobytes in the 80s.

But to compensate, our programs kept getting more and more complex and “bloated.” Once memory, storage, data bus, everything became orders of magnitude faster, we humans kept throwing more and more data at them. A text editor today does the same thing a text editor in the 80s did: edits text. But, of course, it has way more features: pretty fonts, smooth scrolling, auto-correct, auto-format, etc.

We traded resources for comfort, and that is a good thing.

I said in my videos, in the recent rant posts, that it was a very bad thing that the startup bubble only popped in 2022 (and its aftershocks are still going today). We traded programming efficiency for cheap bad programmers. Why bother trying to be more efficient if it was “cheap” to just throw more “bodies” at the problem? The same mentality that enabled the rise of cheap software sweatshops in the 2000s, cranking out bad software like there was no tomorrow.

Personally, I am very happy we finally took one more step toward feeding instructions to the machine more efficiently. LLMs, our stochastic parrots, are, in fact, the most efficient way to produce 90% of the instructions we need to give the machines so they can compute what we need. Same way I do not miss punched cards, or teletypes, or cassette tape, if I do not need to use an IDE in my day-to-day anymore, I will not miss it.

I did not choose to become a programmer to become an IDE operator. When I started, the concept of IDEs did not even exist.

I chose to become a programmer because I like the idea of instructing a machine to compute the things I want. Whether it is a spreadsheet or a game. HOW those instructions are input is not the main thing for me. It never was. IDEs were just a small phase within decades of career and they will not be the last.

I do not understand the reasoning of the over-hyped or the over-haters. Why do you need to glorify a hammer? Why do you need to hate a hammer? That is all I needed to say 🤷‍♂

Some Idiotic Excuses

“But Anthropic (or OpenAI) are evil”

Fuck that. Microsoft was too, it even went through an anti-trust trial in the year 2000 that almost split the company. Even though I prefer Mac or Linux today, has Windows stopped being the most used operating system in the world? No.

“But AI uses a lot of energy, what about the environment?”

Fuck that. This was never a consensus (and most likely never will be and is not even a problem). The planet has already gone through 5 or 6 great extinction events before, the last one was a meteor that fell and wiped out the dinosaurs. We have gone through multiple Ice Ages. The planet is going to be fine, don’t worry.

The problem of the last decade was the stupidity of not investing in more nuclear power plants - and now that option is finally back on the table. I said on Flow that Germany shutting down theirs was one of the dumbest decisions of all time, and I stand by it: it was.

Those are the two most common ones I can think of today. But that is how it is: anything is an excuse now. Pandora’s box has been opened, there is no going back.

Vibe Code: I Built a Smart Image Indexer with AI in 2 Days | Frank Sherlock

Mon, 23 Feb 2026 18:34:34 GMT

Over the last 48 hours, I built a complete desktop application from scratch, with published binaries for Linux, macOS, and Windows. 50 commits, ~26 hours of effective work, 8,359 lines of Rust, 5,842 of TypeScript, 338 automated tests. If you told me this 2 years ago, I would have called it a lie.

The name is Frank Sherlock — a local image cataloging and search system using AI. You point it at a folder (it can be a NAS with terabytes), it scans everything, classifies each file using a vision LLM running locally on your GPU, and gives you full-text search over the content. It’s not cloud, nothing is sent out, it runs 100% on your machine.

Here are some examples of text it extracted from some of my images:

And check out the Surya OCR details: it read the text on the Game Boy screen perfectly:

More than that, I have directories of screenshots of payment receipts. I would never find anything in there again, but now:

It does make some mistakes, of course, with obscure anime (it keeps thinking everything it doesn’t recognize is Evangelion 😂). But it surprisingly gets many of them right too. And the descriptions themselves already help a lot.

You’ll need a minimum GPU (I only tested on my 5090, but there are smaller models to download for smaller GPUs, and it theoretically supports AMD and Mac too, but I haven’t tested yet - I’ll accept Issues and Pull Requests from anyone who wants to do beta-testing on Mac and Windows). You just need Ollama installed and running; optionally, Python (to have Surya OCR, which is optional, but is the best).

I’ll mark it as “1.0” when I have more people testing on Windows/macOS and I have the certificates to sign the executables properly. This is still a “beta” version! Compile and run it yourselves on your machines, everything is explained in the README.

And I did this with my “Agile Vibe Coding” — basically, programming in partnership with an LLM (in this case, Claude Code).

Agile Vibe Coding works. And it works very well. But the idea is only 10% of the work. The other 90% is engineering.

Engineering requires experience, judgment, and knowing how to ask the right questions. The LLM is an excellent executor. But whoever decides what to execute, in what order, and why, that’s still the developer’s job.

There’s a growing discourse that anyone with a good idea can build software now. In a certain sense, yes, you can get a prototype up fast. But the distance between “runs on my machine” and “software that works on 3 operating systems, survives cancellations in the middle of processing, doesn’t corrupt data, and scales from 94 test files to 500,000 in production” is still enormous. That gap is engineering, and engineering still demands someone who knows what they’re doing.

I’ll tell the complete story: from initial research to release, going through benchmarks, proof of concept, architecture decisions, multi-platform CI/CD, and everything that sat between “I have an idea” and “here’s the AppImage, the DMG, and the MSI”.

The Original Idea

It all started with a simple question: can open-source vision LLMs actually classify content? I’m not talking about “woman on the beach” — any model does that. I’m talking about looking at an image and saying “this is Ranma, from the anime Ranma 1/2 by Rumiko Takahashi, in a scene from the OVA The Battle for Miss Beachside”. Are we at that level yet? (TL;DR no, but enough)

And if we are, can you build a smart file catalog? Something where I point at my NAS with terabytes of media accumulated over decades and can search by content, not by filename? Anyone with a home NAS knows: after a few years, files pile up and directory organization simply stops scaling. You know you have that 2019 payment receipt somewhere, but the file is called IMG_20190315_142301.jpg and it’s in a directory with 3,000 other photos.

My hardware: AMD 7850X3D, RTX 5090, Arch Linux. Absolute restriction: no remote APIs, no OpenAI, no cloud. Everything open-source, everything local. If I’m going to process terabytes of personal files, including financial documents and private photos, I don’t want to send anything to third-party servers. Plus the cloud API cost for that volume would be prohibitive.

But first: research. Without knowing if the technology delivers what I need, there’s no point building anything. Small scripts, quick prototypes, different models. See what works before writing the first line of code of the real app. This is a pattern I’ve followed for years: validate the riskiest assumption first. If the vision LLM can’t classify decently, everything else is a waste.

A/B Research: Benchmark Driven Development

This is the part most people skip, and it’s exactly where experience makes the difference. The temptation to jump straight into implementation is enormous. “I’ll use model X because I read on a blog that it’s good.” No. Before choosing model, framework, or architecture, I set up a formal benchmark.

I built a test corpus with 94 files: 60 images (photos, screenshots, anime, documents, receipts), 9 audios, 13 videos, and 12 documents. For each file, I created a ground truth in JSON with the correct classification — type, description, series (when applicable). That ground truth is what lets you measure real accuracy, not “I looked at the result and it seemed OK”.

The benchmark has 6 phases, each answering a specific question:

Metadata: how much does it cost to extract basic metadata? (answer: cheap, 0.07s/file)
Images: which vision model is best? Which OCR?
Audio: does Whisper work? Which model size?
Video: does frame-based classification work?
Unified catalog: does the full integrated pipeline work?
Cost projection: how much time and money to process a real NAS?

Phase 2: Vision — The Result Nobody Expected

I tested qwen2.5vl:7b, llava:13b, and minicpm-v:8b on 30 labeled images. The result:

Model	Type Accuracy	Series Accuracy	JSON Valid	Latency/img
qwen2.5vl:7b	0.80	0.14	0.87	0.55s
minicpm-v:8b	0.50	0.00	0.83	1.63s
llava:13b	0.33	0.06	0.83	1.62s

The 7B parameter model crushed the larger ones. It’s not a typo. qwen2.5vl:7b beat llava:13b (almost twice its size) in every metric, and was also 3x faster. This contradicts the intuition of “bigger model = better model”. In practice, it depends on the task and the prompt.

Naturally, the next question is: what about the 32B? Same model, giant version. We should be able to get much more, right? Wrong:

Model	Type Accuracy	Series Accuracy	Latency/img
qwen2.5vl:7b	0.87	0.19	0.77s
qwen2.5vl:32b	0.87	0.25	22.46s

The 32B gave +0.06 in series accuracy (literally 1 more hit out of 16 labeled items) and cost 29x more time. For someone who’s going to process hundreds of thousands of files, this trade doesn’t close. 29x slower means a 6-hour job turns into a week-long job.

Here I’ll make a comment about tooling: I did the first round with Claude Code and had asked it to pick the models it thought were best. But then I decided to go to GPT Codex and it made other suggestions I found interesting to test. In summary: I’ve been finding Codex much better for experimentation and exploratory code, for actual research. I find Claude better when we already know exactly what we want.

With Codex, I tested the new candidates qwen3-vl:8b and qwen3-vl:30b-a3b with 3 repetitions for statistical significance. The result? Both worse than qwen2.5vl:7b — type_accuracy of 0.55 versus 0.89 for the incumbent, with a 95% confidence interval that doesn’t even come close. And even slower: 2x and 2.2x respectively. A newer model isn’t always a better model for your use case. qwen3-vl frequently returned truncated or malformed JSON — a real regression in robustness.

OCR: Surya vs Ollama Vision

For text extraction (scanned documents, receipts, screenshots with text), I tested two engines:

Engine	Coverage	Similarity	Latency/img
Surya	54/65 files	0.9455	8.15s
Ollama Vision	38/65 files	0.9419	1.73s

Surya finds text in many more files (83% vs 58%), but is 5x slower. When it finds text, the quality is practically the same (similarity > 0.94 on both). Obvious solution: hybrid approach. Use Surya when you need maximum coverage, Ollama Vision as fast fallback. The pipeline design became: try Surya → if it fails or doesn’t find text → fallback to Ollama Vision. That’s why I said at the beginning that Surya is optional, if you don’t want to install it.

Cost Projection

Phase 6 did something I rarely see in open-source projects: cost projection for real usage. With the timings measured per file type, we extrapolated to NAS scenarios:

Scenario	Files	Time	Electric Cost
Test corpus	94	24 min	$0.02
Small NAS	~5K	6.6 h	$0.36
Medium NAS	~50K	2.6 days	$3.37
Large NAS	~500K	26 days	$33.70

Take these numbers with some skepticism because it’s back-of-napkin math. $34 in electricity to classify 500,000 files with a local GPU. Try doing that with the GPT-4 Vision API — at $0.01 per image (conservative), that’s $5,000. The price of my setup (GPU + electricity) pays for itself on the first big use.

The lesson: benchmark, don’t guess. I could have assumed the bigger model would be better, or that the newer one would beat the older. The data showed the opposite. ~2 hours of benchmarks saved me from wrong choices that would cost days of rework (in the old days I would have said “weeks”).

Proof of Concept in Python

With the benchmarks in hand, the next step was to validate the complete pipeline before moving to Rust. Not the benchmark pipeline (which tests each model in isolation), but the real classification pipeline — the exact sequence of calls that the final app will make for each file.

I created a Python prototype — 754 lines in a single file (classification/run_classification.py) — implementing the winning strategy from the benchmarks.

The pipeline has 4 enrichment stages:

Primary classification: sends the image to qwen2.5vl:7b with a structured prompt, asks for JSON with type, description, tags, confidence
Anime enrichment: if the primary type is anime/manga/cartoon, does a second pass with a specialized prompt asking for series, character, scene, artist
Document/OCR: if it’s a document or receipt, extracts text with Surya and/or Ollama Vision, then asks for structured data (dates, values, transaction IDs)
Output: writes the result in YAML mirroring the directory structure of the source

The most critical part is parsing the LLM’s JSON. Anyone who has worked with LLM output knows they’re… creative with formatting. Sometimes it comes with markdown fences (json ... ). Sometimes there’s text before and after the JSON. Sometimes the JSON is almost right but missing a closing brace. The prototype implemented a 3-attempt cascade that later became the rule for the whole project:

Direct parse: json.loads(response) — works in ~70% of cases
Brace-balancing extraction: finds the first {, counts opened and closed braces, extracts the substring — catches another ~20%
Regex field salvage: if all else fails, uses regex to extract individual fields (“type”: “…”, “description”: “…”) — saves the last ~10%

This cascade stayed in the Rust code practically identical. A well-done PoC shortens the path to production.

Result: 60 images processed, zero errors, 6.29s/image average. The pipeline worked end to end. Time to build the real app.

Building the Tauri App

At this point, I tried to continue with Codex, but then it choked on trying to convert the Python PoC to Tauri. But since the choices had already been made, I went back to Claude Code, and it had no problems mapping from Python to Rust.

Here’s where vibe coding shows what it can do. I’ll tell the real timeline, commit by commit, so you get a sense of the rhythm. The timestamps are from git log, so they’re accurate.

Saturday 02/21 (19:29 → 21:08) — 6 commits, 1h39

It all started with research. Six commits of setup, benchmark scripts, results analysis:

f57221c 19:29 Phase 0: Project setup with uv, environment verification
4977497 19:45 Implement all phases: metadata, image/audio/video classification
af6e3aa 19:49 Add research results report
41d6c2a 20:12 Add per-file timing, OCR phase, cost estimation
0cd1b10 21:02 Fix Surya OCR, re-run on 94-file corpus
25b3ace 21:08 Add conclusions, cost analysis, recommended pipeline

Saturday night, pure research only. Not a single line of app code. But when I went to bed, I knew which model to use, which OCR approach, and how much it would cost in time and electricity.

Sunday 02/22 (13:05 → 23:30) — 14 commits, ~10h25

Sunday was the day of heavy construction. I woke up, had lunch, and sat down to work.

13:05 — First commit of the day: the Python classification prototype. 749 lines validating the complete classification pipeline, the JSON parsing cascade, and the conditional enrichment. This prototype wasn’t throwaway — it was the executable design document for the Rust pipeline that would come later.

I took a break to go out, take a walk, have a glass of wine, then came back and continued:

17:31 — Benchmarks updated with the new candidates (qwen3-vl:8b and qwen3-vl:30b-a3b). Three repetitions each, confidence intervals. They confirmed that qwen2.5vl:7b was the right choice — not by a little, but by a huge margin.

17:56 — The “big bang”: Tauri app scaffold. 9,631 lines inserted in a single commit. The entire app structure: Rust backend with SQLite + FTS5, React frontend, config system, file scanner, data models, query parser for natural search. At that point the app was already searching the database. It didn’t have classification yet, but the foundation was solid — and that’s exactly the point. The scaffold came with tests, TypeScript types, and the directory architecture I defined. Claude generated the code, but the module structure (config, db, scan, models, query_parser) came from how I wanted to organize responsibilities.

19:04 — The heaviest commit of the project: 4,186 lines. Classification pipeline in Rust (1,069 lines of classify.rs — practically the translation of the Python PoC), thumbnail generation (Lanczos3, 300px, JPEG 80%), incremental scanning with fingerprint, the Surya OCR Python script, runtime detection for Ollama, and the brutal expansion of the database with upsert, touch, delete, and FTS indexing. In one commit. With 47 tests. I had to make the architecture decisions (how the cache mirrors the rel_path, how the scan is divided into two phases, how errors propagate), but Claude wrote most of the code and the tests that came with it.

19:55 — UI redesign, scan cancellation, auto-cleanup of orphan classifications, reorganization of the whole repo (moved everything from research to _research_ab_test/). CI and release workflow already configured in this commit — I knew I was going to need them the next day.

21:02 — DB resilience (WAL mode, backup, health check), root management (add, remove, list monitored directories), zoom, sidebar redesign with tree view, thumbnail fix. Four features in one commit. In a manual workflow, each one of these would be a separate PR with days of review.

21:14 → 21:39 — Read-only database mode for sandbox filesystems, resume of interrupted scans on startup, grid tiles redesign with hover overlay, selection model, infinite scroll. Three commits in 25 minutes. Claude was on an absurd pace.

22:41 — Multi-select with Ctrl/Shift click, collage preview with selected items, Ctrl+C copies to the OS clipboard.

23:30 — The big refactor: monolithic frontend (everything in App.tsx - if you don’t explain, Claude always does this) broken into 15 components + 10 hooks + 84 frontend tests. This is the kind of thing that normally takes a full day of tedious work. With vibe coding, it was a one-hour commit. Claude extracted each component, created the hooks, set up the tests with proper mocks, and kept everything working. I just had to say “refactor this monolith into components and hooks, and write tests for each one”.

Monday 02/23 (00:09 → 14:33) — 30 commits, ~14h

Monday was polish, robustness, and the cross-platform marathon.

00:09 — Multi-OS abstraction: all platform-specific code isolated in the platform/ module. This decision, taken early, saved hours of pain in CI. When Windows needed special treatment for UNC paths, the change stayed contained in platform/process.rs instead of spread across 8 files.

00:12 → 01:18 — Quick sequence: cargo audit in CI, duplicate removal, GPL-3.0 license, help dialog (F1) with query syntax examples, sort controls for results, formal SQLite migration system, context menu (copy, delete, rename). Six commits in just over an hour.

09:38 — After sleeping about 8 hours, the first commit of the morning: extraction of the LLM module. The monolith of Ollama calls that lived inside classify.rs was separated into llm/client.rs (HTTP calls, JSON parsing), llm/management.rs (download, listing, cleanup of models), and llm/model_selection.rs (hardware-aware selection by tier). Atomic file ops so cache doesn’t get corrupted if the process dies in the middle of a write.

10:54 — EXIF location extraction (GPS coordinates → readable address), metadata editing modal, setup instructions per OS (each OS has different dependencies for Ollama and Python).

11:34 — PDF support: scanning, indexing, and preview using PDFium. Not trivial — PDFium needs a native binary per platform, page rendering to image, blank page detection (so you don’t generate a thumbnail of an empty cover), thumbnail assembly with the first 2 pages that have content, and native text extraction as a faster alternative to OCR.

12:08 — Albums and Smart Folders. Albums are manual collections (the user drags files in), Smart Folders are saved queries that appear in the sidebar and update automatically. Two database migrations, new sidebar component, drag-and-drop. In 34 minutes.

12:11 → 12:52 — macOS-inspired SVG icons in the sidebar, CLI argument support (sherlock /path/to/folder), copy description/OCR text in the context menu, PDFium path fix in production, icons and screenshots in the README, tauri script in npm, sidebar toggle, dynamic titlebar, Windows compilation fix (Unix-only imports). Ten commits in 41 minutes. Most of these were issues that showed up in CI or during manual testing.

13:14 → 13:54 — The CI fixes marathon. Auto-provision of the Surya OCR venv with progress bar in the SetupModal, icon regeneration with alpha channel, cargo fmt + clippy + UNC paths fix on Windows, tests for help dialog examples, individual folder rescan, Windows assertions fix, release workflow permissions fix. Seven hardening commits. Each one resolving a real bug that appeared in the CI matrix or in testing.

14:26 — Drag-and-drop to reorder roots in the sidebar, scan cancellation before deleting a root (so you don’t leave the scan running in the background on a directory the user removed — a subtle edge case that could cause a crash).

14:33 — Last commit: responsiveness fix in scan cancellation. Check the cancel flag after each classification, immediate poll instead of waiting for the next tick. Small detail, big impact on UX.

Architecture Decisions

Here’s what separates “letting the LLM write code” from “building real software”. None of these decisions came from a prompt. Claude didn’t suggest any of them spontaneously. I had to ask for each one.

Read-Only Principle

The app never writes to the scanned directories. Everything — thumbnails, classification cache, database — sits in ~/.local/share/frank_sherlock/. This is more than good practice, it’s respect for the user’s data. If someone points at the company NAS, the app can’t go around creating .sherlock/ in every subdirectory. If the directory is mounted as read-only via NFS, the app needs to work normally. It sounds obvious, but many cataloging apps you know create caches and thumbnails inside the source directories. (cough Synology @eaDir cough)

Incremental Scanning

Scanning terabytes of data every time the app opens would be insane. The scan is incremental in two senses:

Discovery phase (fast): walks the filesystem comparing mtime + size of each file. If nothing changed since the last scan, it doesn’t even read the content — just updates the “seen in this scan” marker. For a NAS with 500K files where 99% hasn’t changed, this phase takes seconds, not hours.
Processing phase (heavy): only for new or modified files. Calculates fingerprint (SHA-256 of the first 64KB), generates thumbnail, classifies with the LLM. And here’s where move detection comes in: if a file changed path but the fingerprint is the same, the app preserves the entire classification already done and just updates the path. You reorganized 10,000 photos into new folders? The app detects and doesn’t reclassify any of them.

The checkpoint is per file. If the scan is interrupted (the app crashed, the user closed it, the power went out), the next time it resumes from the last processed file, not from zero. This is implemented via scan job persistence in the database: the scan cursor is saved in scan_jobs.

Cooperative Cancellation

The scan runs on a separate thread via tokio::spawn_blocking. To cancel, I use an AtomicBool shared between the scan thread and the frontend. The flag is checked:

Before each file in the discovery phase
Before each classification in the processing phase
After each classification (in case the Ollama call takes a while)

This ensures cancellation responds in at most the time of one classification (~1 second), not the time of the whole scan. Without this design, canceling a scan of 500K files could take minutes — or simply not work.

Database Resilience

SQLite with WAL mode (allows concurrent reads during writes), health check on startup, automatic backup before migrations, formal migration system via rusqlite_migration. Five migrations in total:

Initial schema (files, roots, scan_jobs tables, and the virtual FTS5 table for search)
location_text column for EXIF address
FTS index rebuild (needed after changing tokenization)
albums and album_files tables for manual collections
smart_folders table for saved queries

Migrations are identified by position and can never be edited or reordered after being published. This is the kind of rule you learn after corrupting someone’s production database once. The rule is coded in the project’s CLAUDE.md so future vibe coding sessions don’t violate it.

Hardware-Aware Model Selection

Not everyone has an RTX 5090. The app detects the GPU on startup and picks the appropriate model:

Weak GPU or no GPU: qwen2.5vl:3b (small tier — runs on anything)
GPU with >= 6GB VRAM: qwen2.5vl:7b (medium tier, the benchmark default)
Apple Silicon with >= 48GB unified: qwen2.5vl:32b (large tier, only where unified memory allows without swap)

Detection uses nvidia-smi on Linux/Windows, system_profiler on macOS, and sysinfo as fallback for system RAM. The result is cached in the Tauri AppState so it doesn’t keep running subprocesses all the time.

Multi-OS and CI/CD

If there’s one part of the project that justifies having experience, it’s this one. Making software that compiles is easy. Making software that compiles and passes all tests on Linux, macOS, and Windows at the same time teaches you humility.

Platform Module

All OS-specific code lives in src-tauri/src/platform/:

gpu.rs: GPU detection (NVIDIA via nvidia-smi, AMD via sysfs/rocm-smi, Apple via system_profiler)
clipboard.rs: image copy to clipboard (xclip on Linux, pbcopy on macOS, PowerShell on Windows)
python.rs: Python location (python3 vs python in PATH), venv paths per OS
process.rs: subprocess execution abstraction with output handling

This means classify.rs, scan.rs, thumbnail.rs — none of them know which OS they’re running on. They ask the platform and the platform resolves it. When Windows needed special treatment for UNC paths (those that start with \\?\), the change stayed contained in platform/. When macOS needed a conditional import, same thing. The rest of the codebase wasn’t touched.

GitHub Actions Matrix

Two workflows:

CI (push + PR): build and test on Linux, macOS, and Windows. Each push runs cargo test + npm test on the 3 OSes. Includes cargo fmt --check, cargo clippy -- -D warnings, and cargo audit. If any platform fails, the PR doesn’t pass.
Release (tags v*): build via tauri-action, generates AppImage (Linux), DMG (macOS arm64), MSI (Windows), and creates a Draft Release on GitHub with the binaries attached.

The 10+ CI fix commits on Monday morning were the least glamorous part of the project. Things like:

#[cfg(not(target_os = "windows"))] on imports that use std::os::unix::fs::PermissionsExt
dunce::canonicalize instead of std::fs::canonicalize because Windows generates paths with \\?\ prefix that break string comparisons
Install rustfmt and clippy explicitly on the runner because they don’t always come in the default GitHub Actions toolchain
Remove the macOS Intel target from the release workflow (Apple Silicon only — not worth the cost of maintaining two Mac targets)

Nobody posts these things on X. But without them, your app doesn’t build on 2 of the 3 targets.

External Integrations

The app depends on 3 external systems. Each one brought its own problems.

Ollama

Ollama serves the vision models via local REST API (port 11434). The app does:

Status check: checks if Ollama is running and lists installed/loaded models
Model download: if the recommended model isn’t installed, offers download with a progress bar via API streaming
Generation: sends image in base64 + prompt, receives JSON (with the 3-level parsing cascade)
Cleanup: unloads models from VRAM when not classifying, so it doesn’t monopolize the GPU from the user’s other applications

Ollama is the only hard requirement. Without it, classification doesn’t work. The SetupModal guides the user through installation and model download.

Surya OCR

Surya is a Python OCR engine that runs locally. The problem: the app is Rust and can’t depend on a system Python installation. The solution:

The app maintains an isolated Python venv at ~/.local/share/frank_sherlock/surya_venv/
The surya_ocr.py script is bundled as a Tauri resource (packaged in the binary)
On first use, the SetupModal offers to automatically provision the venv (finds Python, creates venv, pip install surya-ocr + dependencies)
Classification calls the script via subprocess, passes the image as argument, reads the extracted text from stdout

Surya is a soft requirement: if it’s not installed, the app works normally — it just won’t have dedicated OCR. The pipeline gracefully degrades to use Ollama Vision as fallback, which is worse in coverage but works. The user sees a warning in setup, not an error that blocks usage.

PDFium

For PDFs, I needed native text extraction and page rendering for thumbnails. PDFium is Chrome’s PDF engine, and has Rust bindings via pdfium-render.

The PDFium binary is downloaded by a script (scripts/download-pdfium.sh) and bundled via lib/ in Tauri resources. Each platform gets the correct binary (.so, .dylib, .dll). lib/ is gitignored — the binaries are downloaded at build time, not versioned.

The PDF pipeline:

Tries to extract native text (no OCR) — many PDFs already have a text layer
If there’s enough text, uses it directly for indexing (faster and more accurate than OCR)
If not, renders the page and sends it to the image pipeline (Ollama Vision)
Detects blank pages, finds the first page with real content
Generates thumbnail as a montage of the first 2 pages with content (gives a better sense of the document than just the cover)

What Agile Vibe Coding Really Is

OK, now the point that really matters. The reason I’m writing this.

What Claude Did

Wrote most of the Rust and TypeScript code
Generated 166 Rust tests and 172 frontend tests
Implemented JSON parsing with 3 fallback levels
Set up CI/CD with 3-OS matrix
Did massive refactors (monolith → 15 components + 10 hooks)
Handled edge cases of encoding, Unicode paths, and exotic file formats
Wrote SQL queries, migrations, FTS5 indexes
Implemented GPU detection, clipboard per OS, Python resolution
Created the setup flow with progress bar and model download
Debugged and fixed dozens of cross-platform compilation issues

The speed is hard to describe without sounding like exaggeration. Saturday’s 19:04 commit, the one with 4,186 lines and 47 tests, took about an hour including my review. A human dev, even a good one, would take a full day to write that with the same test coverage.

What I Did

Decided to do benchmarks before writing any app code
Chose Tauri over Electron (smaller footprint, native Rust, no Node runtime in prod)
Defined the read-only principle as an inviolable rule
Designed the incremental scan with move detection via fingerprint
Decided on cooperative cancellation with AtomicBool
Demanded formal schema migration (no ad-hoc ALTER TABLE in loose scripts)
Insisted on platform abstraction from the first multi-OS commit

And mainly: I asked the annoying questions. “What if the scan is canceled in the middle?” became the checkpoint system. “What if the database corrupts?” became WAL + backup + health check. “What if the person doesn’t have a good GPU?” became tier model selection. “What if Surya isn’t installed?” became a soft requirement with fallback. “What if the user deletes a root that’s in the middle of a scan?” became cancel-before-delete. “What if the file moves but the content is the same?” became move detection.

Oh, and I decided when to stop adding features and publish.

That last one is underrated. The temptation to keep adding “just one more thing” is enormous when the marginal cost of implementing is low. Claude implements any feature I ask for in minutes. But software that’s never published is useless to anyone. Knowing when to stop is a skill no LLM will give you.

The Real Pattern

Agile Vibe coding isn’t “asking the LLM to make an app”. It’s pair programming with a partner who’s stupid fast, has perfect memory, and never complains about refactoring. You say what you want, it implements, you review and adjust the direction. But if you don’t know where to go, speed doesn’t help — you just get to the wrong place faster.

The questions I asked — about resilience, cancellation, multi-OS, graceful degradation, edge case detection — none of them came from the LLM. They came from years building software that has to work in real environments, with real users, on diverse hardware. Claude isn’t going to ask you “what if the user pulls the network cable in the middle of downloading the model?”. But if you ask, it implements the handling in minutes.

It’s tempting to look at this project and conclude that anyone with a good idea could have done the same. But try to imagine: without the decision to do benchmarks, I would have picked the wrong model. Without the Python PoC, I’d have discovered JSON parsing problems in production. Without the platform abstraction, I’d be debugging Windows issues scattered across 15 files. Without the scan checkpoint, users would lose hours of processing on every crash. Without the formal schema migration, the first update would break everyone’s database.

Think of an architect with an absurdly efficient construction crew. The architect doesn’t need to lift every wall, but needs to know where they go and what happens if you take one out. The crew executes fast, works at night, doesn’t complain. But someone needed to have drawn the blueprint. Without a blueprint, it’s just a pile of bricks stacked quickly.

Final Numbers

For those who like concrete numbers:

Metric	Value
Commits	50
Hours of effective work	~26h
Lines of Rust	8,359
Lines of TypeScript	5,842
Lines of CSS	1,649
Rust tests	166
Frontend tests	172
Total tests	338
Platforms	3 (Linux, macOS, Windows)
Database migrations	5
Rust modules	13+
React components	15+
React hooks	10+

The first commit was Friday 02/21 at 19:29. The last was Monday 02/23 at 14:33. Discounting sleep (~8h on Saturday/Sunday night, ~8h on Sunday/Monday early morning) and breaks, that’s ~26 hours of work distributed over a weekend.

From zero — without a single file in the repository — to published binaries for 3 operating systems, with automated tests running in CI on every push. Including the research phase, which on its own would justify a sprint.

Conclusion

Agile Vibe coding works. But it works like any powerful tool: in the hands of someone who knows what they’re doing.

The idea for Frank Sherlock would fit in a tweet: “classify images with local LLM”. But turning that into real software required: benchmarked research with formal ground truth, proof of concept validating the complete pipeline, incremental architecture, error handling at 3 levels, cooperative cancellation, formal schema migration, platform abstraction, CI/CD with 3-OS matrix, integration with 3 external systems with graceful degradation, and 338 tests to ensure none of that breaks when someone runs cargo update.

The LLM sped all that up absurdly. But it didn’t replace the need to know what to do. If I had asked “make an app that classifies images” without the 2 hours of benchmarks, without the proof of concept, without the architecture decisions, without the annoying questions about edge cases, the result would be a prototype that works on my machine and breaks anywhere else. And I probably wouldn’t even notice until someone complained.

The vibe needs an experienced conductor. For now, that conductor is still human.

Code at github.com/akitaonrails/FrankSherlock. GPL-3.0, local-only, open source.