← Back to Blog

The Era of Large Frameworks Is Over

March 3, 2026

I spent 60% of my effort fighting a framework and 40% building domain logic. Then I looked at OpenClaw (500k lines) and NanoClaw (6k lines) and realized: a sturdy core plus skills that encode taste beats a large framework every time.

The thesis

A sturdy core and a good set of skills that encode the style and taste of how to extend a codebase will beat 400,000 lines of framework every time.

I arrived at this conclusion the hard way — by building a community bot on elizaOS v2, watching myself spend 60% of the effort fighting the framework, and then looking at two alternatives that confirmed what I already suspected.

What I built

QD is a community bot for the Bitcoin Quantum ecosystem. It runs on Telegram and Discord, answers questions about post-quantum cryptography, retrieves live blockchain data, and tries very hard not to hallucinate technical parameters like block times and signature sizes.

It's built on elizaOS v2 (v1.7.2), which gives you a plugin architecture: providers inject context into the LLM, actions handle user requests, evaluators reflect on responses, services manage external connections. The framework provides the runtime. I built everything else.

I also have history with this framework. I contributed local BGE embeddings support to elizaOS v1 — fastembed with BGE-small-en-v1.5, 384 dimensions, zero API cost. The kind of quality-of-life improvement that makes a framework usable for real deployments where you don't want every embedding call going through OpenAI.

Then elizaOS v2 shipped and my contribution was gone. Not deprecated. Not replaced with something better. Not even mentioned in the changelog. Just absent — removed in the v1.7.x rewrite. I didn't get a heads-up. There was no migration path. A feature I'd built for the community had been thrown away between major versions, and the only reason I noticed was that my project stopped working.

So I re-implemented it. Same model, same dimensions, vendored directly into my project this time. I wasn't going to let it disappear a third time.

That was the first sign — not just that the framework had gaps, but that contributing upstream doesn't protect you. In a fast-moving open-source AI framework, the code you wrote for everyone can vanish the moment a core maintainer refactors. If the feature matters to your deployment, you own it. You stop depending on the framework to keep it around.

The git log tells the real story

Here's what five weeks of building QD actually looked like, commit by commit:

Week 1: Vendor three plugins just to start

Before writing a single line of domain logic, I had to vendor three framework plugins:

Local embeddings — elizaOS v2 removed built-in local embedding support. The only options were OpenAI embeddings (costs money, network dependency) or Ollama (requires a separate server). I vendored my own implementation back in, using the same fastembed + BGE-small-en-v1.5 I'd contributed to v1.

OpenRouter — the standard plugin doesn't support X-Title and HTTP-Referer attribution headers. This is a business requirement for OpenRouter rankings. I had to copy 1,332 lines of plugin code to add two HTTP headers, because the plugin architecture has no hook for custom headers.

Telegram — the standard Telegram plugin always responds to every message. I needed mention-only mode for group chats. No configuration option exists. Vendor the whole plugin.

Three plugins vendored. Zero domain logic written. That's the framework tax.

Week 2: The hallucination crisis

The bot started stating wrong facts — "BTQ has 6-second block times" — and then refusing to correct itself when the knowledge base contained the right answer (60 seconds).

The root cause is architectural: elizaOS's prompt composition (composeState) treats conversation history and knowledge provider output with equal weight. The LLM trusts its own recent statements over injected knowledge because conversation history gets heavier attention weight.

The real fix would require forking elizaOS core to reorder prompt composition. Instead, I wrote layers of workarounds:

  1. Aggressive "trust KB over memory" language in the system prompt
  2. Explicit "I don't know" examples in the character definition
  3. A ban on using Bitcoin/Ethereum training data for domain questions
  4. Technical question detection that triggers extra warning injection

These are band-aids. They mostly work. The framework's lack of a mechanism for "authoritative knowledge" forced every one of them.

Week 3: The identity crisis

When users mentioned @qdayanon_bot in Telegram, the bot ignored them. ElizaOS's should-respond evaluator checks character.name and character.username — but it checks different fields in different code paths. The character was named "QD" with username "qd_research." The Telegram handle is @qdayanon_bot. The evaluator concluded: this message is about @qdayanon_bot, not QD — IGNORE.

The fix required adding redundant identity statements across name, username, and bio fields, because different parts of the framework check different fields. Then I had to add a 👀 reaction for immediate visual feedback so users knew the bot had seen their message, plus a typing indicator, plus a force-respond rule for mentions.

Eleven commits in rapid succession, all to make the bot respond when someone said its name.

Week 3 (continued): Proper nouns break embeddings

Once the bot responded to mentions, a new problem surfaced: users would ask "who is masato" and the knowledge base — which contained information about masato83, a community member — returned nothing useful.

Small embedding models don't handle proper nouns well. "masato" has no meaningful position in the embedding space of BGE-small-en-v1.5. The cosine similarity between "who is masato" and a chunk containing "masato83 is a BTQ miner" was below the relevance threshold.

The fix was a four-step evolution:

  1. Add naive keyword matching alongside semantic search
  2. Strip @mentions from keyword extraction (they were polluting results)
  3. Increase retrieval from top-3 to top-5 chunks
  4. Augment short follow-up queries with recent conversation context

Then the naive keyword matching revealed its own problems — no term frequency weighting, no inverse document frequency. Mentioning "masato" five times scored the same as once. Common words like "block" got the same boost as rare proper nouns.

So I swapped it for BM25 via MiniSearch (~25KB, zero additional dependencies). Hybrid scoring: BM25 keyword relevance + cosine semantic similarity. This is the search algorithm that finally worked.

None of this is in elizaOS. The framework has no built-in knowledge search, no keyword matching, no hybrid retrieval. Every layer was custom.

Week 4: Qwen3 breaks the evaluator

ElizaOS's reflection evaluator asks the LLM to produce structured XML for extracting facts and relationships from conversations. Qwen3 garbles it — attributes rendered as tag bodies, truncation mid-generation, reasoning blocks injected before the XML.

The framework doesn't support per-model evaluator overrides. The evaluator lives in the bootstrap plugin, not configurable. I vendored the entire evaluator (~500 lines) to swap the XML format from nested elements to flat self-closing tags with attributes, add a regex-based parser, strip Qwen3's <think> blocks, and implement best-effort parsing that extracts facts even when relationships fail.

Week 5: The explorer fights back

The BTQ Explorer API was undocumented — endpoint descriptions only, no response schemas, no example payloads. The original plugin assumed Esplora-style responses (the most common Bitcoin explorer format). BTQ Explorer runs btc-rpc-explorer, which has completely different response formats. Twelve interface mismatches. TypeScript's as T casts don't validate at runtime, so everything compiled fine and failed silently.

Then came response validation — HTTP 200 with HTML error pages, JSON error envelopes with success: false, corrupt data fields, all cached for 30 seconds. Then the circuit breaker — 8 parallel API calls per user query hammering an Explorer on the same machine as the BTQ node, three endpoints broken server-side but retried every 30 seconds indefinitely.

The final design: background polling with sequential sweeps, per-endpoint circuit breakers with exponential backoff, and a provider that reads exclusively from cache — zero network calls per user query.

Seven ADRs document these decisions. Seven.

The scorecard

After five weeks, here's where the effort went:

CategoryEffortWhat it produced
Framework friction~60%3 vendored plugins, prompt engineering band-aids, identity workarounds, XML parser rewrite
Domain logic~40%Knowledge sync pipeline, hybrid BM25+semantic search, explorer integration, circuit breaker, anti-hallucination guards

The domain logic is the good stuff — the knowledge sync scripts that auto-extract parameters from btq-core, the hybrid search that handles proper nouns, the circuit breaker that protects the explorer. That's what makes the bot useful.

The framework friction is the tax — 1,332 lines vendored for two HTTP headers, eleven commits to make the bot respond to its own name, a reflection evaluator rewrite because Qwen3 doesn't produce the XML format the framework expects.

Then I looked at the alternatives

OpenClaw: 500,000 lines

OpenClaw is a self-hosted personal AI OS. WebSocket gateway connecting 25+ messaging channels. Companion apps for macOS, iOS, Android. Voice wake words. Live canvas. Browser automation. 54 bundled skills. 42 extensions. 15+ LLM providers with failover. 53 configuration files. 70+ dependencies.

It's impressive engineering. It also represents a worldview: build a framework that handles all use cases, then configure it for yours.

The problem with 500,000 lines is that you can't read them before you start. You can't understand the failure modes before they hit you. When something breaks — and it will — you're debugging a system larger than most companies' entire codebase. The configuration surface alone (53 files) is a liability. Every config option is a decision someone made that you now inherit, whether or not it matches your needs.

NanoClaw: 6,200 lines

NanoClaw is the deliberate reaction. Same creator, opposite philosophy. Single Node.js process. SQLite. One Docker container per conversation group. Claude Agent SDK inside each container.

~6,200 lines of code. Readable in an afternoon. Extension model: fork the repo and change the code. The assumption is that your AI coding assistant is your co-developer — you don't configure NanoClaw, you change it.

The container isolation model is genuinely interesting. Each group gets OS-level isolation — its own filesystem, memory, and agent session. The container boundary is the security boundary. No application-level permission system to get wrong.

The comparison

BTQ-AgentOpenClawNanoClaw
Custom code~5k lines~500k lines~6.2k lines
Channels225+1 built, 5 planned
Extension modelPluginsPlugins + marketplaceFork and modify
SecurityFramework defaultsApp-level allowlistsOS-level containers
Dependencies~1570+~10
Config files1 (.env)53~0

What this tells us

The pattern across all three projects is the same: the value is in the domain logic, not the framework. My hybrid BM25+semantic search works because I understand why proper nouns fail in embedding space, not because elizaOS provides good knowledge retrieval primitives (it doesn't). NanoClaw's container isolation works because the creator understood OS-level security boundaries, not because Claude Agent SDK makes containers easy.

The framework is just the runtime. The domain logic is what matters. And domain logic is best expressed as small, focused modules that you own and understand — not as configurations of a system too large to read.

This is why I think the era of large frameworks is ending. The alternative is a sturdy core — a runtime, a message loop, a plugin interface — plus skills that encode the taste and style of how to extend it. The core stays small enough to read. The skills carry the institutional knowledge: this is how we do knowledge retrieval, this is how we handle model-specific quirks, this is how we validate upstream API responses.

I spent 60% of my effort on framework friction and 40% on domain logic. The domain logic — the hybrid search, the circuit breaker, the anti-hallucination guards — is the hard part. It should have been most of the work.

A large framework doesn't save you from doing the hard part. It just makes everything else harder too.