Agents Done Right: A Framework Vision for 2026

The Problem with Today’s Agent Frameworks

I’ve been building with LLM-based agents for a while now, and I’ve been using a lot of other people’s tools too. The space is growing fast, but something feels off. It’s getting complex for the sake of being complex. New abstractions pile on top of old ones. Configuration options multiply. Steve Krug’s first law of usability is “don’t make me think.” Current agent frameworks violate this constantly. And yet I keep running into the same walls.

The agent starts strong, but as the task gets complex, its context window fills up. The context window is the LLM’s working memory, a limit on how much text it can hold at once, measured in tokens (roughly, chunks of words). Fill it up, and the model loses track. It forgets what it was doing. It repeats itself. It starts hallucinating, making things up with confidence. Eventually it fails. Not because the model isn’t capable, but because the architecture around it can’t manage the complexity.

We’re in the “Ruby on Rails 1.0” era of agent development. Everyone is building agents, but we’re all solving the same problems from scratch: context exhaustion, doom loops, model selection paralysis, and the cognitive overload of reviewing agent output. I once watched an agent run npm run build over and over for five minutes while I was on a phone call. It just kept going, burning tokens, stuck in a loop it couldn’t escape. That’s a doom loop. The agent repeats the same failed action, hoping for a different result.

The teams that have shipped production agents have independently discovered similar architectural patterns. But these insights are locked inside proprietary systems. The open source ecosystem is still dominated by thin wrappers around LLM APIs that avoid the hard problems.

It’s 2026. We should have a framework that makes the right architecture the easy architecture. This post is my attempt to think through what that framework should look like.

Core Principles

1. Convention Over Configuration

Ruby on Rails succeeded because it made decisions for you. Database table names, file locations, URL structures were all derived from conventions. You could override them, but you didn’t have to think about them.

Agent frameworks today are the opposite. Every project starts with: Which model? Which embedding provider? How should I structure tools? What’s my context strategy? How do subagents communicate?

The framework should have strong defaults for all of this. Here are the conventions I’d choose:

Model selection by task complexity. Simple edits (adding a log statement, fixing a typo) use a fast model. Multi-file refactors or debugging sessions use a reasoning model. The framework infers complexity from the task description and codebase scope. You don’t specify a model.

Context budgets with inheritance. Each agent gets a context budget. Subagents inherit a portion of their parent’s remaining budget. When an agent approaches its limit, the framework triggers automatic summarization. This forces the architecture toward delegation: if a task doesn’t fit in your budget, spawn a subagent.

Mandatory checkpoints for risky operations. File deletion, multi-file modifications, and detected uncertainty (phrases like “I’m not sure” or “this might break”) require human approval. Everything else proceeds automatically. You opt out of safety, not into it.

Curated tool profiles by archetype. Searchers get search tools. Writers get write tools. Researchers get web access. Archetypes don’t get tools outside their profile unless explicitly granted. This prevents the “too many tools” problem where agents waste context evaluating irrelevant options.

These conventions encode lessons from watching agents fail. Context exhaustion, review fatigue, tool confusion: the failure modes are predictable. The conventions exist to make the right architecture the easy architecture.

// This should just work - all conventions applied
const agent = new Agent({ codebase: "./my-project" });
await agent.run("Add input validation to the signup form");

Override when you need to. But start productive.

Configuration complexityMinimal config

SimpleVerbose

Drag the slider to add config sections, then compare the tabs

2. Tasks, Not Models

The current generation of tools asks: “Which model do you want to use?”

This is the wrong question. For most tasks, users shouldn’t be thinking about models at all. When you ask an agent to “add input validation to the signup form,” you don’t care whether it uses a fast model or a reasoning model. You care that it works.

Model selection is an optimization detail. You describe what you want; the framework figures out how to deliver it efficiently. A quick code edit optimizes for speed. Complex debugging allocates thinking time. Large refactors balance cost and capability. You shouldn’t have to care which model runs under the hood.

Framework implication: No model parameter in the default API. The framework infers requirements from the task. Power users can override when needed, but the default path requires zero model knowledge.

Traditional approach

Select a model:

7 options, each with tradeoffs. Which one is right for your task?

Model selection puts cognitive load on the user

3. Subagents as the Primary Scaling Mechanism

Watch an agent tackle a complex task. It reads files, searches the codebase, backtracks, tries a different approach. The context window fills up. By the time it’s ready to write code, it’s forgotten the original requirements. It loops. It hallucinates. It fails.

I hit this wall constantly when researching new topics. One PDF can consume your entire context window. A large codebase is even worse. You end up working in awkward, inconvenient ways just to get access to the knowledge you need.

An obvious fix is to “make context windows bigger.” But bigger windows just delay the problem. They’re slower and more expensive.

A better approach is factoring: break the work into isolated subtasks, each with its own context. A subagent searches the codebase and returns only the relevant file paths. Another analyzes a specific function and returns only its findings. The parent agent stays focused, receiving distilled results instead of accumulating everything.

Subagents are to agents what functions are to programs. They isolate context for subtasks and return only relevant results to the parent. They can be optimized for different task types, whether speed or depth, and they enable parallelism without context collision.

The difference is dramatic. A single-agent approach might use 90% of its context window stumbling through a task. The same task with subagents? The parent uses 25%, staying sharp throughout.

Spawning subagents should feel as natural as calling a function. Not an advanced feature you learn later, but the default pattern from day one.

Framework implication:

// This should feel as natural as a function call
const result = await agent.delegate("searcher", {
  task: "locate authentication logic",
  returnFormat: "file_paths_with_snippets"
});

The framework also needs primitives that make context management automatic:

const agent = new Agent({
  contextBudget: 128_000,           // Explicit budget, inherited by subagents
  overflowStrategy: "summarize",    // What to do when approaching limit
  loopDetection: true               // Circuit-break on repetitive patterns
});

With these primitives, agents can focus on the task while the framework handles the resource management. Just like garbage collection lets programmers focus on logic instead of memory.

Agent Context Window0%

Click "Run Task" to see the difference

Single agent accumulates all context, filling the window

4. Opinionated Subagent Archetypes

Watch enough production agents and you’ll see the same patterns emerge. Teams independently discover that certain subtask shapes keep recurring, and that specialized subagents for those shapes dramatically outperform general-purpose ones.

Here are the archetypes that have proven themselves:

Archetype	What it does	Why it’s specialized
Searcher	Finds relevant code, files, symbols	Optimized for speed; returns locations, not content
Thinker	Reasons through complex problems	Allocates thinking time; returns analysis, not action
Researcher	Gathers external knowledge (docs, APIs)	Web/retrieval access; returns synthesized context
Writer	Makes targeted code changes	Diff-focused; returns patches, not explanations
Planner	Breaks down complex tasks	Strategic focus; returns task breakdown
Checker	Validates and critiques work	Independent perspective; adversarial stance

Notice the pattern: each archetype has a constrained output format. A Searcher returns locations. A Thinker returns analysis. A Writer returns patches. This constraint is the key. It forces the subagent to distill its work rather than dump everything into the parent’s context.

These archetypes should ship as built-in, customizable components. You shouldn’t have to reinvent the Searcher pattern from scratch. It should just be there, ready to use, with sensible defaults you can override when needed.

Subagent Archetypes— tap to explore

Each archetype is optimized for a specific type of subtask

5. Human-Agent Workflow is Part of the Framework

The bottleneck has shifted. It’s no longer “can the agent write code?” It’s “can the human review and trust the code fast enough?”

Today’s workflow: agent dumps a wall of changes, human squints at diffs, tries to understand the intent, hopes nothing broke. This doesn’t scale. As agents get more capable, the review burden grows faster than human attention.

When you run an agent, you should be able to specify where you want control. Checkpoints give you natural stopping points for review. Change proposals explain what will change and why before changes are made. Validation hooks let you plug in secondary verification (another model, static analysis, or tests). Diff streaming gives real-time visibility into changes as they happen. And approval gates let you require sign-off for high-risk operations.

Here’s what that looks like:

const result = await agent.run({
  task: "Add rate limiting to /api/users",
  checkpoints: ["before_write", "after_plan"],
  requireApproval: ["delete_file", "modify_config"],
  onProposal: (plan) => showPlanToUser(plan)
});

This isn’t UI. It’s protocol. The framework defines the contract between agent and human; UIs implement it. A CLI might show a simple approve/reject prompt. An IDE might render an interactive diff viewer. Same protocol, different presentations.

$ agent run "Add rate limiting"

Click "Run Agent" to see the checkpoint flow

Same protocol, different presentations — CLI shows inline prompts

6. Tools are Curated, Not Collected

The promise of universal tool interoperability sounds great in theory. In practice: more tools = more confusion.

Here’s why. Every tool you give an agent is a decision it has to make: “Should I use this?” With 5 well-chosen tools, that decision is easy. With 50 tools, many overlapping, some poorly documented, others rarely useful, the agent wastes context evaluating options, makes false starts with wrong tools, and loses focus on the actual task.

I ran into this recently with MCP (Model Context Protocol) servers. I built one for Outlook to read my mail and calendar. But with multiple MCP servers installed, the agent kept trying to look up information on the web instead of using the mail tool sitting right there. Too many options, not enough guidance about which one to pick.

What you want is a curated core toolset optimized for coding tasks, not a grab-bag of integrations. Different subagents should get different tool subsets: a Searcher doesn’t need web access. Destructive tools should require human approval. And tool usage patterns should be tracked to surface which tools help vs. which cause thrashing.

Framework implication:

const searcher = new Searcher({
  tools: ["grep", "ast_search", "file_read"],  // Curated subset
  restricted: ["web_fetch", "file_write"]      // Explicitly excluded
});

// Tools are first-class objects with metadata
const grepTool = {
  name: "grep",
  description: "Search file contents with regex",
  whenToUse: "Looking for specific strings or patterns in code",
  costEstimate: "low",
  requiresApproval: false
};

The goal: agents that pick the right tool immediately, not agents that waste time evaluating irrelevant options.

Tools available:

Available Tools (5)

grepfile_readfile_writeast_searchrun_test

Agent Execution

Click "Run Task" to see agent behavior

Curated toolset: agent stays focused, minimal wasted effort

7. When to Use a Subagent vs. a Tool

There’s a simple heuristic for deciding whether something should be a tool or a subagent:

Tool: Stateless transformation. Input goes in, output comes out, no iteration required. Format a date. Parse JSON. Run a regex. The operation doesn’t accumulate context or require judgment.

Subagent: Iteration, judgment, or context accumulation. The operation explores, backtracks, tries alternatives, or builds up intermediate state. Search, research, analysis, planning: these need their own context window.

The architectural reason to care: if you implement an iterative operation as a tool, its entire execution trace pollutes the parent context. Every search result, every intermediate step, every dead end. All of it crowds out space for the actual task.

// Bad: "tool" that leaks context
const searchTool = {
  name: "search_codebase",
  execute: async (query) => {
    // All of this ends up in parent context
    const files = await grep(query);
    const ranked = await rerankResults(files);
    const snippets = await extractSnippets(ranked);
    return snippets;
  }
};

// Good: subagent with isolated context
const searcher = new Searcher({
  // Runs in own context window
  // Only `snippets` returned to parent
  returns: "snippets_with_locations"
});

This matters because many things we call tools are actually subagents in disguise. Web search, documentation lookup, code analysis: these involve iteration and judgment. The name “tool” makes them sound simple, but they’re not. Implementing them as proper subagents with isolated context is what makes the archetype pattern (Searcher, Thinker, Researcher) work.

Tool vs Subagent: Context Impact

Tool (leaks context)+0 tokens

Parent Context

Waiting to run...

Subagent (isolated)+0 tokens

Parent Context

Waiting to run...

Same search task — subagent isolates working context from parent

8. Subagent Context is Ephemeral by Default

A subagent’s working context is like local variables in a function. It exists during execution and is discarded when done.

Consider a Researcher subagent investigating a library. It fetches 10 documentation pages, reads API references and examples, follows links to GitHub issues, and synthesizes findings into a summary. All of that is working context. But it returns only the summary to the parent.

The parent doesn’t need the 10 pages, the API refs, or the GitHub issues. It needs the answer. All that intermediate work is temporary scaffolding.

const researcher = new Researcher({
  // Everything inside runs in ephemeral context
  // Only the return value persists
  returns: "synthesis",

  // Optional: persist specific artifacts
  persist: ["key_code_snippets", "api_signatures"]
});

// Parent receives ~500 tokens, not 50,000
const findings = await agent.delegate(researcher, {
  task: "How does auth work in this library?"
});

The principle: Subagents accumulate context to do their job, then compress before returning. The parent receives a distilled result, not the full execution trace.

I built something like this to explore a large codebase. A custom agent walked through the source files, read them, and wrote summarized markdown documentation. The agent consumed thousands of lines of code, but what I got back was a concise capture of the service’s flow. The essence, not the exhaustive detail. That’s the pattern: do the heavy lifting in isolated context, return only what matters.

This is how human experts work too. When you ask a colleague to research something, you want their conclusion, not everything they read along the way.

Ephemeral Context: Research Journey

Researcher Working Context0 tokens

Click "Start Research" to begin

Parent Context+0 tokens

Waiting for results...

Subagent accumulates context, then compresses before returning

The Developer Experience

What Building an Agent Should Feel Like

All of the principles above collapse into a simple question: what does it feel like to build with this framework? If the conventions are right, the code should be obvious. You declare what you want, not how to manage it. The framework handles orchestration, context, and human checkpoints. You focus on the task.

Here’s what that looks like:

import { Agent, Searcher, Thinker, Writer } from "agentkit";

// Declare a coding agent with specialized subagents
const agent = new Agent({
  name: "code-assistant",
  mode: "smart", // vs "fast" for quick iteration
  subagents: [
    new Searcher({ tools: ["grep", "ast_search", "embeddings"] }),
    new Thinker({ timeout: 120 }),
    new Writer({ requireApproval: false }),
  ],
  contextBudget: 128_000,
  humanCheckpoints: ["before_multi_file_edit", "on_uncertainty"],
});

// Run a task - framework handles subagent orchestration
const result = await agent.run({
  task: "Add rate limiting to the /api/users endpoint",
  codebase: "/path/to/repo",
});

// Result includes structured output for UI integration
result.changes;      // List of file changes
result.proposal;     // What changed and why
result.contextUsed;  // Debugging/optimization info

Why TypeScript?

TypeScript offers patterns that make agent code safer in ways that Python’s type system can’t match.

Branded types for resource budgets. Context tokens aren’t just numbers. They’re a distinct unit. Branded types prevent you from accidentally passing a line count where a token count is expected:

type ContextTokens = number & { readonly brand: unique symbol };
const budget: ContextTokens = 64_000 as ContextTokens;
// Can't accidentally pass a plain number where ContextTokens is required

Discriminated unions for checkpoint states. When a checkpoint can be pending, approved, or rejected, discriminated unions force exhaustive handling. The compiler catches missing cases at build time, not runtime:

type CheckpointState =
  | { status: "pending" }
  | { status: "approved"; by: string; at: Date }
  | { status: "rejected"; reason: string };

function handleCheckpoint(state: CheckpointState) {
  switch (state.status) {
    case "pending": return showWaiting();
    case "approved": return proceed(state.by);
    case "rejected": return showError(state.reason);
    // TypeScript errors if you miss a case
  }
}

Generic constraints for subagent return types. A Searcher<FileLocation[]> and a Thinker<Analysis> are different types. The parent agent knows exactly what shape to expect from each delegation:

const locations = await agent.delegate<FileLocation[]>(searcher, { task: "find auth" });
// locations is typed as FileLocation[], not unknown

Beyond type safety, the ecosystem fits: VS Code extensions, language servers, and most developer tooling already run on TypeScript. Using Bun, agents compile to standalone executables with no runtime dependencies.

What This Enables

Think about what web development looked like before Rails. Every project started with the same decisions: How do I structure my code? How do I talk to the database? How do I handle routing? Teams spent months on plumbing before writing a single line of business logic. Rails changed that by encoding the answers into conventions. Suddenly, developers could go from idea to working application in hours instead of weeks.

Agent development is in that pre-Rails moment right now. Every team building production agents is solving the same problems: context management, subagent orchestration, human review workflows, tool selection. The solutions exist, but they’re locked inside proprietary systems or tribal knowledge.

With the right framework primitives, agent builders can focus on what makes their agent unique instead of reinventing context management for the hundredth time. They can experiment with novel interaction patterns instead of debugging the same doom loops. They can invest in evaluation and domain expertise instead of infrastructure.

And users get agents that actually work. Not agents that succeed on simple tasks and fall apart on complex ones. Not agents that require babysitting to avoid going off the rails. Agents with predictable behavior, transparent operation, and graceful degradation when things get hard.

Open Questions

There’s still work to figure out. These are the problems I’m thinking through, along with my current hunches.

How do subagents share learned context? There’s probably a “memory” layer that persists across tasks. Something like a project-level knowledge base that subagents can read from and write to. The Researcher finds something important, it goes into shared memory. The Writer pulls from it later. This doesn’t have to be LLM-backed memory. It could be a graph database, a structured knowledge store, or something we haven’t invented yet. But the interface has to be simple enough that it doesn’t become another configuration burden.

How do peer agents collaborate? Most agent architectures are hierarchical. Parent spawns child, child returns result. But some tasks need peers working in parallel, sharing findings as they go. I suspect the answer is message-passing with typed channels, similar to how concurrent systems handle coordination. The framework handles the plumbing; agents just send and receive.

How do you evaluate agents systematically? This might be the hardest problem, and I don’t think anyone has solved it yet.

The core challenge is ground truth. To evaluate whether an agent did the right thing, you need to know what “right” means for that task. For coding tasks, you can check some things automatically (does the code compile? do tests pass?) but these only catch obvious failures. An agent can produce working code that’s architecturally wrong, or secure code that’s unmaintainable.

I think evaluation has to happen at multiple layers. Task success is the baseline: did the change work at all? Behavioral consistency asks whether the same task on the same codebase produces similar results across runs. If an agent gives wildly different answers each time, something’s wrong. Human preference captures whether humans actually accept the changes in practice.

The infrastructure for this is expensive to build. You need captured scenarios (codebase + task + expected behavior), replay with controlled randomness, and scoring that’s more nuanced than pass/fail. It’s similar to how you’d evaluate a human candidate through realistic work samples rather than abstract tests. The framework should probably include hooks for scenario capture and replay, even if the evaluation logic is user-defined.

There’s also a meta-problem: evaluating the evaluations. A scenario that tests whether the agent can add two numbers tells you nothing useful. The scenarios themselves need to be validated for difficulty and discriminative power. Do they distinguish good agents from bad ones? Do they catch regressions? This is the same challenge test engineers face with code coverage: 100% coverage means nothing if the tests are trivial.

My intuition is that evaluation requires three primitives: Scenario (a frozen codebase + task), Assertion (did the output compile, pass tests, match intent), and Consistency (same scenario, N runs, what’s the variance). This is closer to integration testing than unit testing.

How do you budget spend across subagent hierarchies? Context tokens are one cost, but API calls add up fast when you’re spawning subagents. The framework probably needs a cost model that propagates budgets down the hierarchy. Parent allocates a budget, subagents inherit portions of it, and the framework enforces limits before you get a surprise bill.

What happens when a subagent fails? The boring answer is probably the right one: retry with exponential backoff, then escalate to the parent with an error. The parent decides whether to substitute a different approach or surface the failure to the user. Automatic substitution sounds clever but might hide problems that humans should see.

What’s Next

The patterns are emerging. Production agent systems have proven what works. Now we need to turn these lessons into infrastructure that benefits everyone.

I started this post frustrated with complexity. Every agent framework I tried added abstraction layers without solving the hard problems. Configuration options multiplied while agents still choked on context. New features shipped while doom loops went unfixed. I kept trying new tools and spending more time learning their paradigms than actually getting work done. That’s a red flag. It means we haven’t found the right way to look at this space yet.

Ruby on Rails became popular not because it introduced a lot of new ideas, but because it made easy things trivial and hard things possible in a reasonable amount of time. That’s why everyone learned Ruby. We need the same thing for agents.

The answer isn’t more complexity. It’s the right complexity: conventions that encode hard-won lessons, primitives that make the correct architecture easy, and defaults that work out of the box.

I don’t think there’s one framework to rule them all. But the patterns matter. Whether you’re building agents, evaluating frameworks, or just trying to understand why your agent keeps running npm run build in a loop, these architectural ideas can help you reason about what’s going wrong and what to try next.

If more people internalized these patterns, we’d all waste less time on plumbing and more time on problems that actually matter. And that’s the point.