What Is a Harness
The term was formalized in early 2026 but the concept preceded it. The canonical formula, from LangChain's Vivek Trivedy: Agent = Model + Harness. The corollary: "if you're not the model, you're the harness."
The harness is the complete software infrastructure wrapping an LLM: orchestration loop, tools, memory, context management, state persistence, error handling, and guardrails. When someone says "I built an agent," they mean they built a harness pointed at a model.
LLM as CPU analogy (Beren Millidge, 2023): A raw LLM is a CPU with no RAM, no disk, no I/O. The context window is RAM (fast but limited). External databases are disk (large but slow). Tool integrations are device drivers. The harness is the operating system. "We have reinvented the Von Neumann architecture."
Framework vs. runtime vs. harness: A framework (LangChain) provides abstractions. A runtime (LangGraph) manages execution state. A harness is the opinionated, batteries-included layer that configures both with domain-specific constraints and infrastructure. The Node.js analogy: Node is the runtime, Express is the framework, Next.js is the harness.
Claude Code is a harness. Codex is a harness. Cursor is a harness. They all run the same underlying models; the harness is why they behave differently.
Why Harnesses Matter
Benchmark evidence: LangChain changed only the infrastructure wrapping their LLM (same model, same weights) and jumped from outside the top 30 to rank 5 on TerminalBench 2.0. "The model is a constant. The harness is the variable."
Claude Code's leaked source code (March 2026, accidentally shipped to npm) was 512,000 lines. That code is the harness. Even the makers of the best model in the world invest heavily in harnesses.
The derivation pattern: Harness components aren't arbitrary — they can be derived by working backwards from desired agent behavior to the engineering that enables it. Models take in data (text, images, audio) and output text. They cannot maintain durable state, execute code, access real-time knowledge, or set up environments out of the box. Each harness feature exists to close one of these gaps. The simplest example: "chatting" requires wrapping the model in a while loop that tracks messages. Every harness component follows this pattern: behavior we want → gap in the raw model → harness feature that bridges it.
R.E.S.T. framework (TRAE, Apr 2026): Four production-readiness objectives for any agent system — Reliability (fault recovery, idempotent operations, behavioral consistency), Efficiency (resource budgets, low-latency response, high throughput), Security (least privilege, sandboxed execution, I/O filtering), Traceability (end-to-end tracing, explainable decisions, auditable state). "To move agents beyond the toy stage, they must anchor on these four objectives."
Three Levels of Engineering
- Prompt engineering — crafts instructions the model receives
- Context engineering — manages what the model sees and when
- Harness engineering — encompasses both, plus all application infrastructure: tool orchestration, state persistence, error recovery, verification loops, safety enforcement, lifecycle management
12 Components of a Production Harness
-
Orchestration loop — the Thought-Action-Observation (TAO/ReAct) cycle. Often just a while loop; complexity lives in everything it manages. Anthropic describes their runtime as a "dumb loop" — all intelligence lives in the model.
-
Tools — the agent's hands. Defined as schemas injected into context. Handles registration, validation, argument extraction, sandboxed execution, result capture, formatting. Anthropic frames tools as a new kind of software: a contract between deterministic systems and non-deterministic agents. A function call like
getWeather("NYC")always fetches weather the same way; a tool invoked by an agent might be called, skipped, or preceded by a clarifying question depending on context. This means tool design must account for how agents perceive available actions (their "affordances"), not just what the tools technically do. See Tool Design Principles below for the implications. -
Memory — see section below.
-
Context management — model performance degrades 30%+ when key content falls in mid-window positions ("Lost in the Middle"). Context rot formalizes this: models become worse at reasoning and completing tasks as their context window fills, making context a precious, scarce resource. Three harness-level mitigations: compaction (intelligently summarizing existing context when the window nears capacity), tool call offloading (keeping only head and tail tokens of large tool outputs, writing the full result to the filesystem for on-demand access), and skills as progressive disclosure (loading skill front-matter lazily rather than injecting all tool definitions at startup, protecting the model from context rot before work begins). TRAE formalizes broader context management as a Token Transformation Pipeline: Collection (aggregate inputs + memory + retrievals) → Ranking (score by recency/relevance) → Compression (summarize low-density content) → Budgeting (allocate token limits per category) → Assembly (structured templates with explicit blocks). "Rather than hoping the model figures out what to focus on, actively build the context."
-
Prompt construction — assembles system prompt, tool definitions, memory files, conversation history, current message.
-
Output parsing — modern harnesses use native tool calling (structured tool_calls objects). Legacy: RetryWithErrorOutputParser.
-
State management — LangGraph uses typed dictionaries with checkpointing. Claude Code uses git commits as checkpoints and progress files as scratchpads. State Separation Principle (TRAE): treat the LLM strictly as a stateless compute unit ("CPU"). All state requiring cross-turn consistency — user sessions, task progress — must be offloaded to an external persistence engine controlled by the harness. The anti-pattern: forcing the LLM to maintain complex state via prompt engineering.
-
Error handling — "a 10-step process with 99% per-step success still has only ~90.4% end-to-end success." Errors compound. Four types: transient (retry), LLM-recoverable (return as ToolMessage), user-fixable (interrupt), unexpected (bubble up).
-
Guardrails and safety — OpenAI SDK: input, output, and tool-level guardrails plus tripwires. Claude Code gates ~40 tool capabilities independently with 3-stage permission flow.
-
Verification loops — "giving the model a way to verify its work improves quality by 2-3x" (Boris Cherny, Claude Code creator). Approaches: rules-based (tests/linters), visual (screenshots), LLM-as-judge.
-
Subagent orchestration — Claude Code: Fork, Teammate, Worktree execution models. OpenAI: agents-as-tools and handoffs. LangGraph: nested state graphs.
-
Long-horizon continuity — "Ralph Loop" pattern (Geoffrey Huntley): agent treats every session as a repeating loop — write a planning file at the start, execute one task, update the file before context resets.
Memory Is the Harness
Sarah Wooders (Letta/MemGPT): "Asking to plug memory into an agent harness is like asking to plug driving into a car." Memory isn't a plugin — it's core to what the harness does.
The harness makes invisible memory decisions no external plugin can control:
- How is CLAUDE.md / AGENTS.md loaded into context?
- How is skill metadata shown to the agent?
- Can the agent modify its own system instructions?
- What survives compaction, and what's lost?
- Are interactions stored and made queryable?
- How is memory metadata presented to the agent?
- How is the current working directory represented?
Claude Code's multi-level memory hierarchy (from leaked source code):
- Memory = index, not storage (MEMORY.md as pointers, ~150 chars/entry)
- 3-layer design: index (always loaded), topic files (on-demand), transcripts (grep only)
- Strict write discipline: write to file → then update index, never dump content into index
- Background memory rewriting (autoDream): merges, deduplicates, prunes, converts vague → absolute
- Memory is a hint, not truth — model must verify before acting
- What they don't store: no debugging logs, no code structure, no PR history ("if it's derivable, don't persist it")
Anthropic Dreams (Managed Agents API, May 2026): The API-level equivalent of Claude Code's autoDream. A dream is an async job that reads an existing memory store plus up to 100 past session transcripts, then produces a new, reorganized output store — duplicates merged, stale or contradicted entries replaced, new insights surfaced from session patterns. The input store is never modified, so the output can be reviewed and discarded if unsatisfactory. Dreams run asynchronously (minutes to tens of minutes depending on input size) and expose a live session you can stream to observe what the pipeline is reading and writing in real time. Supported models during research preview: claude-opus-4-7 and claude-sonnet-4-6. Billed at standard API token rates; cost scales roughly linearly with the number and length of input sessions source(https://platform.claude.com/docs/en/managed-agents/memory). The naming — "dreams" — parallels biological memory consolidation during sleep: the agent replays recent experience offline to reorganize long-term storage.
Letta's Context Constitution: Agents learn by actively managing their own context — creating durable token-space representations of what they know. Letta Code uses a git-versioned memory filesystem with background memory subagents for sleep-time compute.
Open vs. Closed Harnesses (Harrison Chase, LangChain)
The ownership question: If you use a closed harness (especially behind a proprietary API), you yield control of your agent's memory to a third party.
- Mildly bad: Stateful APIs (OpenAI Responses API, Anthropic server-side compaction) store state on their servers. Switching models means losing thread history.
- Bad: Closed harnesses like Claude Agent SDK manage memory in ways that are unknown and non-transferrable.
- Worst: When the whole harness including long-term memory is behind an API — you have zero ownership or visibility.
Model providers are incentivized to move memory behind APIs because it creates lock-in. Anthropic's Managed Agents is the clearest example: literally everything behind an API, locked to their platform — Dreams extends this by making memory consolidation a platform service, not just memory storage. Codex generates encrypted compaction summaries not usable outside OpenAI's ecosystem.
Open alternatives: Deep Agents (LangChain), open-source harnesses using open standards (agents.md, agentskills.io).
Thin Harness, Fat Skills (Garry Tan, YC)
The anti-pattern: a fat harness with thin skills. 40+ tool definitions eating half the context window, god-tools with 2-5 second MCP round-trips, REST API wrappers turning every endpoint into a separate tool.
The principle: Push intelligence up into skill files. Push execution down into deterministic tooling. Keep the harness thin.
Five definitions:
-
Skill files — reusable markdown that teaches the model how to do something (not what). Works like a method call: same procedure + different parameters = different capabilities. "Markdown is a more perfect encapsulation of capability than rigid source code, because it describes process, judgment, and context in the language the model already thinks in."
-
The harness — runs the model in a loop, reads/writes files, manages context, enforces safety. ~200 lines of code. That's it.
-
Resolvers — routing tables for context. When task type X appears, load document Y first. Skills have description fields; the model matches user intent to skill descriptions automatically.
-
Latent vs. deterministic — judgment and synthesis go in latent space (the model). Same-input-same-output tasks go deterministic (code, SQL, arithmetic). Confusing these is the most common mistake in agent design.
-
Diarization — the model reads everything about a subject and writes a structured profile. No RAG pipeline produces this. It requires reading, holding contradictions, noticing what changed, and synthesizing.
When skills improve themselves: Skills can have an /improve loop — read feedback, extract patterns, write updated rules back into the skill file. The skill rewrites itself. The system compounds.
Resolvers in Depth (Garry Tan)
A follow-up to "Thin Harness, Fat Skills" that expands on definition #3 (resolvers) — the component that got the least attention but matters most.
The problem: Garry's CLAUDE.md grew to 20,000 lines. The model's attention degraded; responses got slower and less precise. Claude Code literally told him to cut it back. The fix: 200 lines. A numbered decision tree with pointers to documents. "You can't make someone smarter by shouting louder. You make them smarter by giving the right book at the right moment."
The misfiling that revealed everything: An AI-filed policy analysis went to sources/ (raw data dumps) instead of civic/ (political analysis). Root cause: the idea-ingest skill had hardcoded brain/sources/ as default — it didn't consult the resolver. After auditing all 13 brain-writing skills, only 3 referenced the resolver. The other 10 had hardcoded paths.
The fix: A shared _brain-filing-rules.md document + mandate that every brain-writing skill reads RESOLVER.md before creating any page. One rule, ten skills fixed. Zero misfilings since.
The invisible skill problem: A skill that exists but isn't reachable is worse than a missing skill — it creates the illusion of capability. After a month of building, 40+ skills existed but 15% were "dark" (unreachable from the resolver). A signature-tracking system worked perfectly but nobody could invoke it because the resolver had no trigger for "check my signatures."
Resolver trigger evals: A test suite of 50 sample inputs with expected outputs. Two failure modes: false negatives (skill should fire but doesn't) and false positives (wrong skill fires). Both fixable by editing markdown, no code changes. "If you can't prove the right skill fires for the right input, you don't have a system. You have a collection of skills and a prayer."
check-resolvable meta-skill: Walks the entire chain (AGENTS.md → skill file → code) and finds dead links. First run found 6 unreachable skills out of 40+ (15%). Runs weekly as a linter for the resolver.
Context rot: Day 1 the resolver is perfect. Day 30, three new skills exist that nobody added. Day 60, trigger descriptions don't match user phrasing. Day 90, the resolver is a historical document. The endgame: a reinforcement learning loop where the system observes every task dispatch and periodically rewrites the resolver based on observed evidence. "A resolver that learns from its own traffic — that's the endgame for agent governance."
Resolvers are fractal: They compose at every layer:
- Skill resolver (AGENTS.md) — maps task types to skill files
- Filing resolver (RESOLVER.md) — maps content types to directories
- Context resolver (inside each skill) — sub-routing within the skill
Resolvers as management: Skills are employees. The resolver is the org chart. Filing rules are internal process. check-resolvable is audit and compliance. Trigger evals are performance reviews. "The problem isn't that models aren't smart enough. The problem is that we've been building organizations with no management layer."
GBrain: Open-sourced system shipping with the resolver pattern built in. gbrain init creates RESOLVER.md, the decision tree, and disambiguation rules. 25,000 files, 200 inputs/day, compounding. GStack (72K+ stars) is the coding layer; OpenClaw or Hermes Agent is the conductor.
Context as Knowledge Hierarchy
With the model fixed, accuracy is a function of context quality: bloated context buries the signal, missing context forces guessing, and both cost accuracy. The relationship is nonlinear — a task that scores 99% is worth 10x more than one at 95%. But the agent cannot hold the union of everything in context at once. The real objective: minimize the context spent per task, averaged over the task distribution.
This is exactly the problem a CPU faces. A program may touch gigabytes of data, but the storage next to the processor is tiny — so computers stack memory in tiers: a small, instant cache (L1), bigger-and-slower ones below it (L2, L3), then main memory and disk. It works because access is long-tailed: keep the hot set in the fast tier, reach down to the slow tiers only for the rare stuff.
Agent context should have the same structure:
L1 — always resident. The operations on the steep part of the frequency curve. These get feature-engineered, token-compressed, consequence-reporting wrappers that live in the system prompt on every single task. The investment in L1 is disproportionate because the agent pays the cost on every call. Concrete example from a spreadsheet agent: a getCellRange call normalizes 500 near-identical formulas into a single aliased pattern (R1C1 form), attaches the header row and row labels automatically (free context the model never asked for), and compresses cell styles into grouped ranges — turning 600 formulas and 400 styled cells into a handful of lines with zero information loss. On the write side, a structured diff groups and samples changed cells, then triages them into "clean writes" and "cells that need review" with MUST FIX flags for things like #REF! errors — a built-in linter on the agent's own edits source(https://x.com/brainsandtennis/status/2065190286519906657/?rw_tt_thread=True).
L2 — curated specs, on demand. Important-but-occasional capabilities (conditional formatting, pivot tables, charts) that would bloat every task if placed in L1. Written as hand-crafted prose — not type-signature dumps but gotcha-aware recipes that encode the canonical procedure and the constraints you'd otherwise learn only by failing. Retrieved via a single discovery step (e.g., console.log(getAPIInfo("pivot_tables"))). Cost: zero tokens until needed, one cache miss when invoked. The same pattern applies to deferred tools — a meta-tool wall of one-liners where the model loads full schemas only on selection (see deferred tools for the Claude implementation).
L3 — the raw substrate, plus a skill to mine it. The complete reference that can't be anticipated — 70K lines of raw API surface, too large for prompt context, but reachable via a short skill file (~100 lines) that teaches the model how to grep through it. Three to six targeted searches surface the exact signature needed. The system prompt makes this escape hatch explicit: "If the wrapped API can't do it, use the raw API — don't compromise." The agent should never be stuck.
Prompt budget shape mirrors the frequency curve: Most of the system prompt is L1 (a few hundred lines). L2 is ~50 lines of pointers and allowlists. L3 is ~5 lines — the skill name and a reference. The allocation is exactly what the cache-hierarchy framing predicts.
The tiers drift with model strength. Early, weak models needed tiny single-purpose tools and everything spelled out. Today's models absorb a larger L2 spec in one shot and reason over more raw L3 detail without choking. Yesterday's L3 becomes tomorrow's L2; yesterday's L2 collapses into L1. But the hierarchy itself never goes away — context will always be scarce relative to everything you could put in it, and noise will always cost accuracy. Bigger context windows tempt people to paste in more; the better instinct is summaries in cache, details on demand, the raw substrate as the last resort. (See also Harness Coevolution with Models.)
Porting the hierarchy to any domain requires three questions: (1) What do you wrap into L1? The bread-and-butter operations — make them brutally token-efficient and make them report consequences. (2) What do you defer to L2? The important-but-occasional capabilities — curated, English, gotcha-aware specs reachable in one discovery step. (3) What is your L3 escape hatch? The raw, complete substrate plus a skill that teaches the agent to mine it. It doesn't have to be ergonomic; it has to be reachable, complete, and findable in a bounded number of steps.
This pattern is the concrete architectural instantiation of Thin Harness, Fat Skills: L1 is the thin harness kept tight; L2 specs are the fat skills loaded by resolvers; L3 is the substrate that ensures the agent is never truly stuck.
OpenAI's Harness Engineering Lessons (Ryan Lopopolo)
Team built a product with 1M+ lines of code and 0 manually-written lines over 5 months (3.5 PRs per engineer per day, using Codex).
Key lessons:
-
Give agents a map, not a 1,000-page manual. One-big-AGENTS.md fails: crowds out task/code, creates non-guidance ("everything important" = nothing important), rots instantly, can't be mechanically verified. Solution: short AGENTS.md (~100 lines) as table of contents, with structured
docs/directory as system of record. Progressive disclosure from a stable entry point. -
Agent legibility first. Repository optimized for Codex's legibility, not human aesthetics. "From the agent's point of view, anything it can't access in-context while running effectively doesn't exist." Knowledge in Google Docs or Slack = invisible.
-
Enforce invariants, not implementations. Strict architectural constraints (layer dependencies enforced via custom linters), but freedom within those boundaries. Constraints are multipliers: "once encoded, they apply everywhere at once."
-
Garbage collection for AI slop. Full agent autonomy introduces entropy — agents replicate patterns that already exist, even bad ones. Solution: encode "golden principles," run background cleanup agents on a regular cadence. "Technical debt is like a high-interest loan — better to pay continuously in small increments."
-
Throughput changes merge philosophy. In high-throughput agent systems, corrections are cheap, waiting is expensive. Most review can be agent-to-agent.
Seven Harness Design Decisions
-
Single vs. multi-agent — maximize a single agent first. Split only when tool overload exceeds ~10 overlapping tools or clearly separate domains. Multi-agent adds routing overhead and context loss during handoffs.
-
ReAct vs. plan-and-execute — ReAct is flexible but higher per-step cost. Plan-and-execute separates planning from execution. LLMCompiler reports 3.6x speedup over sequential ReAct.
-
Context window strategy — five approaches: time-based clearing, summarization, observation masking, structured note-taking, sub-agent delegation. ACON research: 26-54% token reduction while preserving 95%+ accuracy by prioritizing reasoning traces over raw tool outputs.
-
Verification loop design — computational (tests, linters) provides deterministic ground truth. Inferential (LLM-as-judge) catches semantic issues but adds latency.
-
Permission architecture — permissive (fast but risky) vs. restrictive (safe but slow). Depends on deployment context.
-
Tool scoping — more tools often means worse performance. Vercel removed 80% of tools from v0 and got better results. Claude Code achieves 95% context reduction via lazy loading. See Tool Design Principles below for detailed guidance.
-
Harness thickness — Anthropic bets on thin harnesses + model improvement. Graph-based frameworks bet on explicit control. "Anthropic regularly deletes planning steps from Claude Code's harness as new model versions internalize that capability."
Tool Design Principles
Anthropic's internal tool evaluation work distills five principles for writing effective agent tools, validated by held-out test sets against their Slack, Asana, and other internal MCP servers.
Fewer tools, higher accuracy. Model accuracy degrades as tool count increases — every tool added is more schema in the prompt, more surface to confuse, more ways to pick the wrong one. Popular agents vary 4x on this most basic design question: Codex and Claude Code ship ~30 tools each; Pi ships 7. One spreadsheet agent collapsed all capabilities into a single execute_code tool — no read_range, no write_range, no make_chart — letting the model compose capabilities with the full expressive power of code instead of stitching together rigid tool calls source(https://x.com/brainsandtennis/status/2065190286519906657/?rw_tt_thread=True). This is consistent with Vercel's experience: they removed 80% of tools from v0 and got better results (see Seven Harness Design Decisions, point 6).
Choose the right tools (and skip the wrong ones). Agents have limited context; computer memory is cheap. A list_contacts tool that dumps an entire address book wastes the agent's context window on brute-force search — the better tool is search_contacts or message_contact. Build a few thoughtful tools targeting high-impact workflows, then scale from there. Tools should consolidate multi-step operations: instead of separate list_users, list_events, and create_event tools, a single schedule_event tool that finds availability and books the meeting. Instead of get_customer_by_id + list_transactions + list_notes, a single get_customer_context tool that compiles everything at once.
Namespace to define boundaries. When agents face dozens of MCP servers with hundreds of tools, overlapping or vaguely named tools cause confusion. Grouping by service (asana_search, jira_search) and by resource (asana_projects_search, asana_users_search) helps agents select the right tool. Prefix- vs. suffix-based namespacing has non-trivial effects on evaluation performance and varies by model — choose based on your own evals.
Return meaningful context, not raw data. Prioritize contextual relevance over flexibility. Eschew low-level identifiers (uuid, 256px_image_url, mime_type) in favor of semantically meaningful fields (name, image_url, file_type). Resolving arbitrary alphanumeric UUIDs to natural language or even 0-indexed IDs significantly improves precision in retrieval tasks by reducing hallucinations. Where agents need both natural-language and technical identifiers (e.g., search_user(name='jane') → send_message(id=12345)), expose a response_format enum ("concise" vs. "detailed") so the agent controls verbosity — concise for final answers, detailed when downstream tool calls need IDs.
Optimize for token efficiency. Implement pagination, range selection, filtering, and/or truncation with sensible defaults. Claude Code caps tool responses at 25,000 tokens by default. Truncated responses should steer agents toward efficient strategies ("many small targeted searches instead of one broad search"). Error responses should communicate specific, actionable improvements — not opaque error codes or tracebacks. The response structure itself (XML, JSON, Markdown) affects performance; there is no universal best format.
Prompt-engineer tool descriptions. Think of describing your tool to a new hire: make implicit context explicit — query formats, niche terminology, relationships between resources. Name parameters unambiguously (user_id not user). Even small refinements to descriptions yield dramatic gains: Claude Sonnet 3.5 achieved state-of-the-art SWE-bench Verified performance after precise tool description edits alone source(https://www.anthropic.com/engineering/writing-effective-tools-for-agents). MCP tool annotations disclose which tools require open-world access or make destructive changes.
Evaluation-driven improvement. The inner loop: (1) prototype tools, (2) generate realistic eval tasks grounded in real-world complexity (multi-tool, multi-step — not simplified sandboxes), (3) run evals programmatically with simple agentic loops, (4) analyze results — including agent reasoning/CoT — to identify rough edges, (5) let agents (e.g., Claude Code) refine tools by analyzing concatenated eval transcripts, (6) validate against held-out test sets. Anthropic's internal Slack and Asana tools improved beyond expert-written baselines through this cycle. Interleaved thinking helps probe why agents do or don't call certain tools. Tracking metrics beyond accuracy — runtime, token consumption, tool call counts, error rates — reveals consolidation opportunities and workflow patterns.
Cache-Aware Harness Design
Prompt caching — reusing computation from previous API roundtrips via prefix matching — is the constraint that shapes how production harnesses lay out every request. The Claude Code team (Thariq, Anthropic) treats cache hit rate as an uptime-level metric, running alerts and declaring SEVs when it drops.
The prefix-match constraint: The API caches everything from the start of a request up to each cache_control breakpoint. Any byte-level change anywhere in the prefix invalidates everything after it. This means request ordering is load-bearing architecture: static content first, dynamic content last. Claude Code's layout: (1) static system prompt + tools → (2) CLAUDE.md → (3) session context → (4) conversation messages. Each layer is cached at successively narrower scope (global, project, session). Fragile in practice — a timestamp in the system prompt, non-deterministic tool ordering, or updated tool parameters have all broken caching in production.
Messages over mutations: When information changes mid-session (time, file contents, user settings), updating the system prompt would invalidate the cache. Claude Code injects updates via <system-reminder> tags in the next user message or tool result instead, preserving the cached prefix.
Tool and model stability: Changing the tool set mid-conversation invalidates the entire cache, even if you're just removing a tool the agent doesn't need right now. Plan Mode was designed around this — rather than swapping to a read-only tool set, Plan Mode keeps all tools present and uses EnterPlanMode/ExitPlanMode as tools themselves. A system message explains the constraint; the tool definitions never change. Bonus: because EnterPlanMode is a callable tool, the agent can autonomously enter plan mode when it detects a hard problem. Similarly, switching models mid-session forces a full cache rebuild — counterintuitively, asking Opus to answer a simple question is cheaper than switching to Haiku if you're 100K tokens into a conversation, because the Haiku call pays uncached input rates for the entire history.
Defer loading instead of removing: When dozens of MCP tools are available, including all full schemas in every request is expensive, but removing them breaks the cache. Claude Code sends lightweight stubs (tool name with defer_loading: true) that the model discovers via tool search when needed. Full schemas load on selection. The cached prefix stays stable because the same stubs appear in the same order every time.
Cache-safe compaction: When the context window fills, the conversation must be summarized (compaction). The naive approach — a separate API call with a "summarize this" system prompt and no tools — diverges from the parent's cached prefix at the first token, paying full uncached rates for the entire conversation. Claude Code's fix: the compaction call uses the exact same system prompt, user context, and tool definitions as the parent conversation, with the parent's messages prepended and the compaction instruction appended as a new user message. From the API's perspective the request shares the parent's prefix, so the cache applies. This requires reserving a "compaction buffer" — enough headroom in the context window for the compaction prompt and summary output tokens source(https://claude.com/blog/lessons-from-building-claude-code-prompt-caching).
Design implications for harness builders: (1) Get the request ordering right and most caching works for free. (2) Model state transitions as tool calls, not tool set changes. (3) Monitor cache hit rate the way you monitor uptime. (4) Any fork operation (compaction, summarization, skill execution) must share the parent's prefix to get cache hits.
Agent Experience (AX): Harnesses for Agents, Not Humans
By April 2026, machine identities outnumber human users 45:1 in the average enterprise (some organizations up to 100:1). 80% of Neon databases created by AI agents. GitHub sees 5%+ of commits completely authored by Claude Code.
Agents operate through APIs, scripts, and structured commands — bypassing GUIs entirely. The new software stack:
- Skill files — encode practitioner expertise in machine-readable markdown (Figma Skills example: design system conventions, token structure agents would otherwise get wrong)
- CLI tools and MCP servers — the new interface layer. A CLI that accepts structured input and produces structured output is composable in ways a GUI never can be.
- Vertical models — see Vertical AI for the tradeoffs
Linear's error: built an embedded agent (GUI-first), but customers wanted MCP support so external agents could connect to Linear's data. Basecamp's success: full CLI + revamped API with structured JSON output.
Claude Code Hooks in Practice
Claude Code exposes hooks that fire at specific lifecycle points — session_start, pre_tool_use, post_tool_use, stop. These are a concrete implementation of the harness-as-context-manager concept.
The session_start hook is particularly powerful for personal operating systems: it fires when a new Claude Code session opens and can inject structured context (weekly priorities, active projects, past learnings, constraints) before the user types a word. This makes the system compound over time — every session builds on everything from previous sessions, rather than starting from an empty context window.
Practical demonstration (Dave Khaled, "Dex" personal OS, Apr 2026): Session start hook loads weekly goals, quarterly priorities, account health data, and a list of mistakes made in previous sessions. The CLAUDE.md file for this system deliberately stays short and acts as a map (progressive disclosure) — but the hook ensures the key context always lands, regardless of the CLAUDE.md content.
Key insight: CLAUDE.md is good guidance but not always adhered to. Session start hooks are adhered to every single time because they fire programmatically. For time-sensitive context (current priorities, recent learnings), hooks are more reliable than CLAUDE.md alone.
Operational tip: version-control your CLAUDE.md via git. In high-iteration systems, CLAUDE.md regressions are common — you'll want the ability to revert to a previous version that worked.
Real-World SMB Harness: "Peggy" (Charles Miller, Cooper Demolition)
Charles Miller, a construction business owner running Cooper Demolition (site preparation specialty subcontractor), built a multi-agent architecture he calls "Peggy" — proving that agent harnesses aren't just for tech companies. His framing: if a demolition contractor can use this, so can any SMB owner.
Hardware & Stack: Mac Mini ($600) running 24/7, accessed remotely via Tailscale. Claude Max ($100/mo), MS 365 (Outlook), Airtable (shared memory via API). Deliberately uses existing subscriptions to minimize implementation effort. More elegant stacks exist, but this gets SMBs off the ground with minimal risk.
Shared persona architecture: All agents operate under one identity ("Peggy Olson, Executive Assistant") sharing the same rulebook, CRM, memory file, and to-do list. This makes the system feel coherent rather than a pile of disconnected automations. Peggy has her own company email — employees interact by emailing her like any other teammate.
Four specialized agents:
- Finance Agent — Weekly WIP reports, 13-week cash flow model, AR aging/reconciliation on fixed schedules. Lands in the right inboxes before the workweek starts.
- Operations Agent — Scans email every 30 minutes, triages, maintains to-do list, sends morning briefing. Handles compliance tracking, project submittals, change orders, budget tracking. Directly interacts with project managers.
- Sales Agent — CRM updates with auto-captured contacts, pipeline monitoring, deal status tracking.
- Personal EA — Travel documents, loyalty programs, important dates, relationship context in CRM.
Why segregated agents: Parallel task performance and risk mitigation. Each agent only accesses its specialty area files — sandboxing is "a major reason for widespread Claude adoption when compared to options like OpenClaw."
Interaction model: Primarily through Claude Dispatch — single chat session on phone/laptop with full visibility into everything Peggy does. Also accepts emails from any employee. Background tasks via Claude Cowork scheduled tasks (not Claude Code) — chosen for the approachable UI, with expectation that capabilities will converge.
Key insight for harness design: The shared persona + shared memory + sandboxed tool access pattern makes multi-agent systems feel like one coherent assistant to non-technical users. This is a distribution strategy for agent adoption in AI-resistant industries.
Enterprise Context Synthesis (Hyperspell)
Conor Brennan-Burke (Hyperspell, Apr 2026) argues the industry framed the enterprise AI problem wrong — it's about understanding, not retrieval. Current agent integrations (MCP servers, API connectors) give agents access to company data but not understanding of it.
Access vs. understanding: A new employee with access to Google Drive, Slack, and CRM has access. After six months absorbing context, attending meetings, learning which sources matter — they have understanding. That transition is the entire game.
Five synthesis problems retrieval ignores:
- Data contradiction — Slack says deadline is Friday, Linear says Wednesday, PM said "end of month." Retrieval returns whichever it finds first; synthesis resolves the conflict.
- Entity resolution — "sarah.chen@acme.com" in email, "@sarah" in Slack, "Sarah from Acme" in a transcript are five unrelated strings to retrieval but one person to synthesis.
- Information decay — A six-month-old strategy doc treated with same confidence as a 10-minute-old message. Synthesis tracks when information was last confirmed.
- Source hierarchy — CEO's email vs. random Slack thread, signed contract vs. CRM field. Synthesis maintains authority ranking.
- Cross-source inference — "The migration is at risk" exists nowhere as a document — it only emerges from combining project tracker, calendar, hiring pipeline, and standup notes.
Delivery via filesystem: A context graph surfaced as files any agent can read without custom integration. Claude Code reads project directories, Cursor reads codebases, OpenClaw reads local filesystem. The filesystem is the one interface every agent already supports. The context layer is decoupled from any specific agent or vendor.
Compounding moat: Day 1, the context graph knows a little. Day 30, it has absorbed thousands of signals, resolved conflicts, built identity maps. Every new data point doesn't just add one fact — it might confirm a status, update a relationship, reveal a priority shift, and resolve a conflict between older sources. "The company that starts building today will have six months of compounded understanding that no amount of money can buy later."
The missing benchmark: No benchmark asks whether a system can synthesize fragmented, contradictory, multi-source company data into accurate answers. Kingsbury-style applied skepticism (see Testing section below) applied to the system, not the model.
Testing Pyramid for Agent Systems (Garry Tan)
Garry Tan's response to Kyle Kingsbury's "The Future of Everything is Lies" essay (Apr 2026) reframes model unreliability as an engineering problem, not a verdict. Kingsbury — the Jepsen author who spent a decade proving distributed databases didn't work as advertised — catalogues real LLM failures but tests models in isolation, "testing the engine on a bench and concluding that cars are unsafe."
The core argument: A naked model produces plausible text. A harnessed system produces verified text. The skill file says "check your work against source data." The deterministic code says "compare output to ground truth." The gap between plausible and verified is exactly what harness engineering fills.
Jepsen methodology applied to the right layer: Don't test whether the model hallucinates — of course it does. Test whether the system hallucinates:
- Does the harness prevent hallucinated data from reaching the user?
- Does the skill file route to deterministic code where precision matters?
- Does the resolver fire for the right inputs?
The testing pyramid for agent systems:
- Unit tests — for deterministic code
- Integration tests — for pipeline correctness
- Resolver trigger evals — for routing accuracy
- LLM-as-judge evals — for output quality
- End-to-end tests — for the full pipeline
Why open harnesses matter: Closed-source agents prevent users from writing the skill that verifies output. Real verification requires: "here is my schema, here are my invariants, here is what correct looks like in my domain — now verify against that." This requires open harnesses where the user controls the verification layer. As Pete Koomen (YC) puts it: "The user must write their prompt, otherwise we'll be slaves to a system prompt we can't see."
The Complexity Ratchet
A ratchet is a mechanism that allows motion in one direction only. In agent-coded software, every coding session adds three things to the codebase: tests that encode what "correct" means, documentation that records why decisions were made, and evaluation results that establish quality thresholds. The next agent session loads all three into context. It can't regress below the test suite, can't ignore the documentation, can't ship below the evaluation baseline. The quality floor goes up with every turn — forward-only motion.
Without the ratchet, vibecoded projects die at moderate complexity. The agent adds features but nothing prevents regression; by version 0.5 every change breaks something unexpected. "AI coding works fine. They just didn't build the ratchet."
The 90% threshold: Capers Jones studied over 10,000 software projects measuring defect removal efficiency (DRE). Below 70% coverage, DRE sits around 65–75%. At 85–95% coverage, DRE jumps to 92–97% — a nonlinear knee around 85% where defect escapes drop sharply source(https://x.com/garrytan). The avionics industry codified this in DO-178C, which requires modified condition/decision coverage (MC/DC) for flight-critical software — not because bureaucrats like paperwork, but because data showed that below certain thresholds, critical defects escape at rates incompatible with safety.
Why 90% is now free: Going from 70% to 90% used to require disproportionate human effort — Mockus, Nagappan, and Dinh-Trong's Vista study confirmed the last 20% takes far more work than the first 70%. That effort curve stopped human teams at 70–80%. AI agents don't experience effort. They write the fourteenth edge-case test as cheerfully as the first. The brutal last 20% that made 90% impractical is exactly the work agents are best at. "Getting to 90% used to be a heroic effort. Now it's a Tuesday."
Tests as institutional memory: In traditional teams, institutional memory lives in humans who leave. The agent's context window doesn't quit or get poached. When the test suite encodes a constraint and the documentation explains why, that knowledge is durable — any agent, any model, any time can load it. For solo projects, tests are the only institutional memory.
Everything Harnessable Is Testable
The ratchet's test surface extends far beyond traditional unit tests. Any layer a computer can observe is assertable:
- OS level — process trees, filesystem state, network sockets, cron schedules, database migrations
- Terminal/TTY level — keystroke sequences, interactive prompts, output streams
- Browser level — rendered pages, button states, navigation events, form interactions
- API level — structured responses, schema validation, status codes
- Behavioral level — did the agent follow the protocol, ask before deleting, stop when told to stop
TTY behavioral testing example: GStack's interactive plan review had a failure mode where Claude Code would skip the interactive dialogue and dump all findings in one shot. Traditional testing can't cover "did the AI have a conversation." The fix: a test harness using Bun's TTY functionality that spawns Claude Code in a pseudo-terminal, feeds it a scenario, triggers the review skill, and watches terminal output in real time. If the agent dumps findings without asking a question, the test fails. Three ratchet layers lock this in: STOP gates in skill instructions with anti-rationalization clauses, an anti-shortcut clause closing the exact loophole, and gate-tier floor tests that spawn the agent and verify the behavioral contract.
The aspirin analogy: "We don't know why transformer models have been so successful" — also true of aspirin (mechanism understood only in the 1970s), general anesthesia (still incompletely understood), and bicycle stability (definitively explained only in 2011). Practical utility doesn't require theoretical completeness.
Neurological Diagnostic Framework (Vox)
A complementary lens to the testing pyramid: diagnose agent failures the way a neurologist diagnoses a patient. Don't ask how smart the model is — ask which organ is failing. The model gives an agent thoughts; the harness gives it a body (eyes, hands, memory, brakes, self-check). If any organ fails, even the strongest model behaves like a sick patient.
Six conditions, each a real neurological or cognitive term mapped to a harness failure mode:
-
Source amnesia — the agent remembers a fact but has lost where it came from. In cognitive psychology, a source-monitoring error: the memory is intact, the source label is missing. More dangerous than forgetting — when the agent forgets, it stops to check; when the source is missing, it keeps walking forward with full confidence. Treatment: every memory needs three properties — source, scope, expiry. "A memory without a source is a clue, not a verdict." (Maps to harness component #3: Memory.)
-
Phantom limb state — the agent acts on a world state that no longer exists. A file changed, the environment changed, the task got rewritten — but the agent still patches based on what it read earlier. The agent's behavior looks reasonable because the path looks right, the diff looks right, the explanation looks right — it's just aimed at the old world. Treatment: re-perceive before acting. Re-read the file before editing it, recheck state before any dangerous operation. (Maps to components #4 and #7: Context management and State management.)
-
Locked-in syndrome — the model knows the next step but the tool channel is severed. The MCP server died, the command isn't on PATH, the browser session dropped, file permissions are wrong, or the API key isn't in the environment. Telling it to "try again" doesn't help — it isn't short on reasoning, it's short on actuators. Treatment: separate two diagnostic layers — did reasoning complete, and is the tool channel alive? (Maps to component #2: Tools.)
-
Confabulation — the medically accurate term for what the industry calls "AI hallucination." A hallucination is seeing something that isn't there; confabulation is filling a memory gap with a plausible fabrication. When retrieval fails, the agent produces something that looks like a source instead of admitting the gap. The 2026 paper HalluCitation counted nearly 300 papers across ACL 2024–2025 with at least one hallucinated reference source(https://x.com/Vox). Treatment: open every citation; if it doesn't resolve, remove it entirely. (Maps to component #10: Verification loops.)
-
Disinhibition — broken brakes. The agent's brake isn't conscience; it's the control plane: which actions require confirmation, which tools can't be triggered from memory, which external actions need human approval, which inputs are treated as untrusted. When this layer fails, any memory or external input can flow all the way to the action layer. "The danger is not that the agent can use tools. The danger is that memory and external input got execution rights they should never have had." Treatment: keep public posting, payments, deletion, deployment, and messaging outside model memory — the model can prepare actions but can't authorize them. (Maps to component #9: Guardrails and safety.)
-
Anosognosia — wrong, and unaware of being wrong. A coding agent runs the wrong tests and reports they passed. A research agent cites the wrong source and says the evidence is solid. "The same blind spot cannot self-check with the same blind model." Treatment: real self-check needs external signals — tests, fresh reads, trace review, a second verifier, tool output validation, human approval. (Maps to component #10: Verification loops.)
The unifying insight: six different conditions, one thing in common — a smarter model can't fix any of them. Only a more complete harness body can. Memory needs a source. Action needs fresh perception. Danger needs external approval. Confidence needs external evidence.
Three Dimensions of the Harness (Akshay Pachaar)
Akshay's follow-up framing (Apr 2026): the model itself should be deliberately thin, with intelligence pushed outward and composed at runtime through three dimensions:
- Memory — holds state the model shouldn't carry in weights or context. Working context, semantic knowledge, episodic experience, and personalized memory each have their own lifecycle.
- Skills — holds procedural knowledge. Operational procedures, decision heuristics, and normative constraints specialize the general model per task.
- Protocols — holds interaction contracts. Agent-to-user, agent-to-agent, and agent-to-tools are three distinct surfaces with their own failure modes.
Between the core and these modules sit mediators: sandboxing, observability, compression, evaluation, approval loops, and sub-agent orchestration. They govern how the harness reaches out and how state flows back in.
The useful question this framing unlocks: For any new capability, where should it live? Stable knowledge → memory. Learned playbooks → skills. Communication contracts → protocols. Loop governance → mediators.
Framework Implementation Comparison
How major frameworks implement the harness pattern (synthesized from Akshay's deep dive):
- Claude Agent SDK — Single
query()function creating an agentic loop, returning async iterator. "Dumb loop" runtime — all intelligence in the model. Gather-Act-Verify cycle. - OpenAI Agents SDK — Runner class with async/sync/streamed modes. Code-first: workflow logic in native Python. Codex extends with three-layer architecture: Core (agent code + runtime), App Server (bidirectional JSON-RPC), client surfaces. All surfaces share the same harness — "Codex models feel better on Codex surfaces than a generic chat window."
- LangGraph — Explicit state graph: two nodes (llm_call, tool_node) connected by conditional edge. Evolved from LangChain's AgentExecutor (deprecated in v0.2 for lack of extensibility). Deep Agents layer adds planning, file systems, subagent spawning, persistent memory.
- CrewAI — Role-based multi-agent: Agent (harness around LLM, defined by role/goal/backstory/tools), Task (unit of work), Crew (collection). Flows layer adds "deterministic backbone with intelligence where it matters."
- AutoGen (→ Microsoft Agent Framework) — Conversation-driven orchestration. Three-layer architecture (Core, AgentChat, Extensions). Five orchestration patterns: sequential, concurrent, group chat, handoff, and magentic (manager agent with dynamic task ledger).
Architecture: The Harness as Managed REPL
TRAE's comprehensive guide (Apr 2026) reframes the harness architecturally as a REPL container — a deterministic shell wrapping the non-deterministic brain. Read (Context Manager translates world → structured prompt), Eval (Call Interceptor routes tool calls with monitoring), Print (Feedback Assembler captures results → re-injects as observations), Loop (repeats until goal or termination). This maps directly to the PPAF cycle: Perception, Planning, Action, Feedback/Reflection.
Control Plane vs. Data Plane: Production harnesses decouple into a Control Plane (task scheduling, resource quotas, behavioral planning, policy enforcement) and a Data Plane (agent runtime instances, state/memory storage, sandboxed execution). Four functional layers: Interface → Orchestration → Execution → Infrastructure.
Sandboxing levels: Level 1 (Process-level: chroot/namespaces — fast, shared kernel, trusted tools only). Level 2 (Containers: Docker — industry standard for most tool execution). Level 3 (MicroVMs: Firecracker — independent kernels, multi-tenant). Level 4 (Full VMs: KVM/QEMU — maximum security, highest cost). Default recommendation: Level 2 + hardened kernel + read-only root filesystem; escalate to Level 3 for untrusted code.
Six design principles: Design for Failure (exceptions are the norm), Contract-First (all interactions via explicit schemas/APIs), Secure by Default (least privilege, zero trust, defense-in-depth), Separation of Concerns (decouple deciding from executing), Everything is Measurable (every behavior/decision quantifiable), Data-Driven Evolution (every run is a learning opportunity).
Cognitive maturity matrix: Two axes — Cognitive Loop (reactive → proactive plan & reflect) × Context Efficiency (manual/point-fed → sandboxed/automated injection). The maturity of the harness directly determines an agent's ability to move from passive/inefficient to proactive/efficient quadrants.
Harness Coevolution with Models
Models are now post-trained with specific harnesses in the loop. Claude Code's model learned to use the specific harness it was trained with. Changing tool implementations can degrade performance. A concrete example: OpenAI's Codex-5.3 prompting guide documents how the apply_patch tool logic for file editing is tightly coupled to training — a truly general intelligence should handle switching between patch methods trivially, but training with a harness in the loop creates overfitting to specific tool implementations source(https://developers.openai.com/cookbook/examples/gpt-5/codex_prompting_guide/#apply_patch).
This co-evolution creates a feedback loop: useful primitives are discovered in practice, added to the harness, then used when training the next generation of models. As the cycle repeats, models become more capable within the harness they were trained in — but not necessarily outside it. On the Terminal Bench 2.0 Leaderboard, Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses, demonstrating that the best harness for a task is not necessarily the one a model was post-trained with.
The "future-proofing test": if performance scales up with more powerful models without adding harness complexity, the design is sound. Manus was rebuilt 5 times in 6 months, each rewrite removing complexity.
The long-term trajectory: As models improve, some of what currently lives in the harness will be absorbed into the model — planning, self-verification, and long-horizon coherence will require less external scaffolding. But harness engineering is unlikely to become obsolete: a well-configured environment, the right tools, durable state, and verification loops make any model more efficient regardless of its base intelligence. Open problems include orchestrating hundreds of agents working in parallel on a shared codebase, agents that analyze their own traces to identify and fix harness-level failure modes, and harnesses that dynamically assemble the right tools and context just-in-time rather than being pre-configured.
See also: Agentic Engineering, Claude Code Skill Frameworks
Sources
- "The Anatomy of an Agent Harness" — Akshay Pachaar (tweet thread, Apr 2026) (link)
- "What's an Agent Harness? And how do I choose the best one?" — Matt Abrams (tweet thread, Apr 2026) (link)
- "Your harness, your memory" — Harrison Chase, LangChain (tweet thread, Apr 2026) (link)
- "Why memory isn't a plugin (it's the harness)" — Sarah Wooders, Letta (tweet, Apr 2026) (link)
- "Thin Harness, Fat Skills" — Garry Tan, YC (tweet thread, Apr 2026) (link)
- "Harness engineering: leveraging Codex in an agent-first world" — Ryan Lopopolo, OpenAI (Feb 2026) (link)
- "Harness, Memory, Context Fragments, & the Bitter Lesson" — Viv (tweet, Apr 2026) (link)
- "The New Software: CLI, Skills & Vertical Models" — Sandhya (tweet thread, Apr 2026) (link)
- "Automate Your Entire Work Life With Claude Code — No Coding Needed" — Aakash Gupta / Dave Khaled (video, Apr 2026) (link)
- "Claude for Dummies (SMB Owners)" — Charles Miller (tweet thread, Apr 2026) (link)
- "Resolvers: The Routing Table for Intelligence" — Garry Tan, YC (tweet thread, Apr 2026) (link)
- "Your company needs a brain, not more connectors" — Conor Brennan-Burke, Hyperspell (tweet, Apr 2026) (link)
- "A harnessed LLM agent" — Akshay Pachaar (tweet thread, Apr 2026) (link) — expanded deep dive with framework comparison
- "Imagine if naked people were stupider" — Garry Tan, YC (tweet thread, Apr 2026) (link) — response to Kyle Kingsbury's "The Future of Everything is Lies"; testing pyramid for agent systems
- "The Definitive Guide to Harness Engineering" — TRAE (tweet thread, Apr 2026) (link) — comprehensive framework: R.E.S.T. objectives, REPL container architecture, six design principles, Token Transformation Pipeline, sandboxing levels, cognitive maturity matrix
- "Lessons from building Claude Code: Prompt caching is everything" — Thariq Shihipar, Anthropic (May 2026) (link) — prefix-match constraint, cache-safe forking for compaction, defer_loading for tool stability, Plan Mode as cache-aware design
- "Your OpenClaw / Hermes Gets Neurological Conditions Too" — Vox (tweet thread, May 2026) (link) — six neurological conditions mapped to agent harness failure modes: source amnesia, phantom limb state, locked-in syndrome, confabulation, disinhibition, anosognosia
- "The AI Agent Complexity Ratchet: Why 90% Test Coverage Is Required" — Garry Tan (tweet thread, May 2026) (link) — complexity ratchet mechanism (tests + docs + evals as forward-only quality floor), 90% coverage threshold backed by Capers Jones DRE data and DO-178C, AI agents removing the effort wall, expanded test surface including TTY behavioral testing
- "Dreams" — Anthropic Claude API Docs (May 2026) (link) — async memory consolidation for managed agents: reads memory store + session transcripts, produces deduplicated/reorganized output store
- "Writing effective tools for agents — with agents" — Ken Aizawa et al., Anthropic (Jun 2026) (link) — tool design principles (consolidation, namespacing, meaningful context, token efficiency, description prompt-engineering) and evaluation-driven improvement loop; validated on internal Slack/Asana MCP servers with held-out test sets
- "Building a Good Vertical Agent" — Peter Wang (tweet thread, Jun 2026) (link) — L1/L2/L3 knowledge hierarchy for agent context (CPU cache analogy), single-tool architecture, compression engineering for domain operations, prompt budget allocation mirroring task frequency curves; from building Shortcut (spreadsheet agent deployed at 3 of top 4 multistrategy hedge funds)
- "Deriving Agent Harnesses from First Principles" — Viv (Vivek Trivedy), LangChain (tweet thread, Jun 2026) (link) — systematic derivation of harness components from model limitations; filesystem as foundational primitive; context rot mitigations (compaction, tool call offloading, progressive disclosure); apply_patch overfitting as coevolution example; Terminal Bench 2.0 harness variance data