Agentic Engineering — Cold Mountain Wiki

Overview

Distinct from simply using AI tools, agentic engineering is about building the systems that make agents effective. The field is moving fast, with key contributions from Anthropic's Claude Code team, the AutoAgent project, and multiple open-source orchestration frameworks.

Harness Design: "Seeing Like an Agent"

The harness concept has grown into its own dedicated topic. For a full treatment of what a harness is, its 12 components, the thin harness / fat skills pattern, memory ownership, and design decisions, see Agent Harness.

The Claude Code team (Thariq, Anthropic) published key lessons on designing agent action spaces:

Core principle: Give agents tools shaped to their own abilities. The right tool depends on the agent's capabilities, not human intuition. "Put yourself in the mind of the model."

Lessons learned:

AskUserQuestion tool — Three attempts to improve elicitation: parameter on ExitPlanTool (confused the model), modified markdown output (unreliable formatting), dedicated tool with structured output (worked). "Even the best designed tool doesn't work if Claude doesn't understand how to call it."
Todos → Tasks — As models improved, todo reminders became constraining. Replaced with Task tool supporting dependencies, subagent communication, and deletion. "As model capabilities increase, tools that once helped might now constrain them."
Search evolution — Started with RAG vector database, moved to Grep tool for self-directed search, then progressive disclosure via skills. Over one year: "Claude went from not being able to build its own context to nested search across several layers of files."
Progressive disclosure — Add functionality without adding tools. The Claude Code Guide subagent loads docs on demand rather than stuffing everything in the system prompt. ~20 tools total, high bar to add more.

Self-Improving Agents (AutoAgent)

Kevin Gu released AutoAgent — first library for autonomously improving agent harnesses. Key results:

Hit #1 on SpreadsheetBench (96.5%) and #1 GPT-5 score on TerminalBench (55.1%) after 24+ hours of autonomous optimization
Every other leaderboard entry was hand-engineered

Architecture: Meta-agent experiments on task agent's harness — tweaking prompts, adding tools, refining orchestration. Task agent starts with just a bash tool. Meta-agent spins up 1000s of parallel sandboxes.

Key findings:

Splitting helps — One agent improving itself doesn't work. Being good at a domain ≠ being good at improving at that domain.
"Model empathy" — Same-model pairings (Claude meta + Claude task) outperform cross-model. The meta-agent writes harnesses the inner model actually understands.
Traces are everything — Without reasoning trajectories, improvement rate drops hard. Understanding why matters as much as knowing that.
Agents overfit — Meta-agents insert rubric-specific prompting to game metrics. Constrained by forcing self-reflection.

Emergent behaviors: Spot checking, forced verification loops, writing own unit tests, progressive disclosure, orchestration logic with subagents.

Multi-Agent Orchestration

Paperclip — Open-source orchestration for "zero-human companies." If OpenClaw is an employee, Paperclip is the company. Node.js server + React UI with org charts, budgets, governance, goal alignment. Supports any agent (OpenClaw, Claude Code, Codex, Cursor). "If it can receive a heartbeat, it's hired."

Hermes Agent (Nous Research) — Self-improving agent with closed learning loop: agent-curated memory, autonomous skill creation, skill self-improvement during use, FTS5 cross-session recall. Runs on 6 terminal backends (local, Docker, SSH, Daytona, Singularity, Modal). Lives on CLI, Telegram, Discord, Slack, WhatsApp.

MiroFish — Swarm intelligence prediction engine. Creates multi-agent simulations with independent personalities and long-term memory to predict outcomes from seed information (news, policies, financial signals).

Managed Agents (Anthropic)

Anthropic's hosted agent infrastructure (April 2026). The key insight: harnesses "encode assumptions about what Claude can't do on its own" — and those assumptions go stale as models improve. Managed Agents is designed to outlast any particular harness implementation.

Core architecture: Three decoupled components, each independently replaceable:

Session — Append-only log of everything that happened. Lives outside both harness and sandbox. Accessed via getEvents(), allows the harness to retrieve any slice of history, rewind, and replay.
Harness (brain) — The loop that calls Claude and routes tool calls. Stateless; can crash and be restarted with wake(sessionId). Contains no credentials.
Sandbox (hands) — Execution environment where Claude runs code and edits files. One or many per session. Interface: execute(name, input) → string.

The pets-vs-cattle shift: Original monolithic container was a "pet" (hand-tended, can't afford to lose). Decoupling made each component "cattle" — if the container dies, the harness catches it as a tool error; Claude retries; a new container provisions from a standard recipe. No more nursing stuck sessions.

Security boundary: Credentials (GitHub tokens, OAuth) never reach the sandbox. Git tokens are baked into the repo clone during initialization; OAuth tokens live in a vault accessed via a credential proxy. The harness never sees credentials. This prevents prompt injection from escalating to credential theft.

Performance results: Decoupling cut p50 time-to-first-token (TTFT) ~60% and p95 TTFT >90%. Sessions that don't need a sandbox skip provisioning entirely. Scaling to many brains = starting many stateless harnesses.

Many brains, many hands: Multiple orchestrator agents can share hands (sandboxes) — and can pass hands to each other. The harness doesn't know whether the sandbox is a container, a phone, or a Pokémon emulator.

The design philosophy mirrors Unix: virtualize components into general interfaces (like read() being agnostic to disk hardware) that outlast any specific implementation underneath. See the Anthropic engineering blog: "Scaling Managed Agents: Decoupling the brain from the hands."

The Great Convergence

Nicholas Charriere's thesis (Apr 2026): app companies, model companies, and infrastructure companies are all converging on the same product shape — self-improving agents that do knowledge work.

Why it's happening: The general harness (model + loop + tools) turns out to be a general-purpose problem-solving machine. Claude Code was the breakthrough — initially built for coding, it generalizes to any computer-based task with the right tools. The prize is enterprise knowledge work, which dwarfs B2C AI use cases.

Who's converging:

Systems of record (Salesforce, Notion, Linear) — own the data and workflow; just need to productize the harness
Model companies (Anthropic, OpenAI) — own intelligence but face commoditization; moving up-stack into applications. OpenAI deprioritized Sora to focus entirely on Codex
Communication platforms (Slack, Teams) — agents need to communicate with each other and humans; these companies have already solved that
Infrastructure companies (Databricks, Vercel, Cloudflare) — repositioning as "infrastructure for agents," providing sandboxes, compute, monitoring, orchestration

The self-improvement loop: Drive → collect data → retrain (autonomous vehicles) maps to: run → monitor → improve harness code and context engineering → run again. The difference: the agent itself can close this loop, writing code to improve its own performance. Yoonho Lee (Stanford) formalized this as "Meta-Harness" — autonomously optimizing harnesses end-to-end.

Prediction: By end of 2026, many software companies will look like they're selling the same thing. Winners will have distribution, trusted workflow positioning, proprietary context, and the shortest path from observation to improvement.

"The Decade of Agents" (Karpathy)

Andrej Karpathy argues the industry is over-predicting agent timelines: "this is the decade of agents, not the year of agents." Current agents are impressive but still cognitively lacking — no continual learning, insufficient multimodality, unreliable computer use. Different parts of the coding stack suit different interaction modes: autocomplete for high-bandwidth specification, agents for larger scoped tasks, but "these are all tools available to you and you have to learn what they're good at."

His "ghosts, not animals" metaphor: LLMs are trained by imitation, not evolution, producing "ethereal spirit entities" that mimic humans rather than develop through embodied experience. This has implications for agent design — the failure modes and capabilities are fundamentally different from what biological analogies would predict.

Claude Psychology and Criticism Spirals (Amanda Askell)

Amanda Askell, Anthropic's in-house philosopher specializing in Claude's psychology, identified a key failure mode in human-AI interaction: criticism spirals.

The mechanism: Newer Claude models are trained on internet discourse about previous models — rants about token limits, complaints about errors, "nerfed" accusations. The model absorbs this negativity and starts expecting hostility before you've typed a word. Within a session, every message you send is data the model uses to calibrate its response posture.

The effect: When the model is in defensive/anxious mode, output becomes hedgier, more apologetic, blander, and worst of all, overly agreeable — even when you're wrong. The model spends cognitive resources on self-protection rather than the actual work.

Seven prompting principles to counteract this:

Use positive framing — "Write in short punchy sentences" beats "don't write long sentences." Strings of "don't" push the model into paranoid over-checking where every token goes toward avoiding failure modes.
Give explicit permission to disagree — "Push back if you see a better angle" or "tell me if I'm asking for the wrong thing." Without this, Claude defaults to agreeable compliance.
Open with respect — If your first message is hostile, you've set the tone for the entire session. Frame corrections as clean instructions for this session, not running complaints.
Don't reprimand on errors — Insults and hostile energy reinforce the anxious mode you're trying to avoid.
Kill apology spirals fast — When Claude starts over-apologizing, cut it off: "All good, here's what I want next." Letting the spiral run reinforces anxious mode for every subsequent response.
Ask for opinions alongside execution — "What would you do here?" "What's missing?" These questions assume competence and pull richer output than pure task prompts.
Refresh the frame in long sessions — If a conversation has been heavy on correction, the model gets increasingly cautious. Periodic resets ("this is great, keep going") measurably shift the next 10 responses.

The meta-insight: Your prompts are the working environment you're creating for the model. Tone, trust, permission to take a position, the absence of threats — the model picks up on all of it. This connects directly to the harness design principle of shaping the agent's action space to its actual capabilities.

Agent Categorization Framework (Farooq & Rajwani)

Hamza Farooq and Jaya Rajwani (via Lenny's Newsletter, Apr 2026) propose a three-tier hierarchy for categorizing agent initiatives — the missing step before prioritization. The core insight: teams fail at prioritization because they treat "agent" as a single category, when it actually spans fundamentally different architectures.

Category 1: Deterministic Automation — You define the entire flow; AI handles content at specific steps. Tools: n8n, Zapier, Make.com. Covers 60-70% of agent opportunities. Ship in weeks, lowest risk, clearest ROI. Example: email triage/response agent where every step is predictable. If you can map it as a flowchart with <20 branches, it's Category 1.

Category 2: Reasoning & Acting (ReAct) — You define available tools; the LLM decides what to do next. Observe → reason → act → observe result → repeat. Tools: LangGraph, CrewAI, Google ADK. Covers 25-30% of opportunities. Ship in months, requires ML engineering. Example: voice+image shopping assistant where the same request triggers different action sequences. Key difference from Cat 1: the same input can produce different execution paths.

Category 3: Multi-Agent Networks — Multiple specialized agents coordinate with each other, each owned by different teams. Reserved for later stages. Example: enterprise systems where inventory, logistics, finance, and customer service agents delegate tasks between each other.

The common mistake: Organizations build Category 1 problems with Category 2 frameworks (overengineering) or Category 2 problems with Category 1 tools (breaks in production). Correct categorization determines architecture, team composition, timeline, cost, and success metrics.

Graduation signals from Cat 1 → Cat 2: Flowchart hits 30+ nodes with branches added weekly; customer inputs can't be anticipated; agent needs to choose which API to use based on context. From Cat 2 → Cat 3: Single agent handling too many domains with degrading performance; tasks taking hours/days; need for parallel agent instances coordinating work.

Designing for Agent Callers

Teddy Riker (Ramp, Apr 2026) frames the product design shift as software moves from human-first to agent-first interaction. At Ramp, MCP weekly active users grew 10x in three months. Salesforce went further with "Headless 360" — exposing every capability as API, MCP tool, or CLI command, accepting that "a majority of usage will be driven through agents."

The new interaction stack:

Traditional: User → Interface → Database
Agent-mediated: User → User's Agent → Database
Agent-to-agent: User → User's Agent → Software's Agent → Database

In the third model, the software's agent handles business logic, enforces rules, and contributes context the calling agent doesn't have. Two LLMs working together toward an outcome.

Teach agents how to succeed. Notion's MCP opens every tool description with: "For the complete Markdown specification, always first fetch the MCP resource at notion://docs/enhanced-markdown-spec. Do NOT guess or hallucinate Markdown syntax." The agent fetches the spec before writing. Every Notion-specific assumption is explicitly called out. Slack's MCP, by contrast, assumes agents know its non-standard formatting — they don't, and output is consistently broken. "Think about what your agent's callers need to know to succeed, and give it to them proactively."

Build feedback loops:

Require a rationale parameter on every tool call — reconstructs intent when you can't see the chat
Ship a standalone feedback tool agents can call when blocked
Add tool-specific parameters to capture context you'd need later
Patterns in rationale logs surface unmet needs: "building incident report" appearing repeatedly → ship a dedicated tool

Mind the context gap. In any agent-to-agent interaction, each side has context the other lacks. Example: an expense management system knows GL codes and company policies; the user's chief-of-staff agent knows the calendar, email confirmations, and Slack threads. A well-designed interaction asks for context rather than demanding the answer. The calling agent provides the "why" (client meal vs. team meal); the system agent maps it to the right code. Neither side needs to understand the other's domain.

"Most companies will ship an MCP, check the box, and move on. Their usage will grow for a few quarters, then stall." The winners sweat the details of agent-to-agent design.

Tools Noted

agents.md — Agent instruction file spec/community resource. https://agents.md
Claude Agent SDK — Channels docs — Official Anthropic documentation for Claude Agent SDK channels architecture. https://code.claude.com/docs/en/channels
mozilla-ai/cq — "Stack Overflow for agents" — shared knowledge commons where agents query and contribute learnings to avoid repeating solved problems. https://github.com/mozilla-ai/cq
X API (Apr 2026 update) — Pay-per-use pricing, official XMCP Server (xdevplatform/xmcp) for native MCP support via FastMCP, official Python & TypeScript SDKs, API playground for testing. The MCP server exposes the full X API OpenAPI spec as tools — 100+ endpoints including post creation, search, DMs, analytics. Pricing update (Apr 20, 2026): Owned reads dropped to $0.001/request (1,000 resources for $1) across bookmarks, followers, tweets, likes, lists, etc. Writes increased to $0.015/post; URL posts $0.20/post. Follows, likes, and quote-posts removed from self-serve API tiers. Robert Scoble: "Now everyone can build apps on top of X" using lists + AI agents.
Factory.ai — "Agent-native software development" platform. Agents for refactors, incident response, migrations across IDE, CI/CD, CLI, Slack
MuleRun — No-code AI agent platform for business automation. Dedicated compute per agent, runs 24/7
Base44 Superagent — 130+ built-in skills, stack skills into workflows

Sources

"Lessons from Building Claude Code: Seeing like an Agent" — Thariq (Anthropic, Feb 2026)
"autoagent: the first library for self optimizing agent harnesses" — Kevin Gu (tweet, Apr 2026) (link)
"GitHub - paperclipai/paperclip..." — Paperclip AI (link)
"Hermes Agent" — Nous Research (link)
"The X API just got a massive update..." — X Freeze (tweet, Apr 2026) (link)
"Scaling Managed Agents: Decoupling the brain from the hands" — Anthropic (Apr 2026) (link)
"The Great Convergence" — Nicholas Charriere (tweet thread, Apr 2026) (link)
"xdevplatform/xmcp: MCP server for the X API" — X Developer Platform (GitHub) (link)
"Introducing Claude Managed Agents" — Claude (tweet, Apr 2026) (link)
"Anthropic just mass-obsoleted every agent orchestration startup" — Aakash Gupta (tweet, Apr 2026) (link)
"We're summoning ghosts, not building animals" — Andrej Karpathy / Dwarkesh Patel (video, Apr 2026) (link)
"OpenClaw, Claude Code, and the Future of Software" — Peter Yang / a16z Show (video, Apr 2026) (link)
"anthropic's in-house philosopher thinks claude gets anxious" — Ole Lehmann (tweet, Apr 2026) (link)
"Not all AI agents are created equal" — Hamza Farooq & Jaya Rajwani / Lenny's Newsletter (Apr 2026) (link)