Agent Patterns Hub

How to Build a Self-Healing Agent System

The dirty secret of AI agent systems: they break constantly. Not in dramatic, crash-to-desktop ways -- but in subtle, quality-degrading ways that propagate silently through your pipeline until the final output is confidently wrong.

Self-healing is not about preventing failures. It is about structuring the system so that failures are detected, isolated, and recovered from automatically. After building an autonomous agent system with 840+ commits, here are the patterns that actually work.

The Five Failure Modes

Agent systems fail in five characteristic ways, each requiring a different recovery strategy:

  1. Hallucination cascade: An agent produces incorrect output. The next agent synthesizes it into plausible-looking incorrect output. By layer 3, the error is undetectable. Recovery: quality gates at every synthesis boundary.
  2. Context exhaustion: An agent fills its context window and starts losing information. Output quality degrades without any error signal. Recovery: orchestrator succession -- checkpoint state and spawn a fresh agent.
  3. Straggler blocking: One slow agent blocks the entire pipeline. The wall-clock time is determined by the slowest agent, not the median. Recovery: straggler detection with progressive pressure (warn, demand partial output, kill).
  4. Silent dependency failure: A tool or API the agent calls returns bad data. The agent trusts the bad data and produces wrong conclusions. Recovery: tool output validation on every external call.
  5. Runaway cost: A retry loop or poorly bounded search consumes unlimited tokens. Recovery: budget caps with automatic halt at 80% and kill at 100%.

Pattern 1: Orchestrator Succession

The orchestrator is the single point of failure in any hierarchical agent system. When it exhausts its context, the entire session is lost -- including the work of all child agents.

The fix: treat the orchestrator as replicated state, not a special being. Its identity lives in a state artifact on disk, not in its memory.

// The state artifact contains everything needed to resume
interface OrchestratorState {
  task_tree: TaskNode;
  agent_registry: Record<string, AgentStatus>;
  synthesis_plan: SynthesisStep[];
  checkpoints: Checkpoint[];
  generation: number; // increments on succession
}

// Succession trigger
if (context_usage >= 0.80) {
  writeState(stateArtifact);  // persist to disk
  spawnSuccessor(stateArtifact); // new orchestrator reads state
  terminateSelf();  // clean exit
}

// The new orchestrator reads the state and resumes
// Zero work is lost. The pipeline continues.

Pattern 2: Quality Gates at Synthesis Boundaries

The most dangerous failure mode is cascade corruption. Four mandatory checks at every synthesis boundary:

  1. Coverage: Did all expected child agents contribute? If 3 of 8 agents failed, the synthesis is working with 62.5% of the data.
  2. Diversity: Are the outputs genuinely independent? If all agents produced nearly identical results, effective N is ~1 regardless of how many you deployed.
  3. Consistency: Do the outputs contradict each other? Contradictions are the most valuable signal -- do NOT resolve them by majority vote.
  4. Confidence calibration: Does self-reported confidence match observed quality? An agent reporting 95% confidence on a task where you expected 60% is the clearest signal of context loss.

Pattern 3: Straggler Management

In a real 210-agent orchestration session, one agent took 2,316 seconds while others took 200-500 seconds. The entire pipeline waited for the slowest agent.

Progressive pressure timeline:

  • At 1.0x median: Send "wrap up" signal
  • At 1.5x median: Send "produce best-effort partial output NOW"
  • At 2.0x median: Kill the straggler. Proceed with partial results.
  • Optional: At 1.5x, speculatively launch a replacement. First to finish wins.

Pattern 4: Graceful Degradation

The output is not "the answer" -- it is "the best answer achievable given which agents succeeded, which failed, and which produced uncertain results."

Degradation levels:

  • FULL: All agents succeeded, synthesis complete.
  • PARTIAL: Some agents failed. Result annotated with gaps: "Sections X and Y not covered."
  • RAW: Synthesizer failed. Raw agent outputs returned with provenance metadata.
  • RESUMED: Orchestrator succeeded. Result from successor generation.
  • FAILED: State artifact lost AND no checkpoint AND no task spec. Requires three simultaneous failures.

A 75% coverage result with annotated gaps is far more valuable than a binary failure signal that discards all completed work.

Pattern 5: Circuit Breakers for Cost and Quality

Every agent operation should be wrapped in a circuit breaker that monitors both cost and quality. When either degrades beyond threshold, the breaker opens and blocks further operations until conditions stabilize.

The three-state model: CLOSED (normal operation) -> OPEN (all requests blocked) -> HALF-OPEN (one test request allowed). If the test succeeds, return to CLOSED. If it fails, return to OPEN.

Putting It Together

A self-healing agent system combines all five patterns:

  1. The orchestrator checkpoints its state regularly (succession readiness)
  2. Every synthesis boundary has four-dimensional quality gates
  3. Slow agents are detected and managed automatically
  4. Partial results are always preferred over total failure
  5. Cost and quality circuit breakers prevent runaway damage

The result: a system that absorbs failures the way a bridge absorbs load -- through redundancy of path, not redundancy of material.

All five patterns are detailed with full implementation templates in the Protocol Playbook.