7.2 KiB

name description
agent-swarm-coordinator Coordinate teams of sub-agents for parallel and sequential work at scale. Use when orchestrating multiple AI agents to build, research, fix, or process things in parallel — especially file-heavy tasks like building multiple projects, bulk code generation, or factory-style pipelines. Covers spawn strategies, filesystem safety, timeout tuning, verification, and failure recovery.

Agent Swarm Coordinator

Patterns and rules for orchestrating teams of sub-agents on large-scale tasks. Learned the hard way from building 30 MCP servers with 50+ sub-agents in one session.

Core Principle: Filesystem is the Bottleneck

Multiple agents writing to the same directory tree = guaranteed corruption. The filesystem has no merge resolution. Last write wins. Agents WILL overwrite each other.

Spawn Strategies

Strategy 1: Parallel — Separate Directories (PREFERRED)

Each agent gets its own isolated directory. Merge results after all complete.

workspace/
  agent-1-output/server-a/   ← Agent 1 writes here only
  agent-2-output/server-b/   ← Agent 2 writes here only
  agent-3-output/server-c/   ← Agent 3 writes here only

After completion: verify, then rsync or cp to final location.

Use when: Building independent projects, researching separate topics, processing separate files.

Strategy 2: Sequential — One at a Time

One agent finishes completely before the next starts. Slow but zero conflicts.

Use when: All agents need to modify the same files/repo, or agent N depends on agent N-1's output.

Strategy 3: Parallel — Disjoint File Sets

Multiple agents write to the SAME repo but strictly different subdirectories.

Use when: Each agent owns a completely separate subdirectory (e.g., servers/zendesk/ vs servers/mailchimp/). Works IF agents never touch shared files (package.json at root, shared types, etc.).

WARNING: If ANY shared files exist (root configs, shared modules), this degrades to Strategy 1 or 2.

Strategy 4: Pipeline — Stage Handoffs

Agent A does stage 1 (research), hands off to Agent B (build tools), hands off to Agent C (build apps).

Use when: Work has clear sequential stages with different skills needed per stage.

Batch Sizing

Agent task complexity Recommended batch size Timeout
Simple (one deliverable, <500 LOC) 8-10 parallel 300s (5min)
Medium (multiple files, 500-2000 LOC) 5 parallel 600s (10min)
Heavy (full project, 2000+ LOC) 3 parallel 900s (15min)
Mega (multi-project or research) 1-2 parallel 900s (15min)

Never exceed 5 heavy agents simultaneously — context pressure on the coordinator grows fast.

Task Scoping Rules

Single-Purpose Agents Beat Multi-Purpose Agents

BAD: "Build the complete MCP server with tools, apps, types, server, and README" GOOD: "Build 10 tool files for the Zendesk MCP server. Tools only. Don't touch anything else."

Split big jobs:

  1. Phase 1 agent: Build API client + types
  2. Phase 2 agent: Build tools (depends on types from phase 1)
  3. Phase 3 agent: Build apps (can reference tools for context)

Each agent has ONE clear deliverable. Smaller scope = higher success rate.

Never Say "Delete Everything and Rebuild"

This is the #1 factory killer. Agent deletes all files in minute 1, times out at minute 10 with 30% rebuilt. Server is now WORSE.

Instead:

  • "Build new files alongside existing ones"
  • "Write to /tmp/rebuild-{name}/ then I'll swap after verification"
  • "Add the missing tool files. Do NOT modify or delete existing files."

Git Safety (MANDATORY for shared repos)

  1. Commit after EACH agent completes — not after the whole batch
  2. Before spawning rebuild/fix agents: git add -A && git commit -m "checkpoint before rebuild"
  3. If an agent wipes files: git checkout HEAD -- path/to/dir/ to restore instantly
  4. Never let agents run git push — coordinator pushes after verification

Verification Protocol

NEVER trust agent completion messages. Agents report aspirational results, not actual results.

After every agent completes:

# Count actual deliverables
find src/tools -name "*.ts" | wc -l    # tools built?
find src/ui -name "*.tsx" | wc -l       # apps built?
wc -l src/**/*.ts | tail -1             # total LOC?
npx tsc --noEmit 2>&1 | tail -5        # compiles?

If counts don't match agent's claims → respawn a focused fix agent.

Cron Monitor Anti-Pattern

NEVER run an automated cron monitor that spawns fix agents while you're also manually spawning agents.

What happens:

  1. You see Server X is broken, spawn fix agent
  2. Cron fires 2 minutes later, sees Server X is still broken, spawns ANOTHER fix agent
  3. Both agents fight over the same files
  4. Server X is now more broken than before

Rule: Disable any automated monitors before doing manual intervention. Re-enable after manual work is complete.

Failure Recovery Playbook

Agent timed out (most common)

  • Check what files exist — it probably got 60-80% done
  • Spawn a FOCUSED agent: "Complete the remaining work. These files exist: [list]. Build only what's missing."

Agent returned "no output"

  • Check filesystem directly — the agent may have written files but failed to report
  • If files exist and look good → count as success
  • If files don't exist → respawn with simpler task scope

Agent wiped files then timed out

  • git checkout HEAD -- path/ to restore
  • Respawn with explicit "DO NOT DELETE" instruction

Multiple agents corrupted each other

  • git checkout HEAD -- path/ to restore to last good state
  • Switch to sequential strategy for affected directories
  • Disable any cron monitors

Token Optimization

Reduce input tokens per agent:

  • Don't paste entire API docs — give the API base URL and let the agent research
  • Don't repeat the full project context — just give the specific directory and what to build
  • Reference files by path instead of pasting content

Reduce wasted runs:

  • Verify prerequisite files exist BEFORE spawning (don't spawn a "build apps" agent if types don't exist yet)
  • Use 15min timeouts for heavy builds (10min causes 30% waste from timeouts)
  • Single-purpose agents fail less often than multi-purpose ones

Reduce retry cycles:

  • Commit after each success (git safety net)
  • Verify immediately after completion (catch problems early)
  • Fix specific issues, don't "rebuild everything"

Example: Building 30 MCP Servers

Optimal approach (what we SHOULD have done):

Batch 1 (5 servers): Spawn 5 parallel agents, each building to separate dirs
  → Wait for all 5 → Verify each → Commit each → Push
  
Batch 2 (5 servers): Same pattern
  → Repeat until all 30 done

For each server, 2-phase approach:
  Phase 1: "Build API client + types + tool files for {name} MCP"  (10min)
  Phase 2: "Build 15+ React apps for {name} MCP" (10min, after phase 1 verified)

What we actually did (don't repeat):

  • Spawned 10+ agents at once on the same repo
  • Had a cron monitor spawning MORE agents every 10 minutes
  • Gave "delete and rebuild" instructions
  • Trusted agent reports without filesystem verification
  • Result: 50+ agent sessions, massive token waste, files getting wiped and restored repeatedly