--- name: agent-swarm-coordinator description: Coordinate teams of sub-agents for parallel and sequential work at scale. Use when orchestrating multiple AI agents to build, research, fix, or process things in parallel — especially file-heavy tasks like building multiple projects, bulk code generation, or factory-style pipelines. Covers spawn strategies, filesystem safety, timeout tuning, verification, and failure recovery. --- # Agent Swarm Coordinator Patterns and rules for orchestrating teams of sub-agents on large-scale tasks. Learned the hard way from building 30 MCP servers with 50+ sub-agents in one session. ## Core Principle: Filesystem is the Bottleneck Multiple agents writing to the same directory tree = guaranteed corruption. The filesystem has no merge resolution. Last write wins. Agents WILL overwrite each other. ## Spawn Strategies ### Strategy 1: Parallel — Separate Directories (PREFERRED) Each agent gets its own isolated directory. Merge results after all complete. ``` workspace/ agent-1-output/server-a/ ← Agent 1 writes here only agent-2-output/server-b/ ← Agent 2 writes here only agent-3-output/server-c/ ← Agent 3 writes here only ``` After completion: verify, then `rsync` or `cp` to final location. **Use when:** Building independent projects, researching separate topics, processing separate files. ### Strategy 2: Sequential — One at a Time One agent finishes completely before the next starts. Slow but zero conflicts. **Use when:** All agents need to modify the same files/repo, or agent N depends on agent N-1's output. ### Strategy 3: Parallel — Disjoint File Sets Multiple agents write to the SAME repo but strictly different subdirectories. **Use when:** Each agent owns a completely separate subdirectory (e.g., `servers/zendesk/` vs `servers/mailchimp/`). Works IF agents never touch shared files (package.json at root, shared types, etc.). **WARNING:** If ANY shared files exist (root configs, shared modules), this degrades to Strategy 1 or 2. ### Strategy 4: Pipeline — Stage Handoffs Agent A does stage 1 (research), hands off to Agent B (build tools), hands off to Agent C (build apps). **Use when:** Work has clear sequential stages with different skills needed per stage. ## Batch Sizing | Agent task complexity | Recommended batch size | Timeout | |---|---|---| | Simple (one deliverable, <500 LOC) | 8-10 parallel | 300s (5min) | | Medium (multiple files, 500-2000 LOC) | 5 parallel | 600s (10min) | | Heavy (full project, 2000+ LOC) | 3 parallel | 900s (15min) | | Mega (multi-project or research) | 1-2 parallel | 900s (15min) | **Never exceed 5 heavy agents simultaneously** — context pressure on the coordinator grows fast. ## Task Scoping Rules ### Single-Purpose Agents Beat Multi-Purpose Agents BAD: "Build the complete MCP server with tools, apps, types, server, and README" GOOD: "Build 10 tool files for the Zendesk MCP server. Tools only. Don't touch anything else." Split big jobs: 1. **Phase 1 agent:** Build API client + types 2. **Phase 2 agent:** Build tools (depends on types from phase 1) 3. **Phase 3 agent:** Build apps (can reference tools for context) Each agent has ONE clear deliverable. Smaller scope = higher success rate. ### Never Say "Delete Everything and Rebuild" This is the #1 factory killer. Agent deletes all files in minute 1, times out at minute 10 with 30% rebuilt. Server is now WORSE. Instead: - "Build new files alongside existing ones" - "Write to `/tmp/rebuild-{name}/` then I'll swap after verification" - "Add the missing tool files. Do NOT modify or delete existing files." ## Git Safety (MANDATORY for shared repos) 1. **Commit after EACH agent completes** — not after the whole batch 2. **Before spawning rebuild/fix agents:** `git add -A && git commit -m "checkpoint before rebuild"` 3. **If an agent wipes files:** `git checkout HEAD -- path/to/dir/` to restore instantly 4. **Never let agents run `git push`** — coordinator pushes after verification ## Verification Protocol **NEVER trust agent completion messages.** Agents report aspirational results, not actual results. After every agent completes: ```bash # Count actual deliverables find src/tools -name "*.ts" | wc -l # tools built? find src/ui -name "*.tsx" | wc -l # apps built? wc -l src/**/*.ts | tail -1 # total LOC? npx tsc --noEmit 2>&1 | tail -5 # compiles? ``` If counts don't match agent's claims → respawn a focused fix agent. ## Cron Monitor Anti-Pattern **NEVER run an automated cron monitor that spawns fix agents while you're also manually spawning agents.** What happens: 1. You see Server X is broken, spawn fix agent 2. Cron fires 2 minutes later, sees Server X is still broken, spawns ANOTHER fix agent 3. Both agents fight over the same files 4. Server X is now more broken than before **Rule:** Disable any automated monitors before doing manual intervention. Re-enable after manual work is complete. ## Failure Recovery Playbook ### Agent timed out (most common) - Check what files exist — it probably got 60-80% done - Spawn a FOCUSED agent: "Complete the remaining work. These files exist: [list]. Build only what's missing." ### Agent returned "no output" - Check filesystem directly — the agent may have written files but failed to report - If files exist and look good → count as success - If files don't exist → respawn with simpler task scope ### Agent wiped files then timed out - `git checkout HEAD -- path/` to restore - Respawn with explicit "DO NOT DELETE" instruction ### Multiple agents corrupted each other - `git checkout HEAD -- path/` to restore to last good state - Switch to sequential strategy for affected directories - Disable any cron monitors ## Token Optimization ### Reduce input tokens per agent: - Don't paste entire API docs — give the API base URL and let the agent research - Don't repeat the full project context — just give the specific directory and what to build - Reference files by path instead of pasting content ### Reduce wasted runs: - Verify prerequisite files exist BEFORE spawning (don't spawn a "build apps" agent if types don't exist yet) - Use 15min timeouts for heavy builds (10min causes 30% waste from timeouts) - Single-purpose agents fail less often than multi-purpose ones ### Reduce retry cycles: - Commit after each success (git safety net) - Verify immediately after completion (catch problems early) - Fix specific issues, don't "rebuild everything" ## Example: Building 30 MCP Servers Optimal approach (what we SHOULD have done): ``` Batch 1 (5 servers): Spawn 5 parallel agents, each building to separate dirs → Wait for all 5 → Verify each → Commit each → Push Batch 2 (5 servers): Same pattern → Repeat until all 30 done For each server, 2-phase approach: Phase 1: "Build API client + types + tool files for {name} MCP" (10min) Phase 2: "Build 15+ React apps for {name} MCP" (10min, after phase 1 verified) ``` What we actually did (don't repeat): - Spawned 10+ agents at once on the same repo - Had a cron monitor spawning MORE agents every 10 minutes - Gave "delete and rebuild" instructions - Trusted agent reports without filesystem verification - Result: 50+ agent sessions, massive token waste, files getting wiped and restored repeatedly