7.2 KiB
| name | description |
|---|---|
| agent-swarm-coordinator | Coordinate teams of sub-agents for parallel and sequential work at scale. Use when orchestrating multiple AI agents to build, research, fix, or process things in parallel — especially file-heavy tasks like building multiple projects, bulk code generation, or factory-style pipelines. Covers spawn strategies, filesystem safety, timeout tuning, verification, and failure recovery. |
Agent Swarm Coordinator
Patterns and rules for orchestrating teams of sub-agents on large-scale tasks. Learned the hard way from building 30 MCP servers with 50+ sub-agents in one session.
Core Principle: Filesystem is the Bottleneck
Multiple agents writing to the same directory tree = guaranteed corruption. The filesystem has no merge resolution. Last write wins. Agents WILL overwrite each other.
Spawn Strategies
Strategy 1: Parallel — Separate Directories (PREFERRED)
Each agent gets its own isolated directory. Merge results after all complete.
workspace/
agent-1-output/server-a/ ← Agent 1 writes here only
agent-2-output/server-b/ ← Agent 2 writes here only
agent-3-output/server-c/ ← Agent 3 writes here only
After completion: verify, then rsync or cp to final location.
Use when: Building independent projects, researching separate topics, processing separate files.
Strategy 2: Sequential — One at a Time
One agent finishes completely before the next starts. Slow but zero conflicts.
Use when: All agents need to modify the same files/repo, or agent N depends on agent N-1's output.
Strategy 3: Parallel — Disjoint File Sets
Multiple agents write to the SAME repo but strictly different subdirectories.
Use when: Each agent owns a completely separate subdirectory (e.g., servers/zendesk/ vs servers/mailchimp/). Works IF agents never touch shared files (package.json at root, shared types, etc.).
WARNING: If ANY shared files exist (root configs, shared modules), this degrades to Strategy 1 or 2.
Strategy 4: Pipeline — Stage Handoffs
Agent A does stage 1 (research), hands off to Agent B (build tools), hands off to Agent C (build apps).
Use when: Work has clear sequential stages with different skills needed per stage.
Batch Sizing
| Agent task complexity | Recommended batch size | Timeout |
|---|---|---|
| Simple (one deliverable, <500 LOC) | 8-10 parallel | 300s (5min) |
| Medium (multiple files, 500-2000 LOC) | 5 parallel | 600s (10min) |
| Heavy (full project, 2000+ LOC) | 3 parallel | 900s (15min) |
| Mega (multi-project or research) | 1-2 parallel | 900s (15min) |
Never exceed 5 heavy agents simultaneously — context pressure on the coordinator grows fast.
Task Scoping Rules
Single-Purpose Agents Beat Multi-Purpose Agents
BAD: "Build the complete MCP server with tools, apps, types, server, and README" GOOD: "Build 10 tool files for the Zendesk MCP server. Tools only. Don't touch anything else."
Split big jobs:
- Phase 1 agent: Build API client + types
- Phase 2 agent: Build tools (depends on types from phase 1)
- Phase 3 agent: Build apps (can reference tools for context)
Each agent has ONE clear deliverable. Smaller scope = higher success rate.
Never Say "Delete Everything and Rebuild"
This is the #1 factory killer. Agent deletes all files in minute 1, times out at minute 10 with 30% rebuilt. Server is now WORSE.
Instead:
- "Build new files alongside existing ones"
- "Write to
/tmp/rebuild-{name}/then I'll swap after verification" - "Add the missing tool files. Do NOT modify or delete existing files."
Git Safety (MANDATORY for shared repos)
- Commit after EACH agent completes — not after the whole batch
- Before spawning rebuild/fix agents:
git add -A && git commit -m "checkpoint before rebuild" - If an agent wipes files:
git checkout HEAD -- path/to/dir/to restore instantly - Never let agents run
git push— coordinator pushes after verification
Verification Protocol
NEVER trust agent completion messages. Agents report aspirational results, not actual results.
After every agent completes:
# Count actual deliverables
find src/tools -name "*.ts" | wc -l # tools built?
find src/ui -name "*.tsx" | wc -l # apps built?
wc -l src/**/*.ts | tail -1 # total LOC?
npx tsc --noEmit 2>&1 | tail -5 # compiles?
If counts don't match agent's claims → respawn a focused fix agent.
Cron Monitor Anti-Pattern
NEVER run an automated cron monitor that spawns fix agents while you're also manually spawning agents.
What happens:
- You see Server X is broken, spawn fix agent
- Cron fires 2 minutes later, sees Server X is still broken, spawns ANOTHER fix agent
- Both agents fight over the same files
- Server X is now more broken than before
Rule: Disable any automated monitors before doing manual intervention. Re-enable after manual work is complete.
Failure Recovery Playbook
Agent timed out (most common)
- Check what files exist — it probably got 60-80% done
- Spawn a FOCUSED agent: "Complete the remaining work. These files exist: [list]. Build only what's missing."
Agent returned "no output"
- Check filesystem directly — the agent may have written files but failed to report
- If files exist and look good → count as success
- If files don't exist → respawn with simpler task scope
Agent wiped files then timed out
git checkout HEAD -- path/to restore- Respawn with explicit "DO NOT DELETE" instruction
Multiple agents corrupted each other
git checkout HEAD -- path/to restore to last good state- Switch to sequential strategy for affected directories
- Disable any cron monitors
Token Optimization
Reduce input tokens per agent:
- Don't paste entire API docs — give the API base URL and let the agent research
- Don't repeat the full project context — just give the specific directory and what to build
- Reference files by path instead of pasting content
Reduce wasted runs:
- Verify prerequisite files exist BEFORE spawning (don't spawn a "build apps" agent if types don't exist yet)
- Use 15min timeouts for heavy builds (10min causes 30% waste from timeouts)
- Single-purpose agents fail less often than multi-purpose ones
Reduce retry cycles:
- Commit after each success (git safety net)
- Verify immediately after completion (catch problems early)
- Fix specific issues, don't "rebuild everything"
Example: Building 30 MCP Servers
Optimal approach (what we SHOULD have done):
Batch 1 (5 servers): Spawn 5 parallel agents, each building to separate dirs
→ Wait for all 5 → Verify each → Commit each → Push
Batch 2 (5 servers): Same pattern
→ Repeat until all 30 done
For each server, 2-phase approach:
Phase 1: "Build API client + types + tool files for {name} MCP" (10min)
Phase 2: "Build 15+ React apps for {name} MCP" (10min, after phase 1 verified)
What we actually did (don't repeat):
- Spawned 10+ agents at once on the same repo
- Had a cron monitor spawning MORE agents every 10 minutes
- Gave "delete and rebuild" instructions
- Trusted agent reports without filesystem verification
- Result: 50+ agent sessions, massive token waste, files getting wiped and restored repeatedly