---
name: agent-swarm-coordinator
description: Coordinate teams of sub-agents for parallel and sequential work at scale. Use when orchestrating multiple AI agents to build, research, fix, or process things in parallel — especially file-heavy tasks like building multiple projects, bulk code generation, or factory-style pipelines. Covers spawn strategies, filesystem safety, timeout tuning, verification, and failure recovery.
---

# Agent Swarm Coordinator

Patterns and rules for orchestrating teams of sub-agents on large-scale tasks.
Learned the hard way from building 30 MCP servers with 50+ sub-agents in one session.

## Core Principle: Filesystem is the Bottleneck

Multiple agents writing to the same directory tree = guaranteed corruption.
The filesystem has no merge resolution. Last write wins. Agents WILL overwrite each other.

## Spawn Strategies

### Strategy 1: Parallel — Separate Directories (PREFERRED)
Each agent gets its own isolated directory. Merge results after all complete.

```
workspace/
  agent-1-output/server-a/   ← Agent 1 writes here only
  agent-2-output/server-b/   ← Agent 2 writes here only
  agent-3-output/server-c/   ← Agent 3 writes here only
```

After completion: verify, then `rsync` or `cp` to final location.

**Use when:** Building independent projects, researching separate topics, processing separate files.

### Strategy 2: Sequential — One at a Time
One agent finishes completely before the next starts. Slow but zero conflicts.

**Use when:** All agents need to modify the same files/repo, or agent N depends on agent N-1's output.

### Strategy 3: Parallel — Disjoint File Sets
Multiple agents write to the SAME repo but strictly different subdirectories.

**Use when:** Each agent owns a completely separate subdirectory (e.g., `servers/zendesk/` vs `servers/mailchimp/`). Works IF agents never touch shared files (package.json at root, shared types, etc.).

**WARNING:** If ANY shared files exist (root configs, shared modules), this degrades to Strategy 1 or 2.

### Strategy 4: Pipeline — Stage Handoffs
Agent A does stage 1 (research), hands off to Agent B (build tools), hands off to Agent C (build apps).

**Use when:** Work has clear sequential stages with different skills needed per stage.

## Batch Sizing

| Agent task complexity | Recommended batch size | Timeout |
|---|---|---|
| Simple (one deliverable, <500 LOC) | 8-10 parallel | 300s (5min) |
| Medium (multiple files, 500-2000 LOC) | 5 parallel | 600s (10min) |
| Heavy (full project, 2000+ LOC) | 3 parallel | 900s (15min) |
| Mega (multi-project or research) | 1-2 parallel | 900s (15min) |

**Never exceed 5 heavy agents simultaneously** — context pressure on the coordinator grows fast.

## Task Scoping Rules

### Single-Purpose Agents Beat Multi-Purpose Agents

BAD: "Build the complete MCP server with tools, apps, types, server, and README"
GOOD: "Build 10 tool files for the Zendesk MCP server. Tools only. Don't touch anything else."

Split big jobs:
1. **Phase 1 agent:** Build API client + types
2. **Phase 2 agent:** Build tools (depends on types from phase 1)
3. **Phase 3 agent:** Build apps (can reference tools for context)

Each agent has ONE clear deliverable. Smaller scope = higher success rate.

### Never Say "Delete Everything and Rebuild"

This is the #1 factory killer. Agent deletes all files in minute 1, times out at minute 10 with 30% rebuilt. Server is now WORSE.

Instead:
- "Build new files alongside existing ones"
- "Write to `/tmp/rebuild-{name}/` then I'll swap after verification"
- "Add the missing tool files. Do NOT modify or delete existing files."

## Git Safety (MANDATORY for shared repos)

1. **Commit after EACH agent completes** — not after the whole batch
2. **Before spawning rebuild/fix agents:** `git add -A && git commit -m "checkpoint before rebuild"`
3. **If an agent wipes files:** `git checkout HEAD -- path/to/dir/` to restore instantly
4. **Never let agents run `git push`** — coordinator pushes after verification

## Verification Protocol

**NEVER trust agent completion messages.** Agents report aspirational results, not actual results.

After every agent completes:
```bash
# Count actual deliverables
find src/tools -name "*.ts" | wc -l    # tools built?
find src/ui -name "*.tsx" | wc -l       # apps built?
wc -l src/**/*.ts | tail -1             # total LOC?
npx tsc --noEmit 2>&1 | tail -5        # compiles?
```

If counts don't match agent's claims → respawn a focused fix agent.

## Cron Monitor Anti-Pattern

**NEVER run an automated cron monitor that spawns fix agents while you're also manually spawning agents.**

What happens:
1. You see Server X is broken, spawn fix agent
2. Cron fires 2 minutes later, sees Server X is still broken, spawns ANOTHER fix agent
3. Both agents fight over the same files
4. Server X is now more broken than before

**Rule:** Disable any automated monitors before doing manual intervention. Re-enable after manual work is complete.

## Failure Recovery Playbook

### Agent timed out (most common)
- Check what files exist — it probably got 60-80% done
- Spawn a FOCUSED agent: "Complete the remaining work. These files exist: [list]. Build only what's missing."

### Agent returned "no output" 
- Check filesystem directly — the agent may have written files but failed to report
- If files exist and look good → count as success
- If files don't exist → respawn with simpler task scope

### Agent wiped files then timed out
- `git checkout HEAD -- path/` to restore
- Respawn with explicit "DO NOT DELETE" instruction

### Multiple agents corrupted each other
- `git checkout HEAD -- path/` to restore to last good state  
- Switch to sequential strategy for affected directories
- Disable any cron monitors

## Token Optimization

### Reduce input tokens per agent:
- Don't paste entire API docs — give the API base URL and let the agent research
- Don't repeat the full project context — just give the specific directory and what to build
- Reference files by path instead of pasting content

### Reduce wasted runs:
- Verify prerequisite files exist BEFORE spawning (don't spawn a "build apps" agent if types don't exist yet)
- Use 15min timeouts for heavy builds (10min causes 30% waste from timeouts)
- Single-purpose agents fail less often than multi-purpose ones

### Reduce retry cycles:
- Commit after each success (git safety net)
- Verify immediately after completion (catch problems early)
- Fix specific issues, don't "rebuild everything"

## Example: Building 30 MCP Servers

Optimal approach (what we SHOULD have done):

```
Batch 1 (5 servers): Spawn 5 parallel agents, each building to separate dirs
  → Wait for all 5 → Verify each → Commit each → Push
  
Batch 2 (5 servers): Same pattern
  → Repeat until all 30 done

For each server, 2-phase approach:
  Phase 1: "Build API client + types + tool files for {name} MCP"  (10min)
  Phase 2: "Build 15+ React apps for {name} MCP" (10min, after phase 1 verified)
```

What we actually did (don't repeat):
- Spawned 10+ agents at once on the same repo
- Had a cron monitor spawning MORE agents every 10 minutes  
- Gave "delete and rebuild" instructions
- Trusted agent reports without filesystem verification
- Result: 50+ agent sessions, massive token waste, files getting wiped and restored repeatedly