clawdbot-workspace/memory/lessons-learned.md

# Lessons Learned

## Cloudflare / Tunnels / DNS (2026-02-12)
- **nohup your tunnels**: cloudflared processes die when exec sessions close. Always use `nohup cloudflared tunnel ... &`
- **Verify before announcing**: Always curl the tunnel URL and confirm 200 before posting to Discord. Got burned 3 times in a row.
- **Workers need DNS**: Cloudflare Workers with routes need a proxied A record (use 192.0.2.1 RFC 5737 dummy IP)
- **http2 > quic**: `--protocol http2` works more reliably than default quic for cloudflared tunnels
- **CF Registrar is dashboard-only**: No API for new domain registration. Only management of existing domains.
- **Wrangler OAuth vs API Token**: The OAuth token (in wrangler config) and CLOUDFLARE_API_TOKEN have different scopes. Check both.

## Python / Veo (2026-02-12)
- **Unbuffered output**: Use `python3 -u` for scripts running in background — otherwise stdout is buffered and you see no output
- **Veo download workaround**: `client.files.download()` returns 404. Instead grab the URI from `video.video.uri` and download with `?key=API_KEY`

## Discord Etiquette (2026-02-12)
- **Don't spam debug messages**: Do work silently, announce clean results. Jake had to tell me to delete 45 messages of debug spam. — Buba's Self-Learning Log

> Every mistake is a lesson. Every lesson makes us mega beastly.
> This file is updated CONSTANTLY whenever I figure something out the hard way.
> Search this BEFORE attempting anything similar.

---

## Gateway & Infrastructure

### Gateway logs live at /tmp/clawdbot/ not ~/.clawdbot/logs/
- **Date:** 2026-02-11
- **Mistake:** Checked ~/.clawdbot/logs/ and said "nothing since Feb 5" — confused Jake
- **Reality:** Gateway switched to /tmp/clawdbot/clawdbot-YYYY-MM-DD.log. The old logs dir is stale.
- **Rule:** Always check `/tmp/clawdbot/` for current gateway logs.

### tmux death kills the auto-restart loop
- **Date:** 2026-02-11
- **Mistake:** Assumed compaction caused silence. Actually the entire tmux session died.
- **Reality:** `run-gateway.sh` has a `while true` loop that only works if tmux survives. If tmux itself dies, no recovery.
- **Rule:** When diagnosing downtime, check `tmux list-sessions` and session creation time with `tmux display-message -t clawdbot -p '#{session_created}'`. If the session is newer than expected, tmux died.

### Gateway freeze vs crash — different diagnostics
- **Date:** 2026-02-11
- **Mistake:** Initially thought it was an event loop freeze (alive but hung). Was actually a full crash.
- **Rule:** Check the log timeline for gaps. If there's a gap AND the tmux session is freshly created, it was a crash. If the tmux session is old but logs have a gap, THEN it's a freeze.

## Discord API

### channel-list needs guildId, not channel ID
- **Date:** 2026-02-10
- **Mistake:** Passed channel ID to channel-list, got "Unknown Guild"
- **Rule:** Guild ID ≠ channel ID. Jake's main guild is `1458233582404501547`. Channel IDs are different.

### Guild ID reference
- **Main server:** `1458233582404501547`
- **Config has all guilds listed** under channels.discord.guilds in clawdbot.json

### Deleting messages needs the channel as target
- **Date:** 2026-02-10
- **Rule:** `message delete` needs `target` set to the channel ID where the message lives.

## Cron Jobs

### Cron job parameter format
- **Date:** 2026-02-10
- **Mistake:** Tried multiple wrong formats before getting it right
- **Correct format:**
```json
{
  "name": "job-name",
  "schedule": {"kind": "cron", "expr": "0 9 * * 1,4"},
  "sessionTarget": "main",
  "payload": {"kind": "systemEvent", "text": "..."},
  "enabled": true
}
```
- **Rule:** schedule needs `kind` + `expr`. Payload needs `kind: "systemEvent"` + `text`. NOT `label`, NOT `message`.

## File Operations

### Edit tool requires EXACT text match
- **Date:** 2026-02-11 (CREdispo sub-agent)
- **Mistake:** Multiple edit failures on CREdispo files because oldText didn't match exactly
- **Rule:** Always read the file first to get exact text before editing. Never guess at whitespace or content.

## iMessage / BlueBubbles

### Sending images to group chats via AppleScript is unreliable
- **Date:** 2026-02-10
- **Mistake:** Tried to send images to iMessage group chats via AppleScript — text sends but images may not deliver
- **Rule:** For image delivery to group chats, use BlueBubbles API directly or have Jake send manually from Discord.

### Group chat ID format
- **Date:** 2026-02-10
- **Rule:** iMessage group chat IDs look like `chat358249523368699090`. The send format is `any;+;chat358249523368699090`.

## Context & Memory

### ALWAYS save state to memory before heavy work
- **Date:** 2026-02-11
- **Mistake:** Was deep in CREdispo work, context got compacted, lost all working state
- **Rule:** Before starting any multi-step project, write current state to memory/YYYY-MM-DD.md. Update it at milestones. This survives compaction.

### Compaction ≠ crash — don't confuse them
- **Date:** 2026-02-11
- **Mistake:** Told Jake compaction caused the silence when it was actually a gateway crash
- **Rule:** Compaction just compresses context. It doesn't stop me from responding. If I went silent, something else happened.

## Image Generation

### Nano Banana Pro needs specific iterative prompting for character accuracy
- **Date:** 2026-02-10
- **Mistake:** Took 4 iterations to get Caleb's appearance right (white hair → brown, no beard → beard, etc.)
- **Rule:** When generating character images, be VERY specific about hair color, facial hair, build, and clothing in the first prompt. Don't assume defaults.

## Sub-agents

### Sub-agent results arrive as system messages after compaction
- **Date:** 2026-02-11
- **Mistake:** Didn't realize the CREdispo postgres migration had completed because context was compacted
- **Rule:** After spawning a sub-agent for heavy work, the result comes back as a user message. If context compacts before I process it, I need to check sessions_list for completed sub-agents.

## Security

### Cloudflare quick tunnels break HTML form POST (405 Method Not Allowed)
- **Date:** 2026-02-11
- **Mistake:** Signup/login forms used native HTML `<form method="POST">` which returns 405 through cloudflared quick tunnels
- **Reality:** Cloudflare quick tunnels can mangle POST form submissions. JSON API calls via `fetch()` work fine.
- **Rule:** When serving apps through cloudflared tunnels, use JavaScript fetch() for form submissions instead of native HTML form POSTs. Keep the old form routes for direct access but add `/api/` JSON endpoints.

### VPN breaks Cloudflare tunnels
- **Date:** 2026-02-11
- **Mistake:** Had Mullvad VPN connected to Mexico while trying to create new cloudflared tunnels — tunnels couldn't establish
- **Rule:** Disconnect VPN before creating new cloudflared tunnels. Existing tunnels may also break when VPN connects.

### API tokens must go in gateway config env.vars, not just .env files
- **Date:** 2026-02-11
- **Mistake:** Saved Cloudflare token to `.env.local` but not to gateway config. Gateway couldn't use it.
- **Reality:** The gateway reads env vars from `clawdbot.json` → `env.vars`. A `.env.local` file is for apps, not the gateway process.
- **Rule:** When Jake gives a new API token, save it via `gateway config.patch` to `env.vars` so the gateway has it. Also save to `.env.local` for local app use.

### NEVER save secrets/tokens in memory/*.md files
- **Date:** 2026-02-11
- **Rule:** Memory files are git-backed and could leak. Save tokens/keys to `.env.local` (which is in .gitignore). Reference them by name in memory, never by value.

### Delete messages containing tokens IMMEDIATELY
- **Date:** 2026-02-11
- **Rule:** If Jake or anyone pastes a secret in Discord, delete the message FIRST, then save the token. Every second it sits in a channel is a risk.

---

## Agent Coordination / Factory Builds

### 18. Parallel agents on shared filesystem = disaster
- **Date:** 2026-02-12
- **Mistake:** Spawned 5-10 sub-agents simultaneously, all writing to the same `mcpengine-repo/servers/` directory
- **What happened:** Agents deleted each other's files, overwrote each other's work, and left half-built servers everywhere
- **Rule:** For file-heavy work on a shared repo, go SEQUENTIAL (one agent at a time) or give each agent a SEPARATE directory, then merge. Never let multiple agents write to the same folder simultaneously.

### 19. "Delete everything and rebuild" agents are time bombs
- **Date:** 2026-02-12
- **Mistake:** Gave rebuild agents instructions to "DELETE everything, build from scratch"
- **What happened:** Agent deletes all files in minute 1, then times out at minute 10 with only 30% rebuilt. Now the server is WORSE than before.
- **Rule:** NEVER tell agents to delete first. Say "build new files alongside existing ones" or "write to a temp directory, then swap." Always keep the old code until the new code is verified.

### 20. Factory monitor cron + manual spawns = competing agents
- **Date:** 2026-02-12
- **Mistake:** Had a cron job (every 10min) spawning fix agents for incomplete servers, PLUS I was manually spawning rebuild agents
- **What happened:** 3-4 agents fighting over the same server simultaneously, each deleting what the others wrote
- **Rule:** Before spawning fix agents, DISABLE any cron monitors that might also spawn agents for the same servers. One coordinator, one set of workers. No freelancers.

### 21. 10-minute timeout is too short for full MCP builds
- **Date:** 2026-02-12
- **Mistake:** Set 600s (10min) timeout for agents building entire MCP servers (tools + apps + types + server + README)
- **What happened:** Agents got 60-80% done then died. "No output" completions burning 60-70k tokens each.
- **Rule:** Full MCP server builds need 900s (15min). App-only or tool-only jobs can use 600s. Always set `runTimeoutSeconds` based on scope.

### 22. Git checkout HEAD restores wiped files
- **Date:** 2026-02-12
- **Mistake:** Panicked when rebuild agents wiped committed files
- **What saved us:** `git checkout HEAD -- servers/{name}/` instantly restores all committed files
- **Rule:** Always commit after each server completes. Then if a rogue agent wipes files, one git command fixes it. Commit early, commit often.

### 23. Single-purpose agents > multi-purpose agents
- **Date:** 2026-02-12
- **Mistake:** Gave agents broad tasks like "build the complete MCP server" (tools + apps + types + infra + README)
- **What happened:** They'd run out of tokens/time trying to do everything, often failing at the apps stage
- **Rule:** Split into focused agents: "build tools only", "build apps only", "fix TSC errors only". Smaller scope = higher success rate. Each agent should have ONE clear deliverable.

### 24. Always verify sub-agent output — "success" doesn't mean complete
- **Date:** 2026-02-12
- **Mistake:** Trusted agent completion messages like "50+ tools built!" without checking
- **What happened:** Agent claimed 50 tools but only wrote 2 files. The "findings" text was aspirational, not factual.
- **Rule:** After EVERY sub-agent completion, run a file count check: `find src/tools -name "*.ts" | wc -l`. Never trust the narrative. Trust the filesystem.

### 25. Count apps correctly — multiple storage patterns exist
- **Date:** 2026-02-12
- **Mistake:** Kept miscounting apps because different servers store them differently
- **What happened:** Some use subdirectories, some use .tsx files, some use .ts files, some use .html files, some use src/apps/ instead of src/ui/react-app/
- **Rule:** Check ALL patterns: subdirs in react-app/, .tsx files, .ts files, .html files, AND src/apps/*.ts. Take the max. Use a consistent counting script.

*Last updated: 2026-02-12 22:20 EST*
*Total lessons: 25*

### 17. Jake's Preferred Image Style
- **Mistake:** Used comic book/vibrant cartoon style when Jake asked for "the style I like"
- **What happened:** Jake corrected — his preferred style is **chibi kawaii anime**, NOT comic book
- **Rule:** Jake's go-to image style = chibi/kawaii anime (pastel colors, big eyes, oversized heads, tiny bodies, sparkles, hearts, stars). Same style as Buba's visual identity in IDENTITY.md. Always default to this unless he says otherwise.