16 KiB
16 KiB
Lessons Learned
Cloudflare / Tunnels / DNS (2026-02-12)
- nohup your tunnels: cloudflared processes die when exec sessions close. Always use
nohup cloudflared tunnel ... & - Verify before announcing: Always curl the tunnel URL and confirm 200 before posting to Discord. Got burned 3 times in a row.
- Workers need DNS: Cloudflare Workers with routes need a proxied A record (use 192.0.2.1 RFC 5737 dummy IP)
- http2 > quic:
--protocol http2works more reliably than default quic for cloudflared tunnels - CF Registrar is dashboard-only: No API for new domain registration. Only management of existing domains.
- Wrangler OAuth vs API Token: The OAuth token (in wrangler config) and CLOUDFLARE_API_TOKEN have different scopes. Check both.
Python / Veo (2026-02-12)
- Unbuffered output: Use
python3 -ufor scripts running in background — otherwise stdout is buffered and you see no output - Veo download workaround:
client.files.download()returns 404. Instead grab the URI fromvideo.video.uriand download with?key=API_KEY
Discord Etiquette (2026-02-12)
- Don't spam debug messages: Do work silently, announce clean results. Jake had to tell me to delete 45 messages of debug spam. — Buba's Self-Learning Log
Every mistake is a lesson. Every lesson makes us mega beastly. This file is updated CONSTANTLY whenever I figure something out the hard way. Search this BEFORE attempting anything similar.
Gateway & Infrastructure
Gateway logs live at /tmp/clawdbot/ not ~/.clawdbot/logs/
- Date: 2026-02-11
- Mistake: Checked ~/.clawdbot/logs/ and said "nothing since Feb 5" — confused Jake
- Reality: Gateway switched to /tmp/clawdbot/clawdbot-YYYY-MM-DD.log. The old logs dir is stale.
- Rule: Always check
/tmp/clawdbot/for current gateway logs.
tmux death kills the auto-restart loop
- Date: 2026-02-11
- Mistake: Assumed compaction caused silence. Actually the entire tmux session died.
- Reality:
run-gateway.shhas awhile trueloop that only works if tmux survives. If tmux itself dies, no recovery. - Rule: When diagnosing downtime, check
tmux list-sessionsand session creation time withtmux display-message -t clawdbot -p '#{session_created}'. If the session is newer than expected, tmux died.
Gateway freeze vs crash — different diagnostics
- Date: 2026-02-11
- Mistake: Initially thought it was an event loop freeze (alive but hung). Was actually a full crash.
- Rule: Check the log timeline for gaps. If there's a gap AND the tmux session is freshly created, it was a crash. If the tmux session is old but logs have a gap, THEN it's a freeze.
Discord API
channel-list needs guildId, not channel ID
- Date: 2026-02-10
- Mistake: Passed channel ID to channel-list, got "Unknown Guild"
- Rule: Guild ID ≠ channel ID. Jake's main guild is
1458233582404501547. Channel IDs are different.
Guild ID reference
- Main server:
1458233582404501547 - Config has all guilds listed under channels.discord.guilds in clawdbot.json
Deleting messages needs the channel as target
- Date: 2026-02-10
- Rule:
message deleteneedstargetset to the channel ID where the message lives.
Cron Jobs
Cron job parameter format
- Date: 2026-02-10
- Mistake: Tried multiple wrong formats before getting it right
- Correct format:
{
"name": "job-name",
"schedule": {"kind": "cron", "expr": "0 9 * * 1,4"},
"sessionTarget": "main",
"payload": {"kind": "systemEvent", "text": "..."},
"enabled": true
}
- Rule: schedule needs
kind+expr. Payload needskind: "systemEvent"+text. NOTlabel, NOTmessage.
File Operations
Edit tool requires EXACT text match
- Date: 2026-02-11 (CREdispo sub-agent)
- Mistake: Multiple edit failures on CREdispo files because oldText didn't match exactly
- Rule: Always read the file first to get exact text before editing. Never guess at whitespace or content.
iMessage / BlueBubbles
Sending images to group chats via AppleScript is unreliable
- Date: 2026-02-10
- Mistake: Tried to send images to iMessage group chats via AppleScript — text sends but images may not deliver
- Rule: For image delivery to group chats, use BlueBubbles API directly or have Jake send manually from Discord.
Group chat ID format
- Date: 2026-02-10
- Rule: iMessage group chat IDs look like
chat358249523368699090. The send format isany;+;chat358249523368699090.
Context & Memory
ALWAYS save state to memory before heavy work
- Date: 2026-02-11
- Mistake: Was deep in CREdispo work, context got compacted, lost all working state
- Rule: Before starting any multi-step project, write current state to memory/YYYY-MM-DD.md. Update it at milestones. This survives compaction.
Compaction ≠ crash — don't confuse them
- Date: 2026-02-11
- Mistake: Told Jake compaction caused the silence when it was actually a gateway crash
- Rule: Compaction just compresses context. It doesn't stop me from responding. If I went silent, something else happened.
Image Generation
Nano Banana Pro needs specific iterative prompting for character accuracy
- Date: 2026-02-10
- Mistake: Took 4 iterations to get Caleb's appearance right (white hair → brown, no beard → beard, etc.)
- Rule: When generating character images, be VERY specific about hair color, facial hair, build, and clothing in the first prompt. Don't assume defaults.
Sub-agents
Sub-agent results arrive as system messages after compaction
- Date: 2026-02-11
- Mistake: Didn't realize the CREdispo postgres migration had completed because context was compacted
- Rule: After spawning a sub-agent for heavy work, the result comes back as a user message. If context compacts before I process it, I need to check sessions_list for completed sub-agents.
Security
Cloudflare quick tunnels break HTML form POST (405 Method Not Allowed)
- Date: 2026-02-11
- Mistake: Signup/login forms used native HTML
<form method="POST">which returns 405 through cloudflared quick tunnels - Reality: Cloudflare quick tunnels can mangle POST form submissions. JSON API calls via
fetch()work fine. - Rule: When serving apps through cloudflared tunnels, use JavaScript fetch() for form submissions instead of native HTML form POSTs. Keep the old form routes for direct access but add
/api/JSON endpoints.
VPN breaks Cloudflare tunnels
- Date: 2026-02-11
- Mistake: Had Mullvad VPN connected to Mexico while trying to create new cloudflared tunnels — tunnels couldn't establish
- Rule: Disconnect VPN before creating new cloudflared tunnels. Existing tunnels may also break when VPN connects.
API tokens must go in gateway config env.vars, not just .env files
- Date: 2026-02-11
- Mistake: Saved Cloudflare token to
.env.localbut not to gateway config. Gateway couldn't use it. - Reality: The gateway reads env vars from
clawdbot.json→env.vars. A.env.localfile is for apps, not the gateway process. - Rule: When Jake gives a new API token, save it via
gateway config.patchtoenv.varsso the gateway has it. Also save to.env.localfor local app use.
NEVER save secrets/tokens in memory/*.md files
- Date: 2026-02-11
- Rule: Memory files are git-backed and could leak. Save tokens/keys to
.env.local(which is in .gitignore). Reference them by name in memory, never by value.
Delete messages containing tokens IMMEDIATELY
- Date: 2026-02-11
- Rule: If Jake or anyone pastes a secret in Discord, delete the message FIRST, then save the token. Every second it sits in a channel is a risk.
Agent Coordination / Factory Builds
18. Parallel agents on shared filesystem = disaster
- Date: 2026-02-12
- Mistake: Spawned 5-10 sub-agents simultaneously, all writing to the same
mcpengine-repo/servers/directory - What happened: Agents deleted each other's files, overwrote each other's work, and left half-built servers everywhere
- Rule: For file-heavy work on a shared repo, go SEQUENTIAL (one agent at a time) or give each agent a SEPARATE directory, then merge. Never let multiple agents write to the same folder simultaneously.
19. "Delete everything and rebuild" agents are time bombs
- Date: 2026-02-12
- Mistake: Gave rebuild agents instructions to "DELETE everything, build from scratch"
- What happened: Agent deletes all files in minute 1, then times out at minute 10 with only 30% rebuilt. Now the server is WORSE than before.
- Rule: NEVER tell agents to delete first. Say "build new files alongside existing ones" or "write to a temp directory, then swap." Always keep the old code until the new code is verified.
20. Factory monitor cron + manual spawns = competing agents
- Date: 2026-02-12
- Mistake: Had a cron job (every 10min) spawning fix agents for incomplete servers, PLUS I was manually spawning rebuild agents
- What happened: 3-4 agents fighting over the same server simultaneously, each deleting what the others wrote
- Rule: Before spawning fix agents, DISABLE any cron monitors that might also spawn agents for the same servers. One coordinator, one set of workers. No freelancers.
21. 10-minute timeout is too short for full MCP builds
- Date: 2026-02-12
- Mistake: Set 600s (10min) timeout for agents building entire MCP servers (tools + apps + types + server + README)
- What happened: Agents got 60-80% done then died. "No output" completions burning 60-70k tokens each.
- Rule: Full MCP server builds need 900s (15min). App-only or tool-only jobs can use 600s. Always set
runTimeoutSecondsbased on scope.
22. Git checkout HEAD restores wiped files
- Date: 2026-02-12
- Mistake: Panicked when rebuild agents wiped committed files
- What saved us:
git checkout HEAD -- servers/{name}/instantly restores all committed files - Rule: Always commit after each server completes. Then if a rogue agent wipes files, one git command fixes it. Commit early, commit often.
23. Single-purpose agents > multi-purpose agents
- Date: 2026-02-12
- Mistake: Gave agents broad tasks like "build the complete MCP server" (tools + apps + types + infra + README)
- What happened: They'd run out of tokens/time trying to do everything, often failing at the apps stage
- Rule: Split into focused agents: "build tools only", "build apps only", "fix TSC errors only". Smaller scope = higher success rate. Each agent should have ONE clear deliverable.
24. Always verify sub-agent output — "success" doesn't mean complete
- Date: 2026-02-12
- Mistake: Trusted agent completion messages like "50+ tools built!" without checking
- What happened: Agent claimed 50 tools but only wrote 2 files. The "findings" text was aspirational, not factual.
- Rule: After EVERY sub-agent completion, run a file count check:
find src/tools -name "*.ts" | wc -l. Never trust the narrative. Trust the filesystem.
25. Count apps correctly — multiple storage patterns exist
- Date: 2026-02-12
- Mistake: Kept miscounting apps because different servers store them differently
- What happened: Some use subdirectories, some use .tsx files, some use .ts files, some use .html files, some use src/apps/ instead of src/ui/react-app/
- Rule: Check ALL patterns: subdirs in react-app/, .tsx files, .ts files, .html files, AND src/apps/*.ts. Take the max. Use a consistent counting script.
MCP Factory Quality Standards (2026-02-13)
26. ALWAYS start from the actual API spec — never hand-pick tools from vibes
- Date: 2026-02-13
- Mistake: For the 30 SMB MCP servers, I read API docs casually and hand-picked 7-8 "obvious" tools per server
- What happened: Ended up with surface-level CRUD (list/get/create/update) covering maybe 10-15% of each API, missing the tools people actually need
- Rule: ALWAYS pull the official OpenAPI/Swagger spec (or systematically crawl every endpoint). Build a complete endpoint inventory BEFORE deciding what becomes a tool. If Mailchimp has 127 endpoints, I need to know all 127 before picking which 50 become tools.
27. Prioritize tools by real user workflows, not alphabetical CRUD
- Date: 2026-02-13
- Mistake: Mechanically created
list_X / get_X / create_X / update_Xfor each resource — zero workflow awareness - What happened: A CRM MCP that can
list_leadsbut can'tlog_a_calloradd_note_to_lead— the things salespeople do 50x/day - Rule: Research the platform's top use cases. Map workflow chains (create contact → add to list → send campaign → check results). Tier the tools:
- Tier 1 (daily): 10-15 things every user does daily
- Tier 2 (power user): 15-30 things power users need
- Tier 3 (complete): Everything else for full API coverage
- Ship Tier 1+2 minimum. Tier 3 = "best on market" differentiator.
28. Rich tool descriptions are NOT optional — they drive agent behavior
- Date: 2026-02-13
- Mistake: Wrote basic descriptions like "Lists contacts" with minimal parameter docs
- What happened: AI agents make tool selection decisions based on descriptions. Vague = wrong tool chosen = bad UX
- Rule: Every tool description must tell an AI agent WHEN to use it:
- BAD: "Lists contacts"
- GOOD: "Lists contacts with optional filtering by email, name, tag, or date range. Use when the user wants to find, search, or browse their contact database. Returns paginated results up to 100 per page."
- Every param needs: description, type+format constraints, defaults, required/optional, example values
_metalabels from day one: category, access (read/write/destructive), complexity, rateLimit
29. Maintain a coverage manifest for every MCP server
- Date: 2026-02-13
- Mistake: No tracking of which endpoints were covered vs skipped. No way to measure quality.
- Rule: Every server gets a coverage manifest in its README:
Every skipped endpoint needs a REASON (deprecated, admin-only, OAuth-only, redundant). Set 80%+ as "production quality" threshold.Total API endpoints: 127 Tools implemented: 45 Intentionally skipped: 12 (deprecated/admin-only) Not yet covered: 70 (backlog) Coverage: 35% → target 80%+
30. 7-8 tools per server is a demo, not a product
- Date: 2026-02-13
- Mistake: Treated 7-8 tools as "enough" for the initial 30 servers
- What it actually is: A toy. Nobody can do their real job with 7 tools for a platform that has 100+ API endpoints.
- Rule: Minimum viable tool count depends on API size:
- Small API (<30 endpoints): 15-20 tools
- Medium API (30-100 endpoints): 30-50 tools
- Large API (100+ endpoints): 50-80+ tools
- If customers install it and can't do their #1 use case, it's not a product.
31. Consistent naming conventions across ALL servers — no exceptions
- Date: 2026-02-13
- Rule: Factory-wide naming standard:
list_*for paginated collectionsget_*for single resource by IDcreate_*,update_*,delete_*for mutationssearch_*for query-based lookups- Domain verbs:
send_email,cancel_event,archive_card,assign_task - NEVER mix
fetch_*/get_*/retrieve_*— pick ONE - All snake_case, all lowercase
32. Handle pagination and rate limits properly in every server
- Date: 2026-02-13
- Rule: Every
list_*tool must:- Support cursor/page tokens
- Use reasonable default page sizes (25-100, never "all")
- Return
has_more/next_pageindicators - Handle API rate limits (429) with retry + exponential backoff
- Document known rate limits in tool
_meta
Last updated: 2026-02-13 02:46 EST Total lessons: 32
17. Jake's Preferred Image Style
- Mistake: Used comic book/vibrant cartoon style when Jake asked for "the style I like"
- What happened: Jake corrected — his preferred style is chibi kawaii anime, NOT comic book
- Rule: Jake's go-to image style = chibi/kawaii anime (pastel colors, big eyes, oversized heads, tiny bodies, sparkles, hearts, stars). Same style as Buba's visual identity in IDENTITY.md. Always default to this unless he says otherwise.