clawdbot-workspace/memory/remi-self-healing.md

2.1 KiB

Remi Bot Self-Healing System

Set up 2026-01-24 to auto-monitor and recover from failures.

What Was Fixed

Root Causes of /scan Timeout (2026-01-24)

  1. Missing asyncio import in packages/core/analyzers/scoring_async.py - caused NameError when using asyncio.gather()
  2. Synchronous HTTP calls blocking event loop - SoundCloud API calls were sync, blocking Discord heartbeat for 850+ seconds
  3. No timeouts on scoring operations

Fixes Applied

  1. Added asyncio and ThreadPoolExecutor imports to scoring_async.py
  2. Changed score_batch_async() to run sync scoring in thread pool with 5-min timeout
  3. Added 2-min timeout on enrichment phase
  4. Reduced SoundCloud HTTP client timeout from 30s to 10s

Self-Healing Infrastructure

Watchdog Service (com.remi.watchdog)

  • Location: ~/Library/LaunchAgents/com.remi.watchdog.plist
  • Script: ~/projects/remix-sniper/scripts/watchdog.sh
  • Behavior:
    • Checks every 60 seconds if bot process is running
    • Auto-restarts bot if process dies
    • Monitors bot_error.log for critical errors
    • Writes status to ~/.bot_health file
    • Logs alerts to ~/.bot_alert file

Health Check Script

  • Location: ~/projects/remix-sniper/scripts/health_check.sh
  • Checks: process alive, recent errors, gateway connection
  • Exit 0 = healthy, Exit 1 = issues

Manual Commands

# Check watchdog status
launchctl list | grep remi

# View watchdog logs
tail -f ~/projects/remix-sniper/watchdog.log

# Check bot health
cat ~/projects/remix-sniper/.bot_health

# Restart bot manually
pkill -f "python.*main.py"
cd ~/projects/remix-sniper && source venv/bin/activate
nohup python packages/bot/main.py >> bot.log 2>> bot_error.log &

# Stop watchdog
launchctl unload ~/Library/LaunchAgents/com.remi.watchdog.plist

# Start watchdog
launchctl load ~/Library/LaunchAgents/com.remi.watchdog.plist

Bubabot Integration

When alerted about Remi issues:

  1. Run ~/projects/remix-sniper/scripts/health_check.sh
  2. If fails, check bot_error.log for root cause
  3. Fix code in ~/projects/remix-sniper/packages/
  4. Restart bot
  5. Test with /scan command
  6. Report to #quick-tasks