clawdbot-workspace/memory/remi-self-healing.md

67 lines
2.1 KiB
Markdown

# Remi Bot Self-Healing System
Set up 2026-01-24 to auto-monitor and recover from failures.
## What Was Fixed
### Root Causes of `/scan` Timeout (2026-01-24)
1. **Missing `asyncio` import** in `packages/core/analyzers/scoring_async.py` - caused `NameError` when using `asyncio.gather()`
2. **Synchronous HTTP calls blocking event loop** - SoundCloud API calls were sync, blocking Discord heartbeat for 850+ seconds
3. **No timeouts** on scoring operations
### Fixes Applied
1. Added `asyncio` and `ThreadPoolExecutor` imports to `scoring_async.py`
2. Changed `score_batch_async()` to run sync scoring in thread pool with 5-min timeout
3. Added 2-min timeout on enrichment phase
4. Reduced SoundCloud HTTP client timeout from 30s to 10s
## Self-Healing Infrastructure
### Watchdog Service (`com.remi.watchdog`)
- **Location:** `~/Library/LaunchAgents/com.remi.watchdog.plist`
- **Script:** `~/projects/remix-sniper/scripts/watchdog.sh`
- **Behavior:**
- Checks every 60 seconds if bot process is running
- Auto-restarts bot if process dies
- Monitors `bot_error.log` for critical errors
- Writes status to `~/.bot_health` file
- Logs alerts to `~/.bot_alert` file
### Health Check Script
- **Location:** `~/projects/remix-sniper/scripts/health_check.sh`
- Checks: process alive, recent errors, gateway connection
- Exit 0 = healthy, Exit 1 = issues
### Manual Commands
```bash
# Check watchdog status
launchctl list | grep remi
# View watchdog logs
tail -f ~/projects/remix-sniper/watchdog.log
# Check bot health
cat ~/projects/remix-sniper/.bot_health
# Restart bot manually
pkill -f "python.*main.py"
cd ~/projects/remix-sniper && source venv/bin/activate
nohup python packages/bot/main.py >> bot.log 2>> bot_error.log &
# Stop watchdog
launchctl unload ~/Library/LaunchAgents/com.remi.watchdog.plist
# Start watchdog
launchctl load ~/Library/LaunchAgents/com.remi.watchdog.plist
```
## Bubabot Integration
When alerted about Remi issues:
1. Run `~/projects/remix-sniper/scripts/health_check.sh`
2. If fails, check `bot_error.log` for root cause
3. Fix code in `~/projects/remix-sniper/packages/`
4. Restart bot
5. Test with `/scan` command
6. Report to #quick-tasks