67 lines
2.1 KiB
Markdown
67 lines
2.1 KiB
Markdown
# Remi Bot Self-Healing System
|
|
|
|
Set up 2026-01-24 to auto-monitor and recover from failures.
|
|
|
|
## What Was Fixed
|
|
|
|
### Root Causes of `/scan` Timeout (2026-01-24)
|
|
1. **Missing `asyncio` import** in `packages/core/analyzers/scoring_async.py` - caused `NameError` when using `asyncio.gather()`
|
|
2. **Synchronous HTTP calls blocking event loop** - SoundCloud API calls were sync, blocking Discord heartbeat for 850+ seconds
|
|
3. **No timeouts** on scoring operations
|
|
|
|
### Fixes Applied
|
|
1. Added `asyncio` and `ThreadPoolExecutor` imports to `scoring_async.py`
|
|
2. Changed `score_batch_async()` to run sync scoring in thread pool with 5-min timeout
|
|
3. Added 2-min timeout on enrichment phase
|
|
4. Reduced SoundCloud HTTP client timeout from 30s to 10s
|
|
|
|
## Self-Healing Infrastructure
|
|
|
|
### Watchdog Service (`com.remi.watchdog`)
|
|
- **Location:** `~/Library/LaunchAgents/com.remi.watchdog.plist`
|
|
- **Script:** `~/projects/remix-sniper/scripts/watchdog.sh`
|
|
- **Behavior:**
|
|
- Checks every 60 seconds if bot process is running
|
|
- Auto-restarts bot if process dies
|
|
- Monitors `bot_error.log` for critical errors
|
|
- Writes status to `~/.bot_health` file
|
|
- Logs alerts to `~/.bot_alert` file
|
|
|
|
### Health Check Script
|
|
- **Location:** `~/projects/remix-sniper/scripts/health_check.sh`
|
|
- Checks: process alive, recent errors, gateway connection
|
|
- Exit 0 = healthy, Exit 1 = issues
|
|
|
|
### Manual Commands
|
|
```bash
|
|
# Check watchdog status
|
|
launchctl list | grep remi
|
|
|
|
# View watchdog logs
|
|
tail -f ~/projects/remix-sniper/watchdog.log
|
|
|
|
# Check bot health
|
|
cat ~/projects/remix-sniper/.bot_health
|
|
|
|
# Restart bot manually
|
|
pkill -f "python.*main.py"
|
|
cd ~/projects/remix-sniper && source venv/bin/activate
|
|
nohup python packages/bot/main.py >> bot.log 2>> bot_error.log &
|
|
|
|
# Stop watchdog
|
|
launchctl unload ~/Library/LaunchAgents/com.remi.watchdog.plist
|
|
|
|
# Start watchdog
|
|
launchctl load ~/Library/LaunchAgents/com.remi.watchdog.plist
|
|
```
|
|
|
|
## Bubabot Integration
|
|
|
|
When alerted about Remi issues:
|
|
1. Run `~/projects/remix-sniper/scripts/health_check.sh`
|
|
2. If fails, check `bot_error.log` for root cause
|
|
3. Fix code in `~/projects/remix-sniper/packages/`
|
|
4. Restart bot
|
|
5. Test with `/scan` command
|
|
6. Report to #quick-tasks
|