210 lines
5.8 KiB
Markdown
210 lines
5.8 KiB
Markdown
# Reonomy Scraper - AGENT-BROWSER PLAN
|
|
|
|
**Date**: 2026-01-15
|
|
**Status**: Agent-browser confirmed working and ready to use
|
|
|
|
---
|
|
|
|
## 🎯 New Approach: Use Agent-Browser for Reonomy Scraper
|
|
|
|
### Why Agent-Browser Over Puppeteer
|
|
|
|
| Aspect | Puppeteer | Agent-Browser |
|
|
|--------|-----------|---------------|
|
|
| **Speed** | Fast (Rust CLI) | ⚡ Faster (Rust CLI + Playwright) |
|
|
| **Stability** | Medium (SPA timeouts) | ✅ High (Playwright engine) |
|
|
| **Refs** | ❌ No (CSS selectors) | ✅ Yes (deterministic @e1, @e2) |
|
|
| **Semantic Locators** | ❌ No | ✅ Yes (role, text, label, placeholder) |
|
|
| **State Persistence** | Manual (code changes) | ✅ Built-in (save/load) |
|
|
| **Sessions** | ❌ No (single instance) | ✅ Yes (parallel scrapers) |
|
|
| **API Compatibility** | ✅ Perfect (Node.js) | ✅ Perfect (Node.js) |
|
|
| **Eval Syntax** | Puppeteer `page.evaluate()` | ✅ Simple strings |
|
|
|
|
**Agent-Browser Wins:**
|
|
1. **Refs** — Snapshot once, use refs for all interactions (AI-friendly)
|
|
2. **Semantic Locators** — Find by role/text/label without CSS selectors
|
|
3. **State Persistence** — Login once, reuse across all scrapes (skip auth)
|
|
4. **Sessions** — Run parallel scrapers for different locations
|
|
5. **Playwright Engine** — More reliable than Puppeteer for SPAs
|
|
|
|
---
|
|
|
|
## 📋 Agent-Browser Workflow for Reonomy
|
|
|
|
### Step 1: Login (One-Time)
|
|
```bash
|
|
agent-browser open "https://app.reonomy.com/#!/login"
|
|
agent-browser snapshot -i # Get login form refs
|
|
agent-browser fill @e1 "henry@realestateenhanced.com"
|
|
agent-browser fill @e2 "9082166532"
|
|
agent-browser click @e3 # Click login button
|
|
agent-browser wait 15000
|
|
agent-browser state save "reonomy-auth-state.txt" # Save auth state
|
|
```
|
|
|
|
### Step 2: Load Saved State (Subsequent Runs)
|
|
```bash
|
|
# Skip login on future runs
|
|
agent-browser state load "reonomy-auth-state.txt"
|
|
```
|
|
|
|
### Step 3: Navigate to Search with Filters
|
|
```bash
|
|
# Use your search ID with phone+email filters
|
|
agent-browser open "https://app.reonomy.com/#!/search/504a2d13-d88f-4213-9ac6-a7c8bc7c20c6"
|
|
```
|
|
|
|
### Step 4: Extract Property IDs
|
|
```bash
|
|
# Get snapshot of search results
|
|
agent-browser snapshot -i
|
|
|
|
# Extract property links from refs
|
|
# (Parse JSON output to get all property IDs)
|
|
```
|
|
|
|
### Step 5: Process Each Property (Dual-Tab Extraction)
|
|
|
|
**For each property:**
|
|
```bash
|
|
# Navigate to ownership page directly
|
|
agent-browser open "https://app.reonomy.com/#!/search/504a2d13-d88f-4213-9ac6-a7c8bc7c20c6/property/{property-id}/ownership"
|
|
|
|
# Wait for page to load
|
|
agent-browser wait 8000
|
|
|
|
# Get snapshot
|
|
agent-browser snapshot -i
|
|
|
|
# Extract from Builder and Lot tab
|
|
# (Address, City, State, ZIP, SF, Property Type)
|
|
|
|
# Wait a moment
|
|
agent-browser wait 2000
|
|
|
|
# Extract from Owner tab
|
|
# (Owner Names, Emails using mailto, Phones using your CSS selector)
|
|
|
|
# Screenshot for debugging
|
|
agent-browser screenshot "/tmp/property-{index}.png"
|
|
```
|
|
|
|
### Step 6: Save Results
|
|
```bash
|
|
# Output to JSON
|
|
# (Combine all property data into final JSON)
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 Key Selectors
|
|
|
|
### Email Extraction (Dual Approach)
|
|
```javascript
|
|
// Mailto links
|
|
Array.from(document.querySelectorAll('a[href^="mailto:"]')).map(a => a.href.replace('mailto:', ''))
|
|
|
|
// Text-based emails
|
|
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g
|
|
```
|
|
|
|
### Phone Extraction (Your Provided Selector)
|
|
```css
|
|
p.MuiTypography-root.jss1797.jss1798.MuiTypography-body2
|
|
```
|
|
|
|
### Owner Name Extraction
|
|
```javascript
|
|
// Text patterns
|
|
/Owns\s+(\d+)\s+properties?\s*([A-Z][a-z]+)/i
|
|
```
|
|
|
|
---
|
|
|
|
## 💡 Agent-Browser Commands to Implement
|
|
|
|
1. **Authentication**: `state save`, `state load`
|
|
2. **Navigation**: `open <url>`
|
|
3. **Snapshot**: `snapshot -i` (get refs)
|
|
4. **Extraction**: `eval <js_code>`
|
|
5. **Wait**: `wait <ms>` or `wait --text <string>`
|
|
6. **Screenshots**: `screenshot <path>`
|
|
7. **JSON Output**: `--json` flag for machine-readable output
|
|
|
|
---
|
|
|
|
## 📊 Data Structure
|
|
|
|
```json
|
|
{
|
|
"scrapeDate": "2026-01-15",
|
|
"searchId": "504a2d13-d88f-4213-9ac6-a7c8bc7c20c6",
|
|
"properties": [
|
|
{
|
|
"propertyId": "...",
|
|
"propertyUrl": "...",
|
|
"address": "...",
|
|
"city": "...",
|
|
"state": "...",
|
|
"zip": "...",
|
|
"squareFootage": "...",
|
|
"propertyType": "...",
|
|
"ownerNames": ["..."],
|
|
"emails": ["..."],
|
|
"phones": ["..."]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 🔍 Verification Steps
|
|
|
|
Before creating script:
|
|
1. **Test agent-browser** with Reonomy login
|
|
2. **Snapshot search results** to verify property IDs appear
|
|
3. **Snapshot ownership page** to verify DOM structure
|
|
4. **Test your CSS selector**: `p.MuiTypography-root.jss1797.jss1798.MuiTypography-body2`
|
|
5. **Test email extraction**: Mailto links + text regex
|
|
6. **Test owner name extraction**: Regex patterns
|
|
|
|
---
|
|
|
|
## 💛 Implementation Questions
|
|
|
|
1. **Should I create the agent-browser script now?**
|
|
- Implement the workflow above
|
|
- Add ref-based navigation
|
|
- Implement state save/load
|
|
- Add dual-tab extraction (Builder and Lot + Owner)
|
|
- Use your CSS selector for phones
|
|
|
|
2. **Or should I wait for your manual verification?**
|
|
- You can test agent-browser manually with your search ID
|
|
- Share snapshot results so I can see actual DOM structure
|
|
- Verify CSS selector works for phones
|
|
|
|
3. **Any other requirements?**
|
|
- Google Sheets export via gog?
|
|
- CSV export format?
|
|
- Parallel scraping for multiple locations?
|
|
|
|
---
|
|
|
|
## 🚀 Benefits of Agent-Browser Approach
|
|
|
|
| Benefit | Description |
|
|
|---------|-------------|
|
|
| ✅ **Ref-based navigation** — Snapshot once, use deterministic refs |
|
|
| ✅ **State persistence** — Login once, skip auth on future runs |
|
|
| ✅ **Semantic locators** — Find by role/text/label, not brittle CSS selectors |
|
|
| ✅ **Playwright engine** — More stable than Puppeteer for SPAs |
|
|
| ✅ **Rust CLI speed** — Faster command execution |
|
|
| ✅ **JSON output** | Machine-readable for parsing |
|
|
| ✅ **Parallel sessions** | Run multiple scrapers at once |
|
|
|
|
---
|
|
|
|
**Ready to implement when you confirm!** 💛
|