clawdbot-workspace/REONOMY-SCRAPER-UPDATE.md

177 lines
5.8 KiB
Markdown

# Reonomy Scraper Update - Contact Extraction
## Summary
The Reonomy scraper has been updated to properly extract email and phone numbers from property and owner detail pages. Previously, the scraper only extracted data from the dashboard/search results page, resulting in empty email and phone fields.
## Changes Made
### 1. New Functions Added
#### `extractPropertyContactInfo(page, propertyUrl)`
- Visits each property detail page
- Extracts email and phone numbers using multiple selector strategies
- Uses regex fallback to find contact info in page text
- Returns a contact info object with: email, phone, ownerName, propertyAddress, propertyType, squareFootage
#### `extractOwnerContactInfo(page, ownerUrl)`
- Visits each owner detail page
- Extracts email and phone numbers using multiple selector strategies
- Uses regex fallback to find contact info in page text
- Returns a contact info object with: email, phone, ownerName, ownerLocation, propertyCount
#### `extractLinksFromPage(page)`
- Finds all property and owner links on the current page
- Extracts IDs from URLs and reconstructs full Reonomy URLs
- Removes duplicate URLs
- Returns arrays of property URLs and owner URLs
### 2. Configuration Options Added
- `MAX_PROPERTIES = 20` - Limits number of properties to scrape (rate limiting)
- `MAX_OWNERS = 20` - Limits number of owners to scrape (rate limiting)
- `PAGE_DELAY_MS = 3000` - Delay between page visits (3 seconds) to avoid rate limiting
### 3. Updated Main Scraper Logic
The scraper now:
1. Logs in to Reonomy
2. Performs a search
3. Extracts all property and owner links from the results page
4. **NEW**: Visits each property page (up to MAX_PROPERTIES) to extract contact info
5. **NEW**: Visits each owner page (up to MAX_OWNERS) to extract contact info
6. Saves leads with populated email and phone fields
### 4. Enhanced Extraction Methods
For email detection:
- Multiple CSS selectors (`a[href^="mailto:"]`, `.email`, `[data-test*="email"]`, etc.)
- Regex patterns for email addresses
- Falls back to page text analysis
For phone detection:
- Multiple CSS selectors (`a[href^="tel:"]`, `.phone`, `[data-test*="phone"]`, etc.)
- Multiple regex patterns for US phone numbers
- Falls back to page text analysis
## Rate Limiting
The scraper now includes rate limiting to avoid being blocked:
- 3-second delay between page visits (`PAGE_DELAY_MS`)
- 0.5-second delay between saving each record
- Limits on total properties/owners scraped
## Testing Instructions
### Option 1: Using the wrapper script with 1Password
```bash
cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --1password --location "New York, NY"
```
### Option 2: Using the wrapper script with manual credentials
```bash
cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --location "New York, NY"
```
You'll be prompted for your email and password.
### Option 3: Direct execution with environment variables
```bash
cd /Users/jakeshore/.clawdbot/workspace
export REONOMY_EMAIL="your@email.com"
export REONOMY_PASSWORD="yourpassword"
export REONOMY_LOCATION="New York, NY"
node reonomy-scraper.js
```
### Option 4: Run in headless mode
```bash
HEADLESS=true REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
```
### Option 5: Save to JSON file (no Google Sheets)
```bash
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
```
If `gog` CLI is not set up, it will save to `reonomy-leads.json`.
### Option 6: Use existing Google Sheet
```bash
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" REONOMY_SHEET_ID="your-sheet-id" node reonomy-scraper.js
```
## Expected Output
After running the scraper, you should see logs like:
```
[1/10]
🏠 Visiting property: https://app.reonomy.com/#!/property/xxx-xxx-xxx
📧 Email: owner@example.com
📞 Phone: (555) 123-4567
[2/10]
🏠 Visiting property: https://app.reonomy.com/#!/property/yyy-yyy-yyy
📧 Email: Not found
📞 Phone: Not found
[1/5]
👤 Visiting owner: https://app.reonomy.com/#!/person/zzz-zzz-zzz
📧 Email: another@example.com
📞 Phone: (555) 987-6543
```
The final `reonomy-leads.json` or Google Sheet should have populated `email` and `phone` fields.
## Verification
After scraping, check the output:
### If using JSON:
```bash
cat reonomy-leads.json | jq '.leads[] | select(.email != "" or .phone != "")'
```
### If using Google Sheets:
Open the sheet at `https://docs.google.com/spreadsheets/d/{sheet-id}` and verify the Email and Phone columns are populated.
## Troubleshooting
### "No leads extracted"
- The page structure may have changed
- Check the screenshot saved at `/tmp/reonomy-no-leads.png`
- Review the log file at `reonomy-scraper.log`
### "Email/Phone not found"
- Not all properties/owners have contact information
- Reonomy may not display contact info for certain records
- The information may be behind a paywall or require higher access
### Rate limiting errors
- Increase `PAGE_DELAY_MS` in the script (default is 3000ms)
- Decrease `MAX_PROPERTIES` and `MAX_OWNERS` (default is 20 each)
- Run the scraper in smaller batches
## Key Features of the Updated Scraper
1. **Deep extraction**: Visits each detail page to find contact info
2. **Multiple fallback strategies**: Tries multiple selectors and regex patterns
3. **Rate limiting**: Built-in delays to avoid blocking
4. **Configurable limits**: Can adjust number of properties/owners to scrape
5. **Detailed logging**: Shows progress for each page visited
6. **Error handling**: Continues even if individual page extraction fails
## Next Steps
1. Test the scraper with your credentials
2. Verify email and phone fields are populated
3. Adjust limits (`MAX_PROPERTIES`, `MAX_OWNERS`) and delays (`PAGE_DELAY_MS`) as needed
4. Review the extracted data quality and refine extraction patterns if needed