clawdbot-workspace/REONOMY-SCRAPER-UPDATE.md

# Reonomy Scraper Update - Contact Extraction

## Summary

The Reonomy scraper has been updated to properly extract email and phone numbers from property and owner detail pages. Previously, the scraper only extracted data from the dashboard/search results page, resulting in empty email and phone fields.

## Changes Made

### 1. New Functions Added

#### `extractPropertyContactInfo(page, propertyUrl)`
- Visits each property detail page
- Extracts email and phone numbers using multiple selector strategies
- Uses regex fallback to find contact info in page text
- Returns a contact info object with: email, phone, ownerName, propertyAddress, propertyType, squareFootage

#### `extractOwnerContactInfo(page, ownerUrl)`
- Visits each owner detail page
- Extracts email and phone numbers using multiple selector strategies
- Uses regex fallback to find contact info in page text
- Returns a contact info object with: email, phone, ownerName, ownerLocation, propertyCount

#### `extractLinksFromPage(page)`
- Finds all property and owner links on the current page
- Extracts IDs from URLs and reconstructs full Reonomy URLs
- Removes duplicate URLs
- Returns arrays of property URLs and owner URLs

### 2. Configuration Options Added

- `MAX_PROPERTIES = 20` - Limits number of properties to scrape (rate limiting)
- `MAX_OWNERS = 20` - Limits number of owners to scrape (rate limiting)
- `PAGE_DELAY_MS = 3000` - Delay between page visits (3 seconds) to avoid rate limiting

### 3. Updated Main Scraper Logic

The scraper now:
1. Logs in to Reonomy
2. Performs a search
3. Extracts all property and owner links from the results page
4. **NEW**: Visits each property page (up to MAX_PROPERTIES) to extract contact info
5. **NEW**: Visits each owner page (up to MAX_OWNERS) to extract contact info
6. Saves leads with populated email and phone fields

### 4. Enhanced Extraction Methods

For email detection:
- Multiple CSS selectors (`a[href^="mailto:"]`, `.email`, `[data-test*="email"]`, etc.)
- Regex patterns for email addresses
- Falls back to page text analysis

For phone detection:
- Multiple CSS selectors (`a[href^="tel:"]`, `.phone`, `[data-test*="phone"]`, etc.)
- Multiple regex patterns for US phone numbers
- Falls back to page text analysis

## Rate Limiting

The scraper now includes rate limiting to avoid being blocked:
- 3-second delay between page visits (`PAGE_DELAY_MS`)
- 0.5-second delay between saving each record
- Limits on total properties/owners scraped

## Testing Instructions

### Option 1: Using the wrapper script with 1Password

```bash
cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --1password --location "New York, NY"
```

### Option 2: Using the wrapper script with manual credentials

```bash
cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --location "New York, NY"
```
You'll be prompted for your email and password.

### Option 3: Direct execution with environment variables

```bash
cd /Users/jakeshore/.clawdbot/workspace
export REONOMY_EMAIL="your@email.com"
export REONOMY_PASSWORD="yourpassword"
export REONOMY_LOCATION="New York, NY"
node reonomy-scraper.js
```

### Option 4: Run in headless mode

```bash
HEADLESS=true REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
```

### Option 5: Save to JSON file (no Google Sheets)

```bash
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
```
If `gog` CLI is not set up, it will save to `reonomy-leads.json`.

### Option 6: Use existing Google Sheet

```bash
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" REONOMY_SHEET_ID="your-sheet-id" node reonomy-scraper.js
```

## Expected Output

After running the scraper, you should see logs like:

```
[1/10]
  🏠 Visiting property: https://app.reonomy.com/#!/property/xxx-xxx-xxx
    📧 Email: owner@example.com
    📞 Phone: (555) 123-4567

[2/10]
  🏠 Visiting property: https://app.reonomy.com/#!/property/yyy-yyy-yyy
    📧 Email: Not found
    📞 Phone: Not found

[1/5]
  👤 Visiting owner: https://app.reonomy.com/#!/person/zzz-zzz-zzz
    📧 Email: another@example.com
    📞 Phone: (555) 987-6543
```

The final `reonomy-leads.json` or Google Sheet should have populated `email` and `phone` fields.

## Verification

After scraping, check the output:

### If using JSON:
```bash
cat reonomy-leads.json | jq '.leads[] | select(.email != "" or .phone != "")'
```

### If using Google Sheets:
Open the sheet at `https://docs.google.com/spreadsheets/d/{sheet-id}` and verify the Email and Phone columns are populated.

## Troubleshooting

### "No leads extracted"
- The page structure may have changed
- Check the screenshot saved at `/tmp/reonomy-no-leads.png`
- Review the log file at `reonomy-scraper.log`

### "Email/Phone not found"
- Not all properties/owners have contact information
- Reonomy may not display contact info for certain records
- The information may be behind a paywall or require higher access

### Rate limiting errors
- Increase `PAGE_DELAY_MS` in the script (default is 3000ms)
- Decrease `MAX_PROPERTIES` and `MAX_OWNERS` (default is 20 each)
- Run the scraper in smaller batches

## Key Features of the Updated Scraper

1. **Deep extraction**: Visits each detail page to find contact info
2. **Multiple fallback strategies**: Tries multiple selectors and regex patterns
3. **Rate limiting**: Built-in delays to avoid blocking
4. **Configurable limits**: Can adjust number of properties/owners to scrape
5. **Detailed logging**: Shows progress for each page visited
6. **Error handling**: Continues even if individual page extraction fails

## Next Steps

1. Test the scraper with your credentials
2. Verify email and phone fields are populated
3. Adjust limits (`MAX_PROPERTIES`, `MAX_OWNERS`) and delays (`PAGE_DELAY_MS`) as needed
4. Review the extracted data quality and refine extraction patterns if needed