177 lines
5.8 KiB
Markdown
177 lines
5.8 KiB
Markdown
# Reonomy Scraper Update - Contact Extraction
|
|
|
|
## Summary
|
|
|
|
The Reonomy scraper has been updated to properly extract email and phone numbers from property and owner detail pages. Previously, the scraper only extracted data from the dashboard/search results page, resulting in empty email and phone fields.
|
|
|
|
## Changes Made
|
|
|
|
### 1. New Functions Added
|
|
|
|
#### `extractPropertyContactInfo(page, propertyUrl)`
|
|
- Visits each property detail page
|
|
- Extracts email and phone numbers using multiple selector strategies
|
|
- Uses regex fallback to find contact info in page text
|
|
- Returns a contact info object with: email, phone, ownerName, propertyAddress, propertyType, squareFootage
|
|
|
|
#### `extractOwnerContactInfo(page, ownerUrl)`
|
|
- Visits each owner detail page
|
|
- Extracts email and phone numbers using multiple selector strategies
|
|
- Uses regex fallback to find contact info in page text
|
|
- Returns a contact info object with: email, phone, ownerName, ownerLocation, propertyCount
|
|
|
|
#### `extractLinksFromPage(page)`
|
|
- Finds all property and owner links on the current page
|
|
- Extracts IDs from URLs and reconstructs full Reonomy URLs
|
|
- Removes duplicate URLs
|
|
- Returns arrays of property URLs and owner URLs
|
|
|
|
### 2. Configuration Options Added
|
|
|
|
- `MAX_PROPERTIES = 20` - Limits number of properties to scrape (rate limiting)
|
|
- `MAX_OWNERS = 20` - Limits number of owners to scrape (rate limiting)
|
|
- `PAGE_DELAY_MS = 3000` - Delay between page visits (3 seconds) to avoid rate limiting
|
|
|
|
### 3. Updated Main Scraper Logic
|
|
|
|
The scraper now:
|
|
1. Logs in to Reonomy
|
|
2. Performs a search
|
|
3. Extracts all property and owner links from the results page
|
|
4. **NEW**: Visits each property page (up to MAX_PROPERTIES) to extract contact info
|
|
5. **NEW**: Visits each owner page (up to MAX_OWNERS) to extract contact info
|
|
6. Saves leads with populated email and phone fields
|
|
|
|
### 4. Enhanced Extraction Methods
|
|
|
|
For email detection:
|
|
- Multiple CSS selectors (`a[href^="mailto:"]`, `.email`, `[data-test*="email"]`, etc.)
|
|
- Regex patterns for email addresses
|
|
- Falls back to page text analysis
|
|
|
|
For phone detection:
|
|
- Multiple CSS selectors (`a[href^="tel:"]`, `.phone`, `[data-test*="phone"]`, etc.)
|
|
- Multiple regex patterns for US phone numbers
|
|
- Falls back to page text analysis
|
|
|
|
## Rate Limiting
|
|
|
|
The scraper now includes rate limiting to avoid being blocked:
|
|
- 3-second delay between page visits (`PAGE_DELAY_MS`)
|
|
- 0.5-second delay between saving each record
|
|
- Limits on total properties/owners scraped
|
|
|
|
## Testing Instructions
|
|
|
|
### Option 1: Using the wrapper script with 1Password
|
|
|
|
```bash
|
|
cd /Users/jakeshore/.clawdbot/workspace
|
|
./scrape-reonomy.sh --1password --location "New York, NY"
|
|
```
|
|
|
|
### Option 2: Using the wrapper script with manual credentials
|
|
|
|
```bash
|
|
cd /Users/jakeshore/.clawdbot/workspace
|
|
./scrape-reonomy.sh --location "New York, NY"
|
|
```
|
|
You'll be prompted for your email and password.
|
|
|
|
### Option 3: Direct execution with environment variables
|
|
|
|
```bash
|
|
cd /Users/jakeshore/.clawdbot/workspace
|
|
export REONOMY_EMAIL="your@email.com"
|
|
export REONOMY_PASSWORD="yourpassword"
|
|
export REONOMY_LOCATION="New York, NY"
|
|
node reonomy-scraper.js
|
|
```
|
|
|
|
### Option 4: Run in headless mode
|
|
|
|
```bash
|
|
HEADLESS=true REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
|
|
```
|
|
|
|
### Option 5: Save to JSON file (no Google Sheets)
|
|
|
|
```bash
|
|
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js
|
|
```
|
|
If `gog` CLI is not set up, it will save to `reonomy-leads.json`.
|
|
|
|
### Option 6: Use existing Google Sheet
|
|
|
|
```bash
|
|
REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" REONOMY_SHEET_ID="your-sheet-id" node reonomy-scraper.js
|
|
```
|
|
|
|
## Expected Output
|
|
|
|
After running the scraper, you should see logs like:
|
|
|
|
```
|
|
[1/10]
|
|
🏠 Visiting property: https://app.reonomy.com/#!/property/xxx-xxx-xxx
|
|
📧 Email: owner@example.com
|
|
📞 Phone: (555) 123-4567
|
|
|
|
[2/10]
|
|
🏠 Visiting property: https://app.reonomy.com/#!/property/yyy-yyy-yyy
|
|
📧 Email: Not found
|
|
📞 Phone: Not found
|
|
|
|
[1/5]
|
|
👤 Visiting owner: https://app.reonomy.com/#!/person/zzz-zzz-zzz
|
|
📧 Email: another@example.com
|
|
📞 Phone: (555) 987-6543
|
|
```
|
|
|
|
The final `reonomy-leads.json` or Google Sheet should have populated `email` and `phone` fields.
|
|
|
|
## Verification
|
|
|
|
After scraping, check the output:
|
|
|
|
### If using JSON:
|
|
```bash
|
|
cat reonomy-leads.json | jq '.leads[] | select(.email != "" or .phone != "")'
|
|
```
|
|
|
|
### If using Google Sheets:
|
|
Open the sheet at `https://docs.google.com/spreadsheets/d/{sheet-id}` and verify the Email and Phone columns are populated.
|
|
|
|
## Troubleshooting
|
|
|
|
### "No leads extracted"
|
|
- The page structure may have changed
|
|
- Check the screenshot saved at `/tmp/reonomy-no-leads.png`
|
|
- Review the log file at `reonomy-scraper.log`
|
|
|
|
### "Email/Phone not found"
|
|
- Not all properties/owners have contact information
|
|
- Reonomy may not display contact info for certain records
|
|
- The information may be behind a paywall or require higher access
|
|
|
|
### Rate limiting errors
|
|
- Increase `PAGE_DELAY_MS` in the script (default is 3000ms)
|
|
- Decrease `MAX_PROPERTIES` and `MAX_OWNERS` (default is 20 each)
|
|
- Run the scraper in smaller batches
|
|
|
|
## Key Features of the Updated Scraper
|
|
|
|
1. **Deep extraction**: Visits each detail page to find contact info
|
|
2. **Multiple fallback strategies**: Tries multiple selectors and regex patterns
|
|
3. **Rate limiting**: Built-in delays to avoid blocking
|
|
4. **Configurable limits**: Can adjust number of properties/owners to scrape
|
|
5. **Detailed logging**: Shows progress for each page visited
|
|
6. **Error handling**: Continues even if individual page extraction fails
|
|
|
|
## Next Steps
|
|
|
|
1. Test the scraper with your credentials
|
|
2. Verify email and phone fields are populated
|
|
3. Adjust limits (`MAX_PROPERTIES`, `MAX_OWNERS`) and delays (`PAGE_DELAY_MS`) as needed
|
|
4. Review the extracted data quality and refine extraction patterns if needed
|