clawdbot-workspace/REONOMY-SCRAPER-UPDATE.md

5.8 KiB

Reonomy Scraper Update - Contact Extraction

Summary

The Reonomy scraper has been updated to properly extract email and phone numbers from property and owner detail pages. Previously, the scraper only extracted data from the dashboard/search results page, resulting in empty email and phone fields.

Changes Made

1. New Functions Added

extractPropertyContactInfo(page, propertyUrl)

  • Visits each property detail page
  • Extracts email and phone numbers using multiple selector strategies
  • Uses regex fallback to find contact info in page text
  • Returns a contact info object with: email, phone, ownerName, propertyAddress, propertyType, squareFootage

extractOwnerContactInfo(page, ownerUrl)

  • Visits each owner detail page
  • Extracts email and phone numbers using multiple selector strategies
  • Uses regex fallback to find contact info in page text
  • Returns a contact info object with: email, phone, ownerName, ownerLocation, propertyCount

extractLinksFromPage(page)

  • Finds all property and owner links on the current page
  • Extracts IDs from URLs and reconstructs full Reonomy URLs
  • Removes duplicate URLs
  • Returns arrays of property URLs and owner URLs

2. Configuration Options Added

  • MAX_PROPERTIES = 20 - Limits number of properties to scrape (rate limiting)
  • MAX_OWNERS = 20 - Limits number of owners to scrape (rate limiting)
  • PAGE_DELAY_MS = 3000 - Delay between page visits (3 seconds) to avoid rate limiting

3. Updated Main Scraper Logic

The scraper now:

  1. Logs in to Reonomy
  2. Performs a search
  3. Extracts all property and owner links from the results page
  4. NEW: Visits each property page (up to MAX_PROPERTIES) to extract contact info
  5. NEW: Visits each owner page (up to MAX_OWNERS) to extract contact info
  6. Saves leads with populated email and phone fields

4. Enhanced Extraction Methods

For email detection:

  • Multiple CSS selectors (a[href^="mailto:"], .email, [data-test*="email"], etc.)
  • Regex patterns for email addresses
  • Falls back to page text analysis

For phone detection:

  • Multiple CSS selectors (a[href^="tel:"], .phone, [data-test*="phone"], etc.)
  • Multiple regex patterns for US phone numbers
  • Falls back to page text analysis

Rate Limiting

The scraper now includes rate limiting to avoid being blocked:

  • 3-second delay between page visits (PAGE_DELAY_MS)
  • 0.5-second delay between saving each record
  • Limits on total properties/owners scraped

Testing Instructions

Option 1: Using the wrapper script with 1Password

cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --1password --location "New York, NY"

Option 2: Using the wrapper script with manual credentials

cd /Users/jakeshore/.clawdbot/workspace
./scrape-reonomy.sh --location "New York, NY"

You'll be prompted for your email and password.

Option 3: Direct execution with environment variables

cd /Users/jakeshore/.clawdbot/workspace
export REONOMY_EMAIL="your@email.com"
export REONOMY_PASSWORD="yourpassword"
export REONOMY_LOCATION="New York, NY"
node reonomy-scraper.js

Option 4: Run in headless mode

HEADLESS=true REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js

Option 5: Save to JSON file (no Google Sheets)

REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" node reonomy-scraper.js

If gog CLI is not set up, it will save to reonomy-leads.json.

Option 6: Use existing Google Sheet

REONOMY_EMAIL="your@email.com" REONOMY_PASSWORD="yourpassword" REONOMY_SHEET_ID="your-sheet-id" node reonomy-scraper.js

Expected Output

After running the scraper, you should see logs like:

[1/10]
  🏠 Visiting property: https://app.reonomy.com/#!/property/xxx-xxx-xxx
    📧 Email: owner@example.com
    📞 Phone: (555) 123-4567

[2/10]
  🏠 Visiting property: https://app.reonomy.com/#!/property/yyy-yyy-yyy
    📧 Email: Not found
    📞 Phone: Not found

[1/5]
  👤 Visiting owner: https://app.reonomy.com/#!/person/zzz-zzz-zzz
    📧 Email: another@example.com
    📞 Phone: (555) 987-6543

The final reonomy-leads.json or Google Sheet should have populated email and phone fields.

Verification

After scraping, check the output:

If using JSON:

cat reonomy-leads.json | jq '.leads[] | select(.email != "" or .phone != "")'

If using Google Sheets:

Open the sheet at https://docs.google.com/spreadsheets/d/{sheet-id} and verify the Email and Phone columns are populated.

Troubleshooting

"No leads extracted"

  • The page structure may have changed
  • Check the screenshot saved at /tmp/reonomy-no-leads.png
  • Review the log file at reonomy-scraper.log

"Email/Phone not found"

  • Not all properties/owners have contact information
  • Reonomy may not display contact info for certain records
  • The information may be behind a paywall or require higher access

Rate limiting errors

  • Increase PAGE_DELAY_MS in the script (default is 3000ms)
  • Decrease MAX_PROPERTIES and MAX_OWNERS (default is 20 each)
  • Run the scraper in smaller batches

Key Features of the Updated Scraper

  1. Deep extraction: Visits each detail page to find contact info
  2. Multiple fallback strategies: Tries multiple selectors and regex patterns
  3. Rate limiting: Built-in delays to avoid blocking
  4. Configurable limits: Can adjust number of properties/owners to scrape
  5. Detailed logging: Shows progress for each page visited
  6. Error handling: Continues even if individual page extraction fails

Next Steps

  1. Test the scraper with your credentials
  2. Verify email and phone fields are populated
  3. Adjust limits (MAX_PROPERTIES, MAX_OWNERS) and delays (PAGE_DELAY_MS) as needed
  4. Review the extracted data quality and refine extraction patterns if needed