clawdbot-workspace/SCRAPER-RESEARCH.md

2.9 KiB

Scraper Research: Puppeteer Alternatives

Research Summary

I evaluated several alternatives to Puppeteer for web scraping. Here are my findings:

Top Contender: Playwright

Status: Already installed (v1.57.0)

Key Advantages over Puppeteer:

  1. Built-in Auto-Waiting

    • No more arbitrary sleep() calls
    • waitForSelector() waits intelligently for elements
    • waitForFunction() waits until custom conditions are met
    • waitForResponse() waits for network requests to complete
  2. Better Selectors

    • page.locator() is more robust than page.$()
    • Supports text-based selectors (getByText(), getByRole())
    • Chainable selectors for complex queries
  3. Multiple Browser Support

    • Chromium (Chrome/Edge)
    • Firefox
    • WebKit (Safari)
    • Can switch between browsers with one line change
  4. Faster & More Reliable

    • Better resource management
    • Faster execution
    • More stable for dynamic content
  5. Better Debugging

    • Built-in tracing (trace.start(), trace.stop())
    • Video recording out of the box
    • Screenshot API

Other Options Considered

Tool Status Verdict
Selenium Not installed Mature but slower, more complex API
Cypress Not installed Focus on testing, overkill for scraping
Cheerio Available Fast but no JS execution - won't work for Reonomy
JSDOM Available Similar to Cheerio - no JS execution
Puppeteer-Extra Not installed Still Puppeteer underneath
Zombie.js Not installed Less maintained, limited features

Recommendation: Switch to Playwright

For the Reonomy scraper, Playwright is the clear winner because:

  1. Already installed in the project
  2. No arbitrary sleeps needed for dynamic content
  3. Better handling of the 30-second contact details wait
  4. More reliable element selection
  5. Faster execution

Key Changes in Playwright Version

Puppeteer (Current)

await sleep(8000);  // Arbitrary wait
const element = await page.$('selector');
await element.click();

Playwright (New)

await page.waitForSelector('selector', { state: 'visible', timeout: 30000 });
await page.locator('selector').click();

Waiting for Contact Details

Puppeteer:

// Manual polling with sleep()
for (let i = 0; i < 30; i++) {
  await sleep(1000);
  const data = await extractOwnerTabData(page);
  if (data.emails.length > 0 || data.phones.length > 0) break;
}

Playwright:

// Intelligent wait until condition is met
await page.waitForFunction(
  () => {
    const emails = document.querySelectorAll('a[href^="mailto:"]');
    const phones = document.querySelectorAll('a[href^="tel:"]');
    return emails.length > 0 || phones.length > 0;
  },
  { timeout: 30000 }
);

Implementation

The Playwright version will be saved as: reonomy-scraper-v11-playwright.js