clawdbot-workspace/SCRAPER-RESEARCH.md

102 lines
2.9 KiB
Markdown

# Scraper Research: Puppeteer Alternatives
## Research Summary
I evaluated several alternatives to Puppeteer for web scraping. Here are my findings:
### Top Contender: Playwright ✅
**Status:** Already installed (v1.57.0)
**Key Advantages over Puppeteer:**
1. **Built-in Auto-Waiting**
- No more arbitrary `sleep()` calls
- `waitForSelector()` waits intelligently for elements
- `waitForFunction()` waits until custom conditions are met
- `waitForResponse()` waits for network requests to complete
2. **Better Selectors**
- `page.locator()` is more robust than `page.$()`
- Supports text-based selectors (`getByText()`, `getByRole()`)
- Chainable selectors for complex queries
3. **Multiple Browser Support**
- Chromium (Chrome/Edge)
- Firefox
- WebKit (Safari)
- Can switch between browsers with one line change
4. **Faster & More Reliable**
- Better resource management
- Faster execution
- More stable for dynamic content
5. **Better Debugging**
- Built-in tracing (`trace.start()`, `trace.stop()`)
- Video recording out of the box
- Screenshot API
### Other Options Considered
| Tool | Status | Verdict |
|------|--------|---------|
| **Selenium** | Not installed | Mature but slower, more complex API |
| **Cypress** | Not installed | Focus on testing, overkill for scraping |
| **Cheerio** | Available | Fast but no JS execution - won't work for Reonomy |
| **JSDOM** | Available | Similar to Cheerio - no JS execution |
| **Puppeteer-Extra** | Not installed | Still Puppeteer underneath |
| **Zombie.js** | Not installed | Less maintained, limited features |
## Recommendation: Switch to Playwright
For the Reonomy scraper, Playwright is the clear winner because:
1. ✅ Already installed in the project
2. ✅ No arbitrary sleeps needed for dynamic content
3. ✅ Better handling of the 30-second contact details wait
4. ✅ More reliable element selection
5. ✅ Faster execution
## Key Changes in Playwright Version
### Puppeteer (Current)
```javascript
await sleep(8000); // Arbitrary wait
const element = await page.$('selector');
await element.click();
```
### Playwright (New)
```javascript
await page.waitForSelector('selector', { state: 'visible', timeout: 30000 });
await page.locator('selector').click();
```
### Waiting for Contact Details
**Puppeteer:**
```javascript
// Manual polling with sleep()
for (let i = 0; i < 30; i++) {
await sleep(1000);
const data = await extractOwnerTabData(page);
if (data.emails.length > 0 || data.phones.length > 0) break;
}
```
**Playwright:**
```javascript
// Intelligent wait until condition is met
await page.waitForFunction(
() => {
const emails = document.querySelectorAll('a[href^="mailto:"]');
const phones = document.querySelectorAll('a[href^="tel:"]');
return emails.length > 0 || phones.length > 0;
},
{ timeout: 30000 }
);
```
## Implementation
The Playwright version will be saved as: `reonomy-scraper-v11-playwright.js`