102 lines
2.9 KiB
Markdown
102 lines
2.9 KiB
Markdown
# Scraper Research: Puppeteer Alternatives
|
|
|
|
## Research Summary
|
|
|
|
I evaluated several alternatives to Puppeteer for web scraping. Here are my findings:
|
|
|
|
### Top Contender: Playwright ✅
|
|
**Status:** Already installed (v1.57.0)
|
|
|
|
**Key Advantages over Puppeteer:**
|
|
|
|
1. **Built-in Auto-Waiting**
|
|
- No more arbitrary `sleep()` calls
|
|
- `waitForSelector()` waits intelligently for elements
|
|
- `waitForFunction()` waits until custom conditions are met
|
|
- `waitForResponse()` waits for network requests to complete
|
|
|
|
2. **Better Selectors**
|
|
- `page.locator()` is more robust than `page.$()`
|
|
- Supports text-based selectors (`getByText()`, `getByRole()`)
|
|
- Chainable selectors for complex queries
|
|
|
|
3. **Multiple Browser Support**
|
|
- Chromium (Chrome/Edge)
|
|
- Firefox
|
|
- WebKit (Safari)
|
|
- Can switch between browsers with one line change
|
|
|
|
4. **Faster & More Reliable**
|
|
- Better resource management
|
|
- Faster execution
|
|
- More stable for dynamic content
|
|
|
|
5. **Better Debugging**
|
|
- Built-in tracing (`trace.start()`, `trace.stop()`)
|
|
- Video recording out of the box
|
|
- Screenshot API
|
|
|
|
### Other Options Considered
|
|
|
|
| Tool | Status | Verdict |
|
|
|------|--------|---------|
|
|
| **Selenium** | Not installed | Mature but slower, more complex API |
|
|
| **Cypress** | Not installed | Focus on testing, overkill for scraping |
|
|
| **Cheerio** | Available | Fast but no JS execution - won't work for Reonomy |
|
|
| **JSDOM** | Available | Similar to Cheerio - no JS execution |
|
|
| **Puppeteer-Extra** | Not installed | Still Puppeteer underneath |
|
|
| **Zombie.js** | Not installed | Less maintained, limited features |
|
|
|
|
## Recommendation: Switch to Playwright
|
|
|
|
For the Reonomy scraper, Playwright is the clear winner because:
|
|
1. ✅ Already installed in the project
|
|
2. ✅ No arbitrary sleeps needed for dynamic content
|
|
3. ✅ Better handling of the 30-second contact details wait
|
|
4. ✅ More reliable element selection
|
|
5. ✅ Faster execution
|
|
|
|
## Key Changes in Playwright Version
|
|
|
|
### Puppeteer (Current)
|
|
```javascript
|
|
await sleep(8000); // Arbitrary wait
|
|
const element = await page.$('selector');
|
|
await element.click();
|
|
```
|
|
|
|
### Playwright (New)
|
|
```javascript
|
|
await page.waitForSelector('selector', { state: 'visible', timeout: 30000 });
|
|
await page.locator('selector').click();
|
|
```
|
|
|
|
### Waiting for Contact Details
|
|
|
|
**Puppeteer:**
|
|
```javascript
|
|
// Manual polling with sleep()
|
|
for (let i = 0; i < 30; i++) {
|
|
await sleep(1000);
|
|
const data = await extractOwnerTabData(page);
|
|
if (data.emails.length > 0 || data.phones.length > 0) break;
|
|
}
|
|
```
|
|
|
|
**Playwright:**
|
|
```javascript
|
|
// Intelligent wait until condition is met
|
|
await page.waitForFunction(
|
|
() => {
|
|
const emails = document.querySelectorAll('a[href^="mailto:"]');
|
|
const phones = document.querySelectorAll('a[href^="tel:"]');
|
|
return emails.length > 0 || phones.length > 0;
|
|
},
|
|
{ timeout: 30000 }
|
|
);
|
|
```
|
|
|
|
## Implementation
|
|
|
|
The Playwright version will be saved as: `reonomy-scraper-v11-playwright.js`
|