350 lines
12 KiB
Markdown
350 lines
12 KiB
Markdown
# Visual/Screenshot-Based Browser Automation Tools Research
|
|
## February 2026 State-of-the-Art
|
|
|
|
**Research Date:** February 5, 2026
|
|
**Focus:** Tools using screenshots + vision models for web navigation
|
|
**Key Metrics:** Accuracy, Speed, Complex Multi-Step Task Completion
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
As of Feb 2026, visual web agents using screenshots + vision models have made significant progress but still lag far behind human performance. The best systems achieve ~58-60% success on simplified benchmarks but only 12-38% on real-world computer tasks. Key challenges remain in GUI grounding, long-horizon planning, and operational knowledge.
|
|
|
|
**Top Performers (by category):**
|
|
- **Production Ready:** Browser-Use (with ChatBrowserUse model), Anthropic Computer Use
|
|
- **Research/Accuracy:** SeeAct + GPT-4V, WebVoyager, Agent-Q
|
|
- **Benchmarking Standard:** OSWorld, WebArena, VisualWebArena
|
|
|
|
---
|
|
|
|
## 1. Anthropic Computer Use (Claude 3.5 Sonnet)
|
|
|
|
### Overview
|
|
Released Oct 2024, updated Dec 2024. First frontier AI model with public computer use capability.
|
|
|
|
### Technical Approach
|
|
- **Method:** Screenshot-based visual perception + cursor/keyboard control
|
|
- **Model:** Claude 3.5 Sonnet with specialized computer use API
|
|
- **Architecture:** Sees screen, moves cursor, clicks buttons, types text
|
|
|
|
### Performance Metrics
|
|
- **OSWorld benchmark:** 14.9% (screenshot-only), 22.0% (with more steps)
|
|
- **Complex tasks:** Can handle tasks with dozens to hundreds of steps
|
|
- **Speed:** Still "cumbersome and error-prone" per Anthropic
|
|
- **Early adopters:** Replit, Asana, Canva, DoorDash, The Browser Company
|
|
|
|
### Key Strengths
|
|
- Production-grade API available (Anthropic API, AWS Bedrock, Google Vertex AI)
|
|
- Integrated safety classifiers for harm detection
|
|
- Strong coding performance (49% on SWE-bench Verified)
|
|
|
|
### Key Limitations
|
|
- Actions like scrolling, dragging, zooming present challenges
|
|
- Error-prone on complex workflows
|
|
- Still experimental/beta quality
|
|
|
|
### Use Cases
|
|
- Software development automation (Replit Agent)
|
|
- Multi-step workflow automation
|
|
- App evaluation during development
|
|
|
|
---
|
|
|
|
## 2. SeeAct (GPT-4V-based Web Agent)
|
|
|
|
### Overview
|
|
Published Jan 2024 (ICML'24), open-source from OSU NLP Group.
|
|
|
|
### Technical Approach
|
|
- **Model:** GPT-4V (vision), Gemini, LLaVA supported
|
|
- **Grounding:** Text choices + Set-of-Mark (SoM) overlays
|
|
- **Framework:** Playwright-based, runs on live websites
|
|
|
|
### Performance Metrics
|
|
- **Accuracy:** Strong on grounded element selection
|
|
- **Multimodal:** Significantly outperforms text-only approaches
|
|
- **Mind2Web dataset:** Evaluated on 1000+ real-world tasks
|
|
- **Real websites:** Tested on 15+ popular sites (Google, Amazon, Reddit, etc.)
|
|
|
|
### Key Strengths
|
|
- **Production-ready Python package:** `pip install seeact`
|
|
- Supports multiple LMM backends (GPT-4V, Gemini, local LLaVA)
|
|
- Chrome Extension available (SeeActChromeExtension)
|
|
- Strong element grounding with SoM visual prompting
|
|
- Active maintenance and updates
|
|
|
|
### Key Limitations
|
|
- Requires manual safety monitoring (safety = manual confirmation mode)
|
|
- No auto-login support (security measure)
|
|
- Can be slow on complex multi-page workflows
|
|
|
|
### Use Cases
|
|
- Web scraping and data extraction
|
|
- Form filling automation
|
|
- Research and information gathering
|
|
- Testing and QA automation
|
|
|
|
---
|
|
|
|
## 3. WebVoyager
|
|
|
|
### Overview
|
|
Published Jan 2024 (ACL'24), Tencent AI Lab.
|
|
|
|
### Technical Approach
|
|
- **Model:** GPT-4V for multimodal reasoning
|
|
- **Environment:** Selenium-based, real browser interaction
|
|
- **Planning:** Generalist planning approach with visual+text fusion
|
|
|
|
### Performance Metrics
|
|
- **Task Success Rate:** 59.1% on their benchmark (15 websites, 643 tasks)
|
|
- **vs. GPT-4 text-only:** Significantly better
|
|
- **vs. GPT-4V text-only:** Multimodal consistently outperforms
|
|
- **GPT-4V Auto-evaluation:** 85.3% agreement with human judgment
|
|
|
|
### Key Strengths
|
|
- End-to-end task completion on real websites
|
|
- Strong performance on diverse web tasks
|
|
- Automated evaluation protocol using GPT-4V
|
|
- Handles Booking, Google Flights, ArXiv, BBC News, etc.
|
|
|
|
### Key Limitations
|
|
- Some tasks are time-sensitive (need manual updates)
|
|
- Non-deterministic results despite temperature=0
|
|
- 59.1% success still far from human-level
|
|
- Requires specific setup per website
|
|
|
|
### Use Cases
|
|
- Travel booking automation
|
|
- News and research aggregation
|
|
- Cross-website information synthesis
|
|
- Complex multi-step web workflows
|
|
|
|
---
|
|
|
|
## 4. Browser-Use (Open Source Framework)
|
|
|
|
### Overview
|
|
Modern open-source framework (active development as of Feb 2026), optimized for production.
|
|
|
|
### Technical Approach
|
|
- **Models:** ChatBrowserUse (optimized), GPT-4o, Gemini, LLaVA, local models
|
|
- **Architecture:** Playwright-based with cloud scaling option
|
|
- **Grounding:** State-based with clickable element indexing
|
|
|
|
### Performance Metrics
|
|
- **Speed:** 3-5x faster than other models (with ChatBrowserUse)
|
|
- **Pricing:** $0.20/M input tokens, $2.00/M output (ChatBrowserUse)
|
|
- **Production:** Cloud option with stealth browsers, anti-CAPTCHA
|
|
|
|
### Key Strengths
|
|
- **Production-ready infrastructure:**
|
|
- Sandbox deployment with `@sandbox()` decorator
|
|
- Cloud option for scalability
|
|
- Stealth mode (fingerprinting, proxy rotation)
|
|
- **CLI for rapid iteration:** `browser-use open/click/type/screenshot`
|
|
- **Active development:** Daily updates, strong community
|
|
- **Authentication support:** Real browser profiles, session persistence
|
|
- **Integration:** Works with Claude Code, multiple LLM providers
|
|
|
|
### Key Limitations
|
|
- Newer framework (less academic validation)
|
|
- Best performance requires ChatBrowserUse model (proprietary)
|
|
- CAPTCHA handling requires cloud version
|
|
|
|
### Use Cases
|
|
- Job application automation
|
|
- Grocery shopping (Instacart integration)
|
|
- PC part sourcing
|
|
- Form filling
|
|
- Multi-site data aggregation
|
|
|
|
---
|
|
|
|
## 5. Agent-Q (Reinforcement Learning Approach)
|
|
|
|
### Overview
|
|
Research from Sentient Engineering (Aug 2024), uses Monte Carlo Tree Search + DPO finetuning.
|
|
|
|
### Technical Approach
|
|
- **Architecture:** Multiple options:
|
|
- Planner → Navigator multi-agent
|
|
- Solo planner-actor
|
|
- Actor ↔ Critic multi-agent
|
|
- Actor-Critic + MCTS + DPO finetuning
|
|
- **Learning:** Generates DPO training pairs from MCTS exploration
|
|
|
|
### Performance Metrics
|
|
- Research-focused, specific benchmarks not widely published yet
|
|
- Emphasis on learning and improvement over time
|
|
|
|
### Key Strengths
|
|
- Advanced reasoning architecture
|
|
- Self-improvement via reinforcement learning
|
|
- Multiple agent architectures for different complexity levels
|
|
- Open-source implementation
|
|
|
|
### Key Limitations
|
|
- More research-oriented than production-ready
|
|
- Requires significant computational resources for MCTS
|
|
- Less documentation for practical deployment
|
|
|
|
### Use Cases
|
|
- Research on agent learning
|
|
- Complex reasoning tasks
|
|
- Long-horizon planning experiments
|
|
|
|
---
|
|
|
|
## 6. OpenAI Operator (Rumored/Upcoming - Jan 2025)
|
|
|
|
### Overview
|
|
According to benchmark sources, OpenAI has a system called "Operator" in testing.
|
|
|
|
### Performance Metrics (Reported)
|
|
- **WebArena:** 58% (best overall as of Sept 2025)
|
|
- **OSWorld:** 38% (best overall)
|
|
- Significantly ahead of public models
|
|
|
|
### Status
|
|
- Not yet publicly available as of Feb 2026
|
|
- Proprietary model and data
|
|
- Performance claims from third-party benchmarks
|
|
|
|
---
|
|
|
|
## Benchmark Standards (Feb 2026)
|
|
|
|
### OSWorld (Most Comprehensive)
|
|
- **369 tasks** on real Ubuntu/Windows/macOS environments
|
|
- **Best performance:** 38% (OpenAI Operator), 29.9% (ARPO with RL)
|
|
- **Human performance:** 72.36%
|
|
- **Key finding:** "Significant deficiencies in GUI grounding and operational knowledge"
|
|
|
|
### WebArena
|
|
- **812 tasks** on functional websites (e-commerce, forums, dev tools)
|
|
- **Best performance:** 58% (OpenAI Operator)
|
|
- **GPT-4 baseline:** 14.41%
|
|
- **Human performance:** 78.24%
|
|
|
|
### VisualWebArena
|
|
- **Multimodal tasks** requiring visual information
|
|
- Reveals gaps where text-only agents fail
|
|
- Important for realistic web tasks (visual layouts, images, charts)
|
|
|
|
### Mind2Web / Multimodal-Mind2Web
|
|
- **7,775 training actions**, 3,500+ test actions
|
|
- Real-world websites with human annotations
|
|
- Now includes screenshot+HTML alignment (Hugging Face dataset)
|
|
|
|
---
|
|
|
|
## Key Findings: What Actually Works in 2026
|
|
|
|
### 1. **Multimodal > Text-Only (Consistently)**
|
|
All benchmarks show visual information significantly improves accuracy. Text-only HTML parsing misses layout, images, visual cues.
|
|
|
|
### 2. **Production Readiness Varies Wildly**
|
|
- **Production:** Anthropic Computer Use, Browser-Use, SeeAct
|
|
- **Research:** WebVoyager, Agent-Q, most academic tools
|
|
- Gap: Most papers don't handle auth, CAPTCHAs, rate limits, etc.
|
|
|
|
### 3. **Speed vs. Accuracy Tradeoff**
|
|
- ChatBrowserUse: Optimized for speed (3-5x faster)
|
|
- GPT-4V: More accurate but slower
|
|
- Local models (LLaVA): Fast but less accurate
|
|
|
|
### 4. **Complex Tasks Still Fail Most of the Time**
|
|
- Even best systems: 38-60% on benchmarks
|
|
- Humans: 72-78%
|
|
- Main failures: Long-horizon planning, GUI grounding, handling errors
|
|
|
|
### 5. **Set-of-Mark (SoM) Grounding Works**
|
|
Visual overlays with element markers significantly improve click accuracy. Used by SeeAct, many recent systems.
|
|
|
|
### 6. **Context Length Matters**
|
|
Longer text-based history helps, but screenshot-only history doesn't. Suggests models need semantic understanding, not just visual memory.
|
|
|
|
---
|
|
|
|
## Recommendations by Use Case
|
|
|
|
### For Production Automation (Reliability Priority)
|
|
**Choose:** Browser-Use with ChatBrowserUse or Anthropic Computer Use
|
|
- Why: Production infrastructure, safety measures, active support
|
|
- Tradeoff: Cost vs. control
|
|
|
|
### For Research/Experimentation (Flexibility Priority)
|
|
**Choose:** SeeAct or WebVoyager
|
|
- Why: Open-source, multiple backends, active development
|
|
- Tradeoff: More setup, less hand-holding
|
|
|
|
### For Learning/Adaptation (Future-Proofing)
|
|
**Choose:** Agent-Q or MCTS-based approaches
|
|
- Why: RL enables improvement over time
|
|
- Tradeoff: Complexity, computational cost
|
|
|
|
### For Maximum Accuracy (Cost No Object)
|
|
**Choose:** OpenAI Operator (when available) or GPT-4V + SeeAct
|
|
- Why: Best benchmark scores
|
|
- Tradeoff: Proprietary, expensive, may not be public
|
|
|
|
---
|
|
|
|
## Critical Gaps (Still Unsolved in 2026)
|
|
|
|
1. **Long-Horizon Planning:** Tasks >15 steps fail frequently
|
|
2. **Error Recovery:** Agents don't gracefully handle failures
|
|
3. **GUI Grounding:** Finding the right element remains hard
|
|
4. **Operational Knowledge:** Knowing *how* websites work (not just seeing them)
|
|
5. **Speed:** Visual inference is slow (hundreds of ms per action)
|
|
6. **Robustness:** UI changes, pop-ups, unexpected dialogs break agents
|
|
7. **Authentication:** Login, CAPTCHA, 2FA mostly unsolved without manual help
|
|
|
|
---
|
|
|
|
## Timeline of Progress
|
|
|
|
- **July 2023:** WebArena benchmark released (14% GPT-4 success)
|
|
- **Jan 2024:** SeeAct, WebVoyager published (multimodal wins confirmed)
|
|
- **April 2024:** OSWorld released (real OS benchmark, <15% all models)
|
|
- **Oct 2024:** Anthropic Computer Use beta launched
|
|
- **Aug 2024:** Agent-Q paper (RL for web agents)
|
|
- **Sept 2025:** OpenAI Operator rumored (58% WebArena per leaderboards)
|
|
- **Feb 2026:** Browser-Use active development, ChatBrowserUse optimized
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Best for complex multi-step tasks in Feb 2026:**
|
|
|
|
1. **Anthropic Computer Use** - Most reliable production system, proven by major companies
|
|
2. **Browser-Use + ChatBrowserUse** - Fastest iteration, best cost/performance for production
|
|
3. **SeeAct + GPT-4V** - Best open-source accuracy, flexible deployment
|
|
4. **WebVoyager** - Strong research baseline, good benchmark results
|
|
|
|
**Reality check:** Even the best systems fail 40-60% of the time on realistic tasks. Human-level performance (>70%) remains elusive. The field is rapidly improving but still has fundamental challenges in planning, grounding, and robustness.
|
|
|
|
**Key insight:** The tool matters less than the task. Simple tasks (form filling, single clicks) work well. Complex multi-step workflows across multiple pages still require human oversight and intervention.
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
- Anthropic Computer Use announcement (Oct 2024, Dec 2024 updates)
|
|
- SeeAct (ICML'24): https://github.com/OSU-NLP-Group/SeeAct
|
|
- WebVoyager (ACL'24): https://github.com/MinorJerry/WebVoyager
|
|
- Browser-Use: https://github.com/browser-use/browser-use
|
|
- Agent-Q: https://github.com/sentient-engineering/agent-q
|
|
- OSWorld: https://os-world.github.io/ (OSWorld-Verified July 2025)
|
|
- WebArena: https://webarena.dev/
|
|
- VisualWebArena: https://jykoh.com/vwa
|
|
- Third-party benchmarks: o-mega.ai, emergentmind.com leaderboards
|
|
|
|
---
|
|
|
|
**Report compiled:** February 5, 2026
|
|
**Status:** Active research area, tools updating constantly
|