12 KiB
Visual/Screenshot-Based Browser Automation Tools Research
February 2026 State-of-the-Art
Research Date: February 5, 2026
Focus: Tools using screenshots + vision models for web navigation
Key Metrics: Accuracy, Speed, Complex Multi-Step Task Completion
Executive Summary
As of Feb 2026, visual web agents using screenshots + vision models have made significant progress but still lag far behind human performance. The best systems achieve ~58-60% success on simplified benchmarks but only 12-38% on real-world computer tasks. Key challenges remain in GUI grounding, long-horizon planning, and operational knowledge.
Top Performers (by category):
- Production Ready: Browser-Use (with ChatBrowserUse model), Anthropic Computer Use
- Research/Accuracy: SeeAct + GPT-4V, WebVoyager, Agent-Q
- Benchmarking Standard: OSWorld, WebArena, VisualWebArena
1. Anthropic Computer Use (Claude 3.5 Sonnet)
Overview
Released Oct 2024, updated Dec 2024. First frontier AI model with public computer use capability.
Technical Approach
- Method: Screenshot-based visual perception + cursor/keyboard control
- Model: Claude 3.5 Sonnet with specialized computer use API
- Architecture: Sees screen, moves cursor, clicks buttons, types text
Performance Metrics
- OSWorld benchmark: 14.9% (screenshot-only), 22.0% (with more steps)
- Complex tasks: Can handle tasks with dozens to hundreds of steps
- Speed: Still "cumbersome and error-prone" per Anthropic
- Early adopters: Replit, Asana, Canva, DoorDash, The Browser Company
Key Strengths
- Production-grade API available (Anthropic API, AWS Bedrock, Google Vertex AI)
- Integrated safety classifiers for harm detection
- Strong coding performance (49% on SWE-bench Verified)
Key Limitations
- Actions like scrolling, dragging, zooming present challenges
- Error-prone on complex workflows
- Still experimental/beta quality
Use Cases
- Software development automation (Replit Agent)
- Multi-step workflow automation
- App evaluation during development
2. SeeAct (GPT-4V-based Web Agent)
Overview
Published Jan 2024 (ICML'24), open-source from OSU NLP Group.
Technical Approach
- Model: GPT-4V (vision), Gemini, LLaVA supported
- Grounding: Text choices + Set-of-Mark (SoM) overlays
- Framework: Playwright-based, runs on live websites
Performance Metrics
- Accuracy: Strong on grounded element selection
- Multimodal: Significantly outperforms text-only approaches
- Mind2Web dataset: Evaluated on 1000+ real-world tasks
- Real websites: Tested on 15+ popular sites (Google, Amazon, Reddit, etc.)
Key Strengths
- Production-ready Python package:
pip install seeact - Supports multiple LMM backends (GPT-4V, Gemini, local LLaVA)
- Chrome Extension available (SeeActChromeExtension)
- Strong element grounding with SoM visual prompting
- Active maintenance and updates
Key Limitations
- Requires manual safety monitoring (safety = manual confirmation mode)
- No auto-login support (security measure)
- Can be slow on complex multi-page workflows
Use Cases
- Web scraping and data extraction
- Form filling automation
- Research and information gathering
- Testing and QA automation
3. WebVoyager
Overview
Published Jan 2024 (ACL'24), Tencent AI Lab.
Technical Approach
- Model: GPT-4V for multimodal reasoning
- Environment: Selenium-based, real browser interaction
- Planning: Generalist planning approach with visual+text fusion
Performance Metrics
- Task Success Rate: 59.1% on their benchmark (15 websites, 643 tasks)
- vs. GPT-4 text-only: Significantly better
- vs. GPT-4V text-only: Multimodal consistently outperforms
- GPT-4V Auto-evaluation: 85.3% agreement with human judgment
Key Strengths
- End-to-end task completion on real websites
- Strong performance on diverse web tasks
- Automated evaluation protocol using GPT-4V
- Handles Booking, Google Flights, ArXiv, BBC News, etc.
Key Limitations
- Some tasks are time-sensitive (need manual updates)
- Non-deterministic results despite temperature=0
- 59.1% success still far from human-level
- Requires specific setup per website
Use Cases
- Travel booking automation
- News and research aggregation
- Cross-website information synthesis
- Complex multi-step web workflows
4. Browser-Use (Open Source Framework)
Overview
Modern open-source framework (active development as of Feb 2026), optimized for production.
Technical Approach
- Models: ChatBrowserUse (optimized), GPT-4o, Gemini, LLaVA, local models
- Architecture: Playwright-based with cloud scaling option
- Grounding: State-based with clickable element indexing
Performance Metrics
- Speed: 3-5x faster than other models (with ChatBrowserUse)
- Pricing: $0.20/M input tokens, $2.00/M output (ChatBrowserUse)
- Production: Cloud option with stealth browsers, anti-CAPTCHA
Key Strengths
- Production-ready infrastructure:
- Sandbox deployment with
@sandbox()decorator - Cloud option for scalability
- Stealth mode (fingerprinting, proxy rotation)
- Sandbox deployment with
- CLI for rapid iteration:
browser-use open/click/type/screenshot - Active development: Daily updates, strong community
- Authentication support: Real browser profiles, session persistence
- Integration: Works with Claude Code, multiple LLM providers
Key Limitations
- Newer framework (less academic validation)
- Best performance requires ChatBrowserUse model (proprietary)
- CAPTCHA handling requires cloud version
Use Cases
- Job application automation
- Grocery shopping (Instacart integration)
- PC part sourcing
- Form filling
- Multi-site data aggregation
5. Agent-Q (Reinforcement Learning Approach)
Overview
Research from Sentient Engineering (Aug 2024), uses Monte Carlo Tree Search + DPO finetuning.
Technical Approach
- Architecture: Multiple options:
- Planner → Navigator multi-agent
- Solo planner-actor
- Actor ↔ Critic multi-agent
- Actor-Critic + MCTS + DPO finetuning
- Learning: Generates DPO training pairs from MCTS exploration
Performance Metrics
- Research-focused, specific benchmarks not widely published yet
- Emphasis on learning and improvement over time
Key Strengths
- Advanced reasoning architecture
- Self-improvement via reinforcement learning
- Multiple agent architectures for different complexity levels
- Open-source implementation
Key Limitations
- More research-oriented than production-ready
- Requires significant computational resources for MCTS
- Less documentation for practical deployment
Use Cases
- Research on agent learning
- Complex reasoning tasks
- Long-horizon planning experiments
6. OpenAI Operator (Rumored/Upcoming - Jan 2025)
Overview
According to benchmark sources, OpenAI has a system called "Operator" in testing.
Performance Metrics (Reported)
- WebArena: 58% (best overall as of Sept 2025)
- OSWorld: 38% (best overall)
- Significantly ahead of public models
Status
- Not yet publicly available as of Feb 2026
- Proprietary model and data
- Performance claims from third-party benchmarks
Benchmark Standards (Feb 2026)
OSWorld (Most Comprehensive)
- 369 tasks on real Ubuntu/Windows/macOS environments
- Best performance: 38% (OpenAI Operator), 29.9% (ARPO with RL)
- Human performance: 72.36%
- Key finding: "Significant deficiencies in GUI grounding and operational knowledge"
WebArena
- 812 tasks on functional websites (e-commerce, forums, dev tools)
- Best performance: 58% (OpenAI Operator)
- GPT-4 baseline: 14.41%
- Human performance: 78.24%
VisualWebArena
- Multimodal tasks requiring visual information
- Reveals gaps where text-only agents fail
- Important for realistic web tasks (visual layouts, images, charts)
Mind2Web / Multimodal-Mind2Web
- 7,775 training actions, 3,500+ test actions
- Real-world websites with human annotations
- Now includes screenshot+HTML alignment (Hugging Face dataset)
Key Findings: What Actually Works in 2026
1. Multimodal > Text-Only (Consistently)
All benchmarks show visual information significantly improves accuracy. Text-only HTML parsing misses layout, images, visual cues.
2. Production Readiness Varies Wildly
- Production: Anthropic Computer Use, Browser-Use, SeeAct
- Research: WebVoyager, Agent-Q, most academic tools
- Gap: Most papers don't handle auth, CAPTCHAs, rate limits, etc.
3. Speed vs. Accuracy Tradeoff
- ChatBrowserUse: Optimized for speed (3-5x faster)
- GPT-4V: More accurate but slower
- Local models (LLaVA): Fast but less accurate
4. Complex Tasks Still Fail Most of the Time
- Even best systems: 38-60% on benchmarks
- Humans: 72-78%
- Main failures: Long-horizon planning, GUI grounding, handling errors
5. Set-of-Mark (SoM) Grounding Works
Visual overlays with element markers significantly improve click accuracy. Used by SeeAct, many recent systems.
6. Context Length Matters
Longer text-based history helps, but screenshot-only history doesn't. Suggests models need semantic understanding, not just visual memory.
Recommendations by Use Case
For Production Automation (Reliability Priority)
Choose: Browser-Use with ChatBrowserUse or Anthropic Computer Use
- Why: Production infrastructure, safety measures, active support
- Tradeoff: Cost vs. control
For Research/Experimentation (Flexibility Priority)
Choose: SeeAct or WebVoyager
- Why: Open-source, multiple backends, active development
- Tradeoff: More setup, less hand-holding
For Learning/Adaptation (Future-Proofing)
Choose: Agent-Q or MCTS-based approaches
- Why: RL enables improvement over time
- Tradeoff: Complexity, computational cost
For Maximum Accuracy (Cost No Object)
Choose: OpenAI Operator (when available) or GPT-4V + SeeAct
- Why: Best benchmark scores
- Tradeoff: Proprietary, expensive, may not be public
Critical Gaps (Still Unsolved in 2026)
- Long-Horizon Planning: Tasks >15 steps fail frequently
- Error Recovery: Agents don't gracefully handle failures
- GUI Grounding: Finding the right element remains hard
- Operational Knowledge: Knowing how websites work (not just seeing them)
- Speed: Visual inference is slow (hundreds of ms per action)
- Robustness: UI changes, pop-ups, unexpected dialogs break agents
- Authentication: Login, CAPTCHA, 2FA mostly unsolved without manual help
Timeline of Progress
- July 2023: WebArena benchmark released (14% GPT-4 success)
- Jan 2024: SeeAct, WebVoyager published (multimodal wins confirmed)
- April 2024: OSWorld released (real OS benchmark, <15% all models)
- Oct 2024: Anthropic Computer Use beta launched
- Aug 2024: Agent-Q paper (RL for web agents)
- Sept 2025: OpenAI Operator rumored (58% WebArena per leaderboards)
- Feb 2026: Browser-Use active development, ChatBrowserUse optimized
Conclusion
Best for complex multi-step tasks in Feb 2026:
- Anthropic Computer Use - Most reliable production system, proven by major companies
- Browser-Use + ChatBrowserUse - Fastest iteration, best cost/performance for production
- SeeAct + GPT-4V - Best open-source accuracy, flexible deployment
- WebVoyager - Strong research baseline, good benchmark results
Reality check: Even the best systems fail 40-60% of the time on realistic tasks. Human-level performance (>70%) remains elusive. The field is rapidly improving but still has fundamental challenges in planning, grounding, and robustness.
Key insight: The tool matters less than the task. Simple tasks (form filling, single clicks) work well. Complex multi-step workflows across multiple pages still require human oversight and intervention.
Sources
- Anthropic Computer Use announcement (Oct 2024, Dec 2024 updates)
- SeeAct (ICML'24): https://github.com/OSU-NLP-Group/SeeAct
- WebVoyager (ACL'24): https://github.com/MinorJerry/WebVoyager
- Browser-Use: https://github.com/browser-use/browser-use
- Agent-Q: https://github.com/sentient-engineering/agent-q
- OSWorld: https://os-world.github.io/ (OSWorld-Verified July 2025)
- WebArena: https://webarena.dev/
- VisualWebArena: https://jykoh.com/vwa
- Third-party benchmarks: o-mega.ai, emergentmind.com leaderboards
Report compiled: February 5, 2026
Status: Active research area, tools updating constantly