Jake Shore 0f4e71179d Daily backup: 2026-02-05

2026-02-05 23:01:36 -05:00

12 KiB

Raw Blame History

Visual/Screenshot-Based Browser Automation Tools Research

February 2026 State-of-the-Art

Research Date: February 5, 2026
Focus: Tools using screenshots + vision models for web navigation
Key Metrics: Accuracy, Speed, Complex Multi-Step Task Completion

Executive Summary

As of Feb 2026, visual web agents using screenshots + vision models have made significant progress but still lag far behind human performance. The best systems achieve ~58-60% success on simplified benchmarks but only 12-38% on real-world computer tasks. Key challenges remain in GUI grounding, long-horizon planning, and operational knowledge.

Top Performers (by category):

Production Ready: Browser-Use (with ChatBrowserUse model), Anthropic Computer Use
Research/Accuracy: SeeAct + GPT-4V, WebVoyager, Agent-Q
Benchmarking Standard: OSWorld, WebArena, VisualWebArena

1. Anthropic Computer Use (Claude 3.5 Sonnet)

Overview

Released Oct 2024, updated Dec 2024. First frontier AI model with public computer use capability.

Technical Approach

Method: Screenshot-based visual perception + cursor/keyboard control
Model: Claude 3.5 Sonnet with specialized computer use API
Architecture: Sees screen, moves cursor, clicks buttons, types text

Performance Metrics

OSWorld benchmark: 14.9% (screenshot-only), 22.0% (with more steps)
Complex tasks: Can handle tasks with dozens to hundreds of steps
Speed: Still "cumbersome and error-prone" per Anthropic
Early adopters: Replit, Asana, Canva, DoorDash, The Browser Company

Key Strengths

Production-grade API available (Anthropic API, AWS Bedrock, Google Vertex AI)
Integrated safety classifiers for harm detection
Strong coding performance (49% on SWE-bench Verified)

Key Limitations

Actions like scrolling, dragging, zooming present challenges
Error-prone on complex workflows
Still experimental/beta quality

Use Cases

Software development automation (Replit Agent)
Multi-step workflow automation
App evaluation during development

2. SeeAct (GPT-4V-based Web Agent)

Overview

Published Jan 2024 (ICML'24), open-source from OSU NLP Group.

Technical Approach

Model: GPT-4V (vision), Gemini, LLaVA supported
Grounding: Text choices + Set-of-Mark (SoM) overlays
Framework: Playwright-based, runs on live websites

Performance Metrics

Accuracy: Strong on grounded element selection
Multimodal: Significantly outperforms text-only approaches
Mind2Web dataset: Evaluated on 1000+ real-world tasks
Real websites: Tested on 15+ popular sites (Google, Amazon, Reddit, etc.)

Key Strengths

Production-ready Python package: pip install seeact
Supports multiple LMM backends (GPT-4V, Gemini, local LLaVA)
Chrome Extension available (SeeActChromeExtension)
Strong element grounding with SoM visual prompting
Active maintenance and updates

Key Limitations

Requires manual safety monitoring (safety = manual confirmation mode)
No auto-login support (security measure)
Can be slow on complex multi-page workflows

Use Cases

Web scraping and data extraction
Form filling automation
Research and information gathering
Testing and QA automation

3. WebVoyager

Overview

Published Jan 2024 (ACL'24), Tencent AI Lab.

Technical Approach

Model: GPT-4V for multimodal reasoning
Environment: Selenium-based, real browser interaction
Planning: Generalist planning approach with visual+text fusion

Performance Metrics

Task Success Rate: 59.1% on their benchmark (15 websites, 643 tasks)
vs. GPT-4 text-only: Significantly better
vs. GPT-4V text-only: Multimodal consistently outperforms
GPT-4V Auto-evaluation: 85.3% agreement with human judgment

Key Strengths

End-to-end task completion on real websites
Strong performance on diverse web tasks
Automated evaluation protocol using GPT-4V
Handles Booking, Google Flights, ArXiv, BBC News, etc.

Key Limitations

Some tasks are time-sensitive (need manual updates)
Non-deterministic results despite temperature=0
59.1% success still far from human-level
Requires specific setup per website

Use Cases

Travel booking automation
News and research aggregation
Cross-website information synthesis
Complex multi-step web workflows

4. Browser-Use (Open Source Framework)

Overview

Modern open-source framework (active development as of Feb 2026), optimized for production.

Technical Approach

Models: ChatBrowserUse (optimized), GPT-4o, Gemini, LLaVA, local models
Architecture: Playwright-based with cloud scaling option
Grounding: State-based with clickable element indexing

Performance Metrics

Speed: 3-5x faster than other models (with ChatBrowserUse)
Pricing: $0.20/M input tokens, $2.00/M output (ChatBrowserUse)
Production: Cloud option with stealth browsers, anti-CAPTCHA

Key Strengths

Production-ready infrastructure:
- Sandbox deployment with @sandbox() decorator
- Cloud option for scalability
- Stealth mode (fingerprinting, proxy rotation)
CLI for rapid iteration: browser-use open/click/type/screenshot
Active development: Daily updates, strong community
Authentication support: Real browser profiles, session persistence
Integration: Works with Claude Code, multiple LLM providers

Key Limitations

Newer framework (less academic validation)
Best performance requires ChatBrowserUse model (proprietary)
CAPTCHA handling requires cloud version

Use Cases

Job application automation
Grocery shopping (Instacart integration)
PC part sourcing
Form filling
Multi-site data aggregation

5. Agent-Q (Reinforcement Learning Approach)

Overview

Research from Sentient Engineering (Aug 2024), uses Monte Carlo Tree Search + DPO finetuning.

Technical Approach

Architecture: Multiple options:
- Planner → Navigator multi-agent
- Solo planner-actor
- Actor ↔ Critic multi-agent
- Actor-Critic + MCTS + DPO finetuning
Learning: Generates DPO training pairs from MCTS exploration

Performance Metrics

Research-focused, specific benchmarks not widely published yet
Emphasis on learning and improvement over time

Key Strengths

Advanced reasoning architecture
Self-improvement via reinforcement learning
Multiple agent architectures for different complexity levels
Open-source implementation

Key Limitations

More research-oriented than production-ready
Requires significant computational resources for MCTS
Less documentation for practical deployment

Use Cases

Research on agent learning
Complex reasoning tasks
Long-horizon planning experiments

6. OpenAI Operator (Rumored/Upcoming - Jan 2025)

Overview

According to benchmark sources, OpenAI has a system called "Operator" in testing.

Performance Metrics (Reported)

WebArena: 58% (best overall as of Sept 2025)
OSWorld: 38% (best overall)
Significantly ahead of public models

Status

Not yet publicly available as of Feb 2026
Proprietary model and data
Performance claims from third-party benchmarks

Benchmark Standards (Feb 2026)

OSWorld (Most Comprehensive)

369 tasks on real Ubuntu/Windows/macOS environments
Best performance: 38% (OpenAI Operator), 29.9% (ARPO with RL)
Human performance: 72.36%
Key finding: "Significant deficiencies in GUI grounding and operational knowledge"

WebArena

812 tasks on functional websites (e-commerce, forums, dev tools)
Best performance: 58% (OpenAI Operator)
GPT-4 baseline: 14.41%
Human performance: 78.24%

VisualWebArena

Multimodal tasks requiring visual information
Reveals gaps where text-only agents fail
Important for realistic web tasks (visual layouts, images, charts)

Mind2Web / Multimodal-Mind2Web

7,775 training actions, 3,500+ test actions
Real-world websites with human annotations
Now includes screenshot+HTML alignment (Hugging Face dataset)

Key Findings: What Actually Works in 2026

1. Multimodal > Text-Only (Consistently)

All benchmarks show visual information significantly improves accuracy. Text-only HTML parsing misses layout, images, visual cues.

2. Production Readiness Varies Wildly

Production: Anthropic Computer Use, Browser-Use, SeeAct
Research: WebVoyager, Agent-Q, most academic tools
Gap: Most papers don't handle auth, CAPTCHAs, rate limits, etc.

3. Speed vs. Accuracy Tradeoff

ChatBrowserUse: Optimized for speed (3-5x faster)
GPT-4V: More accurate but slower
Local models (LLaVA): Fast but less accurate

4. Complex Tasks Still Fail Most of the Time

Even best systems: 38-60% on benchmarks
Humans: 72-78%
Main failures: Long-horizon planning, GUI grounding, handling errors

5. Set-of-Mark (SoM) Grounding Works

Visual overlays with element markers significantly improve click accuracy. Used by SeeAct, many recent systems.

6. Context Length Matters

Longer text-based history helps, but screenshot-only history doesn't. Suggests models need semantic understanding, not just visual memory.

Recommendations by Use Case

For Production Automation (Reliability Priority)

Choose: Browser-Use with ChatBrowserUse or Anthropic Computer Use

Why: Production infrastructure, safety measures, active support
Tradeoff: Cost vs. control

For Research/Experimentation (Flexibility Priority)

Choose: SeeAct or WebVoyager

Why: Open-source, multiple backends, active development
Tradeoff: More setup, less hand-holding

For Learning/Adaptation (Future-Proofing)

Choose: Agent-Q or MCTS-based approaches

Why: RL enables improvement over time
Tradeoff: Complexity, computational cost

For Maximum Accuracy (Cost No Object)

Choose: OpenAI Operator (when available) or GPT-4V + SeeAct

Why: Best benchmark scores
Tradeoff: Proprietary, expensive, may not be public

Critical Gaps (Still Unsolved in 2026)

Long-Horizon Planning: Tasks >15 steps fail frequently
Error Recovery: Agents don't gracefully handle failures
GUI Grounding: Finding the right element remains hard
Operational Knowledge: Knowing how websites work (not just seeing them)
Speed: Visual inference is slow (hundreds of ms per action)
Robustness: UI changes, pop-ups, unexpected dialogs break agents
Authentication: Login, CAPTCHA, 2FA mostly unsolved without manual help

Timeline of Progress

July 2023: WebArena benchmark released (14% GPT-4 success)
Jan 2024: SeeAct, WebVoyager published (multimodal wins confirmed)
April 2024: OSWorld released (real OS benchmark, <15% all models)
Oct 2024: Anthropic Computer Use beta launched
Aug 2024: Agent-Q paper (RL for web agents)
Sept 2025: OpenAI Operator rumored (58% WebArena per leaderboards)
Feb 2026: Browser-Use active development, ChatBrowserUse optimized

Conclusion

Best for complex multi-step tasks in Feb 2026:

Anthropic Computer Use - Most reliable production system, proven by major companies
Browser-Use + ChatBrowserUse - Fastest iteration, best cost/performance for production
SeeAct + GPT-4V - Best open-source accuracy, flexible deployment
WebVoyager - Strong research baseline, good benchmark results

Reality check: Even the best systems fail 40-60% of the time on realistic tasks. Human-level performance (>70%) remains elusive. The field is rapidly improving but still has fundamental challenges in planning, grounding, and robustness.

Key insight: The tool matters less than the task. Simple tasks (form filling, single clicks) work well. Complex multi-step workflows across multiple pages still require human oversight and intervention.

Sources

Anthropic Computer Use announcement (Oct 2024, Dec 2024 updates)
SeeAct (ICML'24): https://github.com/OSU-NLP-Group/SeeAct
WebVoyager (ACL'24): https://github.com/MinorJerry/WebVoyager
Browser-Use: https://github.com/browser-use/browser-use
Agent-Q: https://github.com/sentient-engineering/agent-q
OSWorld: https://os-world.github.io/ (OSWorld-Verified July 2025)
WebArena: https://webarena.dev/
VisualWebArena: https://jykoh.com/vwa
Third-party benchmarks: o-mega.ai, emergentmind.com leaderboards

Report compiled: February 5, 2026
Status: Active research area, tools updating constantly

12 KiB Raw Blame History

Visual/Screenshot-Based Browser Automation Tools Research

February 2026 State-of-the-Art

Executive Summary

1. Anthropic Computer Use (Claude 3.5 Sonnet)

Overview

Technical Approach

Performance Metrics

Key Strengths

Key Limitations

Use Cases

2. SeeAct (GPT-4V-based Web Agent)

Overview

Technical Approach

Performance Metrics

Key Strengths

Key Limitations

Use Cases

3. WebVoyager

Overview

Technical Approach

Performance Metrics

Key Strengths

Key Limitations

Use Cases

4. Browser-Use (Open Source Framework)

Overview

Technical Approach

Performance Metrics

Key Strengths

Key Limitations

Use Cases

5. Agent-Q (Reinforcement Learning Approach)

Overview

Technical Approach

Performance Metrics

Key Strengths

Key Limitations

Use Cases

6. OpenAI Operator (Rumored/Upcoming - Jan 2025)

Overview

Performance Metrics (Reported)

Status

Benchmark Standards (Feb 2026)

OSWorld (Most Comprehensive)

WebArena

VisualWebArena

Mind2Web / Multimodal-Mind2Web

Key Findings: What Actually Works in 2026

1. Multimodal > Text-Only (Consistently)

2. Production Readiness Varies Wildly

3. Speed vs. Accuracy Tradeoff

4. Complex Tasks Still Fail Most of the Time

5. Set-of-Mark (SoM) Grounding Works

6. Context Length Matters

Recommendations by Use Case

For Production Automation (Reliability Priority)

For Research/Experimentation (Flexibility Priority)

For Learning/Adaptation (Future-Proofing)

For Maximum Accuracy (Cost No Object)

Critical Gaps (Still Unsolved in 2026)

Timeline of Progress

Conclusion

Sources

12 KiB

Raw Blame History