clawdbot-workspace/browser-research-vision-feb2026.md

# Visual/Screenshot-Based Browser Automation Tools Research
## February 2026 State-of-the-Art

**Research Date:** February 5, 2026
**Focus:** Tools using screenshots + vision models for web navigation
**Key Metrics:** Accuracy, Speed, Complex Multi-Step Task Completion

---

## Executive Summary

As of Feb 2026, visual web agents using screenshots + vision models have made significant progress but still lag far behind human performance. The best systems achieve ~58-60% success on simplified benchmarks but only 12-38% on real-world computer tasks. Key challenges remain in GUI grounding, long-horizon planning, and operational knowledge.

**Top Performers (by category):**
- **Production Ready:** Browser-Use (with ChatBrowserUse model), Anthropic Computer Use
- **Research/Accuracy:** SeeAct + GPT-4V, WebVoyager, Agent-Q
- **Benchmarking Standard:** OSWorld, WebArena, VisualWebArena

---

## 1. Anthropic Computer Use (Claude 3.5 Sonnet)

### Overview
Released Oct 2024, updated Dec 2024. First frontier AI model with public computer use capability.

### Technical Approach
- **Method:** Screenshot-based visual perception + cursor/keyboard control
- **Model:** Claude 3.5 Sonnet with specialized computer use API
- **Architecture:** Sees screen, moves cursor, clicks buttons, types text

### Performance Metrics
- **OSWorld benchmark:** 14.9% (screenshot-only), 22.0% (with more steps)
- **Complex tasks:** Can handle tasks with dozens to hundreds of steps
- **Speed:** Still "cumbersome and error-prone" per Anthropic
- **Early adopters:** Replit, Asana, Canva, DoorDash, The Browser Company

### Key Strengths
- Production-grade API available (Anthropic API, AWS Bedrock, Google Vertex AI)
- Integrated safety classifiers for harm detection
- Strong coding performance (49% on SWE-bench Verified)

### Key Limitations
- Actions like scrolling, dragging, zooming present challenges
- Error-prone on complex workflows
- Still experimental/beta quality

### Use Cases
- Software development automation (Replit Agent)
- Multi-step workflow automation
- App evaluation during development

---

## 2. SeeAct (GPT-4V-based Web Agent)

### Overview
Published Jan 2024 (ICML'24), open-source from OSU NLP Group.

### Technical Approach
- **Model:** GPT-4V (vision), Gemini, LLaVA supported
- **Grounding:** Text choices + Set-of-Mark (SoM) overlays
- **Framework:** Playwright-based, runs on live websites

### Performance Metrics
- **Accuracy:** Strong on grounded element selection
- **Multimodal:** Significantly outperforms text-only approaches
- **Mind2Web dataset:** Evaluated on 1000+ real-world tasks
- **Real websites:** Tested on 15+ popular sites (Google, Amazon, Reddit, etc.)

### Key Strengths
- **Production-ready Python package:** `pip install seeact`
- Supports multiple LMM backends (GPT-4V, Gemini, local LLaVA)
- Chrome Extension available (SeeActChromeExtension)
- Strong element grounding with SoM visual prompting
- Active maintenance and updates

### Key Limitations
- Requires manual safety monitoring (safety = manual confirmation mode)
- No auto-login support (security measure)
- Can be slow on complex multi-page workflows

### Use Cases
- Web scraping and data extraction
- Form filling automation
- Research and information gathering
- Testing and QA automation

---

## 3. WebVoyager

### Overview
Published Jan 2024 (ACL'24), Tencent AI Lab.

### Technical Approach
- **Model:** GPT-4V for multimodal reasoning
- **Environment:** Selenium-based, real browser interaction
- **Planning:** Generalist planning approach with visual+text fusion

### Performance Metrics
- **Task Success Rate:** 59.1% on their benchmark (15 websites, 643 tasks)
- **vs. GPT-4 text-only:** Significantly better
- **vs. GPT-4V text-only:** Multimodal consistently outperforms
- **GPT-4V Auto-evaluation:** 85.3% agreement with human judgment

### Key Strengths
- End-to-end task completion on real websites
- Strong performance on diverse web tasks
- Automated evaluation protocol using GPT-4V
- Handles Booking, Google Flights, ArXiv, BBC News, etc.

### Key Limitations
- Some tasks are time-sensitive (need manual updates)
- Non-deterministic results despite temperature=0
- 59.1% success still far from human-level
- Requires specific setup per website

### Use Cases
- Travel booking automation
- News and research aggregation
- Cross-website information synthesis
- Complex multi-step web workflows

---

## 4. Browser-Use (Open Source Framework)

### Overview
Modern open-source framework (active development as of Feb 2026), optimized for production.

### Technical Approach
- **Models:** ChatBrowserUse (optimized), GPT-4o, Gemini, LLaVA, local models
- **Architecture:** Playwright-based with cloud scaling option
- **Grounding:** State-based with clickable element indexing

### Performance Metrics
- **Speed:** 3-5x faster than other models (with ChatBrowserUse)
- **Pricing:** $0.20/M input tokens, $2.00/M output (ChatBrowserUse)
- **Production:** Cloud option with stealth browsers, anti-CAPTCHA

### Key Strengths
- **Production-ready infrastructure:**
  - Sandbox deployment with `@sandbox()` decorator
  - Cloud option for scalability
  - Stealth mode (fingerprinting, proxy rotation)
- **CLI for rapid iteration:** `browser-use open/click/type/screenshot`
- **Active development:** Daily updates, strong community
- **Authentication support:** Real browser profiles, session persistence
- **Integration:** Works with Claude Code, multiple LLM providers

### Key Limitations
- Newer framework (less academic validation)
- Best performance requires ChatBrowserUse model (proprietary)
- CAPTCHA handling requires cloud version

### Use Cases
- Job application automation
- Grocery shopping (Instacart integration)
- PC part sourcing
- Form filling
- Multi-site data aggregation

---

## 5. Agent-Q (Reinforcement Learning Approach)

### Overview
Research from Sentient Engineering (Aug 2024), uses Monte Carlo Tree Search + DPO finetuning.

### Technical Approach
- **Architecture:** Multiple options:
  - Planner → Navigator multi-agent
  - Solo planner-actor
  - Actor ↔ Critic multi-agent
  - Actor-Critic + MCTS + DPO finetuning
- **Learning:** Generates DPO training pairs from MCTS exploration

### Performance Metrics
- Research-focused, specific benchmarks not widely published yet
- Emphasis on learning and improvement over time

### Key Strengths
- Advanced reasoning architecture
- Self-improvement via reinforcement learning
- Multiple agent architectures for different complexity levels
- Open-source implementation

### Key Limitations
- More research-oriented than production-ready
- Requires significant computational resources for MCTS
- Less documentation for practical deployment

### Use Cases
- Research on agent learning
- Complex reasoning tasks
- Long-horizon planning experiments

---

## 6. OpenAI Operator (Rumored/Upcoming - Jan 2025)

### Overview
According to benchmark sources, OpenAI has a system called "Operator" in testing.

### Performance Metrics (Reported)
- **WebArena:** 58% (best overall as of Sept 2025)
- **OSWorld:** 38% (best overall)
- Significantly ahead of public models

### Status
- Not yet publicly available as of Feb 2026
- Proprietary model and data
- Performance claims from third-party benchmarks

---

## Benchmark Standards (Feb 2026)

### OSWorld (Most Comprehensive)
- **369 tasks** on real Ubuntu/Windows/macOS environments
- **Best performance:** 38% (OpenAI Operator), 29.9% (ARPO with RL)
- **Human performance:** 72.36%
- **Key finding:** "Significant deficiencies in GUI grounding and operational knowledge"

### WebArena
- **812 tasks** on functional websites (e-commerce, forums, dev tools)
- **Best performance:** 58% (OpenAI Operator)
- **GPT-4 baseline:** 14.41%
- **Human performance:** 78.24%

### VisualWebArena
- **Multimodal tasks** requiring visual information
- Reveals gaps where text-only agents fail
- Important for realistic web tasks (visual layouts, images, charts)

### Mind2Web / Multimodal-Mind2Web
- **7,775 training actions**, 3,500+ test actions
- Real-world websites with human annotations
- Now includes screenshot+HTML alignment (Hugging Face dataset)

---

## Key Findings: What Actually Works in 2026

### 1. **Multimodal > Text-Only (Consistently)**
All benchmarks show visual information significantly improves accuracy. Text-only HTML parsing misses layout, images, visual cues.

### 2. **Production Readiness Varies Wildly**
- **Production:** Anthropic Computer Use, Browser-Use, SeeAct
- **Research:** WebVoyager, Agent-Q, most academic tools
- Gap: Most papers don't handle auth, CAPTCHAs, rate limits, etc.

### 3. **Speed vs. Accuracy Tradeoff**
- ChatBrowserUse: Optimized for speed (3-5x faster)
- GPT-4V: More accurate but slower
- Local models (LLaVA): Fast but less accurate

### 4. **Complex Tasks Still Fail Most of the Time**
- Even best systems: 38-60% on benchmarks
- Humans: 72-78%
- Main failures: Long-horizon planning, GUI grounding, handling errors

### 5. **Set-of-Mark (SoM) Grounding Works**
Visual overlays with element markers significantly improve click accuracy. Used by SeeAct, many recent systems.

### 6. **Context Length Matters**
Longer text-based history helps, but screenshot-only history doesn't. Suggests models need semantic understanding, not just visual memory.

---

## Recommendations by Use Case

### For Production Automation (Reliability Priority)
**Choose:** Browser-Use with ChatBrowserUse or Anthropic Computer Use
- Why: Production infrastructure, safety measures, active support
- Tradeoff: Cost vs. control

### For Research/Experimentation (Flexibility Priority)
**Choose:** SeeAct or WebVoyager
- Why: Open-source, multiple backends, active development
- Tradeoff: More setup, less hand-holding

### For Learning/Adaptation (Future-Proofing)
**Choose:** Agent-Q or MCTS-based approaches
- Why: RL enables improvement over time
- Tradeoff: Complexity, computational cost

### For Maximum Accuracy (Cost No Object)
**Choose:** OpenAI Operator (when available) or GPT-4V + SeeAct
- Why: Best benchmark scores
- Tradeoff: Proprietary, expensive, may not be public

---

## Critical Gaps (Still Unsolved in 2026)

1. **Long-Horizon Planning:** Tasks >15 steps fail frequently
2. **Error Recovery:** Agents don't gracefully handle failures
3. **GUI Grounding:** Finding the right element remains hard
4. **Operational Knowledge:** Knowing *how* websites work (not just seeing them)
5. **Speed:** Visual inference is slow (hundreds of ms per action)
6. **Robustness:** UI changes, pop-ups, unexpected dialogs break agents
7. **Authentication:** Login, CAPTCHA, 2FA mostly unsolved without manual help

---

## Timeline of Progress

- **July 2023:** WebArena benchmark released (14% GPT-4 success)
- **Jan 2024:** SeeAct, WebVoyager published (multimodal wins confirmed)
- **April 2024:** OSWorld released (real OS benchmark, <15% all models)
- **Oct 2024:** Anthropic Computer Use beta launched
- **Aug 2024:** Agent-Q paper (RL for web agents)
- **Sept 2025:** OpenAI Operator rumored (58% WebArena per leaderboards)
- **Feb 2026:** Browser-Use active development, ChatBrowserUse optimized

---

## Conclusion

**Best for complex multi-step tasks in Feb 2026:**

1. **Anthropic Computer Use** - Most reliable production system, proven by major companies
2. **Browser-Use + ChatBrowserUse** - Fastest iteration, best cost/performance for production
3. **SeeAct + GPT-4V** - Best open-source accuracy, flexible deployment
4. **WebVoyager** - Strong research baseline, good benchmark results

**Reality check:** Even the best systems fail 40-60% of the time on realistic tasks. Human-level performance (>70%) remains elusive. The field is rapidly improving but still has fundamental challenges in planning, grounding, and robustness.

**Key insight:** The tool matters less than the task. Simple tasks (form filling, single clicks) work well. Complex multi-step workflows across multiple pages still require human oversight and intervention.

---

## Sources

- Anthropic Computer Use announcement (Oct 2024, Dec 2024 updates)
- SeeAct (ICML'24): https://github.com/OSU-NLP-Group/SeeAct
- WebVoyager (ACL'24): https://github.com/MinorJerry/WebVoyager
- Browser-Use: https://github.com/browser-use/browser-use
- Agent-Q: https://github.com/sentient-engineering/agent-q
- OSWorld: https://os-world.github.io/ (OSWorld-Verified July 2025)
- WebArena: https://webarena.dev/
- VisualWebArena: https://jykoh.com/vwa
- Third-party benchmarks: o-mega.ai, emergentmind.com leaderboards

---

**Report compiled:** February 5, 2026
**Status:** Active research area, tools updating constantly