clawdbot-workspace/research-computer-use-macos.md
2026-02-18 23:01:51 -05:00

28 KiB
Raw Blame History

macOS Computer Use Tools for AI Agents — Deep Research (Feb 2026)

Context: Evaluating the best "computer use" tools/frameworks for AI agents running on an always-on Mac mini M-series (specifically for Clawdbot/OpenClaw-style automation).


Table of Contents

  1. Anthropic Computer Use
  2. Apple Accessibility APIs
  3. Peekaboo
  4. Open Interpreter
  5. Other Frameworks
  6. Comparison Matrix
  7. Recommendations for Mac Mini Agent Setup
  8. Headless / SSH Considerations

1. Anthropic Computer Use

GitHub: anthropics/anthropic-quickstarts
Mac forks: deedy/mac_computer_use, PallavAg/claude-computer-use-macos, newideas99/Anthropic-Computer-Use-MacOS

How It Works

  • Screenshot-based. The model receives screenshots and reasons about pixel coordinates.
  • Claude sends actions (mouse_move, click, type, screenshot) to a local executor.
  • On macOS, the executor uses cliclick for mouse/keyboard and screencapture for screenshots.
  • The model identifies coordinates by "counting pixels" — trained specifically for coordinate estimation.
  • Anthropic recommends XGA (1024×768) or WXGA (1280×800) resolution for best accuracy.
  • The official demo uses Docker + Ubuntu (xdotool). macOS forks replace xdotool with cliclick and native screencapture.

Speed / Latency

  • Slow. Each action cycle involves: screenshot → upload image → API inference → parse response → execute action.
  • A single click-and-verify cycle takes 3-8 seconds depending on API latency.
  • Multi-step tasks (e.g., open Safari, navigate, search) can take 30-120+ seconds.
  • Screenshot upload adds ~1-3s overhead per cycle (images are typically 100-500KB).

Reliability

  • Moderate. Coordinate estimation works well for large, distinct UI elements.
  • Struggles with small buttons, dense UIs, and similar-looking elements.
  • No DOM/accessibility tree awareness — purely visual. If the UI changes between screenshot and action, clicks can miss.
  • Self-correction loop helps: model takes new screenshots after each action.
  • Prone to prompt injection from on-screen text (major security concern).
  • Simon Willison's testing (Oct 2024): works for simple tasks, fails on complex multi-step workflows.

Setup Complexity

  • Moderate. Requires: Python 3.12+, cliclick (brew install cliclick), Anthropic API key, macOS Accessibility permissions.
  • Mac forks require cloning a repo + setting up a venv + environment variables.
  • Some forks include a Streamlit UI for interactive testing.
  • Must grant Terminal/Python Accessibility permissions in System Preferences.

Headless / SSH

  • Problematic. screencapture requires WindowServer (a GUI session).
  • Over pure SSH without a display, screencapture fails silently or returns black images.
  • Workaround: Use an HDMI dummy plug + Screen Sharing (VNC), or connect via Apple Remote Desktop. The screencapture then works against the VNC session.
  • Not designed for headless operation.

Cost

  • API costs only. Anthropic API pricing (Feb 2026):
    • Claude Sonnet 4.5: $3/M input tokens, $15/M output tokens
    • Claude Opus 4.5: $5/M input tokens, $25/M output tokens
    • Each screenshot is ~1,500-3,000 tokens (image tokens)
    • A 10-step task might cost $0.05-0.30 depending on model and complexity
  • Computer use itself is free — you run the executor locally.

Reddit Sentiment

  • Excited but cautious. r/Anthropic thread on unsandboxed Mac use got 24 upvotes, with comments calling it "dangerous but cool."
  • r/macmini discussions show interest in buying Mac Minis specifically for this use case.
  • Common complaints: slow, expensive at scale, not reliable enough for unsupervised use.
  • Benjamin Anderson's blog post captures the zeitgeist: "Claude needs his own computer" — the coding agent + computer use convergence thesis.

2. Apple Accessibility APIs

Documentation: Apple Mac Automation Scripting Guide

How It Works

  • Accessibility tree-based. macOS exposes every UI element (buttons, text fields, menus, etc.) through the Accessibility framework (AXUIElement API).
  • Three access methods:
    1. AppleScript / osascript: tell application "System Events" → tell process "Finder" → click button "OK". High-level scripting, easy to write.
    2. JXA (JavaScript for Automation): Same capabilities as AppleScript, written in JavaScript. Run via osascript -l JavaScript.
    3. AXUIElement (C/Swift/Python via pyobjc): Low-level programmatic access to the full accessibility tree. Can enumerate all UI elements, read properties (role, title, position, size), and perform actions (press, set value, etc.).
  • Does NOT rely on screenshots — reads the actual UI element tree.
  • Can traverse the entire hierarchy: Application → Window → Group → Button → etc.

Speed / Latency

  • Fast. AppleScript commands execute in 10-100ms. AXUIElement API calls are typically 1-10ms.
  • No image capture, no network round-trip, no model inference.
  • Menu clicks, text entry, window management — all near-instantaneous.
  • Can enumerate hundreds of UI elements in <100ms.

Reliability

  • High for supported apps. Most native macOS apps and many Electron apps expose accessibility info.
  • Apple's own apps (Finder, Safari, Mail, Calendar, Notes) have excellent accessibility support.
  • Electron apps (VS Code, Slack, Discord) expose basic accessibility but may have gaps.
  • Web content in browsers is accessible via accessibility APIs (each DOM element maps to an AX element).
  • Failure modes: Apps with custom rendering (games, some media apps) may not expose UI elements. Some apps have broken accessibility annotations.

Setup Complexity

  • Low. AppleScript is built into macOS — no installation needed.
  • osascript is available in every terminal.
  • For Python access: pip install pyobjc-framework-ApplicationServices
  • Critical requirement: Must enable Accessibility permissions for the calling application (Terminal, Python, etc.) in System Preferences → Privacy & Security → Accessibility.
  • For automation across apps: System Preferences → Privacy & Security → Automation.

Headless / SSH

  • Partially works. AppleScript/osascript commands work over SSH if a GUI session is active (user logged in).
  • AXUIElement requires WindowServer to be running.
  • Works well with headless Mac Mini + HDMI dummy plug + remote login session.
  • osascript may throw "not allowed assistive access" errors over SSH — the calling process (sshd, bash) needs to be in the Accessibility allow list.
  • Workaround: Save scripts as .app bundles, grant them Accessibility access, then invoke from SSH.

Cost

  • Free. Built into macOS, no API costs.

Best For

  • Structured automation: "Click the Save button in TextEdit" rather than "figure out what's on screen."
  • Fast, deterministic workflows where you know the target app and UI structure.
  • Combining with an LLM: Feed the accessibility tree to an LLM, let it decide which element to interact with. This is what Peekaboo, mcp-server-macos-use, and macOS-use all do under the hood.

Limitations

  • No visual understanding. Can't interpret images, charts, or custom-drawn content.
  • Fragile element references: If an app updates, button names/positions may change.
  • Permission hell: Each calling app needs separate Accessibility + Automation grants. Can't grant to osascript directly (it's not an .app).

3. Peekaboo

GitHub: steipete/Peekaboo
Website: peekaboo.boo
Author: Peter Steinberger (well-known iOS/macOS developer)

How It Works

  • Hybrid: screenshot + accessibility tree. This is Peekaboo's killer feature.
  • The see command captures a screenshot AND overlays element IDs from the accessibility tree, creating an annotated snapshot.
  • The click command can target elements by: accessibility ID, label text, or raw coordinates.
  • Full GUI automation suite: click, type, press, hotkey, scroll, swipe, drag, move, window management, app control, menu interaction, dock control, dialog handling, Space switching.
  • Native Swift CLI — compiled binary, not Python. Fast and deeply integrated with macOS APIs.
  • MCP server mode — can be used as an MCP tool by Claude Desktop, Cursor, or any MCP client.
  • Agent modepeekaboo agent runs a natural-language multi-step automation loop (capture → LLM decide → act → repeat).
  • Supports multiple AI providers: OpenAI, Claude, Grok, Gemini, Ollama (local).

Speed / Latency

  • Fast. Screenshot capture via ScreenCaptureKit is <100ms. Accessibility tree traversal is similarly fast.
  • Individual click/type/press commands execute in 10-50ms.
  • Agent mode latency depends on the LLM provider (1-5s per step with cloud APIs).
  • Much faster than pure screenshot-based approaches because clicks target element IDs, not pixel coordinates.

Reliability

  • High. Using accessibility IDs instead of pixel coordinates means:
    • Clicks don't miss due to resolution changes or slight UI shifts.
    • Elements are identified by semantic identity (button label, role), not visual appearance.
  • The annotated snapshot approach gives the LLM both visual context and structural data — best of both worlds.
  • Menu interaction, dialog handling, and window management are deeply integrated.
  • Created by Peter Steinberger — high-quality Swift code, actively maintained.

Setup Complexity

  • Low. brew install steipete/tap/peekaboo — single command.
  • Requires macOS 15+ (Sequoia), Screen Recording permission, Accessibility permission.
  • MCP server mode: npx @steipete/peekaboo-mcp@beta (zero-install for Node users).
  • Configuration for AI providers via peekaboo config.

Headless / SSH

  • Requires a GUI session (ScreenCaptureKit and accessibility APIs need WindowServer).
  • Works with Mac Mini + HDMI dummy plug + Screen Sharing.
  • Can be invoked over SSH if a GUI login session is active.
  • The CLI nature makes it easy to script and automate remotely.

Cost

  • Free and open-source (MIT license).
  • AI provider costs apply when using peekaboo agent or peekaboo see --analyze.
  • Local models via Ollama = zero marginal cost.

Reddit / Community Sentiment

  • Very well-received in the macOS developer community.
  • Peter Steinberger's reputation lends credibility.
  • Described as "giving AI agents eyes on macOS."
  • Praised for the hybrid screenshot+accessibility approach.
  • Active development — regular releases with new features.

Why Peekaboo Stands Out

  • Best-in-class for macOS-specific automation. It's what a senior macOS developer would build if they were making the perfect agent tool.
  • Complete command set: see, click, type, press, hotkey, scroll, swipe, drag, window, app, space, menu, menubar, dock, dialog.
  • Runnable automation scripts (.peekaboo.json).
  • Clean JSON output for programmatic consumption.

4. Open Interpreter

Website: openinterpreter.com
GitHub: OpenInterpreter/open-interpreter

How It Works

  • Primarily code execution, with experimental "OS mode" for GUI control.
  • Normal mode: LLM generates Python/bash/JS code, executes it locally.
  • OS mode (interpreter --os): Screenshot-based. Takes screenshots, sends to a vision model (GPT-4V, etc.), model reasons about actions, executes via pyautogui.
  • Also includes 01 Light hardware — a portable voice interface that connects to a home computer.

Speed / Latency

  • Normal mode (code execution): Fast — direct code execution, limited by LLM inference time.
  • OS mode: Slow — same screenshot→API→action loop as Anthropic Computer Use.
  • OS mode is explicitly labeled "highly experimental."

Reliability

  • Normal mode: Good for code-centric tasks. LLM writes code that runs on your machine.
  • OS mode: Low. Labeled as "work in progress." Community reports frequent failures.
  • Single monitor only. No multi-display support in OS mode.
  • Better at tasks that can be accomplished via code (file manipulation, API calls, data processing) than GUI interaction.

Setup Complexity

  • Low. pip install open-interpreter and interpreter --os.
  • Requires Screen Recording permissions on macOS.
  • API key for your chosen LLM provider.

Headless / SSH

  • Normal mode (code execution): Works perfectly over SSH.
  • OS mode: Requires GUI session (uses pyautogui + screenshots).

Cost

  • Free and open-source.
  • LLM API costs apply.

Reddit Sentiment

  • Community has cooled on Open Interpreter since the initial hype.
  • OS mode is seen as a proof-of-concept, not production-ready.
  • Normal mode (code execution) is valued but outcompeted by Claude Code, Cursor, etc.
  • 01 Light hardware project had enthusiastic reception but unclear adoption.

Verdict

  • Not recommended for computer use / GUI automation. Its strength is code execution, and dedicated coding agents (Claude Code, Codex) do that better now.
  • OS mode is too experimental and unreliable for production use.

5. Other Notable Frameworks

5.1 macOS-use (browser-use)

GitHub: browser-use/macOS-use
Install: pip install mlx-use

How it works: Screenshot-based. Takes screenshots, sends to vision model (OpenAI/Anthropic/Gemini), model returns actions (click coordinates, type text, etc.), executes via pyautogui/AppleScript.

Key details:

  • Spin-off from the popular browser-use project.
  • Supports OpenAI, Anthropic, Gemini APIs.
  • Vision: plans to support local inference via Apple MLX framework (not yet implemented).
  • Works across ALL macOS apps, not just browsers.
  • Early stage — "varying success rates depending on task prompt."
  • Security warning: Can access credentials, stored passwords, and all UI components.

Speed: Slow (cloud API round-trip per action).
Reliability: Low-moderate. Early development.
Setup: pip install mlx-use, configure API key.
Headless: Requires GUI session.
Cost: Free + API costs.
Sentiment: Exciting concept but immature. Reddit post got moderate engagement.


5.2 Agent S (Simular.ai)

GitHub: simular-ai/Agent-S
Website: simular.ai

How it works: Multi-model system using screenshot + grounding model + planning model.

  • Agent S3 (latest) uses a planning LLM (e.g., GPT-5, Claude) + a grounding model (UI-TARS-1.5-7B) for precise element location.
  • The grounding model takes screenshots and returns precise coordinates for UI elements.
  • Supports macOS, Windows, Linux.
  • State-of-the-art results: Agent S3 was the first to surpass human performance on OSWorld benchmark (72.6%).
  • ICLR 2025 Best Paper Award.

Key details:

  • Requires two models: a main reasoning model + a grounding model (UI-TARS-1.5-7B recommended).
  • The grounding model can be self-hosted on Hugging Face Inference Endpoints.
  • Optional local coding environment for code execution tasks.
  • Uses pyautogui for actions + screenshots for perception.
  • CLI interface: agent_s --provider openai --model gpt-5-2025-08-07 --ground_provider huggingface ...

Speed: Moderate. Two-model inference adds latency. Grounding model can be local for faster inference.
Reliability: Highest reported. 72.6% on OSWorld surpasses human performance.
Setup: Complex. Requires two models, API keys, grounding model deployment.
Headless: Requires GUI session (pyautogui + screenshots).
Cost: Free (open source) + API costs for both models. UI-TARS-7B hosting adds cost.
Sentiment: Highly respected in the research community. ICLR paper, strong benchmarks. The "serious" option for computer use research.


5.3 C/ua (trycua)

GitHub: trycua/cua
Website: cua.ai
YC Company

How it works: Sandboxed virtual machines for computer use agents.

  • Runs macOS or Linux VMs on Apple Silicon using Apple's Virtualization.Framework.
  • Near-native performance (97% of native CPU speed reported).
  • Provides a complete SDK for agents to control the VM: click, type, scroll, screenshot, accessibility tree.
  • CuaBot: CLI tool that gives any coding agent (Claude Code, OpenClaw) a sandbox.
  • Includes benchmarking suite (cua-bench) for evaluating agents on OSWorld, ScreenSpot, etc.

Key details:

  • lume — macOS/Linux VM management on Apple Silicon (their virtualization layer).
  • lumier — Docker-compatible interface for Lume VMs.
  • Agent SDK supports multiple models (Anthropic, OpenAI, etc.).
  • Designed specifically for the "give your agent a computer" use case.
  • Sandboxed = safe. Agent can't damage your host system.

Speed: Near-native. VM overhead is minimal on Apple Silicon.
Reliability: Good. VM provides consistent environment.
Setup: Moderate. npx cuabot for quick start, or programmatic setup via Python SDK.
Headless: Excellent. VMs run headless by design. H.265 streaming for when you want to observe.
Cost: Free and open source (MIT). API costs for the AI model.
Sentiment: Strong interest on r/LocalLLaMA. "Docker for computer use agents" resonates. YC backing adds credibility.

Why C/ua matters: It solves the biggest problem with giving agents computer access — safety. The agent operates in an isolated VM, can't touch your host system. Perfect for always-on Mac Mini setups.


5.4 mcp-server-macos-use (mediar-ai)

GitHub: mediar-ai/mcp-server-macos-use

How it works: Accessibility tree-based. Swift MCP server that controls macOS apps through AXUIElement APIs.

  • Every action (click, type, press key) is followed by an accessibility tree traversal, giving the LLM updated UI state.
  • Tools: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, refresh_traversal.
  • Communicates via stdin/stdout (MCP protocol).
  • Uses the app's PID (process ID) for targeting.

Speed: Fast. Native Swift, accessibility APIs are low-latency.
Reliability: High for apps with good accessibility support.
Setup: Build with swift build, configure in Claude Desktop or any MCP client.
Headless: Requires GUI session (accessibility APIs need WindowServer).
Cost: Free and open source.
Sentiment: Niche but well-designed. Good for MCP-native workflows.


5.5 mcp-remote-macos-use

GitHub: baryhuang/mcp-remote-macos-use

How it works: Screen Sharing-based remote control. Uses macOS Screen Sharing (VNC) protocol.

  • Captures screenshots and sends input over the VNC connection.
  • Doesn't require any software installed on the target Mac (just Screen Sharing enabled).
  • Deployable via Docker.
  • No extra API key needed — works with any MCP client/LLM.

Speed: Moderate (VNC overhead).
Reliability: Moderate. VNC-level interaction.
Setup: Enable Screen Sharing on target Mac, configure env vars.
Headless: Yes! Designed for remote/headless operation via Screen Sharing.
Cost: Free.
Sentiment: Practical for remote Mac control scenarios.


5.6 macOS Automator MCP (steipete)

GitHub: steipete/macos-automator-mcp

How it works: AppleScript/JXA execution via MCP. Ships with 200+ pre-built automation recipes.

  • Executes AppleScript or JXA (JavaScript for Automation) scripts.
  • Knowledge base of common automations: toggle dark mode, extract URLs from Safari, manage windows, etc.
  • Supports inline scripts, file-based scripts, and pre-built knowledge base scripts.
  • TypeScript/Node.js implementation.

Speed: Fast. AppleScript executes in milliseconds.
Reliability: High for scripted automations. Depends on script quality.
Setup: npx @steipete/macos-automator-mcp@latest — minimal.
Headless: Partially. AppleScript works over SSH with GUI session active.
Cost: Free (MIT).
Sentiment: Great companion to Peekaboo. Same author (Peter Steinberger).


5.7 mac_computer_use (deedy)

GitHub: deedy/mac_computer_use

How it works: Fork of Anthropic's official computer-use demo, adapted for native macOS.

  • Screenshot-based (screencapture + cliclick).
  • Streamlit web UI.
  • Multi-provider support (Anthropic, Bedrock, Vertex).
  • Automatic resolution scaling.

Speed: Same as Anthropic Computer Use (slow — API round-trip per action).
Reliability: Same as Anthropic Computer Use (moderate).
Setup: Clone, pip install, set API key, run streamlit.
Headless: Same limitations (needs WindowServer).
Cost: Free + API costs.


6. Comparison Matrix

Tool Approach Speed Reliability Setup Headless Cost Best For
Anthropic Computer Use Screenshot + pixel coords Slow Moderate Moderate Needs GUI API costs General-purpose computer use
Apple Accessibility APIs Accessibility tree Instant High Low ⚠️ Partial Free Deterministic automation
Peekaboo Hybrid: screenshot + accessibility Fast High Easy ⚠️ Needs GUI Free + API Best macOS agent tool
Open Interpreter Screenshots (OS mode) Slow Low Easy OS mode needs GUI Free + API Code execution (not GUI)
macOS-use Screenshots + pyautogui Slow Low-Med Easy Needs GUI Free + API Cross-app automation (experimental)
Agent S3 Screenshots + grounding model Moderate Highest Complex Needs GUI Free + 2× API Research / highest accuracy
C/ua VM sandbox + screenshot/a11y Fast Good Moderate Yes Free + API Safest sandboxed option
mcp-server-macos-use Accessibility tree (Swift) Fast High Moderate ⚠️ Needs GUI Free MCP-native workflows
mcp-remote-macos-use VNC screen sharing Moderate Moderate Easy Yes Free Remote Mac control
macOS Automator MCP AppleScript/JXA Instant High Easy ⚠️ Partial Free Scripted automations

7. Recommendations for Mac Mini Agent Setup

🏆 Tier 1: Best Overall

Peekaboo is the clear winner for an always-on Mac Mini running AI agent automation.

Why:

  • Hybrid approach (screenshot + accessibility tree) gives the best of both worlds
  • Native Swift CLI = fast and deeply integrated with macOS
  • MCP server mode works with any MCP client
  • Complete automation toolkit (click, type, menu, window, dialog, etc.)
  • Active development by a respected macOS developer
  • Easy install (brew install steipete/tap/peekaboo)

Recommended stack:

Peekaboo (GUI automation) 
+ macOS Automator MCP (AppleScript/JXA for scripted tasks)
+ Apple Accessibility APIs (direct AXUIElement for custom automation)

🥈 Tier 2: For Safety-Critical Use

C/ua if you need sandboxed execution (agent can't damage your host system).

Why:

  • VM isolation = peace of mind for unsupervised operation
  • Near-native performance on Apple Silicon
  • Works headless by design
  • Good for running untrusted or experimental agents
  • YC-backed, strong engineering

🥉 Tier 3: For Research / Maximum Accuracy

Agent S3 if you need the highest possible task completion rate and are willing to invest in setup complexity.

Why:

  • Best benchmark results (72.6% on OSWorld, surpassing human performance)
  • Two-model approach provides better grounding
  • Research-grade quality
  • But: complex setup, higher API costs

For Clawdbot/OpenClaw Specifically

The ideal integration path:

  1. Peekaboo MCP as the primary computer-use tool (add to MCP config)
  2. macOS Automator MCP for common scripted tasks (dark mode, app control, etc.)
  3. Apple Accessibility APIs via osascript for quick deterministic actions
  4. Fall back to Anthropic Computer Use for tasks requiring pure visual reasoning

8. Headless / SSH Considerations

Running computer-use tools on a headless Mac Mini is a critical concern for always-on setups:

The Core Problem

macOS GUI automation tools (screenshots, accessibility APIs, pyautogui, cliclick) require:

  1. WindowServer to be running (a GUI session must exist)
  2. A display (real or virtual) for screenshots to capture

Solutions

  1. HDMI Dummy Plug ($5-15): Plugs into HDMI port, tricks macOS into thinking a display is connected. This is the most reliable solution for headless Mac Minis.

  2. Apple Screen Sharing / VNC: Enable Screen Sharing in System Settings. Connect from another Mac or use a VNC client. screencapture works against the active session.

  3. HDMI Dummy + Auto-Login: Configure macOS to auto-login on boot, use HDMI dummy plug for display emulation. Most robust setup for unattended operation.

  4. C/ua VMs: Run the agent in a VM — it has its own virtual display. No dummy plug needed.

What Works Over SSH (with GUI session active)

Capability Works Over SSH?
osascript / AppleScript Yes (if Accessibility granted)
screencapture Yes (with GUI session + display)
cliclick Yes (with GUI session)
Peekaboo CLI Yes (with GUI session)
pyautogui Yes (with GUI session)
Mac Mini M4 + HDMI Dummy Plug
├── Auto-login enabled
├── Screen Sharing enabled (for monitoring)
├── SSH enabled (for CLI access)
├── Peekaboo installed
├── Clawdbot/OpenClaw running as launch daemon
└── HDMI dummy forces 1080p display for consistent screenshots

Key tip from community: Get the cheapest HDMI dummy plug you can find (Amazon, ~$8). Without it, the Mac Mini may boot into a low-resolution or no-display mode that breaks all screenshot-based automation.


Sources


Last updated: February 18, 2026