537 lines
28 KiB
Markdown
537 lines
28 KiB
Markdown
# macOS Computer Use Tools for AI Agents — Deep Research (Feb 2026)
|
||
|
||
> **Context:** Evaluating the best "computer use" tools/frameworks for AI agents running on an always-on Mac mini M-series (specifically for Clawdbot/OpenClaw-style automation).
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Anthropic Computer Use](#1-anthropic-computer-use)
|
||
2. [Apple Accessibility APIs](#2-apple-accessibility-apis)
|
||
3. [Peekaboo](#3-peekaboo)
|
||
4. [Open Interpreter](#4-open-interpreter)
|
||
5. [Other Frameworks](#5-other-notable-frameworks)
|
||
- [macOS-use (browser-use)](#51-macos-use-browser-use)
|
||
- [Agent S (Simular.ai)](#52-agent-s-simularai)
|
||
- [C/ua (trycua)](#53-cua-trycua)
|
||
- [mcp-server-macos-use (mediar-ai)](#54-mcp-server-macos-use-mediar-ai)
|
||
- [mcp-remote-macos-use](#55-mcp-remote-macos-use)
|
||
- [macOS Automator MCP (steipete)](#56-macos-automator-mcp-steipete)
|
||
- [mac_computer_use (deedy)](#57-mac_computer_use-deedy)
|
||
6. [Comparison Matrix](#6-comparison-matrix)
|
||
7. [Recommendations for Mac Mini Agent Setup](#7-recommendations-for-mac-mini-agent-setup)
|
||
8. [Headless / SSH Considerations](#8-headless--ssh-considerations)
|
||
|
||
---
|
||
|
||
## 1. Anthropic Computer Use
|
||
|
||
**GitHub:** [anthropics/anthropic-quickstarts](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo)
|
||
**Mac forks:** [deedy/mac_computer_use](https://github.com/deedy/mac_computer_use), [PallavAg/claude-computer-use-macos](https://github.com/PallavAg/claude-computer-use-macos), [newideas99/Anthropic-Computer-Use-MacOS](https://github.com/newideas99/Anthropic-Computer-Use-MacOS)
|
||
|
||
### How It Works
|
||
- **Screenshot-based.** The model receives screenshots and reasons about pixel coordinates.
|
||
- Claude sends actions (mouse_move, click, type, screenshot) to a local executor.
|
||
- On macOS, the executor uses `cliclick` for mouse/keyboard and `screencapture` for screenshots.
|
||
- The model identifies coordinates by "counting pixels" — trained specifically for coordinate estimation.
|
||
- Anthropic recommends XGA (1024×768) or WXGA (1280×800) resolution for best accuracy.
|
||
- The official demo uses Docker + Ubuntu (xdotool). macOS forks replace xdotool with `cliclick` and native `screencapture`.
|
||
|
||
### Speed / Latency
|
||
- **Slow.** Each action cycle involves: screenshot → upload image → API inference → parse response → execute action.
|
||
- A single click-and-verify cycle takes **3-8 seconds** depending on API latency.
|
||
- Multi-step tasks (e.g., open Safari, navigate, search) can take **30-120+ seconds**.
|
||
- Screenshot upload adds ~1-3s overhead per cycle (images are typically 100-500KB).
|
||
|
||
### Reliability
|
||
- **Moderate.** Coordinate estimation works well for large, distinct UI elements.
|
||
- Struggles with small buttons, dense UIs, and similar-looking elements.
|
||
- No DOM/accessibility tree awareness — purely visual. If the UI changes between screenshot and action, clicks can miss.
|
||
- Self-correction loop helps: model takes new screenshots after each action.
|
||
- Prone to **prompt injection** from on-screen text (major security concern).
|
||
- Simon Willison's testing (Oct 2024): works for simple tasks, fails on complex multi-step workflows.
|
||
|
||
### Setup Complexity
|
||
- **Moderate.** Requires: Python 3.12+, cliclick (`brew install cliclick`), Anthropic API key, macOS Accessibility permissions.
|
||
- Mac forks require cloning a repo + setting up a venv + environment variables.
|
||
- Some forks include a Streamlit UI for interactive testing.
|
||
- Must grant Terminal/Python Accessibility permissions in System Preferences.
|
||
|
||
### Headless / SSH
|
||
- **Problematic.** `screencapture` requires WindowServer (a GUI session).
|
||
- Over pure SSH without a display, `screencapture` fails silently or returns black images.
|
||
- **Workaround:** Use an HDMI dummy plug + Screen Sharing (VNC), or connect via Apple Remote Desktop. The screencapture then works against the VNC session.
|
||
- Not designed for headless operation.
|
||
|
||
### Cost
|
||
- **API costs only.** Anthropic API pricing (Feb 2026):
|
||
- Claude Sonnet 4.5: $3/M input tokens, $15/M output tokens
|
||
- Claude Opus 4.5: $5/M input tokens, $25/M output tokens
|
||
- Each screenshot is ~1,500-3,000 tokens (image tokens)
|
||
- A 10-step task might cost $0.05-0.30 depending on model and complexity
|
||
- Computer use itself is free — you run the executor locally.
|
||
|
||
### Reddit Sentiment
|
||
- **Excited but cautious.** r/Anthropic thread on unsandboxed Mac use got 24 upvotes, with comments calling it "dangerous but cool."
|
||
- r/macmini discussions show interest in buying Mac Minis specifically for this use case.
|
||
- Common complaints: slow, expensive at scale, not reliable enough for unsupervised use.
|
||
- Benjamin Anderson's blog post captures the zeitgeist: "Claude needs his own computer" — the coding agent + computer use convergence thesis.
|
||
|
||
---
|
||
|
||
## 2. Apple Accessibility APIs
|
||
|
||
**Documentation:** [Apple Mac Automation Scripting Guide](https://developer.apple.com/library/archive/documentation/LanguagesUtilities/Conceptual/MacAutomationScriptingGuide/AutomatetheUserInterface.html)
|
||
|
||
### How It Works
|
||
- **Accessibility tree-based.** macOS exposes every UI element (buttons, text fields, menus, etc.) through the Accessibility framework (AXUIElement API).
|
||
- **Three access methods:**
|
||
1. **AppleScript / osascript:** `tell application "System Events" → tell process "Finder" → click button "OK"`. High-level scripting, easy to write.
|
||
2. **JXA (JavaScript for Automation):** Same capabilities as AppleScript, written in JavaScript. Run via `osascript -l JavaScript`.
|
||
3. **AXUIElement (C/Swift/Python via pyobjc):** Low-level programmatic access to the full accessibility tree. Can enumerate all UI elements, read properties (role, title, position, size), and perform actions (press, set value, etc.).
|
||
- Does NOT rely on screenshots — reads the actual UI element tree.
|
||
- Can traverse the entire hierarchy: Application → Window → Group → Button → etc.
|
||
|
||
### Speed / Latency
|
||
- **Fast.** AppleScript commands execute in **10-100ms**. AXUIElement API calls are typically **1-10ms**.
|
||
- No image capture, no network round-trip, no model inference.
|
||
- Menu clicks, text entry, window management — all near-instantaneous.
|
||
- Can enumerate hundreds of UI elements in <100ms.
|
||
|
||
### Reliability
|
||
- **High for supported apps.** Most native macOS apps and many Electron apps expose accessibility info.
|
||
- Apple's own apps (Finder, Safari, Mail, Calendar, Notes) have excellent accessibility support.
|
||
- Electron apps (VS Code, Slack, Discord) expose basic accessibility but may have gaps.
|
||
- Web content in browsers is accessible via accessibility APIs (each DOM element maps to an AX element).
|
||
- **Failure modes:** Apps with custom rendering (games, some media apps) may not expose UI elements. Some apps have broken accessibility annotations.
|
||
|
||
### Setup Complexity
|
||
- **Low.** AppleScript is built into macOS — no installation needed.
|
||
- `osascript` is available in every terminal.
|
||
- For Python access: `pip install pyobjc-framework-ApplicationServices`
|
||
- **Critical requirement:** Must enable Accessibility permissions for the calling application (Terminal, Python, etc.) in System Preferences → Privacy & Security → Accessibility.
|
||
- For automation across apps: System Preferences → Privacy & Security → Automation.
|
||
|
||
### Headless / SSH
|
||
- **Partially works.** AppleScript/osascript commands work over SSH **if** a GUI session is active (user logged in).
|
||
- AXUIElement requires WindowServer to be running.
|
||
- Works well with headless Mac Mini + HDMI dummy plug + remote login session.
|
||
- `osascript` may throw "not allowed assistive access" errors over SSH — the calling process (sshd, bash) needs to be in the Accessibility allow list.
|
||
- **Workaround:** Save scripts as .app bundles, grant them Accessibility access, then invoke from SSH.
|
||
|
||
### Cost
|
||
- **Free.** Built into macOS, no API costs.
|
||
|
||
### Best For
|
||
- **Structured automation:** "Click the Save button in TextEdit" rather than "figure out what's on screen."
|
||
- **Fast, deterministic workflows** where you know the target app and UI structure.
|
||
- **Combining with an LLM:** Feed the accessibility tree to an LLM, let it decide which element to interact with. This is what Peekaboo, mcp-server-macos-use, and macOS-use all do under the hood.
|
||
|
||
### Limitations
|
||
- **No visual understanding.** Can't interpret images, charts, or custom-drawn content.
|
||
- **Fragile element references:** If an app updates, button names/positions may change.
|
||
- **Permission hell:** Each calling app needs separate Accessibility + Automation grants. Can't grant to `osascript` directly (it's not an .app).
|
||
|
||
---
|
||
|
||
## 3. Peekaboo
|
||
|
||
**GitHub:** [steipete/Peekaboo](https://github.com/steipete/Peekaboo)
|
||
**Website:** [peekaboo.boo](https://www.peekaboo.boo/)
|
||
**Author:** Peter Steinberger (well-known iOS/macOS developer)
|
||
|
||
### How It Works
|
||
- **Hybrid: screenshot + accessibility tree.** This is Peekaboo's killer feature.
|
||
- The `see` command captures a screenshot AND overlays element IDs from the accessibility tree, creating an annotated snapshot.
|
||
- The `click` command can target elements by: accessibility ID, label text, or raw coordinates.
|
||
- **Full GUI automation suite:** click, type, press, hotkey, scroll, swipe, drag, move, window management, app control, menu interaction, dock control, dialog handling, Space switching.
|
||
- **Native Swift CLI** — compiled binary, not Python. Fast and deeply integrated with macOS APIs.
|
||
- **MCP server mode** — can be used as an MCP tool by Claude Desktop, Cursor, or any MCP client.
|
||
- **Agent mode** — `peekaboo agent` runs a natural-language multi-step automation loop (capture → LLM decide → act → repeat).
|
||
- Supports multiple AI providers: OpenAI, Claude, Grok, Gemini, Ollama (local).
|
||
|
||
### Speed / Latency
|
||
- **Fast.** Screenshot capture via ScreenCaptureKit is <100ms. Accessibility tree traversal is similarly fast.
|
||
- Individual click/type/press commands execute in **10-50ms**.
|
||
- Agent mode latency depends on the LLM provider (1-5s per step with cloud APIs).
|
||
- Much faster than pure screenshot-based approaches because clicks target element IDs, not pixel coordinates.
|
||
|
||
### Reliability
|
||
- **High.** Using accessibility IDs instead of pixel coordinates means:
|
||
- Clicks don't miss due to resolution changes or slight UI shifts.
|
||
- Elements are identified by semantic identity (button label, role), not visual appearance.
|
||
- The annotated snapshot approach gives the LLM **both** visual context and structural data — best of both worlds.
|
||
- Menu interaction, dialog handling, and window management are deeply integrated.
|
||
- Created by Peter Steinberger — high-quality Swift code, actively maintained.
|
||
|
||
### Setup Complexity
|
||
- **Low.** `brew install steipete/tap/peekaboo` — single command.
|
||
- Requires macOS 15+ (Sequoia), Screen Recording permission, Accessibility permission.
|
||
- MCP server mode: `npx @steipete/peekaboo-mcp@beta` (zero-install for Node users).
|
||
- Configuration for AI providers via `peekaboo config`.
|
||
|
||
### Headless / SSH
|
||
- **Requires a GUI session** (ScreenCaptureKit and accessibility APIs need WindowServer).
|
||
- Works with Mac Mini + HDMI dummy plug + Screen Sharing.
|
||
- Can be invoked over SSH if a GUI login session is active.
|
||
- The CLI nature makes it easy to script and automate remotely.
|
||
|
||
### Cost
|
||
- **Free and open-source** (MIT license).
|
||
- AI provider costs apply when using `peekaboo agent` or `peekaboo see --analyze`.
|
||
- Local models via Ollama = zero marginal cost.
|
||
|
||
### Reddit / Community Sentiment
|
||
- Very well-received in the macOS developer community.
|
||
- Peter Steinberger's reputation lends credibility.
|
||
- Described as "giving AI agents eyes on macOS."
|
||
- Praised for the hybrid screenshot+accessibility approach.
|
||
- Active development — regular releases with new features.
|
||
|
||
### Why Peekaboo Stands Out
|
||
- **Best-in-class for macOS-specific automation.** It's what a senior macOS developer would build if they were making the perfect agent tool.
|
||
- Complete command set: see, click, type, press, hotkey, scroll, swipe, drag, window, app, space, menu, menubar, dock, dialog.
|
||
- Runnable automation scripts (`.peekaboo.json`).
|
||
- Clean JSON output for programmatic consumption.
|
||
|
||
---
|
||
|
||
## 4. Open Interpreter
|
||
|
||
**Website:** [openinterpreter.com](https://www.openinterpreter.com/)
|
||
**GitHub:** [OpenInterpreter/open-interpreter](https://github.com/OpenInterpreter/open-interpreter)
|
||
|
||
### How It Works
|
||
- **Primarily code execution**, with experimental "OS mode" for GUI control.
|
||
- Normal mode: LLM generates Python/bash/JS code, executes it locally.
|
||
- **OS mode** (`interpreter --os`): Screenshot-based. Takes screenshots, sends to a vision model (GPT-4V, etc.), model reasons about actions, executes via pyautogui.
|
||
- Also includes 01 Light hardware — a portable voice interface that connects to a home computer.
|
||
|
||
### Speed / Latency
|
||
- Normal mode (code execution): **Fast** — direct code execution, limited by LLM inference time.
|
||
- OS mode: **Slow** — same screenshot→API→action loop as Anthropic Computer Use.
|
||
- OS mode is explicitly labeled "highly experimental."
|
||
|
||
### Reliability
|
||
- Normal mode: **Good** for code-centric tasks. LLM writes code that runs on your machine.
|
||
- OS mode: **Low.** Labeled as "work in progress." Community reports frequent failures.
|
||
- Single monitor only. No multi-display support in OS mode.
|
||
- Better at tasks that can be accomplished via code (file manipulation, API calls, data processing) than GUI interaction.
|
||
|
||
### Setup Complexity
|
||
- **Low.** `pip install open-interpreter` and `interpreter --os`.
|
||
- Requires Screen Recording permissions on macOS.
|
||
- API key for your chosen LLM provider.
|
||
|
||
### Headless / SSH
|
||
- Normal mode (code execution): **Works perfectly** over SSH.
|
||
- OS mode: **Requires GUI session** (uses pyautogui + screenshots).
|
||
|
||
### Cost
|
||
- **Free and open-source.**
|
||
- LLM API costs apply.
|
||
|
||
### Reddit Sentiment
|
||
- Community has cooled on Open Interpreter since the initial hype.
|
||
- OS mode is seen as a proof-of-concept, not production-ready.
|
||
- Normal mode (code execution) is valued but outcompeted by Claude Code, Cursor, etc.
|
||
- 01 Light hardware project had enthusiastic reception but unclear adoption.
|
||
|
||
### Verdict
|
||
- **Not recommended for computer use / GUI automation.** Its strength is code execution, and dedicated coding agents (Claude Code, Codex) do that better now.
|
||
- OS mode is too experimental and unreliable for production use.
|
||
|
||
---
|
||
|
||
## 5. Other Notable Frameworks
|
||
|
||
### 5.1 macOS-use (browser-use)
|
||
|
||
**GitHub:** [browser-use/macOS-use](https://github.com/browser-use/macOS-use)
|
||
**Install:** `pip install mlx-use`
|
||
|
||
**How it works:** Screenshot-based. Takes screenshots, sends to vision model (OpenAI/Anthropic/Gemini), model returns actions (click coordinates, type text, etc.), executes via pyautogui/AppleScript.
|
||
|
||
**Key details:**
|
||
- Spin-off from the popular browser-use project.
|
||
- Supports OpenAI, Anthropic, Gemini APIs.
|
||
- Vision: plans to support local inference via Apple MLX framework (not yet implemented).
|
||
- Works across ALL macOS apps, not just browsers.
|
||
- Early stage — "varying success rates depending on task prompt."
|
||
- **Security warning:** Can access credentials, stored passwords, and all UI components.
|
||
|
||
**Speed:** Slow (cloud API round-trip per action).
|
||
**Reliability:** Low-moderate. Early development.
|
||
**Setup:** `pip install mlx-use`, configure API key.
|
||
**Headless:** Requires GUI session.
|
||
**Cost:** Free + API costs.
|
||
**Sentiment:** Exciting concept but immature. Reddit post got moderate engagement.
|
||
|
||
---
|
||
|
||
### 5.2 Agent S (Simular.ai)
|
||
|
||
**GitHub:** [simular-ai/Agent-S](https://github.com/simular-ai/Agent-S)
|
||
**Website:** [simular.ai](https://www.simular.ai/)
|
||
|
||
**How it works:** Multi-model system using **screenshot + grounding model + planning model.**
|
||
- Agent S3 (latest) uses a planning LLM (e.g., GPT-5, Claude) + a grounding model (UI-TARS-1.5-7B) for precise element location.
|
||
- The grounding model takes screenshots and returns precise coordinates for UI elements.
|
||
- Supports macOS, Windows, Linux.
|
||
- **State-of-the-art results:** Agent S3 was the first to surpass human performance on OSWorld benchmark (72.6%).
|
||
- ICLR 2025 Best Paper Award.
|
||
|
||
**Key details:**
|
||
- Requires two models: a main reasoning model + a grounding model (UI-TARS-1.5-7B recommended).
|
||
- The grounding model can be self-hosted on Hugging Face Inference Endpoints.
|
||
- Optional local coding environment for code execution tasks.
|
||
- Uses pyautogui for actions + screenshots for perception.
|
||
- CLI interface: `agent_s --provider openai --model gpt-5-2025-08-07 --ground_provider huggingface ...`
|
||
|
||
**Speed:** Moderate. Two-model inference adds latency. Grounding model can be local for faster inference.
|
||
**Reliability:** **Highest reported.** 72.6% on OSWorld surpasses human performance.
|
||
**Setup:** Complex. Requires two models, API keys, grounding model deployment.
|
||
**Headless:** Requires GUI session (pyautogui + screenshots).
|
||
**Cost:** Free (open source) + API costs for both models. UI-TARS-7B hosting adds cost.
|
||
**Sentiment:** Highly respected in the research community. ICLR paper, strong benchmarks. The "serious" option for computer use research.
|
||
|
||
---
|
||
|
||
### 5.3 C/ua (trycua)
|
||
|
||
**GitHub:** [trycua/cua](https://github.com/trycua/cua)
|
||
**Website:** [cua.ai](https://cua.ai/)
|
||
**YC Company**
|
||
|
||
**How it works:** **Sandboxed virtual machines** for computer use agents.
|
||
- Runs macOS or Linux VMs on Apple Silicon using Apple's Virtualization.Framework.
|
||
- Near-native performance (97% of native CPU speed reported).
|
||
- Provides a complete SDK for agents to control the VM: click, type, scroll, screenshot, accessibility tree.
|
||
- **CuaBot:** CLI tool that gives any coding agent (Claude Code, OpenClaw) a sandbox.
|
||
- Includes benchmarking suite (cua-bench) for evaluating agents on OSWorld, ScreenSpot, etc.
|
||
|
||
**Key details:**
|
||
- `lume` — macOS/Linux VM management on Apple Silicon (their virtualization layer).
|
||
- `lumier` — Docker-compatible interface for Lume VMs.
|
||
- Agent SDK supports multiple models (Anthropic, OpenAI, etc.).
|
||
- Designed specifically for the "give your agent a computer" use case.
|
||
- Sandboxed = safe. Agent can't damage your host system.
|
||
|
||
**Speed:** Near-native. VM overhead is minimal on Apple Silicon.
|
||
**Reliability:** Good. VM provides consistent environment.
|
||
**Setup:** Moderate. `npx cuabot` for quick start, or programmatic setup via Python SDK.
|
||
**Headless:** **Excellent.** VMs run headless by design. H.265 streaming for when you want to observe.
|
||
**Cost:** Free and open source (MIT). API costs for the AI model.
|
||
**Sentiment:** Strong interest on r/LocalLLaMA. "Docker for computer use agents" resonates. YC backing adds credibility.
|
||
|
||
**Why C/ua matters:** It solves the biggest problem with giving agents computer access — **safety.** The agent operates in an isolated VM, can't touch your host system. Perfect for always-on Mac Mini setups.
|
||
|
||
---
|
||
|
||
### 5.4 mcp-server-macos-use (mediar-ai)
|
||
|
||
**GitHub:** [mediar-ai/mcp-server-macos-use](https://github.com/mediar-ai/mcp-server-macos-use)
|
||
|
||
**How it works:** **Accessibility tree-based.** Swift MCP server that controls macOS apps through AXUIElement APIs.
|
||
- Every action (click, type, press key) is followed by an accessibility tree traversal, giving the LLM updated UI state.
|
||
- Tools: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, refresh_traversal.
|
||
- Communicates via stdin/stdout (MCP protocol).
|
||
- Uses the app's PID (process ID) for targeting.
|
||
|
||
**Speed:** Fast. Native Swift, accessibility APIs are low-latency.
|
||
**Reliability:** High for apps with good accessibility support.
|
||
**Setup:** Build with `swift build`, configure in Claude Desktop or any MCP client.
|
||
**Headless:** Requires GUI session (accessibility APIs need WindowServer).
|
||
**Cost:** Free and open source.
|
||
**Sentiment:** Niche but well-designed. Good for MCP-native workflows.
|
||
|
||
---
|
||
|
||
### 5.5 mcp-remote-macos-use
|
||
|
||
**GitHub:** [baryhuang/mcp-remote-macos-use](https://github.com/baryhuang/mcp-remote-macos-use)
|
||
|
||
**How it works:** **Screen Sharing-based remote control.** Uses macOS Screen Sharing (VNC) protocol.
|
||
- Captures screenshots and sends input over the VNC connection.
|
||
- Doesn't require any software installed on the target Mac (just Screen Sharing enabled).
|
||
- Deployable via Docker.
|
||
- No extra API key needed — works with any MCP client/LLM.
|
||
|
||
**Speed:** Moderate (VNC overhead).
|
||
**Reliability:** Moderate. VNC-level interaction.
|
||
**Setup:** Enable Screen Sharing on target Mac, configure env vars.
|
||
**Headless:** **Yes!** Designed for remote/headless operation via Screen Sharing.
|
||
**Cost:** Free.
|
||
**Sentiment:** Practical for remote Mac control scenarios.
|
||
|
||
---
|
||
|
||
### 5.6 macOS Automator MCP (steipete)
|
||
|
||
**GitHub:** [steipete/macos-automator-mcp](https://github.com/steipete/macos-automator-mcp)
|
||
|
||
**How it works:** **AppleScript/JXA execution via MCP.** Ships with 200+ pre-built automation recipes.
|
||
- Executes AppleScript or JXA (JavaScript for Automation) scripts.
|
||
- Knowledge base of common automations: toggle dark mode, extract URLs from Safari, manage windows, etc.
|
||
- Supports inline scripts, file-based scripts, and pre-built knowledge base scripts.
|
||
- TypeScript/Node.js implementation.
|
||
|
||
**Speed:** Fast. AppleScript executes in milliseconds.
|
||
**Reliability:** High for scripted automations. Depends on script quality.
|
||
**Setup:** `npx @steipete/macos-automator-mcp@latest` — minimal.
|
||
**Headless:** Partially. AppleScript works over SSH with GUI session active.
|
||
**Cost:** Free (MIT).
|
||
**Sentiment:** Great companion to Peekaboo. Same author (Peter Steinberger).
|
||
|
||
---
|
||
|
||
### 5.7 mac_computer_use (deedy)
|
||
|
||
**GitHub:** [deedy/mac_computer_use](https://github.com/deedy/mac_computer_use)
|
||
|
||
**How it works:** Fork of Anthropic's official computer-use demo, adapted for native macOS.
|
||
- Screenshot-based (screencapture + cliclick).
|
||
- Streamlit web UI.
|
||
- Multi-provider support (Anthropic, Bedrock, Vertex).
|
||
- Automatic resolution scaling.
|
||
|
||
**Speed:** Same as Anthropic Computer Use (slow — API round-trip per action).
|
||
**Reliability:** Same as Anthropic Computer Use (moderate).
|
||
**Setup:** Clone, pip install, set API key, run streamlit.
|
||
**Headless:** Same limitations (needs WindowServer).
|
||
**Cost:** Free + API costs.
|
||
|
||
---
|
||
|
||
## 6. Comparison Matrix
|
||
|
||
| Tool | Approach | Speed | Reliability | Setup | Headless | Cost | Best For |
|
||
|------|----------|-------|-------------|-------|----------|------|----------|
|
||
| **Anthropic Computer Use** | Screenshot + pixel coords | ⭐⭐ Slow | ⭐⭐⭐ Moderate | ⭐⭐⭐ Moderate | ❌ Needs GUI | API costs | General-purpose computer use |
|
||
| **Apple Accessibility APIs** | Accessibility tree | ⭐⭐⭐⭐⭐ Instant | ⭐⭐⭐⭐ High | ⭐⭐⭐⭐ Low | ⚠️ Partial | Free | Deterministic automation |
|
||
| **Peekaboo** | **Hybrid: screenshot + accessibility** | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ High | ⭐⭐⭐⭐⭐ Easy | ⚠️ Needs GUI | Free + API | **Best macOS agent tool** |
|
||
| **Open Interpreter** | Screenshots (OS mode) | ⭐⭐ Slow | ⭐⭐ Low | ⭐⭐⭐⭐ Easy | ❌ OS mode needs GUI | Free + API | Code execution (not GUI) |
|
||
| **macOS-use** | Screenshots + pyautogui | ⭐⭐ Slow | ⭐⭐ Low-Med | ⭐⭐⭐ Easy | ❌ Needs GUI | Free + API | Cross-app automation (experimental) |
|
||
| **Agent S3** | Screenshots + grounding model | ⭐⭐⭐ Moderate | ⭐⭐⭐⭐⭐ Highest | ⭐⭐ Complex | ❌ Needs GUI | Free + 2× API | Research / highest accuracy |
|
||
| **C/ua** | VM sandbox + screenshot/a11y | ⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ Good | ⭐⭐⭐ Moderate | ✅ Yes | Free + API | **Safest sandboxed option** |
|
||
| **mcp-server-macos-use** | Accessibility tree (Swift) | ⭐⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐ High | ⭐⭐⭐ Moderate | ⚠️ Needs GUI | Free | MCP-native workflows |
|
||
| **mcp-remote-macos-use** | VNC screen sharing | ⭐⭐⭐ Moderate | ⭐⭐⭐ Moderate | ⭐⭐⭐ Easy | ✅ Yes | Free | Remote Mac control |
|
||
| **macOS Automator MCP** | AppleScript/JXA | ⭐⭐⭐⭐⭐ Instant | ⭐⭐⭐⭐ High | ⭐⭐⭐⭐⭐ Easy | ⚠️ Partial | Free | Scripted automations |
|
||
|
||
---
|
||
|
||
## 7. Recommendations for Mac Mini Agent Setup
|
||
|
||
### 🏆 Tier 1: Best Overall
|
||
|
||
**Peekaboo** is the clear winner for an always-on Mac Mini running AI agent automation.
|
||
|
||
**Why:**
|
||
- Hybrid approach (screenshot + accessibility tree) gives the best of both worlds
|
||
- Native Swift CLI = fast and deeply integrated with macOS
|
||
- MCP server mode works with any MCP client
|
||
- Complete automation toolkit (click, type, menu, window, dialog, etc.)
|
||
- Active development by a respected macOS developer
|
||
- Easy install (`brew install steipete/tap/peekaboo`)
|
||
|
||
**Recommended stack:**
|
||
```
|
||
Peekaboo (GUI automation)
|
||
+ macOS Automator MCP (AppleScript/JXA for scripted tasks)
|
||
+ Apple Accessibility APIs (direct AXUIElement for custom automation)
|
||
```
|
||
|
||
### 🥈 Tier 2: For Safety-Critical Use
|
||
|
||
**C/ua** if you need sandboxed execution (agent can't damage your host system).
|
||
|
||
**Why:**
|
||
- VM isolation = peace of mind for unsupervised operation
|
||
- Near-native performance on Apple Silicon
|
||
- Works headless by design
|
||
- Good for running untrusted or experimental agents
|
||
- YC-backed, strong engineering
|
||
|
||
### 🥉 Tier 3: For Research / Maximum Accuracy
|
||
|
||
**Agent S3** if you need the highest possible task completion rate and are willing to invest in setup complexity.
|
||
|
||
**Why:**
|
||
- Best benchmark results (72.6% on OSWorld, surpassing human performance)
|
||
- Two-model approach provides better grounding
|
||
- Research-grade quality
|
||
- But: complex setup, higher API costs
|
||
|
||
### For Clawdbot/OpenClaw Specifically
|
||
|
||
The ideal integration path:
|
||
1. **Peekaboo MCP** as the primary computer-use tool (add to MCP config)
|
||
2. **macOS Automator MCP** for common scripted tasks (dark mode, app control, etc.)
|
||
3. **Apple Accessibility APIs** via `osascript` for quick deterministic actions
|
||
4. Fall back to **Anthropic Computer Use** for tasks requiring pure visual reasoning
|
||
|
||
---
|
||
|
||
## 8. Headless / SSH Considerations
|
||
|
||
Running computer-use tools on a headless Mac Mini is a **critical concern** for always-on setups:
|
||
|
||
### The Core Problem
|
||
macOS GUI automation tools (screenshots, accessibility APIs, pyautogui, cliclick) require:
|
||
1. **WindowServer** to be running (a GUI session must exist)
|
||
2. **A display** (real or virtual) for screenshots to capture
|
||
|
||
### Solutions
|
||
|
||
1. **HDMI Dummy Plug** ($5-15): Plugs into HDMI port, tricks macOS into thinking a display is connected. This is **the most reliable solution** for headless Mac Minis.
|
||
|
||
2. **Apple Screen Sharing / VNC**: Enable Screen Sharing in System Settings. Connect from another Mac or use a VNC client. `screencapture` works against the active session.
|
||
|
||
3. **HDMI Dummy + Auto-Login**: Configure macOS to auto-login on boot, use HDMI dummy plug for display emulation. Most robust setup for unattended operation.
|
||
|
||
4. **C/ua VMs**: Run the agent in a VM — it has its own virtual display. No dummy plug needed.
|
||
|
||
### What Works Over SSH (with GUI session active)
|
||
|
||
| Capability | Works Over SSH? |
|
||
|-----------|----------------|
|
||
| `osascript` / AppleScript | ✅ Yes (if Accessibility granted) |
|
||
| `screencapture` | ✅ Yes (with GUI session + display) |
|
||
| `cliclick` | ✅ Yes (with GUI session) |
|
||
| Peekaboo CLI | ✅ Yes (with GUI session) |
|
||
| pyautogui | ✅ Yes (with GUI session) |
|
||
|
||
### Recommended Headless Setup
|
||
|
||
```
|
||
Mac Mini M4 + HDMI Dummy Plug
|
||
├── Auto-login enabled
|
||
├── Screen Sharing enabled (for monitoring)
|
||
├── SSH enabled (for CLI access)
|
||
├── Peekaboo installed
|
||
├── Clawdbot/OpenClaw running as launch daemon
|
||
└── HDMI dummy forces 1080p display for consistent screenshots
|
||
```
|
||
|
||
**Key tip from community:** Get the cheapest HDMI dummy plug you can find (Amazon, ~$8). Without it, the Mac Mini may boot into a low-resolution or no-display mode that breaks all screenshot-based automation.
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
- [Anthropic Computer Use Docs](https://docs.anthropic.com/en/docs/build-with-claude/computer-use)
|
||
- [Simon Willison's Computer Use Analysis](https://simonwillison.net/2024/Oct/22/computer-use/)
|
||
- [Benjamin Anderson: Should I Buy Claude a Mac Mini?](https://benanderson.work/blog/claude-mac-mini/)
|
||
- [Peekaboo GitHub](https://github.com/steipete/Peekaboo)
|
||
- [C/ua GitHub](https://github.com/trycua/cua)
|
||
- [Agent S GitHub](https://github.com/simular-ai/Agent-S)
|
||
- [macOS-use GitHub](https://github.com/browser-use/macOS-use)
|
||
- [mcp-server-macos-use GitHub](https://github.com/mediar-ai/mcp-server-macos-use)
|
||
- [macOS Automator MCP GitHub](https://github.com/steipete/macos-automator-mcp)
|
||
- [Apple Accessibility Documentation](https://developer.apple.com/library/archive/documentation/LanguagesUtilities/Conceptual/MacAutomationScriptingGuide/AutomatetheUserInterface.html)
|
||
- Various Reddit threads (r/macmini, r/Anthropic, r/MacOS, r/LocalLLaMA)
|
||
|
||
---
|
||
|
||
*Last updated: February 18, 2026*
|