Jake Shore f3c4cd817b Add all MCP servers + factory infra to MCPEngine — 2026-02-06

=== NEW SERVERS ADDED (7) ===
- servers/closebot — 119 tools, 14 modules, 4,656 lines TS (Stage 7)
- servers/google-console — Google Search Console MCP (Stage 7)
- servers/meta-ads — Meta/Facebook Ads MCP (Stage 8)
- servers/twilio — Twilio communications MCP (Stage 8)
- servers/competitor-research — Competitive intel MCP (Stage 6)
- servers/n8n-apps — n8n workflow MCP apps (Stage 6)
- servers/reonomy — Commercial real estate MCP (Stage 1)

=== FACTORY INFRASTRUCTURE ADDED ===
- infra/factory-tools — mcp-jest, mcp-validator, mcp-add, MCP Inspector
  - 60 test configs, 702 auto-generated test cases
  - All 30 servers score 100/100 protocol compliance
- infra/command-center — Pipeline state, operator playbook, dashboard config
- infra/factory-reviews — Automated eval reports

=== DOCS ADDED ===
- docs/MCP-FACTORY.md — Factory overview
- docs/reports/ — 5 pipeline evaluation reports
- docs/research/ — Browser MCP research

=== RULES ESTABLISHED ===
- CONTRIBUTING.md — All MCP work MUST go in this repo
- README.md — Full inventory of 37 servers + infra docs
- .gitignore — Updated for Python venvs

TOTAL: 37 MCP servers + full factory pipeline in one repo.
This is now the single source of truth for all MCP work.

2026-02-06 06:32:29 -05:00

38 KiB

Raw Blame History

Agent Gamma — AI/UX & Testing Review

Reviewer: Agent Gamma (AI/UX & Testing Methodology Expert)
Date: February 4, 2026
Scope: All 5 MCP Factory skills + master blueprint
Research basis: Paragon tool-calling benchmarks, Statsig agent architecture patterns, MCP Apps official spec (Jan 2026), Prompt Engineering Guide (function calling), Confident AI agent evaluation framework, WCAG 2.1 accessibility standards, Berkeley Function Calling Leaderboard findings, visual regression tooling landscape

Executive Summary

Tool descriptions are the pipeline's hidden bottleneck. The current "What/Returns/When" formula is good but insufficient — research shows tool descriptions need negative examples ("do NOT use when..."), disambiguation cues between similar tools, and output shape previews to reach >95% routing accuracy. With 30+ servers averaging 20+ tools each, misrouting will be the #1 user-facing failure mode.
The official MCP Apps extension (shipped Jan 2026) makes our iframe/postMessage architecture semi-obsolete. MCP now has ui:// resource URIs, _meta.ui.resourceUri on tools, and bidirectional JSON-RPC over postMessage. Our skill documents don't mention this at all — we're building to a 2025 pattern while the spec has moved forward.
Testing is the weakest link in the pipeline. The QA skill has the right layers but lacks quantitative metrics (tool correctness rate, task completion rate), has no automated regression baseline, no accessibility auditing, and no test data fixtures. It's a manual checklist masquerading as a testing framework.
Accessibility is completely absent. Zero mention of ARIA attributes, keyboard navigation, focus management, screen reader support, or WCAG contrast ratios across all 5 skills. Our dark theme palette fails WCAG AA for secondary text (#96989d on #1a1d23 = 3.7:1, needs 4.5:1).
App UX patterns are solid for static rendering but miss all interactive patterns. No drag-and-drop (kanban reordering), no inline editing, no real-time streaming updates, no optimistic UI, no undo/redo, no keyboard shortcuts, no search-within-app. Apps feel like screenshots, not tools.

Per-Skill Reviews

1. MCP API Analyzer (Phase 1)

Strengths:

Excellent reading priority hierarchy (auth → rate limits → overview → endpoints)
The "speed technique for large APIs" using OpenAPI specs is smart
App candidate selection criteria are well-reasoned (BUILD when / SKIP when)
Template is thorough and would produce consistent outputs

Issues & Suggestions:

🔴 Critical: Tool description formula needs upgrading

The current formula is:

{What it does}. {What it returns}. {When to use it / what triggers it}.

Research from Paragon's 50-test-case benchmark (2025) and the Prompt Engineering Guide shows this needs expansion. Better formula:

{What it does}. {What it returns — include 2-3 key field names}. 
{When to use it — specific user intents}. {When NOT to use it — disambiguation}.
{Side effects — if any}.

Example upgrade:

# Current (from skill)
"List contacts with optional filters. Returns paginated results including name, email, phone, 
and status. Use when the user wants to see, search, or browse their contact list."

# Improved
"List contacts with optional filters and pagination. Returns {name, email, phone, status, 
created_date} for each contact. Use when the user wants to browse, filter, or get an overview 
of multiple contacts. Do NOT use for searching by specific keyword (use search_contacts instead) 
or for getting full details of one contact (use get_contact instead)."

The "do NOT use" disambiguation is the single highest-impact improvement per Paragon's research — it reduced tool misrouting by ~30% in their benchmarks.

🟡 Important: Missing tool count optimization guidance

The skill says "aim for 5-15 groups, 3-15 tools per group" but doesn't address total tool count impact. Research from Berkeley Function Calling Leaderboard and the Medium analysis on tool limits shows:

1-10 tools: High accuracy, minimal degradation
10-20 tools: Noticeable accuracy drops begin
20+ tools: Significant degradation; lazy loading helps but descriptions still crowd the context

Recommendation: Add guidance to cap active tools at 15-20 per interaction via lazy loading, and add a "tool pruning" section for aggressively combining similar tools (e.g., list_contacts + search_contacts → single tool with optional query param).

🟡 Important: No semantic clustering guidance

When tools have overlapping names (e.g., list_invoices, get_invoice_summary, get_invoice_details), LLMs struggle. Add guidance for:

Using verb prefixes that signal intent: browse_ (list/overview), inspect_ (single item deep-dive), modify_ (create/update), remove_ (delete)
Grouping mutually exclusive tools with "INSTEAD OF" notes in descriptions

🟢 Nice-to-have: Add example disambiguation table

For each tool group, produce a disambiguation matrix:

User says...	Correct tool	Why not others
"Show me all contacts"	list_contacts	Not search (no keyword), not get (not specific)
"Find John Smith"	search_contacts	Not list (specific name = search), not get (no ID)
"What's John's email?"	get_contact	Not list/search (asking about specific known contact)

2. MCP Server Builder (Phase 2)

Strengths:

Solid project scaffolding with good defaults
Auth pattern catalog covers the common cases well
MCP Annotations decision matrix is clear and correct
Error handling pattern (Zod → client → server levels) is well-layered
One-file vs modular threshold (15 tools) is practical

Issues & Suggestions:

🔴 Critical: Missing MCP Apps extension support

As of January 2026, MCP has an official Apps extension (@modelcontextprotocol/ext-apps). This changes how tools declare UI:

// NEW PATTERN: Tool declares its UI resource
registerAppTool(server, "get-time", {
  title: "Get Time",
  description: "Returns the current server time.",
  inputSchema: {},
  _meta: { ui: { resourceUri: "ui://get-time/mcp-app.html" } },
}, async () => { /* handler */ });

// Resource serves the HTML
registerAppResource(server, resourceUri, resourceUri, 
  { mimeType: RESOURCE_MIME_TYPE },
  async () => { /* return HTML */ }
);

Our servers should be built to support BOTH our custom LocalBosses postMessage pattern AND the official MCP Apps protocol. This future-proofs the servers for use in Claude Desktop, VS Code Copilot, and other MCP hosts.

Action: Add a section on _meta.ui.resourceUri registration. Update the tool definition interface to include optional _meta field.

🟡 Important: Tool description in code doesn't match analysis guidance

The builder skill's tool group template has descriptions that are shorter and less detailed than what the analyzer skill recommends. The code template shows:

description: "List contacts with optional filters and pagination. Returns name, email, phone, and status. Use when the user wants to see, search, or browse contacts."

But the Zod schema descriptions are separate and minimal:

page: z.number().optional().default(1).describe("Page number (default 1)")

Issue: Parameter descriptions in Zod .describe() aren't always surfaced by MCP clients. The parameter descriptions in inputSchema.properties[].description are what matters for tool selection. Add explicit guidance: "Always put the most helpful description in inputSchema.properties, not just in Zod."

🟡 Important: No output schema guidance

Tool definitions include inputSchema but nothing about expected output shapes. While MCP doesn't formally require output schemas, providing an output hint in the tool description massively helps:

The LLM knows what data it will get back
The LLM can better plan multi-step tool chains
App designers know exactly what fields to expect

Add to the tool definition template:

// In the description:
"Returns: { data: Contact[], meta: { total, page, pageSize } } where Contact has {name, email, phone, status}"

🟢 Nice-to-have: Add streaming support pattern

For tools that return large datasets, add a streaming pattern using MCP's progress notifications. This is especially relevant for list/search operations that may take 2-5 seconds.

3. MCP App Designer (Phase 3)

Strengths:

Comprehensive design system with specific hex values and spacing
The 8 app type templates cover the most common patterns
Three-state requirement (loading/empty/data) is excellent
Data reception with both postMessage + polling is robust
Responsive breakpoints and CSS are production-ready

Issues & Suggestions:

🔴 Critical: No accessibility at all

The entire skill has zero mention of:

ARIA attributes — Tables need role="table", status badges need role="status" or aria-label
Keyboard navigation — Interactive elements must be focusable and operable with Enter/Space
Focus management — When data loads and replaces skeleton, focus should move to content
Color contrast — Secondary text (#96989d on #1a1d23) = 3.7:1 ratio. WCAG AA requires 4.5:1 for normal text. Fix: use #b0b2b8 for secondary text (5.0:1)
Screen reader announcements — Data state changes should use aria-live="polite" regions
Reduced motion — The shimmer animation should respect prefers-reduced-motion

Minimum additions to base template:

<!-- Add to loading state -->
<div id="loading" role="status" aria-label="Loading content">
  <span class="sr-only">Loading...</span>
  <!-- skeletons -->
</div>

<!-- Add to content container -->
<div id="content" style="display:none" aria-live="polite">

/* Screen reader only class */
.sr-only { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0,0,0,0); border: 0; }

/* Respect reduced motion */
@media (prefers-reduced-motion: reduce) {
  .skeleton { animation: none; background: #2b2d31; }
}

🔴 Critical: Missing interactive patterns

The 8 app types are all display patterns. Real productivity apps need:

Inline editing — Click a cell in the data grid to edit it, sends update via postMessage to host
Drag-and-drop — Reorder pipeline columns, prioritize items (critical for kanban boards)
Bulk actions — Select multiple rows with checkboxes, apply action to all
Search/filter within app — Client-side filtering without roundtripping through the AI
Sorting — Click column headers to sort (client-side for loaded data)
Pagination controls — Previous/Next buttons that request more data from host
Expand/collapse — Accordion sections for detail cards with many fields
Copy-to-clipboard — Click to copy IDs, emails, etc.

Add at least a 9th app type: Interactive Data Grid with sort, filter, select, and inline edit.

🟡 Important: No data visualization beyond bar charts

The Analytics template only shows basic vertical bar charts. Missing:

Line/area charts — For time-series trends (critical for dashboards)
Donut/pie charts — For composition/percentage breakdowns
Sparklines — Tiny inline charts in metric cards showing trend
Heatmaps — For calendar/matrix data (contribution-style)
Progress bars — For funnel conversion rates, goal tracking
Horizontal bar charts — For ranking/comparison views

All of these can be done in pure CSS/SVG without external libraries. Add a "Visualization Primitives" section with reusable CSS/SVG snippets.

Example sparkline (pure SVG):

<svg viewBox="0 0 100 30" style="width:80px;height:24px">
  <polyline fill="none" stroke="#ff6d5a" stroke-width="2" 
    points="0,25 15,20 30,22 45,10 60,15 75,8 90,12 100,5"/>
</svg>

🟡 Important: No error boundary pattern

If the render function throws (malformed data, unexpected types), the entire app goes blank. Add a global error boundary:

window.onerror = function(msg, url, line) {
  document.getElementById('content').innerHTML = `
    <div class="empty-state">
      <div class="empty-state-icon">⚠️</div>
      <div class="empty-state-title">Display Error</div>
      <div class="empty-state-text">The app encountered an issue rendering the data. Try sending a new message.</div>
    </div>`;
  showState('data');
  return true;
};

🟡 Important: Missing bidirectional communication pattern

Apps currently only receive data. They should also be able to:

Request data refresh (user clicks "Refresh" button)
Send user actions back to host (user clicks "Delete" on a row)
Navigate to another app (user clicks a contact name → opens contact card)

Add a sendToHost() utility:

function sendToHost(action, payload) {
  window.parent.postMessage({ 
    type: 'mcp_app_action', 
    action, 
    payload,
    appId: APP_ID 
  }, '*');
}

// Usage: sendToHost('refresh', {}); 
// Usage: sendToHost('navigate', { app: 'contact-card', contactId: '123' });
// Usage: sendToHost('tool_call', { tool: 'delete_contact', args: { id: '123' } });

🟢 Nice-to-have: Add micro-interactions

Stagger animation on list items appearing (each row fades in 50ms apart)
Number counting animation on metric values
Smooth transitions when data updates (not a hard re-render)

.row-enter { animation: fadeSlideIn 0.2s ease-out forwards; opacity: 0; }
@keyframes fadeSlideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: translateY(0); } }

4. MCP LocalBosses Integrator (Phase 4)

Strengths:

Extremely detailed file-by-file update guide — truly copy-paste ready
Complete Calendly walkthrough example is great
Cross-reference check (all 4 files must have every app ID) is critical
System prompt engineering section covers the right principles

Issues & Suggestions:

🔴 Critical: System prompt engineering is under-specified

The current guidance is "describe capabilities in natural language" and "specify when to use each tool." This is insufficient for reliable tool routing. Research from the Prompt Engineering Guide and Statsig's optimization guide shows system prompts need:

Explicit tool routing rules — Not just "you can manage contacts" but structured decision trees:

TOOL SELECTION RULES:
- If user asks to SEE/BROWSE/LIST multiple items → use list_* tools
- If user asks about ONE specific item by name/ID → use get_* tools  
- If user asks to CREATE/ADD/NEW → use create_* tools
- If user asks to CHANGE/UPDATE/MODIFY → use update_* tools
- If user asks to DELETE/REMOVE → use delete_* tools (always confirm first)
- If user asks for STATS/METRICS/OVERVIEW → use analytics tools

Output formatting instructions — Tell the AI exactly how to structure APP_DATA:

When returning data for the contact grid app, your APP_DATA MUST include:
- "data": array of objects, each with at minimum {name, email, status}
- "meta": {total, page, pageSize} for pagination
- "title": descriptive title matching what user asked for

Few-shot examples — Include 2-3 example interactions showing the full input → tool call → APP_DATA flow. This is the single most effective technique per OpenAI's prompt engineering guide.
Negative instructions — "Do NOT call tools when the user asks general questions about best practices. Do NOT use list tools when the user clearly knows which specific record they want."

🟡 Important: Intake questions need A/B testing framework

The intake question is the first interaction point and hugely impacts user experience. Currently it's hardcoded text with no measurement. Add:

Guidance for writing intake questions that are action-oriented not question-oriented
Alternative phrasings to test (e.g., "What contacts should I pull up?" vs "Tell me what you're looking for")
Skip label should be the most common action (data shows 60%+ users skip — make the default great)

🟡 Important: System prompt addon is too coupled to data shape

The systemPromptAddon includes exact JSON structures, which means:

If the app's render() function changes, the prompt is stale
The AI treats it as a template, not understanding the data semantics
Complex data requires enormous prompt addons

Better approach: Reference a data contract by name:

systemPromptAddon: `Generate APP_DATA conforming to the ContactGrid schema.
Required fields: data[] with {name, email, phone, status, created}, meta with {total, page, pageSize}.
Include 5-25 records matching the user's request. Realistic data only.`,

🟢 Nice-to-have: Add channel onboarding flow

When a user enters a new channel for the first time, show a brief guided tour:

What this channel does
What apps are available (visual toolbar walkthrough)
Example things to try

5. MCP QA Tester (Phase 5)

Strengths:

Five testing layers is the right conceptual framework
The shell script template for automated static analysis is practical
Common issues & fixes table is immediately useful
Visual testing with Gemini/Peekaboo is creative

Issues & Suggestions:

🔴 Critical: No quantitative metrics or benchmarks

The entire testing framework is binary pass/fail checklists. Modern LLM agent evaluation (per Confident AI's DeepEval framework and the Berkeley Function Calling Leaderboard) measures:

Tool Correctness Rate — What % of natural language messages trigger the correct tool? Target: >95%
Task Completion Rate — What % of end-to-end scenarios actually complete? Target: >90%
First-Attempt Success Rate — Does the tool work on the first call without retries? Target: >85%
APP_DATA Accuracy — Does the generated JSON match the app's expected schema? Target: 100%
Response Latency — Time from user message to app render. Target: <3 seconds for reads, <5 for writes

Add a metrics section:

## Performance Metrics (per channel)

| Metric | Target | Method |
|--------|--------|--------|
| Tool Correctness | >95% | Run 20 NL messages, count correct tool selections |
| Task Completion | >90% | Run 10 E2E scenarios, count fully completed |
| APP_DATA Schema Match | 100% | Validate every APP_DATA block against JSON schema |
| Response Latency (P50) | <3s | Measure 10 interactions |
| Response Latency (P95) | <8s | Measure 10 interactions |
| App Render Success | 100% | All apps render data state without console errors |
| Accessibility Score | >90 | Run axe-core or Lighthouse on each app |

🔴 Critical: No regression testing baseline

The skill has no concept of baselines or regression detection. When you update a tool description, how do you know you didn't break routing for 3 other tools? When you change an app's CSS, how do you detect layout shifts?

Add:

Screenshot baselines — Store reference screenshots per app. On each test run, compare pixel diff. Tools: BackstopJS (open source), or custom Gemini comparison.
Tool routing baselines — Store a fixtures file of 20 NL messages → expected tool mappings. Re-run after any tool description change.
JSON schema validation — Define schemas for each app's expected APP_DATA format. Validate every AI response against it.

# Screenshot baseline workflow
backstop init
backstop reference  # Capture current state as baseline
# ... make changes ...
backstop test       # Compare against baseline, flag regressions

🔴 Critical: No accessibility testing

Zero mention of:

Color contrast auditing (our #96989d secondary text FAILS WCAG AA)
Keyboard navigation testing (Tab through all interactive elements)
Screen reader testing (VoiceOver on Mac)
axe-core or Lighthouse accessibility audits

Add Layer 2.5: Accessibility Testing:

### Accessibility Checks (per app)
- [ ] Run axe-core: `axe.run(document).then(results => console.log(results.violations))`
- [ ] All text passes WCAG AA contrast (4.5:1 normal, 3:1 large)
- [ ] All interactive elements reachable via Tab key
- [ ] All interactive elements operable with Enter/Space
- [ ] Loading/empty/data state changes announced to screen readers
- [ ] No info conveyed by color alone (icons/text supplement color badges)

🟡 Important: Testing is entirely manual

The "automated QA script" only checks file existence and compilation. The functional, visual, and integration layers are all "manual testing required." For 30+ servers, this is unscalable.

Add automated testing patterns:

Tool routing smoke test — Script that sends 5 NL messages per channel via API and checks tool selection
APP_DATA schema validator — Script that parses AI responses and validates JSON against schemas
App render test — Playwright script that loads each HTML file, injects sample data, screenshots it

// Automated app render test (Playwright)
const { chromium } = require('playwright');
const fs = require('fs');

async function testApp(htmlPath, sampleData) {
  const browser = await chromium.launch();
  const page = await browser.newPage({ viewport: { width: 400, height: 600 } });
  await page.goto(`file://${htmlPath}`);
  
  // Inject data via postMessage
  await page.evaluate((data) => {
    window.postMessage({ type: 'mcp_app_data', data }, '*');
  }, sampleData);
  
  await page.waitForTimeout(500);
  
  // Check no console errors
  const errors = [];
  page.on('console', msg => { if (msg.type() === 'error') errors.push(msg.text()); });
  
  // Screenshot
  await page.screenshot({ path: `/tmp/test-${path.basename(htmlPath)}.png` });
  
  // Check content rendered (not still showing loading)
  const loadingVisible = await page.isVisible('#loading');
  const contentVisible = await page.isVisible('#content');
  
  await browser.close();
  return { errors, loadingVisible, contentVisible };
}

🟡 Important: No performance testing

No guidance on measuring:

App file size budgets (should enforce <50KB)
Time to first render
Memory usage (important for many-app channels like GHL with 65 apps)
postMessage throughput (how fast can data update?)

🟡 Important: No data fixture library

Each test requires manually crafted sample data. Create a standardized fixture library:

fixtures/
  dashboard-sample.json
  data-grid-sample.json
  detail-card-sample.json
  timeline-sample.json
  calendar-sample.json
  pipeline-sample.json
  empty-state.json
  malformed-data.json
  huge-dataset.json (1000+ rows)

🟢 Nice-to-have: Add chaos testing

What happens when:

API returns 500 on every call?
postMessage sends data in wrong format?
APP_DATA is 500KB+ (huge dataset)?
User sends 10 messages rapidly?
Two apps try to render simultaneously?

Research Findings

1. Tool Calling Optimization (Paragon / Statsig / Berkeley BFCL)

Key findings:

LLM model choice matters most. Paragon's benchmarks showed model selection had the biggest impact on tool correctness. o3 (April 2025 update) performed best, but Claude 3.5 Sonnet was close behind.
Reducing tool count improves accuracy. The paper "Less is More" (arxiv, Nov 2024) proved that selectively reducing available tools significantly improves function-calling performance. Our lazy loading approach is on the right track, but we should go further — only surface tools relevant to the current conversation context.
Tool descriptions are the #1 lever after model choice. Better descriptions improved correctness by ~15-25% in Paragon's tests. The "do NOT use when" pattern was particularly impactful.
Router-based architecture outperforms flat tool lists. Statsig recommends: big model does routing/planning, specialized sub-agents handle execution. This is aligned with our lazy loading but could be extended to per-channel tool pre-filtering.
Requiring a rationale before tool calls improves accuracy. Adding "Before calling any tool, briefly state which tool you're choosing and why" to system prompts reduces misrouting.

Recommendations for our pipeline:

Add "anti-descriptions" (when NOT to use) to every tool
Implement dynamic tool activation — only surface tools relevant to detected user intent
Add rationale requirement to system prompts
Cap active tool count at 15-20 per interaction

2. MCP Apps Official Extension (Jan 2026)

Major protocol update we're not leveraging:

Tools can now declare _meta.ui.resourceUri pointing to a ui:// resource
HTML apps communicate with hosts via JSON-RPC over postMessage (not custom protocol)
Apps can call server tools directly, receive streaming data, and update context
Sandboxed iframe rendering with CSP controls
Adopted by Claude Desktop, VS Code Copilot, Gemini CLI, Cline, Goose, Codex

Impact on our pipeline:

Phase 2 (Server Builder): Should register tools with _meta.ui when they have apps
Phase 3 (App Designer): Should support the official MCP Apps SDK client-side
Phase 4 (Integrator): LocalBosses should support both our custom protocol AND the official one
This enables our servers to work in ANY MCP client, not just LocalBosses

3. Agent Evaluation Framework (Confident AI / DeepEval)

Industry standard for agent testing has evolved to:

Component-level evaluation — Test each piece (tool selection, parameter extraction, response generation) separately, not just end-to-end
Tool Correctness metric — Exact matching between expected and actual tool calls
Task Completion metric — LLM-scored evaluation of whether the full task was completed
Trace-based debugging — Record every step (tool chosen, params sent, output received) for root cause analysis

What we should adopt:

Define test cases as { prompt, expected_tools, expected_params, expected_data_shape }
Score tool correctness and task completion quantitatively
Store traces for debugging failed tests
Build a regression test suite that runs on every tool description change

4. Visual Regression Tooling (2025-2026 Landscape)

Top tools for our use case:

BackstopJS — Open source, screenshot comparison, perfect for our HTML apps. No external dependencies.
Percy (BrowserStack) — Cloud-based, AI-powered diff detection, but SaaS cost
Playwright screenshots — Built into our existing toolchain, can compare programmatically

Recommended approach: BackstopJS for baseline management + Gemini multimodal for subjective quality analysis. This is a two-layer approach: pixel diff catches regressions, AI analysis catches design quality issues.

5. Best MCP Servers (Competitive Analysis)

Top-starred MCP servers (June 2025):

GitHub MCP (15.2K ⭐) — Gold standard for API-aware agents with identity/permissions
Playwright MCP (11.6K ⭐) — Browser automation via MCP, used for QA
AWS MCP (3.7K ⭐) — Documentation, billing, service metadata
Context7 — Provides LLMs with up-to-date, version-specific documentation

What they do better than us:

Scoped permissions — GitHub MCP integrates with GitHub's auth model. Our servers have flat API keys with no per-tool permission scoping.
Rich error context — Best servers return errors with suggested fixes, not just error messages
Documentation as tool — Context7's approach of serving relevant docs as context is something our servers could do (e.g., when a tool fails, suggest the right docs)
Security guardrails — Pomerium's analysis shows most MCP servers lack security. We should add at least basic rate limiting per-user and audit logging.

UX & Design Gaps

1. No Progressive Loading

When a user sends a message and waits 2-5 seconds for the AI to respond with APP_DATA, the app sits in "loading skeleton" state. Users don't know if it's working. We need:

Streaming indicator — Show "AI is thinking..." or typing dots in the app itself
Progressive data — If possible, stream partial APP_DATA as it's generated
Time expectation — "Usually loads in 2-3 seconds" text in the loading state

2. No Transition Between Data States

When new APP_DATA arrives (user refines their request), the app hard-replaces all content. This is jarring. Better:

Cross-fade between old and new content
Highlight what changed (new rows, updated values)
Animate metric values counting up/down to new numbers

3. No User Memory / Preferences

Apps don't remember anything between sessions:

Last viewed filters/sort
Preferred view mode (grid vs list)
Collapsed/expanded sections
Recently viewed items

This could use host-mediated storage (not localStorage in the iframe) via postMessage.

4. No Mobile Considerations

The responsive breakpoints stop at 280px but don't consider:

Touch targets (minimum 44x44px per WCAG)
Swipe gestures (swipe to delete, swipe between tabs)
Safe area insets (notch/home indicator on mobile)
Virtual keyboard pushing content

5. No Multi-Language Support

All apps are hardcoded English. At minimum:

Date/number formatting should respect locale (toLocaleDateString is good but inconsistent)
No hardcoded English strings in the templates — use a simple i18n pattern
RTL text support for international users

6. No Empty State Personalization

Every app's empty state says "Ask me a question in the chat to populate this view with data." This should be contextual:

Dashboard: "Ask me for a performance overview or specific metrics"
Contact Grid: "Try 'show me all active contacts' or 'contacts added this week'"
Pipeline: "Ask to see your sales pipeline or a specific deal stage"

7. Missing "Magic Moment" Polish

The transition from "user types message" to "beautiful app appears" should feel magical. Currently it's: loading skeleton → hard pop of content. Better:

Typing indicator appears in chat
App shows "Preparing your view..." with subtle animation
Content slides in with staggered row animation
Metric numbers animate from 0 to their values
Charts animate/grow their bars

This takes the experience from "functional" to "delightful."

Testing Methodology Gaps

1. No Test Data Management

The QA skill has no concept of:

Fixture files — Standardized sample data for each app type
Edge case data — Empty strings, null values, extremely long text, Unicode, HTML entities
Scale data — 1000+ row datasets to test scroll performance
Adversarial data — XSS payloads in text fields (currently escaped with escapeHtml, but untested)

2. No Continuous Testing

Testing is positioned as a one-time phase, not continuous. Need:

Pre-commit hooks — Run static analysis on every commit
CI/CD integration — Automated screenshot comparison on PR
Monitoring — Track tool correctness rate in production over time
Alerting — If tool misrouting rate exceeds 5%, alert

3. No Cross-Browser Testing

Apps are tested in one browser (Safari via Peekaboo). Need:

Chrome (most common)
Firefox (rendering differences)
Mobile Safari (iOS webview)
Electron (if LocalBosses is desktop-wrapped)

4. No Load Testing

What happens when:

10 users hit the same channel simultaneously?
An app receives 50 data updates per minute?
30 threads are open across different channels?

5. No Security Testing

Zero mention of:

XSS testing (even though apps escape HTML, test it)
CSRF considerations in postMessage handling
Content Security Policy validation
API key exposure in client-side code

6. No AI Response Quality Testing

Beyond "did the right tool fire?", test:

Is the natural language response helpful?
Does the APP_DATA contain realistic, well-formatted data?
Does the AI handle ambiguous requests gracefully (asking for clarification vs guessing)?
Does the AI handle multi-intent messages? ("Show me contacts and create a new deal")

7. Missing Test Types

Test Type	Current Coverage	Gap
Static analysis	✅ Basic	No linting, no type coverage
Visual testing	⚠️ Manual screenshots	No baselines, no automated diff
Functional testing	⚠️ Manual NL testing	No automated tool routing tests
Integration testing	⚠️ Manual E2E	No scripted scenarios
Accessibility testing	❌ None	Need axe-core + keyboard + VoiceOver
Performance testing	❌ None	Need file size, render time, latency
Security testing	❌ None	Need XSS, CSP, postMessage validation
Regression testing	❌ None	Need baselines + automated comparison
Chaos testing	❌ None	Need error injection, malformed data
AI quality testing	❌ None	Need response quality scoring

Priority Recommendations

Ranked by impact on user experience and pipeline reliability:

P0 — Critical (Do Before Shipping More Servers)

Fix accessibility contrast ratio — Change secondary text from #96989d to #b0b2b8 across all apps. This is a compliance issue.
- Impact: High (legal/compliance risk, affects all apps)
- Effort: Low (CSS find-and-replace)
Upgrade tool description formula — Add "do NOT use when" disambiguation to every tool description template in the API Analyzer skill.
- Impact: Very high (directly reduces tool misrouting, the #1 user-facing failure)
- Effort: Medium (update templates, retroactively fix existing servers)
Add quantitative QA metrics — Define Tool Correctness Rate, Task Completion Rate, APP_DATA Schema Match, and Response Latency as required metrics. Build the 20-message routing test fixture.
- Impact: High (enables data-driven quality improvement)
- Effort: Medium (define metrics, build test fixture)
Create test data fixtures — Build a fixtures library with sample data for each app type, including edge cases and adversarial data.
- Impact: High (unblocks automated testing, ensures consistent QA)
- Effort: Low-medium (one-time creation)

P1 — High Priority (Next Sprint)

Add MCP Apps extension support — Update Server Builder to optionally register _meta.ui.resourceUri. Update App Designer to support the official SDK client-side protocol.
- Impact: High (future-proofs servers for all MCP hosts, not just LocalBosses)
- Effort: Medium-high (new code patterns, update templates)
Add interactive patterns to App Designer — At minimum: client-side sort, client-side filter/search, copy-to-clipboard, and expand/collapse. These turn apps from views into tools.
- Impact: High (transforms user experience from "reading" to "working")
- Effort: Medium (new template code)
Build automated app render tests — Playwright script that loads each HTML app, injects fixture data, checks for console errors, and captures screenshots.
- Impact: High (catches visual regressions automatically)
- Effort: Medium (one-time script, reusable across all servers)
Improve system prompt engineering guidelines — Add structured tool routing rules, few-shot examples, rationale requirements, and negative instructions to the Integrator skill.
- Impact: High (directly improves AI interaction quality)
- Effort: Medium (template updates + example creation)

P2 — Important (This Quarter)

Add data visualization primitives — Line charts, donut charts, sparklines, progress bars in pure CSS/SVG. Include as copy-paste snippets in App Designer.
- Impact: Medium-high (dashboards and analytics apps become much richer)
- Effort: Medium (design + code for each viz type)
Add accessibility testing layer — axe-core validation, keyboard navigation testing, color contrast auditing as part of Layer 2 in QA.
- Impact: Medium-high (compliance + usability)
- Effort: Medium (add tools, update checklist)
Add screenshot regression baselines — BackstopJS integration for automated visual comparison.
- Impact: Medium (catches unintended visual changes)
- Effort: Medium (setup + baseline capture)
Add error boundaries to all apps — Global error handler + try/catch in render() so apps never go blank.
- Impact: Medium (prevents worst-case "blank screen" UX)
- Effort: Low (small code addition to base template)

P3 — Nice-to-Have (This Quarter if Time)

Add bidirectional app communication — sendToHost() pattern for refresh, navigate, and tool calls from within apps.
Add micro-interactions — Staggered row animations, metric counting, smooth transitions.
Add dynamic tool activation — Surface only contextually-relevant tools per interaction.
Add AI response quality scoring — Beyond tool correctness, evaluate helpfulness and data quality.
Add chaos testing — Error injection, malformed data, rapid-fire interactions.
Personalize empty states — Context-specific prompts per app type.

Appendix: Contrast Ratio Audit

Element	Current Color	Background	Ratio	WCAG AA	Fix
Primary text	#dcddde	#1a1d23	10.4:1	✅ Pass	—
Secondary text	#96989d	#1a1d23	3.7:1	❌ Fail	Use #b0b2b8 (5.0:1)
Secondary text	#96989d	#2b2d31	3.2:1	❌ Fail	Use #b0b2b8 (4.3:1) or #b8babe (5.0:1)
Heading text	#ffffff	#1a1d23	15.0:1	✅ Pass	—
Accent	#ff6d5a	#1a1d23	4.9:1	✅ Pass	—
Accent on card	#ff6d5a	#2b2d31	4.2:1	⚠️ Fail (normal text)	OK for large text only
Table header	#96989d	#2b2d31	3.2:1	❌ Fail	Use #b0b2b8
Success badge text	#43b581	badge bg	3.8:1	⚠️ Marginal	Use #4cc992

End of review. These recommendations are prioritized to maximize impact on user experience while maintaining the pipeline's efficiency for mass-producing MCP servers. The most critical items (contrast fix, tool descriptions, QA metrics) should be addressed before shipping the next batch of servers.

38 KiB Raw Blame History

Agent Gamma — AI/UX & Testing Review

Executive Summary

Per-Skill Reviews

1. MCP API Analyzer (Phase 1)

2. MCP Server Builder (Phase 2)

3. MCP App Designer (Phase 3)

4. MCP LocalBosses Integrator (Phase 4)

5. MCP QA Tester (Phase 5)

Research Findings

1. Tool Calling Optimization (Paragon / Statsig / Berkeley BFCL)

2. MCP Apps Official Extension (Jan 2026)

3. Agent Evaluation Framework (Confident AI / DeepEval)

4. Visual Regression Tooling (2025-2026 Landscape)

5. Best MCP Servers (Competitive Analysis)

UX & Design Gaps

1. No Progressive Loading

2. No Transition Between Data States

3. No User Memory / Preferences

4. No Mobile Considerations

5. No Multi-Language Support

6. No Empty State Personalization

7. Missing "Magic Moment" Polish

Testing Methodology Gaps

1. No Test Data Management

2. No Continuous Testing

3. No Cross-Browser Testing

4. No Load Testing

5. No Security Testing

6. No AI Response Quality Testing

7. Missing Test Types

Priority Recommendations

P0 — Critical (Do Before Shipping More Servers)

P1 — High Priority (Next Sprint)

P2 — Important (This Quarter)

P3 — Nice-to-Have (This Quarter if Time)

Appendix: Contrast Ratio Audit

38 KiB

Raw Blame History