=== NEW SERVERS ADDED (7) === - servers/closebot — 119 tools, 14 modules, 4,656 lines TS (Stage 7) - servers/google-console — Google Search Console MCP (Stage 7) - servers/meta-ads — Meta/Facebook Ads MCP (Stage 8) - servers/twilio — Twilio communications MCP (Stage 8) - servers/competitor-research — Competitive intel MCP (Stage 6) - servers/n8n-apps — n8n workflow MCP apps (Stage 6) - servers/reonomy — Commercial real estate MCP (Stage 1) === FACTORY INFRASTRUCTURE ADDED === - infra/factory-tools — mcp-jest, mcp-validator, mcp-add, MCP Inspector - 60 test configs, 702 auto-generated test cases - All 30 servers score 100/100 protocol compliance - infra/command-center — Pipeline state, operator playbook, dashboard config - infra/factory-reviews — Automated eval reports === DOCS ADDED === - docs/MCP-FACTORY.md — Factory overview - docs/reports/ — 5 pipeline evaluation reports - docs/research/ — Browser MCP research === RULES ESTABLISHED === - CONTRIBUTING.md — All MCP work MUST go in this repo - README.md — Full inventory of 37 servers + infra docs - .gitignore — Updated for Python venvs TOTAL: 37 MCP servers + full factory pipeline in one repo. This is now the single source of truth for all MCP work.
38 KiB
Agent Gamma — AI/UX & Testing Review
Reviewer: Agent Gamma (AI/UX & Testing Methodology Expert)
Date: February 4, 2026
Scope: All 5 MCP Factory skills + master blueprint
Research basis: Paragon tool-calling benchmarks, Statsig agent architecture patterns, MCP Apps official spec (Jan 2026), Prompt Engineering Guide (function calling), Confident AI agent evaluation framework, WCAG 2.1 accessibility standards, Berkeley Function Calling Leaderboard findings, visual regression tooling landscape
Executive Summary
-
Tool descriptions are the pipeline's hidden bottleneck. The current "What/Returns/When" formula is good but insufficient — research shows tool descriptions need negative examples ("do NOT use when..."), disambiguation cues between similar tools, and output shape previews to reach >95% routing accuracy. With 30+ servers averaging 20+ tools each, misrouting will be the #1 user-facing failure mode.
-
The official MCP Apps extension (shipped Jan 2026) makes our iframe/postMessage architecture semi-obsolete. MCP now has
ui://resource URIs,_meta.ui.resourceUrion tools, and bidirectional JSON-RPC over postMessage. Our skill documents don't mention this at all — we're building to a 2025 pattern while the spec has moved forward. -
Testing is the weakest link in the pipeline. The QA skill has the right layers but lacks quantitative metrics (tool correctness rate, task completion rate), has no automated regression baseline, no accessibility auditing, and no test data fixtures. It's a manual checklist masquerading as a testing framework.
-
Accessibility is completely absent. Zero mention of ARIA attributes, keyboard navigation, focus management, screen reader support, or WCAG contrast ratios across all 5 skills. Our dark theme palette fails WCAG AA for secondary text (#96989d on #1a1d23 = 3.7:1, needs 4.5:1).
-
App UX patterns are solid for static rendering but miss all interactive patterns. No drag-and-drop (kanban reordering), no inline editing, no real-time streaming updates, no optimistic UI, no undo/redo, no keyboard shortcuts, no search-within-app. Apps feel like screenshots, not tools.
Per-Skill Reviews
1. MCP API Analyzer (Phase 1)
Strengths:
- Excellent reading priority hierarchy (auth → rate limits → overview → endpoints)
- The "speed technique for large APIs" using OpenAPI specs is smart
- App candidate selection criteria are well-reasoned (BUILD when / SKIP when)
- Template is thorough and would produce consistent outputs
Issues & Suggestions:
🔴 Critical: Tool description formula needs upgrading
The current formula is:
{What it does}. {What it returns}. {When to use it / what triggers it}.
Research from Paragon's 50-test-case benchmark (2025) and the Prompt Engineering Guide shows this needs expansion. Better formula:
{What it does}. {What it returns — include 2-3 key field names}.
{When to use it — specific user intents}. {When NOT to use it — disambiguation}.
{Side effects — if any}.
Example upgrade:
# Current (from skill)
"List contacts with optional filters. Returns paginated results including name, email, phone,
and status. Use when the user wants to see, search, or browse their contact list."
# Improved
"List contacts with optional filters and pagination. Returns {name, email, phone, status,
created_date} for each contact. Use when the user wants to browse, filter, or get an overview
of multiple contacts. Do NOT use for searching by specific keyword (use search_contacts instead)
or for getting full details of one contact (use get_contact instead)."
The "do NOT use" disambiguation is the single highest-impact improvement per Paragon's research — it reduced tool misrouting by ~30% in their benchmarks.
🟡 Important: Missing tool count optimization guidance
The skill says "aim for 5-15 groups, 3-15 tools per group" but doesn't address total tool count impact. Research from Berkeley Function Calling Leaderboard and the Medium analysis on tool limits shows:
- 1-10 tools: High accuracy, minimal degradation
- 10-20 tools: Noticeable accuracy drops begin
- 20+ tools: Significant degradation; lazy loading helps but descriptions still crowd the context
Recommendation: Add guidance to cap active tools at 15-20 per interaction via lazy loading, and add a "tool pruning" section for aggressively combining similar tools (e.g., list_contacts + search_contacts → single tool with optional query param).
🟡 Important: No semantic clustering guidance
When tools have overlapping names (e.g., list_invoices, get_invoice_summary, get_invoice_details), LLMs struggle. Add guidance for:
- Using verb prefixes that signal intent:
browse_(list/overview),inspect_(single item deep-dive),modify_(create/update),remove_(delete) - Grouping mutually exclusive tools with "INSTEAD OF" notes in descriptions
🟢 Nice-to-have: Add example disambiguation table
For each tool group, produce a disambiguation matrix:
| User says... | Correct tool | Why not others |
|---|---|---|
| "Show me all contacts" | list_contacts | Not search (no keyword), not get (not specific) |
| "Find John Smith" | search_contacts | Not list (specific name = search), not get (no ID) |
| "What's John's email?" | get_contact | Not list/search (asking about specific known contact) |
2. MCP Server Builder (Phase 2)
Strengths:
- Solid project scaffolding with good defaults
- Auth pattern catalog covers the common cases well
- MCP Annotations decision matrix is clear and correct
- Error handling pattern (Zod → client → server levels) is well-layered
- One-file vs modular threshold (15 tools) is practical
Issues & Suggestions:
🔴 Critical: Missing MCP Apps extension support
As of January 2026, MCP has an official Apps extension (@modelcontextprotocol/ext-apps). This changes how tools declare UI:
// NEW PATTERN: Tool declares its UI resource
registerAppTool(server, "get-time", {
title: "Get Time",
description: "Returns the current server time.",
inputSchema: {},
_meta: { ui: { resourceUri: "ui://get-time/mcp-app.html" } },
}, async () => { /* handler */ });
// Resource serves the HTML
registerAppResource(server, resourceUri, resourceUri,
{ mimeType: RESOURCE_MIME_TYPE },
async () => { /* return HTML */ }
);
Our servers should be built to support BOTH our custom LocalBosses postMessage pattern AND the official MCP Apps protocol. This future-proofs the servers for use in Claude Desktop, VS Code Copilot, and other MCP hosts.
Action: Add a section on _meta.ui.resourceUri registration. Update the tool definition interface to include optional _meta field.
🟡 Important: Tool description in code doesn't match analysis guidance
The builder skill's tool group template has descriptions that are shorter and less detailed than what the analyzer skill recommends. The code template shows:
description: "List contacts with optional filters and pagination. Returns name, email, phone, and status. Use when the user wants to see, search, or browse contacts."
But the Zod schema descriptions are separate and minimal:
page: z.number().optional().default(1).describe("Page number (default 1)")
Issue: Parameter descriptions in Zod .describe() aren't always surfaced by MCP clients. The parameter descriptions in inputSchema.properties[].description are what matters for tool selection. Add explicit guidance: "Always put the most helpful description in inputSchema.properties, not just in Zod."
🟡 Important: No output schema guidance
Tool definitions include inputSchema but nothing about expected output shapes. While MCP doesn't formally require output schemas, providing an output hint in the tool description massively helps:
- The LLM knows what data it will get back
- The LLM can better plan multi-step tool chains
- App designers know exactly what fields to expect
Add to the tool definition template:
// In the description:
"Returns: { data: Contact[], meta: { total, page, pageSize } } where Contact has {name, email, phone, status}"
🟢 Nice-to-have: Add streaming support pattern
For tools that return large datasets, add a streaming pattern using MCP's progress notifications. This is especially relevant for list/search operations that may take 2-5 seconds.
3. MCP App Designer (Phase 3)
Strengths:
- Comprehensive design system with specific hex values and spacing
- The 8 app type templates cover the most common patterns
- Three-state requirement (loading/empty/data) is excellent
- Data reception with both postMessage + polling is robust
- Responsive breakpoints and CSS are production-ready
Issues & Suggestions:
🔴 Critical: No accessibility at all
The entire skill has zero mention of:
- ARIA attributes — Tables need
role="table", status badges needrole="status"oraria-label - Keyboard navigation — Interactive elements must be focusable and operable with Enter/Space
- Focus management — When data loads and replaces skeleton, focus should move to content
- Color contrast — Secondary text (#96989d on #1a1d23) = 3.7:1 ratio. WCAG AA requires 4.5:1 for normal text. Fix: use
#b0b2b8for secondary text (5.0:1) - Screen reader announcements — Data state changes should use
aria-live="polite"regions - Reduced motion — The shimmer animation should respect
prefers-reduced-motion
Minimum additions to base template:
<!-- Add to loading state -->
<div id="loading" role="status" aria-label="Loading content">
<span class="sr-only">Loading...</span>
<!-- skeletons -->
</div>
<!-- Add to content container -->
<div id="content" style="display:none" aria-live="polite">
/* Screen reader only class */
.sr-only { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0,0,0,0); border: 0; }
/* Respect reduced motion */
@media (prefers-reduced-motion: reduce) {
.skeleton { animation: none; background: #2b2d31; }
}
🔴 Critical: Missing interactive patterns
The 8 app types are all display patterns. Real productivity apps need:
- Inline editing — Click a cell in the data grid to edit it, sends update via postMessage to host
- Drag-and-drop — Reorder pipeline columns, prioritize items (critical for kanban boards)
- Bulk actions — Select multiple rows with checkboxes, apply action to all
- Search/filter within app — Client-side filtering without roundtripping through the AI
- Sorting — Click column headers to sort (client-side for loaded data)
- Pagination controls — Previous/Next buttons that request more data from host
- Expand/collapse — Accordion sections for detail cards with many fields
- Copy-to-clipboard — Click to copy IDs, emails, etc.
Add at least a 9th app type: Interactive Data Grid with sort, filter, select, and inline edit.
🟡 Important: No data visualization beyond bar charts
The Analytics template only shows basic vertical bar charts. Missing:
- Line/area charts — For time-series trends (critical for dashboards)
- Donut/pie charts — For composition/percentage breakdowns
- Sparklines — Tiny inline charts in metric cards showing trend
- Heatmaps — For calendar/matrix data (contribution-style)
- Progress bars — For funnel conversion rates, goal tracking
- Horizontal bar charts — For ranking/comparison views
All of these can be done in pure CSS/SVG without external libraries. Add a "Visualization Primitives" section with reusable CSS/SVG snippets.
Example sparkline (pure SVG):
<svg viewBox="0 0 100 30" style="width:80px;height:24px">
<polyline fill="none" stroke="#ff6d5a" stroke-width="2"
points="0,25 15,20 30,22 45,10 60,15 75,8 90,12 100,5"/>
</svg>
🟡 Important: No error boundary pattern
If the render function throws (malformed data, unexpected types), the entire app goes blank. Add a global error boundary:
window.onerror = function(msg, url, line) {
document.getElementById('content').innerHTML = `
<div class="empty-state">
<div class="empty-state-icon">⚠️</div>
<div class="empty-state-title">Display Error</div>
<div class="empty-state-text">The app encountered an issue rendering the data. Try sending a new message.</div>
</div>`;
showState('data');
return true;
};
🟡 Important: Missing bidirectional communication pattern
Apps currently only receive data. They should also be able to:
- Request data refresh (user clicks "Refresh" button)
- Send user actions back to host (user clicks "Delete" on a row)
- Navigate to another app (user clicks a contact name → opens contact card)
Add a sendToHost() utility:
function sendToHost(action, payload) {
window.parent.postMessage({
type: 'mcp_app_action',
action,
payload,
appId: APP_ID
}, '*');
}
// Usage: sendToHost('refresh', {});
// Usage: sendToHost('navigate', { app: 'contact-card', contactId: '123' });
// Usage: sendToHost('tool_call', { tool: 'delete_contact', args: { id: '123' } });
🟢 Nice-to-have: Add micro-interactions
- Stagger animation on list items appearing (each row fades in 50ms apart)
- Number counting animation on metric values
- Smooth transitions when data updates (not a hard re-render)
.row-enter { animation: fadeSlideIn 0.2s ease-out forwards; opacity: 0; }
@keyframes fadeSlideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: translateY(0); } }
4. MCP LocalBosses Integrator (Phase 4)
Strengths:
- Extremely detailed file-by-file update guide — truly copy-paste ready
- Complete Calendly walkthrough example is great
- Cross-reference check (all 4 files must have every app ID) is critical
- System prompt engineering section covers the right principles
Issues & Suggestions:
🔴 Critical: System prompt engineering is under-specified
The current guidance is "describe capabilities in natural language" and "specify when to use each tool." This is insufficient for reliable tool routing. Research from the Prompt Engineering Guide and Statsig's optimization guide shows system prompts need:
- Explicit tool routing rules — Not just "you can manage contacts" but structured decision trees:
TOOL SELECTION RULES:
- If user asks to SEE/BROWSE/LIST multiple items → use list_* tools
- If user asks about ONE specific item by name/ID → use get_* tools
- If user asks to CREATE/ADD/NEW → use create_* tools
- If user asks to CHANGE/UPDATE/MODIFY → use update_* tools
- If user asks to DELETE/REMOVE → use delete_* tools (always confirm first)
- If user asks for STATS/METRICS/OVERVIEW → use analytics tools
- Output formatting instructions — Tell the AI exactly how to structure APP_DATA:
When returning data for the contact grid app, your APP_DATA MUST include:
- "data": array of objects, each with at minimum {name, email, status}
- "meta": {total, page, pageSize} for pagination
- "title": descriptive title matching what user asked for
-
Few-shot examples — Include 2-3 example interactions showing the full input → tool call → APP_DATA flow. This is the single most effective technique per OpenAI's prompt engineering guide.
-
Negative instructions — "Do NOT call tools when the user asks general questions about best practices. Do NOT use list tools when the user clearly knows which specific record they want."
🟡 Important: Intake questions need A/B testing framework
The intake question is the first interaction point and hugely impacts user experience. Currently it's hardcoded text with no measurement. Add:
- Guidance for writing intake questions that are action-oriented not question-oriented
- Alternative phrasings to test (e.g., "What contacts should I pull up?" vs "Tell me what you're looking for")
- Skip label should be the most common action (data shows 60%+ users skip — make the default great)
🟡 Important: System prompt addon is too coupled to data shape
The systemPromptAddon includes exact JSON structures, which means:
- If the app's render() function changes, the prompt is stale
- The AI treats it as a template, not understanding the data semantics
- Complex data requires enormous prompt addons
Better approach: Reference a data contract by name:
systemPromptAddon: `Generate APP_DATA conforming to the ContactGrid schema.
Required fields: data[] with {name, email, phone, status, created}, meta with {total, page, pageSize}.
Include 5-25 records matching the user's request. Realistic data only.`,
🟢 Nice-to-have: Add channel onboarding flow
When a user enters a new channel for the first time, show a brief guided tour:
- What this channel does
- What apps are available (visual toolbar walkthrough)
- Example things to try
5. MCP QA Tester (Phase 5)
Strengths:
- Five testing layers is the right conceptual framework
- The shell script template for automated static analysis is practical
- Common issues & fixes table is immediately useful
- Visual testing with Gemini/Peekaboo is creative
Issues & Suggestions:
🔴 Critical: No quantitative metrics or benchmarks
The entire testing framework is binary pass/fail checklists. Modern LLM agent evaluation (per Confident AI's DeepEval framework and the Berkeley Function Calling Leaderboard) measures:
- Tool Correctness Rate — What % of natural language messages trigger the correct tool? Target: >95%
- Task Completion Rate — What % of end-to-end scenarios actually complete? Target: >90%
- First-Attempt Success Rate — Does the tool work on the first call without retries? Target: >85%
- APP_DATA Accuracy — Does the generated JSON match the app's expected schema? Target: 100%
- Response Latency — Time from user message to app render. Target: <3 seconds for reads, <5 for writes
Add a metrics section:
## Performance Metrics (per channel)
| Metric | Target | Method |
|--------|--------|--------|
| Tool Correctness | >95% | Run 20 NL messages, count correct tool selections |
| Task Completion | >90% | Run 10 E2E scenarios, count fully completed |
| APP_DATA Schema Match | 100% | Validate every APP_DATA block against JSON schema |
| Response Latency (P50) | <3s | Measure 10 interactions |
| Response Latency (P95) | <8s | Measure 10 interactions |
| App Render Success | 100% | All apps render data state without console errors |
| Accessibility Score | >90 | Run axe-core or Lighthouse on each app |
🔴 Critical: No regression testing baseline
The skill has no concept of baselines or regression detection. When you update a tool description, how do you know you didn't break routing for 3 other tools? When you change an app's CSS, how do you detect layout shifts?
Add:
- Screenshot baselines — Store reference screenshots per app. On each test run, compare pixel diff. Tools: BackstopJS (open source), or custom Gemini comparison.
- Tool routing baselines — Store a fixtures file of 20 NL messages → expected tool mappings. Re-run after any tool description change.
- JSON schema validation — Define schemas for each app's expected APP_DATA format. Validate every AI response against it.
# Screenshot baseline workflow
backstop init
backstop reference # Capture current state as baseline
# ... make changes ...
backstop test # Compare against baseline, flag regressions
🔴 Critical: No accessibility testing
Zero mention of:
- Color contrast auditing (our #96989d secondary text FAILS WCAG AA)
- Keyboard navigation testing (Tab through all interactive elements)
- Screen reader testing (VoiceOver on Mac)
- axe-core or Lighthouse accessibility audits
Add Layer 2.5: Accessibility Testing:
### Accessibility Checks (per app)
- [ ] Run axe-core: `axe.run(document).then(results => console.log(results.violations))`
- [ ] All text passes WCAG AA contrast (4.5:1 normal, 3:1 large)
- [ ] All interactive elements reachable via Tab key
- [ ] All interactive elements operable with Enter/Space
- [ ] Loading/empty/data state changes announced to screen readers
- [ ] No info conveyed by color alone (icons/text supplement color badges)
🟡 Important: Testing is entirely manual
The "automated QA script" only checks file existence and compilation. The functional, visual, and integration layers are all "manual testing required." For 30+ servers, this is unscalable.
Add automated testing patterns:
- Tool routing smoke test — Script that sends 5 NL messages per channel via API and checks tool selection
- APP_DATA schema validator — Script that parses AI responses and validates JSON against schemas
- App render test — Playwright script that loads each HTML file, injects sample data, screenshots it
// Automated app render test (Playwright)
const { chromium } = require('playwright');
const fs = require('fs');
async function testApp(htmlPath, sampleData) {
const browser = await chromium.launch();
const page = await browser.newPage({ viewport: { width: 400, height: 600 } });
await page.goto(`file://${htmlPath}`);
// Inject data via postMessage
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, sampleData);
await page.waitForTimeout(500);
// Check no console errors
const errors = [];
page.on('console', msg => { if (msg.type() === 'error') errors.push(msg.text()); });
// Screenshot
await page.screenshot({ path: `/tmp/test-${path.basename(htmlPath)}.png` });
// Check content rendered (not still showing loading)
const loadingVisible = await page.isVisible('#loading');
const contentVisible = await page.isVisible('#content');
await browser.close();
return { errors, loadingVisible, contentVisible };
}
🟡 Important: No performance testing
No guidance on measuring:
- App file size budgets (should enforce <50KB)
- Time to first render
- Memory usage (important for many-app channels like GHL with 65 apps)
- postMessage throughput (how fast can data update?)
🟡 Important: No data fixture library
Each test requires manually crafted sample data. Create a standardized fixture library:
fixtures/
dashboard-sample.json
data-grid-sample.json
detail-card-sample.json
timeline-sample.json
calendar-sample.json
pipeline-sample.json
empty-state.json
malformed-data.json
huge-dataset.json (1000+ rows)
🟢 Nice-to-have: Add chaos testing
What happens when:
- API returns 500 on every call?
- postMessage sends data in wrong format?
- APP_DATA is 500KB+ (huge dataset)?
- User sends 10 messages rapidly?
- Two apps try to render simultaneously?
Research Findings
1. Tool Calling Optimization (Paragon / Statsig / Berkeley BFCL)
Key findings:
- LLM model choice matters most. Paragon's benchmarks showed model selection had the biggest impact on tool correctness. o3 (April 2025 update) performed best, but Claude 3.5 Sonnet was close behind.
- Reducing tool count improves accuracy. The paper "Less is More" (arxiv, Nov 2024) proved that selectively reducing available tools significantly improves function-calling performance. Our lazy loading approach is on the right track, but we should go further — only surface tools relevant to the current conversation context.
- Tool descriptions are the #1 lever after model choice. Better descriptions improved correctness by ~15-25% in Paragon's tests. The "do NOT use when" pattern was particularly impactful.
- Router-based architecture outperforms flat tool lists. Statsig recommends: big model does routing/planning, specialized sub-agents handle execution. This is aligned with our lazy loading but could be extended to per-channel tool pre-filtering.
- Requiring a rationale before tool calls improves accuracy. Adding "Before calling any tool, briefly state which tool you're choosing and why" to system prompts reduces misrouting.
Recommendations for our pipeline:
- Add "anti-descriptions" (when NOT to use) to every tool
- Implement dynamic tool activation — only surface tools relevant to detected user intent
- Add rationale requirement to system prompts
- Cap active tool count at 15-20 per interaction
2. MCP Apps Official Extension (Jan 2026)
Major protocol update we're not leveraging:
- Tools can now declare
_meta.ui.resourceUripointing to aui://resource - HTML apps communicate with hosts via JSON-RPC over postMessage (not custom protocol)
- Apps can call server tools directly, receive streaming data, and update context
- Sandboxed iframe rendering with CSP controls
- Adopted by Claude Desktop, VS Code Copilot, Gemini CLI, Cline, Goose, Codex
Impact on our pipeline:
- Phase 2 (Server Builder): Should register tools with
_meta.uiwhen they have apps - Phase 3 (App Designer): Should support the official MCP Apps SDK client-side
- Phase 4 (Integrator): LocalBosses should support both our custom protocol AND the official one
- This enables our servers to work in ANY MCP client, not just LocalBosses
3. Agent Evaluation Framework (Confident AI / DeepEval)
Industry standard for agent testing has evolved to:
- Component-level evaluation — Test each piece (tool selection, parameter extraction, response generation) separately, not just end-to-end
- Tool Correctness metric — Exact matching between expected and actual tool calls
- Task Completion metric — LLM-scored evaluation of whether the full task was completed
- Trace-based debugging — Record every step (tool chosen, params sent, output received) for root cause analysis
What we should adopt:
- Define test cases as
{ prompt, expected_tools, expected_params, expected_data_shape } - Score tool correctness and task completion quantitatively
- Store traces for debugging failed tests
- Build a regression test suite that runs on every tool description change
4. Visual Regression Tooling (2025-2026 Landscape)
Top tools for our use case:
- BackstopJS — Open source, screenshot comparison, perfect for our HTML apps. No external dependencies.
- Percy (BrowserStack) — Cloud-based, AI-powered diff detection, but SaaS cost
- Playwright screenshots — Built into our existing toolchain, can compare programmatically
Recommended approach: BackstopJS for baseline management + Gemini multimodal for subjective quality analysis. This is a two-layer approach: pixel diff catches regressions, AI analysis catches design quality issues.
5. Best MCP Servers (Competitive Analysis)
Top-starred MCP servers (June 2025):
- GitHub MCP (15.2K ⭐) — Gold standard for API-aware agents with identity/permissions
- Playwright MCP (11.6K ⭐) — Browser automation via MCP, used for QA
- AWS MCP (3.7K ⭐) — Documentation, billing, service metadata
- Context7 — Provides LLMs with up-to-date, version-specific documentation
What they do better than us:
- Scoped permissions — GitHub MCP integrates with GitHub's auth model. Our servers have flat API keys with no per-tool permission scoping.
- Rich error context — Best servers return errors with suggested fixes, not just error messages
- Documentation as tool — Context7's approach of serving relevant docs as context is something our servers could do (e.g., when a tool fails, suggest the right docs)
- Security guardrails — Pomerium's analysis shows most MCP servers lack security. We should add at least basic rate limiting per-user and audit logging.
UX & Design Gaps
1. No Progressive Loading
When a user sends a message and waits 2-5 seconds for the AI to respond with APP_DATA, the app sits in "loading skeleton" state. Users don't know if it's working. We need:
- Streaming indicator — Show "AI is thinking..." or typing dots in the app itself
- Progressive data — If possible, stream partial APP_DATA as it's generated
- Time expectation — "Usually loads in 2-3 seconds" text in the loading state
2. No Transition Between Data States
When new APP_DATA arrives (user refines their request), the app hard-replaces all content. This is jarring. Better:
- Cross-fade between old and new content
- Highlight what changed (new rows, updated values)
- Animate metric values counting up/down to new numbers
3. No User Memory / Preferences
Apps don't remember anything between sessions:
- Last viewed filters/sort
- Preferred view mode (grid vs list)
- Collapsed/expanded sections
- Recently viewed items
This could use host-mediated storage (not localStorage in the iframe) via postMessage.
4. No Mobile Considerations
The responsive breakpoints stop at 280px but don't consider:
- Touch targets (minimum 44x44px per WCAG)
- Swipe gestures (swipe to delete, swipe between tabs)
- Safe area insets (notch/home indicator on mobile)
- Virtual keyboard pushing content
5. No Multi-Language Support
All apps are hardcoded English. At minimum:
- Date/number formatting should respect locale (
toLocaleDateStringis good but inconsistent) - No hardcoded English strings in the templates — use a simple i18n pattern
- RTL text support for international users
6. No Empty State Personalization
Every app's empty state says "Ask me a question in the chat to populate this view with data." This should be contextual:
- Dashboard: "Ask me for a performance overview or specific metrics"
- Contact Grid: "Try 'show me all active contacts' or 'contacts added this week'"
- Pipeline: "Ask to see your sales pipeline or a specific deal stage"
7. Missing "Magic Moment" Polish
The transition from "user types message" to "beautiful app appears" should feel magical. Currently it's: loading skeleton → hard pop of content. Better:
- Typing indicator appears in chat
- App shows "Preparing your view..." with subtle animation
- Content slides in with staggered row animation
- Metric numbers animate from 0 to their values
- Charts animate/grow their bars
This takes the experience from "functional" to "delightful."
Testing Methodology Gaps
1. No Test Data Management
The QA skill has no concept of:
- Fixture files — Standardized sample data for each app type
- Edge case data — Empty strings, null values, extremely long text, Unicode, HTML entities
- Scale data — 1000+ row datasets to test scroll performance
- Adversarial data — XSS payloads in text fields (currently escaped with
escapeHtml, but untested)
2. No Continuous Testing
Testing is positioned as a one-time phase, not continuous. Need:
- Pre-commit hooks — Run static analysis on every commit
- CI/CD integration — Automated screenshot comparison on PR
- Monitoring — Track tool correctness rate in production over time
- Alerting — If tool misrouting rate exceeds 5%, alert
3. No Cross-Browser Testing
Apps are tested in one browser (Safari via Peekaboo). Need:
- Chrome (most common)
- Firefox (rendering differences)
- Mobile Safari (iOS webview)
- Electron (if LocalBosses is desktop-wrapped)
4. No Load Testing
What happens when:
- 10 users hit the same channel simultaneously?
- An app receives 50 data updates per minute?
- 30 threads are open across different channels?
5. No Security Testing
Zero mention of:
- XSS testing (even though apps escape HTML, test it)
- CSRF considerations in postMessage handling
- Content Security Policy validation
- API key exposure in client-side code
6. No AI Response Quality Testing
Beyond "did the right tool fire?", test:
- Is the natural language response helpful?
- Does the APP_DATA contain realistic, well-formatted data?
- Does the AI handle ambiguous requests gracefully (asking for clarification vs guessing)?
- Does the AI handle multi-intent messages? ("Show me contacts and create a new deal")
7. Missing Test Types
| Test Type | Current Coverage | Gap |
|---|---|---|
| Static analysis | ✅ Basic | No linting, no type coverage |
| Visual testing | ⚠️ Manual screenshots | No baselines, no automated diff |
| Functional testing | ⚠️ Manual NL testing | No automated tool routing tests |
| Integration testing | ⚠️ Manual E2E | No scripted scenarios |
| Accessibility testing | ❌ None | Need axe-core + keyboard + VoiceOver |
| Performance testing | ❌ None | Need file size, render time, latency |
| Security testing | ❌ None | Need XSS, CSP, postMessage validation |
| Regression testing | ❌ None | Need baselines + automated comparison |
| Chaos testing | ❌ None | Need error injection, malformed data |
| AI quality testing | ❌ None | Need response quality scoring |
Priority Recommendations
Ranked by impact on user experience and pipeline reliability:
P0 — Critical (Do Before Shipping More Servers)
-
Fix accessibility contrast ratio — Change secondary text from
#96989dto#b0b2b8across all apps. This is a compliance issue.- Impact: High (legal/compliance risk, affects all apps)
- Effort: Low (CSS find-and-replace)
-
Upgrade tool description formula — Add "do NOT use when" disambiguation to every tool description template in the API Analyzer skill.
- Impact: Very high (directly reduces tool misrouting, the #1 user-facing failure)
- Effort: Medium (update templates, retroactively fix existing servers)
-
Add quantitative QA metrics — Define Tool Correctness Rate, Task Completion Rate, APP_DATA Schema Match, and Response Latency as required metrics. Build the 20-message routing test fixture.
- Impact: High (enables data-driven quality improvement)
- Effort: Medium (define metrics, build test fixture)
-
Create test data fixtures — Build a fixtures library with sample data for each app type, including edge cases and adversarial data.
- Impact: High (unblocks automated testing, ensures consistent QA)
- Effort: Low-medium (one-time creation)
P1 — High Priority (Next Sprint)
-
Add MCP Apps extension support — Update Server Builder to optionally register
_meta.ui.resourceUri. Update App Designer to support the official SDK client-side protocol.- Impact: High (future-proofs servers for all MCP hosts, not just LocalBosses)
- Effort: Medium-high (new code patterns, update templates)
-
Add interactive patterns to App Designer — At minimum: client-side sort, client-side filter/search, copy-to-clipboard, and expand/collapse. These turn apps from views into tools.
- Impact: High (transforms user experience from "reading" to "working")
- Effort: Medium (new template code)
-
Build automated app render tests — Playwright script that loads each HTML app, injects fixture data, checks for console errors, and captures screenshots.
- Impact: High (catches visual regressions automatically)
- Effort: Medium (one-time script, reusable across all servers)
-
Improve system prompt engineering guidelines — Add structured tool routing rules, few-shot examples, rationale requirements, and negative instructions to the Integrator skill.
- Impact: High (directly improves AI interaction quality)
- Effort: Medium (template updates + example creation)
P2 — Important (This Quarter)
-
Add data visualization primitives — Line charts, donut charts, sparklines, progress bars in pure CSS/SVG. Include as copy-paste snippets in App Designer.
- Impact: Medium-high (dashboards and analytics apps become much richer)
- Effort: Medium (design + code for each viz type)
-
Add accessibility testing layer — axe-core validation, keyboard navigation testing, color contrast auditing as part of Layer 2 in QA.
- Impact: Medium-high (compliance + usability)
- Effort: Medium (add tools, update checklist)
-
Add screenshot regression baselines — BackstopJS integration for automated visual comparison.
- Impact: Medium (catches unintended visual changes)
- Effort: Medium (setup + baseline capture)
-
Add error boundaries to all apps — Global error handler + try/catch in render() so apps never go blank.
- Impact: Medium (prevents worst-case "blank screen" UX)
- Effort: Low (small code addition to base template)
P3 — Nice-to-Have (This Quarter if Time)
- Add bidirectional app communication —
sendToHost()pattern for refresh, navigate, and tool calls from within apps. - Add micro-interactions — Staggered row animations, metric counting, smooth transitions.
- Add dynamic tool activation — Surface only contextually-relevant tools per interaction.
- Add AI response quality scoring — Beyond tool correctness, evaluate helpfulness and data quality.
- Add chaos testing — Error injection, malformed data, rapid-fire interactions.
- Personalize empty states — Context-specific prompts per app type.
Appendix: Contrast Ratio Audit
| Element | Current Color | Background | Ratio | WCAG AA | Fix |
|---|---|---|---|---|---|
| Primary text | #dcddde | #1a1d23 | 10.4:1 | ✅ Pass | — |
| Secondary text | #96989d | #1a1d23 | 3.7:1 | ❌ Fail | Use #b0b2b8 (5.0:1) |
| Secondary text | #96989d | #2b2d31 | 3.2:1 | ❌ Fail | Use #b0b2b8 (4.3:1) or #b8babe (5.0:1) |
| Heading text | #ffffff | #1a1d23 | 15.0:1 | ✅ Pass | — |
| Accent | #ff6d5a | #1a1d23 | 4.9:1 | ✅ Pass | — |
| Accent on card | #ff6d5a | #2b2d31 | 4.2:1 | ⚠️ Fail (normal text) | OK for large text only |
| Table header | #96989d | #2b2d31 | 3.2:1 | ❌ Fail | Use #b0b2b8 |
| Success badge text | #43b581 | badge bg | 3.8:1 | ⚠️ Marginal | Use #4cc992 |
End of review. These recommendations are prioritized to maximize impact on user experience while maintaining the pipeline's efficiency for mass-producing MCP servers. The most critical items (contrast fix, tool descriptions, QA metrics) should be addressed before shipping the next batch of servers.