Jake Shore f3c4cd817b Add all MCP servers + factory infra to MCPEngine — 2026-02-06

=== NEW SERVERS ADDED (7) ===
- servers/closebot — 119 tools, 14 modules, 4,656 lines TS (Stage 7)
- servers/google-console — Google Search Console MCP (Stage 7)
- servers/meta-ads — Meta/Facebook Ads MCP (Stage 8)
- servers/twilio — Twilio communications MCP (Stage 8)
- servers/competitor-research — Competitive intel MCP (Stage 6)
- servers/n8n-apps — n8n workflow MCP apps (Stage 6)
- servers/reonomy — Commercial real estate MCP (Stage 1)

=== FACTORY INFRASTRUCTURE ADDED ===
- infra/factory-tools — mcp-jest, mcp-validator, mcp-add, MCP Inspector
  - 60 test configs, 702 auto-generated test cases
  - All 30 servers score 100/100 protocol compliance
- infra/command-center — Pipeline state, operator playbook, dashboard config
- infra/factory-reviews — Automated eval reports

=== DOCS ADDED ===
- docs/MCP-FACTORY.md — Factory overview
- docs/reports/ — 5 pipeline evaluation reports
- docs/research/ — Browser MCP research

=== RULES ESTABLISHED ===
- CONTRIBUTING.md — All MCP work MUST go in this repo
- README.md — Full inventory of 37 servers + infra docs
- .gitignore — Updated for Python venvs

TOTAL: 37 MCP servers + full factory pipeline in one repo.
This is now the single source of truth for all MCP work.

2026-02-06 06:32:29 -05:00

40 KiB

Raw Blame History

Boss Kofi — Final Review & Improvement Proposals

Date: 2026-02-04 Reviewer: Boss Director Kofi — AI Agent UX, Tool Orchestration & Quality Systems Authority Scope: MCP Factory Pipeline v1 — all 6 skills reviewed (Analyzer, Builder, App Designer, Integrator, QA Tester) + orchestration doc

Pass 1 Notes (per skill — AI interaction quality assessment)

1. MCP-FACTORY.md (Orchestration Doc)

What's great:

Crystal clear 6-phase pipeline with defined inputs/outputs and quality gates. This is production-grade thinking.
Agent role separation (Analyst→Builder→Designer→Integrator→QA) maps perfectly to skill specialization.
The parallel execution insight (Agents 2+3 can run concurrently) shows real pipeline optimization awareness.
Inventory tracking of 30 built-but-untested servers gives immediate actionable work.

What would produce mediocre experiences:

The pipeline is linear. There's no feedback loop from QA→Builder/Designer. If QA finds that the tool descriptions cause misrouting, there's no prescribed path back to fix them — it's just "fixes" in the QA output.
No mention of versioning or iteration. APIs change, tool descriptions need tuning based on real usage. The pipeline treats shipping as final.
Missing: user feedback loop. After ship, how do you know if users are actually having good experiences? Tool correctness in production is never measured.

AI interaction quality:

The APP_DATA block pattern (embedding structured JSON in LLM responses) is the biggest fragility point in the whole system. The LLM is an unreliable JSON serializer. This is the #1 source of quality drops.

2. mcp-api-analyzer/SKILL.md

What's great:

The API Style Detection table (REST/GraphQL/SOAP/gRPC/WebSocket) with tool mapping is exceptionally thorough.
The Pagination Pattern Catalog covering 8 distinct strategies is a reference-quality resource.
Tool Description Best Practices with the 6-part formula (What/Returns/When/When NOT/Side effects) — this is the single most important section across all skills for end-user quality.
Disambiguation Tables per tool group — this is gold. Explicitly mapping "User says X → Correct tool → Why not others" directly addresses the #1 cause of bad AI experiences.
Content Annotations planning (audience + priority) shows forward-thinking about data routing.
Elicitation Candidates section acknowledges the need for mid-flow user input.
Token Budget Awareness with concrete targets (<200 tokens/tool, <5000 total) is practical.

What would produce mediocre experiences:

The analysis document is extremely long. A service with 50+ endpoints produces a massive file that the Builder agent must parse. No prioritization of "which tools matter most for the user experience."
Tool descriptions are written for LLM routing but not tested against real LLM routing. There's no feedback mechanism: "I wrote this description, then tested it with 20 queries, and it routed correctly 18/20."
The Disambiguation Table is created once during analysis but never validated empirically. It's based on the analyst's guess about what users will say, not real user utterances.
Missing: common user intent clustering. What do users ACTUALLY type when they want to see contacts? "Show contacts," "list my people," "who's in the CRM," "customer list," etc. The disambiguation table should be trained on diverse phrasings.

Testing theater vs real quality:

The Quality Gate Checklist is comprehensive (23 items) but entirely self-reviewed. There's no external validation of tool description quality — the same agent that wrote them checks them.

3. mcp-server-builder/SKILL.md

What's great:

This is an incredibly thorough server construction guide. The template variable reference table is smart — prevents the most common copy-paste error.
Circuit breaker pattern built into the API client template is production-grade resilience.
The pluggable pagination system supporting 5 strategies out of the box is excellent.
Structured logging on stderr (JSON format with request IDs and timing) — this enables real debugging and performance monitoring.
The structuredContent + content dual-return pattern ensures compatibility with both new and old MCP clients.
The one-file vs modular threshold (≤15 tools) is a pragmatic call.
Health check tool always included — this is a crucial debugging aid.
Error classification (Protocol vs Tool Execution) with the insight that validation errors should be Tool Execution Errors (enabling LLM self-correction) is exactly right.

What would produce mediocre experiences:

The template is heavily oriented toward building servers but doesn't address testing them in isolation. There's no "start the server, send 5 tool calls, verify outputs" built into the build phase.
Token budget section warns about >25 tools but doesn't provide automated measurement. You tell the builder to keep descriptions under 200 tokens but don't give them a way to count.
The server template has listChanged: false in capabilities. This means if you hot-reload tool groups, clients won't know. For development iteration, this should be true.
Resource URIs use {service}:// scheme but there's no actual Resource handler registered. The resource_link in tool results points to URIs that no client can resolve.

Testing theater vs real quality:

Quality Gate has 27 items — all self-checked by the builder agent. No automated verification script. The QA tester skill has one, but that's 3 phases later.

4. mcp-app-designer/SKILL.md

What's great:

The design system is genuinely well-crafted. WCAG AA compliance note with specific contrast ratios, the rejection of #96989d, the prefers-reduced-motion support — this shows real accessibility awareness.
9 app type templates with expected data shapes and customized empty states is a comprehensive library.
The Interactive Data Grid (6.9) with sorting, filtering, bulk selection, expand/collapse, and copy-to-clipboard is genuinely interactive — not just a static table.
Data visualization primitives (SVG charts, sparklines, donut charts, progress bars) with zero dependencies is impressive.
Bidirectional communication via sendToHost() enables real interactivity (refresh, navigate, trigger tool calls).
The error boundary (window.onerror + try/catch in render) prevents white-screen-of-death.
Polling with exponential backoff (3s→5s→10s→30s, max 20 attempts) is well-designed fallback behavior.
validateData() function for defensive rendering is a solid pattern.

What would produce mediocre experiences:

The apps are static renderings. They receive data once, render it, and sit there. There's no live updating, no streaming, no real-time feel. The user asks a question, waits for the AI, then the app renders. Compare this to a real dashboard that updates continuously.
No loading state between data updates. When the user asks a follow-up question, the app shows the OLD data until new data arrives. There's no visual indication that a refresh is happening. This creates a confusing lag where the user types a new query but sees stale data.
The sendToHost('tool_call', ...) pattern isn't implemented on the host side yet. The app designer documents bidirectional communication, but the integrator skill doesn't wire up the host to listen for mcp_app_action messages. It's a dead feature.
Form apps have no submit action. The form template renders input fields but has no submit button that triggers a tool call. It's a display form, not a functional form.
No app-to-app navigation. The sendToHost('navigate', ...) pattern exists in code but there's no host-side handler documented in the integrator skill.
280px minimum is very narrow. Tables become unusable. The pipeline/kanban view horizontally scrolls at this width but the columns are too narrow to read. Should acknowledge that some app types need a wider minimum.

Testing theater vs real quality:

Quality gate checks "every app renders with sample data" — but who provides the sample data? The designer creates apps but doesn't create test fixtures. The QA skill has fixtures, but they're generic, not per-service.

5. mcp-localbosses-integrator/SKILL.md

What's great:

The detailed walkthrough of all 5 files to update, with exact templates, is a model of reproducible integration documentation.
Intake Question Quality Criteria table (format hint, skipLabel, length, action-oriented, context-specific) with good/bad examples is excellent.
APP_DATA Failure Modes table documenting 6 known LLM serialization failures with fixes is crucial real-world knowledge.
The recommended parseAppData() parser with fallbacks (exact match → code block strip → heuristic JSON extraction) is battle-tested.
System Prompt Engineering Guidelines with Prompt Budget Targets (<500 tokens channel, <300 tokens addon) prevent context bloat.
The Integration Validation Script that cross-references all 4 files to catch missing/orphaned entries is exactly the right automated check.
Rollback Strategy (git checkpoint, feature flag, manifest-based) shows production deployment awareness.
Few-shot examples in systemPromptAddon — the document correctly identifies this as "the single most effective technique for consistent tool routing and APP_DATA generation."

What would produce mediocre experiences:

The LLM-as-JSON-serializer problem. The entire data flow depends on the LLM correctly embedding JSON in its response text (). This is the weakest link. Even with the parser fallbacks, LLMs regularly produce: multi-line JSON (breaking the "single line" rule), truncated JSON (context window limits), hallucinated data (when they don't have real tool results), and inconsistent field names (calling it total_contacts vs totalContacts vs contacts_count).
No schema enforcement between tool output and APP_DATA. The tool returns structuredContent with a known schema. The LLM then re-serializes this as APP_DATA. But there's no validation that the LLM's APP_DATA matches what the app's render() function expects. The tool might return {data: [...]} but the LLM outputs {contacts: [...]}, and the app looks for data.data and shows the empty state.
System prompts are duplicating tool information. The channel system prompt describes tools in natural language, and the MCP tool definitions ALSO describe tools. This is double context consumption. When tools change, the system prompt becomes stale.
The systemPromptAddon examples include sample JSON structures. This consumes significant tokens showing the LLM what to output, but it's fragile — if the app's render function changes, the addon becomes a lie.
Thread State Management relies entirely on localStorage. No server-side persistence means all thread history is lost on cache clear, device switch, or incognito mode.

Testing theater vs real quality:

The Integration Validation Script is excellent for static cross-referencing. But it doesn't test the runtime behavior — does clicking the app actually open a thread? Does the AI actually generate valid APP_DATA? Those are left entirely to manual Phase 5 QA.

6. mcp-qa-tester/SKILL.md

What's great:

The 6-layer testing architecture (Protocol → Static → Visual → Accessibility → Functional → Performance → Live API → Security → Integration) is genuinely comprehensive.
Quantitative Quality Metrics with specific targets (Tool Correctness >95%, Task Completion >90%, Accessibility >90%, Cold Start <2s, Latency P50 <3s) — finally, numbers instead of checkboxes.
MCP Protocol Compliance testing via MCP Inspector + custom JSON-RPC lifecycle tests validates the foundation correctly.
Automated Playwright visual tests that check loading/empty/data states, dark theme compliance, and responsive layout are well-designed.
axe-core accessibility integration with score calculation and keyboard navigation testing is real accessibility testing, not theater.
The BackstopJS visual regression approach with 5% pixel diff threshold is solid.
Security testing with 10 XSS payloads, postMessage origin validation, CSP checks, and API key exposure scans covers the critical vectors.
Chaos testing (API 500s, wrong postMessage format, 500KB datasets, rapid-fire messages, concurrent apps) tests real failure modes.
Test data fixtures library with edge cases (unicode, extremely long text, null values, XSS payloads) is thorough.
Persistent QA reports with trend tracking across runs enables regression detection.

What would produce mediocre experiences:

Tool Correctness testing is theoretical. The skill defines routing fixtures (20+ NL messages → expected tool) but doesn't actually send them through the LLM. It validates that fixture files exist and that tool names are real. The actual routing accuracy test requires "the AI/LLM in the loop" — acknowledged as a comment but not automated.
No end-to-end data flow testing. There's no test that: (1) sends a message to the AI, (2) verifies the AI calls the right tool, (3) captures the AI's response, (4) extracts APP_DATA, (5) validates APP_DATA schema, (6) sends it to the app iframe, (7) screenshots the result. This end-to-end flow is the magic moment, and it's tested manually.
MSW mocks test the handler code, not the real API. Layer 3 tests use Mock Service Worker — essential for unit testing, but the mocks are hand-crafted. There's no guarantee the mocks match the real API's response shape. If the real API returns {results: [...]} but the mock returns {data: [...]}, the tests pass but production fails.
No APP_DATA generation testing with actual LLMs. The QA skill validates APP_DATA parsing (can we extract JSON from the text?) but not APP_DATA generation (does the LLM actually produce correct JSON given the system prompt?). This is the highest-failure-rate step.
Visual testing requires manual baseline capture. backstop reference must be run when apps are "verified correct" — but who verifies? And baselines aren't stored in version control by default.
No monitoring or production quality metrics. All testing is pre-ship. There's no guidance on tracking tool correctness, APP_DATA parse success rate, or user satisfaction in production.

Testing theater vs real quality:

The QA skill is about 70% real testing (static analysis, visual regression, accessibility, security, chaos) and 30% theater (tool routing fixtures that aren't run through LLMs, E2E scenarios that are manual templates, live API testing that's skipped for 30/37 servers due to missing credentials).
The biggest gap: the most important quality question — "does the user get the right data in a beautiful app within 3 seconds?" — is never tested automatically.

Pass 2 Notes (user journey trace, quality gaps, testing theater)

The Full User Journey (traced end-to-end)

USER types: "show me my top customers"
    │
    ▼ [QUALITY DROP POINT 1: Tool Selection]
AI reads system prompt + tool definitions
AI must select correct tool (list_contacts? search_contacts? get_analytics?)
    │
    ▼ [QUALITY DROP POINT 2: Parameter Selection]  
AI must figure out what "top" means (by revenue? by recency? by deal count?)
If ambiguous, should it ask or guess?
    │
    ▼ [QUALITY DROP POINT 3: API Execution]
MCP tool calls real API → gets data or error
Error handling must be graceful (circuit breaker, retry, timeout)
    │
    ▼ [QUALITY DROP POINT 4: LLM Re-serialization ← BIGGEST GAP]
AI receives structuredContent from tool
AI must re-serialize it as APP_DATA JSON in its text response
This is where JSON gets mangled, fields get renamed, data gets truncated
    │
    ▼ [QUALITY DROP POINT 5: APP_DATA Parsing]
Frontend must parse <!--APP_DATA:...:END_APP_DATA--> from response text
The parser has fallbacks, but failure = app shows empty state
    │
    ▼ [QUALITY DROP POINT 6: Data Shape Mismatch]
App's render() expects data.data[] but receives data.contacts[]
App shows empty state or crashes — user sees nothing
    │
    ▼ [QUALITY DROP POINT 7: Render Quality]
App renders with correct data
But: is it the RIGHT data? Did the AI interpret "top customers" correctly?
    │
    ▼ USER sees result (total time: 3-10 seconds)

The critical insight: Quality Drop Point 4 (LLM Re-serialization) is the highest-failure-rate step, yet it has the LEAST testing coverage. The analyzer writes tool descriptions (helps point 1), the builder validates API calls (helps point 3), the QA tester checks visual rendering (helps point 7), but NOBODY systematically tests points 4-6.

Mental Testing: Ambiguous Queries

I mentally tested the tool descriptions with ambiguous queries:

User Says	Ambiguity	Current System Response	Better Response
"show me John"	Which John? Which tool?	Probably `search_contacts` — but if multiple Johns, shows grid instead of card	Should ask "Which John?" via elicitation, or show grid with filter
"delete everything"	Delete what?	Hopefully doesn't call `delete_*` — system prompt says "confirm first"	Should refuse without specifics — destructive + vague = must clarify
"what happened today"	Activity? Calendar? Dashboard?	Could route to timeline, calendar, or dashboard depending on channel	Should default to timeline/activity feed — "what happened" implies events
"update the deal"	Which deal? What fields?	`update_deal` needs an ID — will fail with validation error	Should search deals first, then ask which one
"show me revenue and also add a new contact named Sarah"	Multi-intent	Will likely only handle one intent (probably the first)	Should acknowledge both, handle sequentially, or ask which to do first
"actually, I meant the other one"	Contextual correction	System has no memory of previous results — can't resolve "the other one"	Need conversation state tracking — remember previous result sets

Key finding: Multi-intent messages and contextual corrections are completely unaddressed. The system prompt has no guidance for handling "actually I meant..." or "also do X."

System Prompt Sufficiency for APP_DATA

I evaluated whether the systemPromptAddon templates actually produce correct APP_DATA consistently:

The Good:

Few-shot examples (when included) dramatically improve consistency
The explicit field listing ("Required fields: title, metrics, recent") helps

The Bad:

The system prompt says "SINGLE LINE JSON" but LLMs consistently produce multi-line JSON, especially for large datasets. The parser handles this, but it shouldn't have to.
No schema validation between what the addon describes and what the app's render() expects. These can drift silently.
The addon tells the LLM to "generate REALISTIC data" — but when using real tool results, it should use THAT data, not fabricate realistic-looking data. This instruction is confusing.

Are the Apps Actually Delightful?

What feels good:

The dark theme is polished and consistent — it feels like a real product, not a prototype
Loading skeletons with shimmer animation look professional
Status badges with semantic colors (green=active, red=failed) communicate at a glance
The Interactive Data Grid with sort/filter/expand is genuinely useful

What feels mediocre:

Static data. Once rendered, the app is a snapshot. No live updates, no streaming data. You see "245 contacts" but it doesn't change until you ask another question.
No visual feedback during AI processing. User types a follow-up question → sees the old app → waits → suddenly the app flashes with new data. No "updating..." indicator.
No drill-down. You see a data grid with contacts but clicking a contact name doesn't open the detail card. The sendToHost('navigate') pattern exists in code but isn't wired up.
No data persistence across sessions. Close the browser, lose all thread state and app data.
Charts are basic. The SVG primitives are functional but look like early d3.js examples, not like a modern analytics dashboard. No tooltips on hover, no click-to-filter, no zoom.

Research Findings (latest techniques for tool optimization and agent evaluation)

1. Berkeley Function Calling Leaderboard (BFCL V4) — Key Findings

The BFCL evaluates LLMs' ability to call functions accurately across real-world scenarios. Key insights:

Negative instructions reduce misrouting by ~30%. The MCP Factory already includes "Do NOT use when..." in tool descriptions — this is validated by BFCL research.
Tool count vs accuracy tradeoff: Accuracy degrades significantly above 15-20 active tools per interaction. The Factory's lazy loading approach (loading groups on demand) is the right mitigation, but the ListTools handler returns ALL tools regardless. Clients see the full inventory.
Multi-step tool chains are where most agents fail. Searching for a contact, then getting details, then updating — requires correct tool sequencing. The system prompts don't address multi-step chains.

2. Paragon's Tool Calling Optimization Research (2025-2026)

From Paragon's 50-test-case evaluation across 6 LLMs:

LLM choice has the biggest impact on tool correctness. OpenAI o3 (2025-04-16) performed best. Claude 3.5 Sonnet was strong. The Factory's model recommendation (Opus for analysis, Sonnet for building) is sound.
Better tool descriptions improve performance more than better system prompts. This validates the Factory's emphasis on the 6-part description formula.
Reducing tool count (fewer tools per interaction) has a larger effect than improving descriptions. The Factory's 15-20 tools per interaction target aligns with this finding.
DeepEval's Tool Correctness metric (correct tools / total test cases) and Task Completion metric (LLM-judged) are the industry standard for measuring tool calling quality.

3. DeepEval Agent Evaluation Framework (2025-2026)

DeepEval provides the most mature framework for evaluating AI agents:

Separate reasoning and action evaluation. Reasoning (did the agent plan correctly?) and Action (did it call the right tools?) should be measured independently.
Key metrics: PlanQualityMetric, PlanAdherenceMetric, ToolCorrectnessMetric, TaskCompletionMetric.
Production monitoring: DeepEval supports update_current_span() for tracing agent actions in production — enabling real-time quality measurement.
LLM-as-judge for task completion: Instead of hand-crafted ground truth, use an LLM to evaluate whether the task was completed. This scales to thousands of test cases.

Recommendation for MCP Factory: Integrate DeepEval as the evaluation framework for Layer 3 functional testing. Replace the manual routing fixture approach with automated DeepEval test runs.

4. MCP Apps Protocol (Official Extension — January 2026)

The MCP Apps extension is now live (announced January 26, 2026). Key features:

_meta.ui.resourceUri on tools — tools declare which UI to render
ui:// resource URIs — server-side HTML/JS served as MCP resources
JSON-RPC over postMessage — bidirectional app↔host communication
@modelcontextprotocol/ext-apps SDK — standardized App class with ontoolresult, callServerTool, updateModelContext
Client support: Claude, ChatGPT, VS Code, Goose — all support MCP Apps today

Critical implication for LocalBosses: The APP_DATA block pattern () is now legacy. MCP Apps provides the official way to deliver UI from tools. The medium-term roadmap in the Integrator skill (route structuredContent directly to apps) should be accelerated, and the long-term roadmap (MCP Apps protocol) is no longer "future" — it's available NOW.

5. Tool Description Optimization Research

From academic papers and production experience:

Explicit negative constraints in descriptions ("Do NOT use when...") reduce misrouting more than positive guidance ("Use when...")
Field name lists in descriptions (Returns {name, email, status}) help the LLM understand response shape — critical for APP_DATA generation
Parameter descriptions matter less than tool-level descriptions for routing accuracy
Ordering tools by frequency of use in the tools list can improve selection for top tools (LLMs have position bias — first tools are slightly more likely to be selected)

Proposed Improvements (specific, actionable, with examples)

CRITICAL Priority (do these first)

1. Eliminate the LLM Re-serialization Bottleneck

Problem: The entire app data flow depends on the LLM correctly embedding JSON in its text response. This is the #1 source of quality failures.

Solution: Implement the "medium-term" architecture NOW — route structuredContent from tool results directly to the app iframe, bypassing LLM text generation.

Implementation:

// In chat/route.ts — intercept tool results BEFORE LLM generates text
const toolResults = await mcpClient.callTool(toolName, args);

if (toolResults.structuredContent && activeAppId) {
  // Route structured data directly to the app — no LLM re-serialization
  await sendToApp(activeAppId, toolResults.structuredContent);
}

// LLM still generates the text explanation, but doesn't need to embed JSON
// APP_DATA block becomes optional fallback, not primary data channel

Impact: Eliminates Quality Drop Points 4, 5, and 6 from the user journey. Data goes from tool → app with zero lossy transformation.

2. Adopt MCP Apps Protocol

Problem: The custom APP_DATA pattern works only in LocalBosses. MCP Apps is now an official standard supported by Claude, ChatGPT, VS Code, and Goose.

Solution: Migrate MCP servers to use _meta.ui.resourceUri on tools, serve app HTML via ui:// resources, and use @modelcontextprotocol/ext-apps SDK in apps.

Implementation path:

Add _meta.ui.resourceUri to tool definitions in the server builder template
Register app HTML files as ui:// resources in each MCP server
Update app template to use @modelcontextprotocol/ext-apps App class for data reception
Maintain backward compatibility with postMessage/polling for LocalBosses during transition

Impact: MCP tools work in ANY MCP client (Claude, ChatGPT, VS Code) — not just LocalBosses. Huge distribution multiplier.

3. Automated Tool Routing Evaluation with DeepEval

Problem: Tool routing accuracy is tested with static fixture files that aren't actually run through an LLM. It's the most important quality metric with the least real testing.

Solution: Integrate DeepEval's ToolCorrectnessMetric and TaskCompletionMetric into the QA pipeline.

Implementation:

# tests/tool_routing_eval.py
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall

test_cases = [
    LLMTestCase(
        input="Show me all active contacts",
        actual_output=agent_response,
        expected_tools=[ToolCall(name="list_contacts", arguments={"status": "active"})],
        tools_called=[actual_tool_call],
    ),
    # ... 20+ test cases per server
]

metric = ToolCorrectnessMetric()
evaluate(test_cases, [metric])
# Returns: Tool Correctness Rate with per-case breakdowns

Impact: Transforms tool routing testing from theater (fixture files exist) to real measurement (LLM actually routes correctly X% of the time).

HIGH Priority

4. Add "Updating..." State to Apps

Problem: When the user asks a follow-up question, the app shows stale data with no visual indicator that new data is incoming.

Solution: Add a fourth state: "updating" — shows a subtle overlay or indicator on the existing data while new data loads.

Implementation:

// In app template — add updating state
function showState(state) {
  document.getElementById('loading').style.display = state === 'loading' ? 'block' : 'none';
  document.getElementById('empty').style.display = state === 'empty' ? 'block' : 'none';
  const content = document.getElementById('content');
  content.style.display = (state === 'data' || state === 'updating') ? 'block' : 'none';
  
  // Updating overlay
  const overlay = document.getElementById('updating-overlay');
  if (overlay) overlay.style.display = state === 'updating' ? 'flex' : 'none';
}

// When user sends a new message (detected via postMessage from host)
window.addEventListener('message', (event) => {
  if (event.data.type === 'user_message_sent') {
    showState('updating'); // Show "Updating..." on current data
  }
  if (event.data.type === 'mcp_app_data') {
    handleData(event.data.data); // Replace with new data
  }
});

Impact: User knows the system is working on their request. Reduces perceived latency by 50%+.

5. Wire Up Bidirectional Communication (App → Host)

Problem: sendToHost('navigate'), sendToHost('tool_call'), and sendToHost('refresh') are documented in the app designer but never wired up on the host side.

Solution: Document and implement the host-side handler in the integrator skill.

Implementation (in LocalBosses host):

// In the iframe wrapper component
iframe.contentWindow.addEventListener('message', (event) => {
  if (event.data.type === 'mcp_app_action') {
    switch (event.data.action) {
      case 'navigate':
        openApp(event.data.payload.app, event.data.payload.params);
        break;
      case 'refresh':
        resendLastToolCall();
        break;
      case 'tool_call':
        sendMessageToThread(`[Auto] Calling ${event.data.payload.tool}...`);
        // Trigger the tool call through the chat API
        break;
    }
  }
});

Impact: Enables drill-down (click contact in grid → open contact card), refresh buttons, and in-app actions. Transforms static apps into interactive ones.

6. Schema Contract Between Tools and Apps

Problem: No validation that the tool's structuredContent matches what the app's render() function expects. These can drift silently.

Solution: Generate a shared JSON schema that both the tool's outputSchema and the app's validateData() reference.

Implementation:

{service}-mcp/
├── schemas/
│   ├── contact-grid.schema.json    # Shared schema
│   └── dashboard.schema.json
├── src/tools/contacts.ts           # outputSchema references this
└── app-ui/contact-grid.html        # validateData() references this

// In app template — load schema at build time (inline it)
const EXPECTED_SCHEMA = {"required":["data","meta"],"properties":{"data":{"type":"array"}}};

function validateData(data, schema) {
  // Validate against the same schema the tool declares as outputSchema
  // If mismatch, show diagnostic empty state: "Data shape mismatch — tool returned X, app expected Y"
}

Impact: Catches data shape mismatches during development instead of in production. Enables clear error messages when something goes wrong.

MEDIUM Priority

7. Add Multi-Intent and Correction Handling to System Prompts

Problem: Users often type multi-intent messages ("show me contacts and also create a new one") or corrections ("actually, I meant the other list"). The system prompts don't address these.

Solution: Add explicit instructions to the channel system prompt template:

MULTI-INTENT MESSAGES:
- If the user asks for multiple things in one message, address them sequentially.
- State which you're handling first and that you'll get to the others.
- Complete one action before starting the next.

CORRECTIONS:
- If the user says "actually", "wait", "no I meant", "the other one", etc., 
  treat this as a correction to your previous action.
- If they reference "the other one" or "that one", check the previous results 
  in the conversation and clarify if needed.
- Never repeat the same action — understand what changed.

8. Add Token Counting to the Builder Skill

Problem: The builder skill says "keep descriptions under 200 tokens" but doesn't provide measurement.

Solution: Add a token counting step to the build workflow:

# Add to build script
node -e "
const tools = require('./dist/tools/index.js');
// Count tokens per tool description (approximate: words * 1.3)
for (const tool of tools) {
  const tokens = Math.ceil(tool.description.split(/\s+/).length * 1.3);
  const status = tokens > 200 ? '⚠️' : '✅';
  console.log(\`\${status} \${tool.name}: ~\${tokens} tokens\`);
}
"

9. Create Per-Service Test Fixtures in the Designer Phase

Problem: The QA skill has generic fixtures, but each service needs fixtures that match its specific data shapes.

Solution: The app designer should create test-fixtures/{service}/{app-name}.json alongside each HTML app, using the tool's outputSchema to generate realistic test data.

10. Add Production Quality Monitoring Guidance

Problem: All testing is pre-ship. No guidance on measuring quality in production.

Solution: Add a "Layer 6: Production Monitoring" to the QA skill:

### Layer 6: Production Monitoring (post-ship)

Metrics to track:
- APP_DATA parse success rate (target: >98%)
- Tool correctness (sample 5% of interactions, LLM-judge)
- Time to first app render (target: <3s P50, <8s P95)
- User retry rate (how often do users rephrase the same request)
- Thread completion rate (% of threads where user gets desired outcome)

Implementation: Log these metrics in the chat route and aggregate weekly.

The "Magic Moment" Audit

What makes it feel AMAZING:

Instant visual gratification. User types "show me contacts" → within 2s, a beautiful dark-themed data grid appears with sortable columns, status badges, and realistic data. This first impression is the hook.
The dark theme. It looks like a premium product, not a hackathon demo. The consistent color palette, proper typography, and polished components signal quality.
Contextual empty states. Instead of "No data" → "Try 'show me all active contacts' or 'list recent invoices'" — this teaches the user what to do next.
Loading skeletons. The shimmer effect during loading says "something is happening" — much better than a blank screen or spinner.

What makes it feel MEDIOCRE:

The 3-8 second wait. User types → AI processes → tool calls API → AI generates response + APP_DATA → frontend parses → app renders. Every step adds latency. For "show me contacts," 3 seconds feels slow compared to clicking a button in a traditional app.
Stale data between updates. User types a follow-up → app shows old data → eventually updates. No "updating..." indicator. Feels broken.
Dead interactivity. Click a contact name in the grid — nothing happens. The data grid looks interactive (hover effects, click cursor) but clicking doesn't navigate to the detail card.
One-way conversation with apps. The app is a display-only surface. You can't interact with it to drive the conversation — no "click to filter" or "select rows to export."
JSON failures. When APP_DATA parsing fails (and it does, maybe 5-10% of the time), the app stays on the loading state. The user sees the AI's text response saying "here are your contacts" but the app shows nothing. Confusing and frustrating.

What would make it feel MAGICAL:

Streaming data rendering. As the AI generates the response, the app starts rendering partial data. User sees the table building row by row — feels alive and fast.
Click-to-drill-down. Click a contact name → detail card opens automatically. Click a pipeline deal → detail view. Apps are interconnected.
App-driven conversation. Select 3 contacts in the grid → click "Send email" → AI drafts an email to those contacts. The app DRIVES the AI, not just displays data from it.
Live dashboards. After initial render, the dashboard polls for updates every 30 seconds. Numbers tick up. Sparklines animate. Feels like a real ops dashboard.
Inline editing. Click a field in the detail card → edit it in place → app calls sendToHost('tool_call', { tool: 'update_contact', args: { id: '123', name: 'New Name' } }). Instant save.

Testing Reality Check (what the QA skill actually catches vs what it misses)

What it CATCHES (real quality):

Test	What it validates	Real-world impact
TypeScript compilation	Code compiles, types are correct	Prevents server crashes
MCP Inspector	Protocol compliance	Server works with any MCP client
Playwright visual tests	Apps render all 3 states, dark theme, responsive	Users see a polished UI
axe-core accessibility	WCAG AA, keyboard nav, screen reader	Accessible to all users
XSS payload testing	No script injection via user data	Security against malicious data
Chaos testing (500 errors, wrong formats, huge data)	Graceful degradation	App doesn't crash under adverse conditions
Static cross-reference	All app IDs consistent across 4 files	No broken routes or missing entries
File size budgets	Apps under 50KB	Fast loading

What it MISSES (testing theater):

Gap	Why it matters	Current state
Tool routing accuracy with real LLM	This is THE quality metric — does the AI pick the right tool?	Fixture files exist but aren't run through an LLM
APP_DATA generation quality	Does the LLM produce valid JSON that matches the app's expectations?	Not tested at all — parser is tested, generator is not
End-to-end data flow	Message → AI → tool → API → APP_DATA → app render → correct data	Manual only — no automated E2E test
Multi-step tool chains	"Find John's email and send him a meeting invite" — requires 3 tool calls in sequence	Not tested — all routing tests are single-tool
Conversation context	"Show me more details about the second one" — requires memory of previous results	Not addressed in any skill
Real API response shape matching	Do MSW mocks match real API responses?	Mocks are hand-crafted, never validated against real APIs
Production quality monitoring	Is quality maintained after ship?	No post-ship quality measurement at all
APP_DATA parse failure rate	How often does the LLM produce unparseable JSON?	Not measured — the parser silently falls back

The Hard Truth:

The QA skill is excellent at testing the infrastructure (server compiles, apps render, accessibility passes, security is clean) but weak at testing the AI interaction quality (tool routing, data generation, multi-step flows). The infrastructure is maybe 40% of the user experience; the AI interaction quality is 60%. The testing effort is inverted.

Summary: Top 5 Actions by Impact

#	Action	Impact	Effort	Priority
1	Route structuredContent directly to apps (bypass LLM re-serialization)	Eliminates the #1 failure mode, improves reliability from ~90% to ~99%	Medium — requires chat route refactor	CRITICAL
2	Adopt MCP Apps protocol	Tools work in Claude/ChatGPT/VS Code, not just LocalBosses. Future-proofs everything.	High — requires server + app template updates	CRITICAL
3	Automated tool routing evaluation with DeepEval	Transforms testing from theater to real measurement	Medium — requires DeepEval integration + test case authoring	CRITICAL
4	Wire up bidirectional communication (app → host)	Transforms static apps into interactive experiences	Low — handler code is simple	HIGH
5	Add "updating" state + schema contracts	Eliminates stale data confusion and silent data shape mismatches	Low — small template + schema file changes	HIGH

This review was conducted with one goal: does the end user have an amazing experience? The MCP Factory pipeline is impressively thorough — it's the most complete MCP development framework I've seen. The infrastructure is production-grade. The gap is in the AI-interaction layer: the fragile LLM→JSON→app data flow, the untested tool routing accuracy, and the static nature of the apps. Fix those three things, and this system ships magic.

40 KiB Raw Blame History

Boss Kofi — Final Review & Improvement Proposals

Pass 1 Notes (per skill — AI interaction quality assessment)

1. MCP-FACTORY.md (Orchestration Doc)

2. mcp-api-analyzer/SKILL.md

3. mcp-server-builder/SKILL.md

4. mcp-app-designer/SKILL.md

5. mcp-localbosses-integrator/SKILL.md

6. mcp-qa-tester/SKILL.md

Pass 2 Notes (user journey trace, quality gaps, testing theater)

The Full User Journey (traced end-to-end)

Mental Testing: Ambiguous Queries

System Prompt Sufficiency for APP_DATA

Are the Apps Actually Delightful?

Research Findings (latest techniques for tool optimization and agent evaluation)

1. Berkeley Function Calling Leaderboard (BFCL V4) — Key Findings

2. Paragon's Tool Calling Optimization Research (2025-2026)

3. DeepEval Agent Evaluation Framework (2025-2026)

4. MCP Apps Protocol (Official Extension — January 2026)

5. Tool Description Optimization Research

Proposed Improvements (specific, actionable, with examples)

CRITICAL Priority (do these first)

1. Eliminate the LLM Re-serialization Bottleneck

2. Adopt MCP Apps Protocol

3. Automated Tool Routing Evaluation with DeepEval

HIGH Priority

4. Add "Updating..." State to Apps

5. Wire Up Bidirectional Communication (App → Host)

6. Schema Contract Between Tools and Apps

MEDIUM Priority

7. Add Multi-Intent and Correction Handling to System Prompts

8. Add Token Counting to the Builder Skill

9. Create Per-Service Test Fixtures in the Designer Phase

10. Add Production Quality Monitoring Guidance

The "Magic Moment" Audit

What makes it feel AMAZING:

What makes it feel MEDIOCRE:

What would make it feel MAGICAL:

Testing Reality Check (what the QA skill actually catches vs what it misses)

What it CATCHES (real quality):

What it MISSES (testing theater):

The Hard Truth:

Summary: Top 5 Actions by Impact

40 KiB

Raw Blame History