Jake Shore f3c4cd817b Add all MCP servers + factory infra to MCPEngine — 2026-02-06

=== NEW SERVERS ADDED (7) ===
- servers/closebot — 119 tools, 14 modules, 4,656 lines TS (Stage 7)
- servers/google-console — Google Search Console MCP (Stage 7)
- servers/meta-ads — Meta/Facebook Ads MCP (Stage 8)
- servers/twilio — Twilio communications MCP (Stage 8)
- servers/competitor-research — Competitive intel MCP (Stage 6)
- servers/n8n-apps — n8n workflow MCP apps (Stage 6)
- servers/reonomy — Commercial real estate MCP (Stage 1)

=== FACTORY INFRASTRUCTURE ADDED ===
- infra/factory-tools — mcp-jest, mcp-validator, mcp-add, MCP Inspector
  - 60 test configs, 702 auto-generated test cases
  - All 30 servers score 100/100 protocol compliance
- infra/command-center — Pipeline state, operator playbook, dashboard config
- infra/factory-reviews — Automated eval reports

=== DOCS ADDED ===
- docs/MCP-FACTORY.md — Factory overview
- docs/reports/ — 5 pipeline evaluation reports
- docs/research/ — Browser MCP research

=== RULES ESTABLISHED ===
- CONTRIBUTING.md — All MCP work MUST go in this repo
- README.md — Full inventory of 37 servers + infra docs
- .gitignore — Updated for Python venvs

TOTAL: 37 MCP servers + full factory pipeline in one repo.
This is now the single source of truth for all MCP work.

2026-02-06 06:32:29 -05:00

36 KiB

Raw Blame History

Agent Beta — Production Engineering & DX Review

Date: 2026-02-04 Reviewer: Agent Beta (Production Engineering & Developer Experience Expert) Scope: MCP Factory pipeline — master blueprint + 5 skills Model: Opus

Executive Summary

The pipeline is well-structured for greenfield development but has no provisions for failure recovery, resumability, or rollback — if an agent crashes mid-Phase 3 with 12 of 20 apps built, there's no checkpoint to resume from; the entire phase starts over.
The "30 untested servers" inventory is a ticking bomb at scale — the skills assume each server is a fresh build, but the real near-term problem is validating/remediating 30 existing servers against live APIs; the pipeline has no "audit/remediation" mode.
Token budget and context window pressure are unaddressed — research shows 50+ tools can consume 10,000-20,000 tokens just in tool definitions; with GHL at 65 apps and potentially 100+ tools, this is a live performance issue the skills don't acknowledge.
No gateway pattern, no centralized secret management, no health monitoring — production MCP at scale (2026 state of the art) demands an MCP gateway for routing, centralized auth, and observability; the pipeline builds 30+ independent servers with independent auth, which the industry calls "connection chaos."
The skills are excellent reference documentation but lack operational runbooks — they tell you how to build but not how to operate, how to debug when broken at 3am, or how to upgrade when APIs change.

Per-Skill Reviews

Skill 1: `mcp-api-analyzer` (Phase 1)

Strengths:

Excellent prioritized reading order (auth → rate limits → overview → endpoints → pagination). This is genuinely good engineering triage.
The "Speed technique for large APIs" section acknowledging OpenAPI spec parsing is smart — most analysis time is wasted reading docs linearly.
Tool description formula (What it does. What it returns. When to use it.) is simple, memorable, and effective.
App candidate selection criteria (build vs skip) prevents app sprawl.

Issues:

No handling of non-REST API patterns (CRITICAL)
- The entire skill assumes REST APIs with standard HTTP verbs and JSON responses.
- Missing: GraphQL APIs (single endpoint, schema introspection, query/mutation split)
- Missing: SOAP/XML APIs (still common in enterprise: ServiceTitan, FieldEdge, some Clover endpoints)
- Missing: WebSocket/real-time APIs (relevant for chat, notifications, live dashboards)
- Missing: gRPC APIs (growing in B2B SaaS)
- Fix: Add a "API Style Detection" section upfront. If non-REST, document the adaptation pattern. For GraphQL: map queries→read tools, mutations→write tools, subscriptions→skip (or note for future). For SOAP: identify WSDL, map operations to tools.
Pagination analysis is too shallow (HIGH)
- Lists cursor/offset/page as the only patterns, but real APIs have:
  - Link header pagination (GitHub-style — Link: <url>; rel="next")
  - Keyset pagination (Stripe-style — starting_after=obj_xxx)
  - Scroll/search-after (Elasticsearch-style)
  - Composite cursors (base64-encoded JSON with multiple sort fields)
  - Token-based (AWS-style NextToken)
- Fix: Expand pagination section with a pattern catalog. Each entry should note: how to request next page, how to detect last page, whether total count is available, and whether backwards pagination is supported.
Auth flow documentation assumes happy path (MEDIUM)
- OAuth2 has 4+ grant types (authorization code, client credentials, PKCE, device code). The template just says "OAuth2" without specifying which.
- Missing: Token storage strategy for MCP servers (they're long-running processes — how do you handle token refresh for OAuth when the server may run for days?).
- Missing: API key rotation procedures. What happens when a key is compromised?
- Fix: Add auth pattern subtypes. For OAuth2 specifically, document: grant type, redirect URI requirements, scope requirements, token lifetime, refresh token availability.
No version/deprecation awareness (MEDIUM)
- Says "skip changelog/migration guides" which is dangerous. Many APIs (GHL, Stripe, Twilio) actively deprecate endpoints and enforce version sunsets.
- Fix: Add a "Version & Deprecation" section to the analysis template: current stable version, deprecation timeline, breaking changes in recent versions, version header requirements.
Rate limit analysis doesn't consider burst patterns (LOW-MEDIUM)
- Many APIs use token bucket or leaky bucket algorithms, not simple "X per minute" limits.
- The analysis should capture: sustained rate, burst allowance, rate limit scope (per-key, per-endpoint, per-user), and penalty for exceeding (429 response vs temporary ban).

DX Assessment: A new agent could follow this skill clearly. The template is well-structured. The execution workflow at the bottom is a nice checklist. Main gap: the skill reads as "analyze a typical REST API" when reality is much messier.

Skill 2: `mcp-server-builder` (Phase 2)

Strengths:

The one-file vs modular decision tree (≤15 tools = one file) is pragmatic and prevents over-engineering.
Auth pattern catalog (A through D) covers the most common cases.
The annotation decision matrix is crystal clear.
Zod validation as mandatory before any API call is the right call — catches bad input before burning rate limit quota.
Error handling standards (client → handler → server) with explicit "never crash" rule.

Issues:

Lazy loading provides minimal actual benefit for stdio transport (CRITICAL MISCONCEPTION)
- The skill emphasizes lazy loading as a key performance feature, but research shows the real issue is different:
- For stdio MCP servers: The server process starts fresh per-session. ListTools is called immediately on connection, which triggers loadAllGroups() anyway. Lazy loading only helps if a tool is never used in a session — but the tool definitions are still loaded and sent.
- The actual bottleneck is token consumption, not server memory. Research from CatchMetrics shows 50+ tools with 200-token average definitions = 10,000+ tokens consumed from the AI's context window before any work begins.
- What actually matters: Concise tool descriptions and minimal schema verbosity. The skill optimizes the wrong thing.
- Fix: Add a "Token Budget Awareness" section. Set a target: total tool definition tokens should stay under 5,000 for a server. For large servers (GHL with 65 apps), implement tool groups that are selectively registered based on channel context, not just lazily loaded.
No circuit breaker pattern (HIGH)
- The retry logic in client.ts does exponential backoff on 5xx errors, but:
  - No circuit breaker to stop hammering a down service
  - No fallback responses for degraded mode
  - No per-endpoint failure tracking
- Real-world scenario: ServiceTitan's API goes down at 2am. Your server retries every request 3 times with backoff, but a user sending 10 messages triggers 30 failed requests in rapid succession. Without a circuit breaker, you're amplifying the failure.
- Fix: Add a simple circuit breaker to the API client:
```
- Track failure count per endpoint (or globally)
- After N consecutive failures, enter "open" state
- In "open" state, immediately return cached/error response without hitting API
- After timeout, try one request ("half-open")
- If succeeds, close circuit; if fails, stay open
```
Pagination helper assumes uniform patterns (HIGH)
- The paginate() method in client.ts assumes query param pagination (?page=1&pageSize=25), but:
  - Stripe uses starting_after with object IDs
  - GHL uses different pagination per endpoint
  - Some APIs use POST body for pagination (Elasticsearch)
  - Some return a next_url you fetch directly
- Fix: Make pagination a pluggable strategy. Create a PaginationStrategy interface with implementations for: offset, cursor, keyset, link-header, and next-url patterns. Each tool can specify which strategy its endpoint uses.
No request/response logging (HIGH)
- The server has zero observability. No structured logging. No request IDs. No timing.
- When something breaks in production, the only signal is console.error on stderr.
- Fix: Add a minimal structured logger:
```
function log(level: string, event: string, data: Record<string, unknown>) {
  console.error(JSON.stringify({ ts: new Date().toISOString(), level, event, ...data }));
}
```
  Log: tool invocations (name, duration, success/fail), API requests (endpoint, status, duration), errors (with stack traces).
TypeScript template has placeholder variables (MEDIUM-DX)
- process.env.{SERVICE}_API_KEY — the curly braces are literal template markers that won't compile.
- The builder agent needs to know to replace these. This is documented implicitly but could trip up an automated build.
- Fix: Either use actual environment variable names in examples, or add an explicit "Template Variables" section listing all {service}, {SERVICE}, {Service} patterns that must be replaced.
No health check or self-test capability (MEDIUM)
- No way to verify the server is working without sending a real tool call.
- Fix: Add a ping or health_check tool that validates: env vars are set, API base URL is reachable, auth token is valid. This is invaluable for QA (Phase 5) and ongoing monitoring.
Missing: Connection timeout configuration (MEDIUM)
- The fetch() calls have no timeout. A hanging API response will block the tool indefinitely.
- Fix: Add AbortController with configurable timeout (default 30s) to every request.

DX Assessment: Strong skill. An agent given an analysis doc can produce a working server. The templates are copy-paste ready (after variable substitution). Biggest risk: servers work in demo but fail under real-world conditions because resilience patterns are absent.

Skill 3: `mcp-app-designer` (Phase 3)

Strengths:

The design system is comprehensive and consistent. Color tokens, typography scale, spacing — this is production-quality design documentation.
8 app type templates cover the vast majority of use cases.
Three required states (loading, empty, data) with the skeleton animation is excellent UX.
Utility functions (escapeHtml, formatCurrency, getBadgeClass) prevent common bugs.
escapeHtml() prevents XSS — security-aware by default.

Issues:

Polling creates unnecessary load at scale (HIGH)
- Every app polls /api/app-data every 3 seconds. With 10 apps open across tabs/threads, that's 200 requests/minute to the LocalBosses API.
- The comment says "stop polling once we have data" but only if postMessage succeeds first. If the initial postMessage fails (race condition), polling continues indefinitely.
- Fix:
  - Increase poll interval to 5s, then 10s, then 30s (exponential backoff on polling)
  - Add a maximum poll count (stop after 20 attempts, show error state)
  - Consider replacing polling with a one-time fetch + event listener pattern
  - Add document.hidden check — don't poll if tab isn't visible (visibilitychange event)
No data validation in render functions (HIGH)
- The render functions do basic null checks but don't validate data shapes. If the AI returns data.contacts but the app expects data.data, you get a blank screen with no error.
- Every app type template accesses data differently: data.data || data.items || data.contacts || data.results — this "try everything" pattern masks bugs and makes debugging hard.
- Fix: Add a validateData(data, expectedShape) helper that checks for required fields and logs warnings for missing ones. Have each app type declare its expected data shape explicitly.
Accessibility is completely absent (MEDIUM)
- No ARIA attributes, no keyboard navigation, no focus management.
- Tables have no scope attributes on headers.
- Status badges rely solely on color (fails WCAG for color-blind users).
- Fix: At minimum: add role attributes to dynamic regions, aria-label on interactive elements, and text alternatives for color-coded status badges (e.g., add a text prefix: "● Active" vs just the green badge).
CSS-only charts don't handle negative values or zero-height bars (LOW-MEDIUM)
- The analytics bar chart template: height:${Math.max(pct, 2)}% — minimum 2% height is good, but:
  - No support for negative values (common in financial data: losses, negative growth)
  - No axis labels or gridlines
  - Bar chart is the only visualization option
- Fix: For the factory's scope this is acceptable, but add a note that complex visualizations should use a lightweight inline charting approach or consider SVG-based charts (still no external deps).
File size guideline ("under 50KB") may be exceeded for complex apps (LOW)
- The pipeline/kanban template with 20+ items in 6 stages, plus all the CSS and utility functions, can exceed 50KB.
- Fix: The guideline is fine, but add a note about minification. Even simple whitespace removal can cut 30% off HTML file sizes. Could add a build step: html-minifier in the server build process.

DX Assessment: The strongest skill in terms of "copy template, customize, ship." The design system is well-documented enough that even a junior developer could build consistent apps. The templates handle 90% of cases well. The 10% edge cases (complex data, accessibility, performance) are where issues arise.

Skill 4: `mcp-localbosses-integrator` (Phase 4)

Strengths:

The cross-reference check ("every app ID must appear in ALL 4 files") is critical and well-called-out.
The complete Calendly example at the end is extremely helpful — shows all 5 files in one cohesive example.
System prompt engineering guidelines differentiate natural language capability descriptions from raw tool names.
The systemPromptAddon pattern with sample data shapes is clever — gives the AI a template to follow.

Issues:

No automated cross-reference validation (CRITICAL)

The skill says "verify all app IDs appear in all 4 files" but provides no automated way to do this.
With 30+ servers × 5-15 apps each = 150-450 app IDs to track. Manual verification is guaranteed to miss something.

Fix: Create a validation script (should live in scripts/validate-integration.ts):

- Parse channels.ts → extract all mcpApps arrays
- Parse appNames.ts → extract all keys
- Parse app-intakes.ts → extract all keys
- Parse mcp-apps/route.ts → extract APP_NAME_MAP keys
- Cross-reference: every ID in channels must exist in other 3 files
- Verify: every APP_NAME_MAP entry resolves to an actual HTML file
- Output: missing entries, orphaned entries, file resolution failures

This script should run in CI and as part of Phase 5 QA.

System prompt scaling problem (HIGH)
- Each channel gets one system prompt that lists all capabilities. For GHL (65 apps, 100+ tools), this prompt is enormous.
- The systemPromptAddon in app-intakes adds per-thread instructions with sample data shapes. For a channel with 15 apps, the AI's context is loaded with instructions for all 15 app types even though only 1 is active.
- Fix:
  - System prompts should be modular: core identity + dynamically injected tool-group descriptions based on the current thread's app.
  - systemPromptAddon should be the ONLY app-specific instruction injected, not in addition to the full channel prompt.
  - Consider a "prompt budget" target: channel system prompt < 500 tokens, addon < 300 tokens.
APP_DATA format is fragile (HIGH)
- The  format relies on the AI producing exact delimiters.
- Real-world failure modes:
  - AI adds a line break inside the JSON (spec says "single line" but LLMs don't reliably follow this)
  - AI adds text after END_APP_DATA
  - AI wraps it in a code block (````json\n<!--APP_DATA...`)
  - AI forgets the block entirely (even with "MANDATORY" in the prompt)
  - AI produces invalid JSON (missing closing brace, trailing comma)
- Fix:
  - Parser should be robust: strip whitespace/newlines from JSON before parsing, handle code block wrapping, try JSON.parse with error recovery
  - Add fallback: if no APP_DATA block, try to extract JSON from the response body (heuristic)
  - Track APP_DATA generation success rate per channel — if it drops below 90%, the system prompt needs revision
No versioning of channel configurations (MEDIUM)
- Adding a channel requires editing 4 source files. If integration fails, there's no way to roll back cleanly.
- Fix: Consider a channel configuration manifest ({service}-channel.json) that's validated and auto-wired, rather than manual edits to 4 shared TypeScript files. This would also enable automated integration and rollback.
Thread state management not documented (MEDIUM)
- The skill mentions "thread lifecycle" but doesn't address: What happens to thread state when LocalBosses restarts? When does thread data expire? How much localStorage is consumed by 100+ threads?
- Fix: Add a thread state management section covering: storage mechanism, expiry/cleanup, maximum thread count, and what happens on storage quota exceeded.
Intake question quality is untested (LOW-MEDIUM)
- The intake questions are written once and never validated. A question like "What would you like to see?" is vague. A question like "Which contacts would you like to view? Provide a name, email, or ID." is specific.
- Fix: Add intake question quality criteria:
  - Must suggest what input format to provide
  - Must have a skipLabel for the most common default
  - Should be under 20 words
  - Should not require domain expertise to answer

DX Assessment: This skill carries the most operational risk because errors here affect ALL users immediately (broken sidebar, missing apps, 404s). The manual 4-file editing pattern is the weakest point — error-prone and not automatable. A new developer would be able to follow it, but a new agent might miss the cross-referencing requirement.

Skill 5: `mcp-qa-tester` (Phase 5)

Strengths:

The 5-layer testing pyramid is well-organized (static → visual → functional → live API → integration).
The automated QA script template is immediately useful.
The "Common Issues & Fixes" table at the end is a great quick-reference debugging guide.
Visual testing with Peekaboo + Gemini is creative and leverages the existing toolchain well.

Issues:

No automated test suite — everything is manual or script-based (CRITICAL)
- The QA skill has no actual test framework. No Jest. No Playwright. No test runner.
- The "automated test script" is a bash script that checks file existence and byte sizes — not tests.
- For 30 servers × 5-15 apps × 5 NL messages = 750-2,250 manual test cases. This doesn't scale.
- Fix: Define a minimal automated test framework:
  - Unit tests: For each tool handler, test with mock API responses (Jest + MSW or similar)
  - Schema tests: Validate every tool's Zod schema against real API response shapes
  - App render tests: Use jsdom or Playwright to load each HTML file with sample data, verify no JS errors, verify DOM elements exist
  - Integration tests: Playwright script that navigates LocalBosses, sends a message, waits for APP_DATA, captures screenshot
  - Store sample API responses as fixtures for offline testing
Visual testing relies on subjective AI judgment (HIGH)
- "Analyze this screenshot with Gemini" — the pass/fail criteria are subjective. Gemini might say "looks fine" when there's a subtle alignment bug. Might flag normal variance as a bug.
- No baseline comparison. No pixel-diff. No regression detection.
- Fix:
  - Add screenshot comparison: capture a "golden" screenshot when the app is first verified as correct. On subsequent QA runs, compare against the golden image. Flag >5% pixel difference.
  - Use Gemini for initial evaluation but require human sign-off on the first run.
  - Store golden screenshots in the repo for each app.
Live API testing has no credential management strategy (HIGH)
- "Set environment variables in .env" — but for 30 servers, that's 30+ API keys/secrets to obtain and manage.
- Missing: Where are test credentials stored? Are they prod or sandbox? Do they expire? Who rotates them?
- Missing: Some APIs (ServiceTitan, FieldEdge) require business relationships to get API access — you can't just sign up for a free key.
- Fix: Add a credential management section:
  - Centralized .env management (e.g., a master .env.testing file or a secret manager)
  - Categorize each server: has-creds, needs-creds, sandbox-available, no-sandbox
  - For servers without credentials, QA should focus on static + mock testing (Layers 1-3)
No performance testing (MEDIUM-HIGH)
- No mention of testing: cold start time, response latency, memory usage, behavior under load.
- With 50+ servers potentially running, resource consumption matters.
- Fix: Add a Layer 2.5: Performance Testing:
  - Measure cold start time (time node dist/index.js → first ListTools response)
  - Measure tool invocation latency (mock API with known response time, measure overhead)
  - Measure memory usage after loading all tool groups
  - Target: cold start < 2s, tool overhead < 100ms, memory < 100MB per server
Test report has no persistence or trending (MEDIUM)
- Reports are written to /tmp/ — they don't persist. No historical tracking.
- Can't answer: "Is this server getting better or worse over time?"
- Fix: Store reports in the workspace: mcp-factory-reviews/{service}/qa-report-{date}.md. Add a summary dashboard that aggregates pass/fail counts across all servers.
No regression testing strategy (MEDIUM)
- After fixing a bug, no mechanism to ensure it doesn't recur.
- Fix: When a bug is found and fixed, add a specific test case for it. Store regression test cases per server. Run them on every QA cycle.
E2E scenarios are only 2-3 per channel (LOW)
- For complex channels like CRM with 65 apps, 2-3 scenarios test ~5% of functionality.
- Fix: Establish a minimum: at least 1 E2E scenario per app type (dashboard, grid, card, form, timeline, calendar, pipeline). For high-value channels, expand to 2-3 per app.

DX Assessment: The weakest skill in terms of scalability. It was designed for manual QA of individual servers, not for verifying 30+ servers in a production pipeline. A QA agent following this skill would spend hours per server on manual testing with no automated regression safety net. The skill needs a fundamental shift from "manual verification" to "automated testing with manual override for judgment calls."

Research Findings: Production Patterns We Should Adopt

1. MCP Gateway Pattern (Industry Standard for Scale)

The industry has converged on the MCP Gateway as the answer to multi-server management:

"An MCP gateway is a session-aware reverse proxy and lightweight control plane that fronts many MCP servers behind one endpoint. It adds routing, centralized authn/authz, policy enforcement, observability, and lifecycle management." — Skywork AI

Key findings:

Without a gateway, clients must maintain separate connections to each server, each with own auth, error handling, and lifecycle — this is called "connection chaos"
Gateways provide: centralized auth (authenticate once, access many), unified logging/audit, intelligent routing + load balancing, server discovery/registration
Major players: Lasso MCP Gateway (open-source, enterprise security), Peta MCP Suite, Azure MCP Gateway (Kubernetes-native), WSO2 (unified control plane)
Recommendation for LocalBosses: Consider implementing a lightweight gateway layer that LocalBosses uses to route tool calls to the appropriate MCP server. This eliminates per-server connection management in the chat route.

2. Token Budget Management (The Real Performance Problem)

Research from CatchMetrics and others reveals that the #1 performance issue with multiple MCP servers isn't memory or CPU — it's context window consumption:

Each tool definition consumes 50-1000 tokens depending on schema complexity
A server with 20 tools averaging 200 tokens each = 4,000 tokens just for tool definitions
5 servers active simultaneously = 20,000 tokens consumed before any conversation
This is 10% of Claude's 200K context window — and it compounds with system prompts and conversation history

Mitigation strategies from research:

Ruthless schema optimization: Eliminate redundant descriptions, use references not inline docs
Dynamic tool registration: Only register tools relevant to the current conversation context
Plain text responses over JSON: For large datasets, return formatted text instead of full JSON — 80% token reduction
Response pruning: Strip null/empty fields from API responses before returning to the AI

3. OpenAPI-to-MCP Automation Tools

Multiple tools now exist to auto-generate MCP servers from OpenAPI specs:

Stainless MCP Portal: CI/CD integration — regenerates MCP server when OpenAPI spec changes
FastMCP from_openapi(): Python — one-liner to create MCP server from spec
openapi-mcp-generator (GitHub): CLI tool, supports TypeScript output
Higress (Alibaba): Bulk conversion of OpenAPI specs
ConvertMCP.com: Free online tool, supports multiple languages

Recommendation: For the 30 untested servers, check if OpenAPI specs exist for each API. If so, auto-generating a server and comparing against the hand-built version could catch missing endpoints and type mismatches. Could also be used as a "second opinion" validation step in Phase 1.

4. Production MCP Best Practices (The New Stack, Feb 2026)

Key practices from the 15-best-practices guide that our pipeline misses:

Treat each server as a bounded context — ✅ we do this
Prefer stateless, idempotent tool design — ✅ annotations cover this
Choose the right transport — ⚠️ stdio only; Streamable HTTP not considered
Elicitation for human-in-the-loop — ❌ not mentioned at all
OAuth 2.1 mandatory for HTTP transport — ⚠️ not applicable yet (stdio)
Structured content with outputSchema — ❌ not using June 2025 spec features
Instrument like a production microservice — ❌ no logging, metrics, correlation IDs
Version your surface area — ❌ no versioning strategy
Handle streaming for large outputs — ❌ no streaming support
Test with real hosts and failure injection — ❌ no fault injection testing
Package as microservice (containerize) — ❌ no container strategy
Document risks for impactful actions — ⚠️ annotations exist but no dry-run mode

5. Circuit Breaker + Retry + Rate Limiter Triad

Production API integration requires three resilience patterns working together:

Retry: Handle transient failures (network blips, 503s) — our pipeline has this
Rate Limiter: Prevent overwhelming the upstream API — our pipeline has basic version
Circuit Breaker: Stop calling a failing service, fail fast — our pipeline is missing this

The research consensus is clear: retry without circuit breaker is dangerous. It amplifies failures during outages.

Missing Pieces: What the Pipeline Doesn't Cover But Should

1. Operational Runbook (CRITICAL GAP)

What to do when a server stops responding
How to diagnose "tool not triggering" issues
How to update when an API changes endpoints
How to add a new tool to an existing server without breaking others
Emergency: how to disable a broken server without restarting everything

2. Pipeline Resumability (CRITICAL GAP)

If Phase 3 fails after building 10 of 20 apps, how does the agent know which are done?
If Phase 4 crashes after updating 2 of 5 files, the integration is in a broken state
Need: checkpoint files, progress tracking, idempotent phase execution
Pattern: Each phase should check "what's already done" before starting

3. Configuration Management at Scale (HIGH GAP)

30 servers × 2-5 env vars each = 60-150 secrets to manage
Currently: individual .env files per server
Need: centralized secret management (Vault, 1Password CLI, or at minimum a master .env.all)
Need: environment separation (sandbox/staging/production)

4. Dependency Management (HIGH GAP)

All 30 servers depend on @modelcontextprotocol/sdk — version updates affect all
Currently: each server has its own package.json with pinned-ish versions
Need: dependency update strategy. When SDK v2.0 drops, how do you update 30 servers?
Consider: shared workspace/monorepo with unified dependency management (pnpm workspaces or npm workspaces)

5. API Version Change Detection (MEDIUM GAP)

APIs change their endpoints, add required fields, deprecate features
No mechanism to detect when an API change breaks a tool
Need: periodic "smoke test" that calls each tool's primary read endpoint and validates the response shape
Could run as a cron: every 24h, call list_* on each server, verify response matches expected schema

6. Monitoring & Alerting (MEDIUM GAP)

No health checks for running servers
No way to know if an API key expired, a rate limit was hit, or responses changed shape
Need: per-server health endpoint, centralized dashboard, alerting on failure patterns
Even simple: a daily "status check" script that tries each server's primary tool

7. Multi-Tenant / Multi-User Considerations (MEDIUM GAP)

LocalBosses presumably has multiple users
The pipeline assumes one set of API credentials per server
What if different users have different API accounts? (e.g., each user has their own CRM)
Need: at minimum, document the assumption (single-tenant). If multi-tenant needed later, the gateway pattern supports it.

8. Rollback Strategy (MEDIUM GAP)

After Phase 4 integration, if QA (Phase 5) reveals problems, how do you un-integrate?
Need: integration should be reversible. Either:
- Git-based: commit before integration, revert if QA fails
- Feature-flag: new channels start disabled, enable after QA pass
- Or: the manifest-based approach (JSON config per channel, delete the config to remove)

9. Documentation for Non-Agent Humans (LOW-MEDIUM GAP)

The skills are written for AI agents to follow, but humans need to understand the system too.
Need: a high-level architecture diagram, a "how it all fits together" overview, and a troubleshooting FAQ
The MCP-FACTORY.md is close but focuses on process, not architecture

10. Non-REST API Support (see Skill 1 review)

GraphQL, SOAP, WebSocket, gRPC patterns
Several APIs in the inventory may use these (especially enterprise field service tools)

Priority Recommendations (Ranked by Impact)

P0 — Do Before Scaling to 30+ Servers

Add integration validation script (Est: 2-4 hours)
- Automated cross-reference check for all 4 integration files
- Run before every deploy; add to CI
- Prevents the #1 cause of "app not found" errors
- Immediate ROI for the 30-server push
Add circuit breaker to API client template (Est: 2-3 hours)
- Modify client.ts template to include simple circuit breaker
- Prevents cascading failures when upstream APIs go down
- Saves 3am on-call debugging
Add structured logging to server template (Est: 1-2 hours)
- JSON-formatted logs on stderr: tool invocations, API calls, errors
- Include request IDs for tracing
- You can't fix what you can't see
Add request timeouts (Est: 30 min)
- AbortController with 30s default on all fetch calls
- Prevents indefinite hangs
- Trivial to implement, prevents a whole class of production failures

P1 — Do During the 30-Server Push

Create automated QA test framework (Est: 1-2 days)
- Jest tests for tool handlers with mock responses
- Playwright tests for app rendering with sample data
- HTML validation for all app files
- Turns 2-3 hours of manual QA per server into 5 minutes of automated testing
Implement token budget awareness (Est: 4-6 hours)
- Audit all tool descriptions for verbosity
- Set target: <200 tokens per tool definition
- For channels with 20+ tools, implement context-aware tool registration
- Directly improves AI response quality
Add health check tool to every server (Est: 1 hour per server, templateable)
- health_check tool that validates: env vars set, API reachable, auth valid
- Enables automated monitoring and QA Layer 4 validation
- Investment pays back across all 30 servers
Centralize secret management (Est: 3-4 hours)
- Master .env.testing with all API credentials
- Script to distribute credentials to individual servers
- Documentation of which servers have/need credentials
- Prerequisite for any automated testing

P2 — Do After Initial 30-Server Push

Implement MCP gateway layer (Est: 1-2 weeks)
- Lightweight routing proxy in LocalBosses
- Centralized auth, logging, health monitoring
- Tool registry that clients query instead of connecting to each server
- Architectural improvement that makes everything else easier
Add pipeline resumability (Est: 1 day)
- Checkpoint files for each phase ({service}-phase-{n}-complete.json)
- Each phase checks for existing outputs before re-running
- Progress tracking for multi-app builds
- Prevents wasted compute when agents fail mid-pipeline
Explore OpenAPI-to-MCP automation (Est: 2-3 days research + prototyping)
- Test openapi-mcp-generator against 3-5 APIs that have specs
- Compare auto-generated output against hand-built servers
- Could dramatically accelerate the pipeline for spec-having APIs
- Potential 10x speedup for Phase 1+2 combined
Add non-REST API support to analyzer (Est: 1 day)
- GraphQL adaptation guide (queries→read tools, mutations→write tools)
- SOAP/XML handling notes
- Flag in analysis doc for API style
- Unblocks enterprise APIs that don't fit the REST assumption

P3 — Ongoing / Future

Containerize servers for production deployment
Implement API change detection (daily smoke tests)
Build shared monorepo for dependency management
Add accessibility standards to app designer
Implement golden-screenshot regression testing
Explore Streamable HTTP transport for network-deployed servers

Appendix: Quick Wins (< 1 hour each)

#	Fix	Skill	Impact
1	Add `AbortController` timeout to `client.ts` template	Server Builder	Prevents infinite hangs
2	Add `document.hidden` check to polling in app template	App Designer	Reduces unnecessary requests
3	Add exponential backoff to app polling (3s → 5s → 10s → 30s)	App Designer	Reduces server load
4	Add max poll count (20 attempts then error state)	App Designer	Prevents zombie polling
5	Add "API Style" field to analysis template (REST/GraphQL/SOAP/gRPC)	API Analyzer	Flags non-REST early
6	Add pagination pattern catalog to analysis template	API Analyzer	Catches diverse patterns
7	Add `--noEmit` typecheck to QA script	QA Tester	Separates compile from build
8	Document template variable replacement rules	Server Builder	Reduces agent confusion

Review complete. The MCP Factory pipeline is a solid foundation — it's one of the more organized approaches to systematic MCP server production I've seen. The gaps are mostly in operational maturity (resilience, monitoring, automation) rather than fundamental design. The priority should be hardening the templates for production reliability before scaling to 30+ servers, because every template improvement multiplies across the entire fleet.

36 KiB Raw Blame History Unescape Escape

Agent Beta — Production Engineering & DX Review

Executive Summary

Per-Skill Reviews

Skill 1: mcp-api-analyzer (Phase 1)

Skill 2: mcp-server-builder (Phase 2)

Skill 3: mcp-app-designer (Phase 3)

Skill 4: mcp-localbosses-integrator (Phase 4)

Skill 5: mcp-qa-tester (Phase 5)

Research Findings: Production Patterns We Should Adopt

1. MCP Gateway Pattern (Industry Standard for Scale)

2. Token Budget Management (The Real Performance Problem)

3. OpenAPI-to-MCP Automation Tools

4. Production MCP Best Practices (The New Stack, Feb 2026)

5. Circuit Breaker + Retry + Rate Limiter Triad

Missing Pieces: What the Pipeline Doesn't Cover But Should

1. Operational Runbook (CRITICAL GAP)

2. Pipeline Resumability (CRITICAL GAP)

3. Configuration Management at Scale (HIGH GAP)

4. Dependency Management (HIGH GAP)

5. API Version Change Detection (MEDIUM GAP)

6. Monitoring & Alerting (MEDIUM GAP)

7. Multi-Tenant / Multi-User Considerations (MEDIUM GAP)

8. Rollback Strategy (MEDIUM GAP)

9. Documentation for Non-Agent Humans (LOW-MEDIUM GAP)

10. Non-REST API Support (see Skill 1 review)

Priority Recommendations (Ranked by Impact)

P0 — Do Before Scaling to 30+ Servers

P1 — Do During the 30-Server Push

P2 — Do After Initial 30-Server Push

P3 — Ongoing / Future

Appendix: Quick Wins (< 1 hour each)

36 KiB

Raw Blame History

Skill 1: `mcp-api-analyzer` (Phase 1)

Skill 2: `mcp-server-builder` (Phase 2)

Skill 3: `mcp-app-designer` (Phase 3)

Skill 4: `mcp-localbosses-integrator` (Phase 4)

Skill 5: `mcp-qa-tester` (Phase 5)