Skills added: - mcp-api-analyzer (43KB) — Phase 1: API analysis - mcp-server-builder (88KB) — Phase 2: Server build - mcp-server-development (31KB) — TS MCP patterns - mcp-app-designer (85KB) — Phase 3: Visual apps - mcp-apps-integration (20KB) — structuredContent UI - mcp-apps-official (48KB) — MCP Apps SDK - mcp-apps-merged (39KB) — Combined apps reference - mcp-localbosses-integrator (61KB) — Phase 4: LocalBosses wiring - mcp-qa-tester (113KB) — Phase 5: Full QA framework - mcp-deployment (17KB) — Phase 6: Production deploy - mcp-skill (exa integration) These skills are the encoded knowledge that lets agents build production-quality MCP servers autonomously through the pipeline.
111 KiB
MCP QA Tester — Automated Testing Framework & Quality Metrics Pipeline
When to use this skill: Testing MCP servers, apps, and their LocalBosses integration. Use after Phase 4 (integration) to verify everything works — at the protocol level, visually, functionally, and against live APIs. This is an automated-first framework with quantitative metrics, regression baselines, and persistent reporting.
What this covers: MCP protocol compliance, automated unit/visual/functional testing, accessibility auditing, performance benchmarking, security validation, chaos testing, and quantitative quality metrics with regression tracking.
Testing Architecture
Layer 0: Protocol Compliance ─── MCP Inspector + JSON-RPC lifecycle validation
Layer 1: Static Analysis ──────── TypeScript build, linting, file structure, schema validation
Layer 2: Visual Testing ────────── Playwright screenshots, BackstopJS regression, Gemini analysis
Layer 2.5: Accessibility ────────── axe-core, keyboard nav, contrast audit, screen reader compat
Layer 3: Functional Testing ───── Tool routing smoke tests, data flow validation, thread lifecycle
Layer 3.5: Performance ────────── Cold start, latency, memory, file size budgets
Layer 4: Live API Testing ──────── Real API calls with credential management strategy
Layer 4.5: Security ────────────── XSS, CSP, postMessage origin, key exposure
Layer 5: Integration Testing ──── Full E2E scenarios, chaos testing, cross-browser validation
Every layer has quantitative pass/fail criteria. Do NOT skip layers — issues compound.
Quantitative Quality Metrics (REQUIRED)
Every QA report MUST include these metrics. No more pass/fail checklists — we measure.
| Metric | Target | Method | Priority |
|---|---|---|---|
| MCP Protocol Compliance | 100% | MCP Inspector — all checks pass | P0 |
| Tool Correctness Rate | >95% | Run 20 NL messages, count correct tool selections | P0 |
| Task Completion Rate | >90% | Run 10 E2E scenarios, count fully completed | P0 |
| APP_DATA Schema Match | 100% | Validate every APP_DATA against JSON schema | P0 |
| Response Latency P50 | <3s | Measure 10 read interactions | P1 |
| Response Latency P95 | <8s | Measure 10 interactions (reads + writes) | P1 |
| App Render Success | 100% | All apps render data state without console errors | P0 |
| Accessibility Score | >90 | axe-core audit on every app HTML | P1 |
| Cold Start Time | <2s | time node dist/index.js → first ListTools response |
P1 |
| App File Size | <50KB each | Check all HTML files | P1 |
| Security Scan | 0 critical | XSS + CSP + key exposure checks | P0 |
How to calculate:
Tool Correctness Rate = (correct_tool_selections / total_test_messages) × 100
Task Completion Rate = (completed_scenarios / total_scenarios) × 100
APP_DATA Schema Match = (valid_app_data_blocks / total_app_data_blocks) × 100
Layer 0: MCP Protocol Compliance Testing
Why this layer exists: The MCP spec defines exact JSON-RPC lifecycle, tool definition formats, and error codes. If the server isn't protocol-compliant, nothing else matters. This is the foundation.
0.1 — MCP Inspector (Official Tool)
# Install and run MCP Inspector against the server
npx @modelcontextprotocol/inspector stdio node dist/index.js
# The Inspector validates:
# ✅ initialize → initialized lifecycle
# ✅ tools/list response format
# ✅ tools/call request/response format
# ✅ JSON-RPC message framing
# ✅ Capability negotiation
# ✅ Notification handling
0.2 — Automated Protocol Test Script
Save as tests/protocol-compliance.test.ts:
import { spawn, ChildProcess } from 'child_process';
import * as readline from 'readline';
// Minimal JSON-RPC client for testing MCP servers over stdio
class MCPTestClient {
private proc: ChildProcess;
private rl: readline.Interface;
private pending: Map<number, { resolve: Function; reject: Function }> = new Map();
private nextId = 1;
private notifications: any[] = [];
constructor(command: string, args: string[]) {
this.proc = spawn(command, args, { stdio: ['pipe', 'pipe', 'pipe'] });
this.rl = readline.createInterface({ input: this.proc.stdout! });
this.rl.on('line', (line) => {
try {
const msg = JSON.parse(line);
if (msg.id && this.pending.has(msg.id)) {
this.pending.get(msg.id)!.resolve(msg);
this.pending.delete(msg.id);
} else if (!msg.id) {
this.notifications.push(msg);
}
} catch (e) { /* ignore non-JSON lines */ }
});
}
async request(method: string, params?: any): Promise<any> {
const id = this.nextId++;
const msg = JSON.stringify({ jsonrpc: '2.0', id, method, params: params || {} });
this.proc.stdin!.write(msg + '\n');
return new Promise((resolve, reject) => {
this.pending.set(id, { resolve, reject });
setTimeout(() => {
if (this.pending.has(id)) {
this.pending.delete(id);
reject(new Error(`Timeout on ${method}`));
}
}, 10000);
});
}
getNotifications() { return this.notifications; }
async close() {
this.proc.kill();
}
}
describe('MCP Protocol Compliance', () => {
let client: MCPTestClient;
beforeAll(async () => {
client = new MCPTestClient('node', ['dist/index.js']);
});
afterAll(async () => {
await client.close();
});
test('initialize → initialized lifecycle', async () => {
const initResult = await client.request('initialize', {
protocolVersion: '2025-11-25',
capabilities: {},
clientInfo: { name: 'qa-test-client', version: '1.0.0' }
});
expect(initResult.result).toBeDefined();
expect(initResult.result.protocolVersion).toBeDefined();
expect(initResult.result.capabilities).toBeDefined();
expect(initResult.result.serverInfo).toBeDefined();
expect(initResult.result.serverInfo.name).toBeTruthy();
expect(initResult.result.serverInfo.version).toBeTruthy();
// Send initialized notification (no id = notification)
client.request('notifications/initialized', {}).catch(() => {});
});
test('tools/list returns valid tool definitions', async () => {
const result = await client.request('tools/list', {});
expect(result.result).toBeDefined();
expect(result.result.tools).toBeInstanceOf(Array);
expect(result.result.tools.length).toBeGreaterThan(0);
for (const tool of result.result.tools) {
// Required fields per MCP 2025-11-25
expect(tool.name).toBeTruthy();
expect(tool.description).toBeTruthy();
expect(typeof tool.name).toBe('string');
expect(typeof tool.description).toBe('string');
// Name format: must be alphanumeric + underscores/hyphens/dots
expect(tool.name).toMatch(/^[a-zA-Z0-9_.\-]+$/);
// inputSchema must be valid JSON Schema object
if (tool.inputSchema) {
expect(tool.inputSchema.type).toBe('object');
}
// If title exists, must be string
if (tool.title) {
expect(typeof tool.title).toBe('string');
}
// If outputSchema exists, validate it
if (tool.outputSchema) {
expect(tool.outputSchema.type).toBeDefined();
}
// If annotations exist, validate known fields
if (tool.annotations) {
const validAnnotations = [
'readOnlyHint', 'destructiveHint', 'idempotentHint', 'openWorldHint'
];
for (const key of Object.keys(tool.annotations)) {
if (validAnnotations.includes(key)) {
expect(typeof tool.annotations[key]).toBe('boolean');
}
}
}
}
});
test('tools/call returns valid response for read-only tools', async () => {
// Get list of tools first
const listResult = await client.request('tools/list', {});
const readOnlyTools = listResult.result.tools.filter(
(t: any) => t.annotations?.readOnlyHint === true
);
// Test first read-only tool (safest to call)
if (readOnlyTools.length > 0) {
const tool = readOnlyTools[0];
const callResult = await client.request('tools/call', {
name: tool.name,
arguments: {}
});
expect(callResult.result).toBeDefined();
// Result must have content array
if (!callResult.result.isError) {
expect(callResult.result.content).toBeInstanceOf(Array);
for (const item of callResult.result.content) {
expect(item.type).toBeDefined();
// Text content must have text field
if (item.type === 'text') {
expect(typeof item.text).toBe('string');
}
}
}
// If structuredContent exists, validate against outputSchema
if (callResult.result.structuredContent && tool.outputSchema) {
// Basic type check — full JSON Schema validation is in the schema validator section
expect(typeof callResult.result.structuredContent).toBe('object');
}
}
});
test('error responses use correct JSON-RPC error codes', async () => {
// Call non-existent tool — should get method not found or tool error
const result = await client.request('tools/call', {
name: 'nonexistent_tool_that_should_not_exist_12345',
arguments: {}
});
// Should be an error response
expect(
result.error || result.result?.isError
).toBeTruthy();
// If protocol error, must use standard JSON-RPC codes
if (result.error) {
expect(result.error.code).toBeDefined();
expect(typeof result.error.code).toBe('number');
expect(result.error.message).toBeTruthy();
// Standard codes: -32700 (parse), -32600 (invalid request),
// -32601 (method not found), -32602 (invalid params), -32603 (internal)
}
});
test('notification handling works', async () => {
// Server should handle ping
try {
await client.request('ping', {});
// If no error, ping is supported
} catch (e) {
// Ping timeout is acceptable for some servers
}
});
});
0.3 — structuredContent Validation
// tests/structured-content.test.ts
import Ajv from 'ajv';
const ajv = new Ajv({ allErrors: true });
function validateStructuredContent(
toolName: string,
outputSchema: object,
structuredContent: any
): { valid: boolean; errors: string[] } {
const validate = ajv.compile(outputSchema);
const valid = validate(structuredContent);
return {
valid: !!valid,
errors: valid ? [] : (validate.errors || []).map(e =>
`${e.instancePath} ${e.message}`
)
};
}
// Run this after getting tools/list + tools/call results
describe('structuredContent schema validation', () => {
test('every tool with outputSchema returns conforming structuredContent', async () => {
// This would be populated from actual tool calls
const toolResults: Array<{
toolName: string;
outputSchema: object;
structuredContent: any;
}> = []; // Populate from Layer 4 results
for (const { toolName, outputSchema, structuredContent } of toolResults) {
if (structuredContent && outputSchema) {
const result = validateStructuredContent(toolName, outputSchema, structuredContent);
expect(result.valid).toBe(true);
if (!result.valid) {
console.error(`Schema mismatch for ${toolName}:`, result.errors);
}
}
}
});
});
0.4 — Tasks & Elicitation Testing (2025-11-25 Spec)
If the server declares tasks capability (async operations via SEP-1686), test the task lifecycle:
test('tasks/list returns valid task list', async () => {
const result = await client.request('tasks/list', {});
if (result.result) {
expect(result.result.tasks).toBeInstanceOf(Array);
}
// Some servers may not implement tasks — that's OK, just verify no crash
});
test('long-running tool call returns task reference when task-enabled', async () => {
// If a tool has execution.taskSupport = "required" or "optional",
// calling it with _meta.taskId should return a task reference
// rather than blocking until completion
const listResult = await client.request('tools/list', {});
const taskTools = listResult.result.tools.filter(
(t: any) => t.execution?.taskSupport === 'required' || t.execution?.taskSupport === 'optional'
);
// Log task-capable tools for the report
console.log(`Task-capable tools: ${taskTools.map((t: any) => t.name).join(', ') || 'none'}`);
});
If the server uses elicitation (elicitation/create), test that:
- Elicitation requests include valid
requestedSchemawith JSON Schema - The server handles user-provided elicitation responses gracefully
- URL mode elicitation (2025-11-25) correctly redirects to external URLs
- The server doesn't hang if elicitation is denied by the client
test('server handles elicitation denial gracefully', async () => {
// If server requests elicitation and client denies, server should
// return a useful error message, not crash or hang
// This is tested implicitly by calling tools without providing
// elicitation responses — the server should timeout or fallback
});
Quality Gate:
- MCP Inspector passes all checks
- initialize → initialized lifecycle works
- tools/list returns valid, non-empty tool array
- All tool names match
/^[a-zA-Z0-9_.\-]+$/ - All tool descriptions are non-empty strings
- tools/call returns valid content arrays
- structuredContent (if present) matches outputSchema
- Error responses use correct JSON-RPC codes
- Server handles unknown methods gracefully (doesn't crash)
Layer 1: Static Analysis
1.1 — TypeScript Compilation
cd {service}-mcp
npm run build 2>&1
# Must exit 0 with no errors
# Warnings are OK but should be reviewed
# Separate type-check (catches issues build might miss)
npx tsc --noEmit 2>&1
1.2 — Code Quality Checks
# Check for `any` types (red flag)
grep -rn ": any" src/ --include="*.ts" | grep -v "node_modules" | grep -v "// eslint" | grep -v "catch"
# Goal: zero instances in tool handlers
# Exception: catch(error: any) is acceptable
# Check for console.log (should use structured logging)
grep -rn "console.log" src/ --include="*.ts" | grep -v "node_modules"
# Goal: zero — use console.error for MCP server logging
# Check SDK version is pinned appropriately
node -e "const p = require('./package.json'); console.log('SDK:', p.dependencies['@modelcontextprotocol/sdk'])"
# Should be ^1.26.0 or higher (security fix: GHSA-345p-7cg4-v4c7)
# Check Zod version
node -e "const p = require('./package.json'); console.log('Zod:', p.dependencies['zod'])"
# Should be ^3.25.0 or higher
1.3 — HTML App Validation
# Check all app HTML files exist and are within size budget
for f in app-ui/*.html ui/dist/*.html; do
if [ -f "$f" ]; then
SIZE=$(wc -c < "$f" | tr -d ' ')
if [ "$SIZE" -gt 51200 ]; then
echo "⚠️ $f ($SIZE bytes) — EXCEEDS 50KB budget"
else
echo "✅ $f ($SIZE bytes)"
fi
else
echo "❌ $f MISSING"
fi
done
1.4 — Route Mapping Cross-Reference
# Verify every app ID in channels.ts has a matching entry in ALL integration files
node -e "
const fs = require('fs');
const path = require('path');
const LB_ROOT = 'localbosses-app/src';
const files = {
channels: fs.readFileSync(path.join(LB_ROOT, 'lib/channels.ts'), 'utf8'),
appNames: fs.readFileSync(path.join(LB_ROOT, 'lib/appNames.ts'), 'utf8'),
intakes: fs.readFileSync(path.join(LB_ROOT, 'lib/app-intakes.ts'), 'utf8'),
route: fs.readFileSync(path.join(LB_ROOT, 'app/api/mcp-apps/route.ts'), 'utf8'),
};
// Extract app IDs from channels (anything in mcpApps arrays)
const channelApps = [...files.channels.matchAll(/['\"]([a-z0-9-]+)['\"]/g)]
.map(m => m[1])
.filter(id => id.length > 3 && !['true','false','null'].includes(id));
let issues = 0;
const unique = [...new Set(channelApps)];
for (const id of unique) {
const inNames = files.appNames.includes(id);
const inIntakes = files.intakes.includes(id);
const inRoute = files.route.includes(id);
if (!inNames || !inIntakes || !inRoute) {
console.log('❌ ' + id + ': ' +
(!inNames ? 'MISSING appNames ' : '') +
(!inIntakes ? 'MISSING app-intakes ' : '') +
(!inRoute ? 'MISSING route ' : ''));
issues++;
}
}
if (issues === 0) console.log('✅ All ' + unique.length + ' app IDs cross-referenced');
else console.log('\\n⚠️ ' + issues + ' cross-reference issues found');
"
Quality Gate:
- TypeScript compiles with zero errors
tsc --noEmitpasses clean- No unintended
anytypes in tool handlers - SDK pinned to
^1.26.0+, Zod to^3.25.0+ (Do NOT use Zod v4.x with SDK v1.x — known incompatibility, issue #1429) - All HTML app files exist, are >1KB and <50KB
- All app IDs cross-referenced across channels, appNames, app-intakes, and route map
- All route mappings resolve to actual HTML files
Layer 2: Visual Testing
2.1 — Automated Playwright Visual Tests
Save as tests/visual.test.ts:
import { test, expect, Page } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';
// Configuration
const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
const SCREENSHOTS_DIR = path.resolve(__dirname, '../test-results/screenshots');
const BASELINES_DIR = path.resolve(__dirname, '../test-baselines/screenshots');
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');
// Ensure directories exist
fs.mkdirSync(SCREENSHOTS_DIR, { recursive: true });
// Discover all HTML app files
const appFiles = fs.readdirSync(APP_UI_DIR)
.filter(f => f.endsWith('.html'))
.map(f => path.join(APP_UI_DIR, f));
// Load fixture for app type (or use default)
function loadFixture(appFile: string): any {
const baseName = path.basename(appFile, '.html');
const fixturePath = path.join(FIXTURES_DIR, `${baseName}.json`);
if (fs.existsSync(fixturePath)) {
return JSON.parse(fs.readFileSync(fixturePath, 'utf8'));
}
// Default fixture
return {
title: 'Test Data',
data: [
{ name: 'Test Item 1', status: 'active', value: 100 },
{ name: 'Test Item 2', status: 'inactive', value: 200 },
{ name: 'Test Item 3', status: 'pending', value: 300 },
],
meta: { total: 3, page: 1, pageSize: 25 }
};
}
for (const appFile of appFiles) {
const appName = path.basename(appFile, '.html');
test.describe(`Visual: ${appName}`, () => {
let page: Page;
test.beforeEach(async ({ browser }) => {
page = await browser.newPage({ viewport: { width: 400, height: 600 } });
await page.goto(`file://${appFile}`);
// Collect console errors
page.on('console', msg => {
if (msg.type() === 'error') {
console.error(`[${appName}] Console error:`, msg.text());
}
});
});
test.afterEach(async () => {
await page.close();
});
test('renders loading state initially', async () => {
// Before any data, loading state should show
const loading = page.locator('#loading');
const content = page.locator('#content');
// At least one should be visible
const loadingVis = await loading.isVisible().catch(() => false);
const contentVis = await content.isVisible().catch(() => false);
expect(loadingVis || contentVis).toBe(true);
await page.screenshot({
path: path.join(SCREENSHOTS_DIR, `${appName}-loading.png`)
});
});
test('renders empty state', async () => {
// Inject empty data
await page.evaluate(() => {
window.postMessage({ type: 'mcp_app_data', data: {} }, '*');
});
await page.waitForTimeout(500);
// Should show empty state, not crash
const hasError = await page.evaluate(() => {
return document.body.innerText.includes('Error') ||
document.body.innerText.includes('undefined');
});
await page.screenshot({
path: path.join(SCREENSHOTS_DIR, `${appName}-empty.png`)
});
// No JS crashes
expect(hasError).toBe(false);
});
test('renders data state without console errors', async () => {
const fixture = loadFixture(appFile);
const consoleErrors: string[] = [];
page.on('console', msg => {
if (msg.type() === 'error') consoleErrors.push(msg.text());
});
// Inject fixture data
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
await page.waitForTimeout(1000);
// Content should be visible (loading hidden)
const loading = page.locator('#loading');
const loadingHidden = !(await loading.isVisible().catch(() => true));
await page.screenshot({
path: path.join(SCREENSHOTS_DIR, `${appName}-data.png`)
});
expect(consoleErrors).toHaveLength(0);
});
test('no horizontal overflow at 320px', async () => {
await page.setViewportSize({ width: 320, height: 600 });
const fixture = loadFixture(appFile);
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
await page.waitForTimeout(500);
const hasOverflow = await page.evaluate(() => {
return document.documentElement.scrollWidth > document.documentElement.clientWidth;
});
await page.screenshot({
path: path.join(SCREENSHOTS_DIR, `${appName}-narrow.png`)
});
expect(hasOverflow).toBe(false);
});
test('dark theme compliance', async () => {
const fixture = loadFixture(appFile);
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
await page.waitForTimeout(500);
// Check background color is dark
const bgColor = await page.evaluate(() => {
return getComputedStyle(document.body).backgroundColor;
});
// Should be dark (r,g,b each < 60)
const match = bgColor.match(/\d+/g);
if (match) {
const [r, g, b] = match.map(Number);
expect(r).toBeLessThan(60);
expect(g).toBeLessThan(60);
expect(b).toBeLessThan(60);
}
});
});
}
2.2 — BackstopJS Visual Regression
# Initialize BackstopJS (one-time setup)
npm install -g backstopjs
backstop init
# Configure backstop.json:
{
"id": "mcp-apps",
"viewports": [
{ "label": "thread-panel", "width": 400, "height": 600 },
{ "label": "narrow", "width": 320, "height": 600 },
{ "label": "wide", "width": 800, "height": 600 }
],
"scenarios": [
{
"label": "contact-grid-data",
"url": "file:///path/to/app-ui/contact-grid.html",
"onReadyScript": "inject-data.js",
"delay": 1000,
"misMatchThreshold": 5.0,
"requireSameDimensions": true
}
],
"paths": {
"bitmaps_reference": "test-baselines/backstop",
"bitmaps_test": "test-results/backstop",
"engine_scripts": "tests/backstop-scripts"
},
"engine": "playwright",
"engineOptions": {
"args": ["--no-sandbox"]
}
}
// tests/backstop-scripts/inject-data.js
module.exports = async (page, scenario, viewport, isReference, browserContext) => {
const fixtures = require('../test-fixtures/' + scenario.label.split('-')[0] + '.json');
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixtures);
await page.waitForTimeout(500);
};
# Capture baselines (run once when apps are verified correct)
backstop reference
# Test against baselines (run on every QA cycle)
backstop test
# Result: PASS if <5% pixel diff, FAIL otherwise
# Visual diff report opens in browser automatically
2.3 — Gemini Multimodal Analysis (Subjective Quality)
# After Playwright captures screenshots, run Gemini for subjective quality:
gemini "Analyze this MCP app screenshot. Check and rate PASS/WARN/FAIL:
1. RENDERING: Does it show real content (not blank/placeholder)?
2. DARK THEME: Background ~#1a1d23, accent ~#ff6d5a, text ~#dcddde
3. LAYOUT: Content properly aligned, no overlapping elements?
4. TYPOGRAPHY: Text readable, proper sizing, no clipping?
5. DATA QUALITY: Does the rendered data look realistic?
6. RESPONSIVENESS: Would this work at 280px (thread panel)?
7. BUGS: Any visual artifacts, broken images, misaligned elements?" -f screenshot.png
Quality Gate:
- All apps render loading → empty → data states without crashes
- Zero console errors in data state
- No horizontal overflow at 320px width
- Dark theme compliance (background RGB each <60)
- BackstopJS regression: <5% pixel diff from baselines
- Gemini subjective review: no FAIL ratings
Layer 2.5: Accessibility Testing
2.5.1 — axe-core Automated Audit
Integrate directly into Playwright tests:
// tests/accessibility.test.ts
import { test, expect, Page } from '@playwright/test';
import AxeBuilder from '@axe-core/playwright';
import * as fs from 'fs';
import * as path from 'path';
const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');
const appFiles = fs.readdirSync(APP_UI_DIR)
.filter(f => f.endsWith('.html'));
for (const appFile of appFiles) {
const appName = path.basename(appFile, '.html');
test.describe(`Accessibility: ${appName}`, () => {
test('passes axe-core audit with data loaded', async ({ page }) => {
await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);
// Load fixture data
const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
const fixture = fs.existsSync(fixturePath)
? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
: { title: 'Test', data: [{ name: 'Test', status: 'active' }] };
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
await page.waitForTimeout(1000);
// Run axe-core
const results = await new AxeBuilder({ page })
.withTags(['wcag2a', 'wcag2aa', 'wcag21a', 'wcag21aa'])
.analyze();
// Log violations for debugging
if (results.violations.length > 0) {
console.log(`\n[${appName}] Accessibility violations:`);
for (const v of results.violations) {
console.log(` ${v.impact}: ${v.id} — ${v.description}`);
console.log(` Help: ${v.helpUrl}`);
for (const node of v.nodes.slice(0, 3)) {
console.log(` Target: ${node.target.join(' > ')}`);
}
}
}
// Calculate score: (passes / (passes + violations)) * 100
const totalChecks = results.passes.length + results.violations.length;
const score = totalChecks > 0
? Math.round((results.passes.length / totalChecks) * 100)
: 100;
console.log(`[${appName}] Accessibility score: ${score}%`);
// Target: >90% score, zero critical/serious violations
const criticalViolations = results.violations.filter(
v => v.impact === 'critical' || v.impact === 'serious'
);
expect(criticalViolations).toHaveLength(0);
expect(score).toBeGreaterThanOrEqual(90);
});
test('all interactive elements reachable via keyboard', async ({ page }) => {
await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);
// Inject data first
const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
const fixture = fs.existsSync(fixturePath)
? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
: { title: 'Test', data: [{ name: 'Test' }] };
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
await page.waitForTimeout(500);
// Get all interactive elements
const interactiveElements = await page.evaluate(() => {
const selectors = 'a, button, input, select, textarea, [tabindex], [role="button"], [role="link"], [role="tab"]';
const elements = document.querySelectorAll(selectors);
return Array.from(elements).map(el => ({
tag: el.tagName.toLowerCase(),
text: (el as HTMLElement).innerText?.slice(0, 50) || el.getAttribute('aria-label') || '',
tabIndex: (el as HTMLElement).tabIndex,
visible: (el as HTMLElement).offsetParent !== null,
}));
});
// Filter to visible elements
const visibleInteractive = interactiveElements.filter(el => el.visible);
// Tab through all elements and verify focus reaches each
let focusedCount = 0;
for (let i = 0; i < visibleInteractive.length + 5; i++) {
await page.keyboard.press('Tab');
const focused = await page.evaluate(() => {
const el = document.activeElement;
return el ? el.tagName.toLowerCase() : 'none';
});
if (focused !== 'body' && focused !== 'none') {
focusedCount++;
}
}
// At least 80% of visible interactive elements should be reachable
if (visibleInteractive.length > 0) {
const reachRate = focusedCount / visibleInteractive.length;
expect(reachRate).toBeGreaterThanOrEqual(0.8);
}
});
});
}
2.5.2 — Standalone axe-core Snippet (for Browser DevTools)
// Paste this into browser console on any app iframe:
(async () => {
if (!window.axe) {
const s = document.createElement('script');
s.src = 'https://cdnjs.cloudflare.com/ajax/libs/axe-core/4.10.0/axe.min.js';
document.head.appendChild(s);
await new Promise(r => s.onload = r);
}
const results = await axe.run(document, {
runOnly: ['wcag2a', 'wcag2aa', 'wcag21aa']
});
console.log('=== Accessibility Results ===');
console.log(`Passes: ${results.passes.length}`);
console.log(`Violations: ${results.violations.length}`);
const score = Math.round(
(results.passes.length / (results.passes.length + results.violations.length)) * 100
);
console.log(`Score: ${score}%`);
if (results.violations.length > 0) {
console.table(results.violations.map(v => ({
impact: v.impact,
id: v.id,
description: v.description,
nodes: v.nodes.length
})));
}
return results;
})();
2.5.3 — Color Contrast Audit
// Validate contrast ratios for all text elements
// Paste into browser console on any app iframe:
(function auditContrast() {
function luminance(r, g, b) {
const a = [r, g, b].map(v => {
v /= 255;
return v <= 0.03928 ? v / 12.92 : Math.pow((v + 0.055) / 1.055, 2.4);
});
return a[0] * 0.2126 + a[1] * 0.7152 + a[2] * 0.0722;
}
function contrastRatio(rgb1, rgb2) {
const l1 = luminance(...rgb1) + 0.05;
const l2 = luminance(...rgb2) + 0.05;
return l1 > l2 ? l1 / l2 : l2 / l1;
}
function parseRGB(color) {
const m = color.match(/\d+/g);
return m ? m.slice(0, 3).map(Number) : [0, 0, 0];
}
const textElements = document.querySelectorAll('*');
const issues = [];
textElements.forEach(el => {
const style = getComputedStyle(el);
if (!el.textContent?.trim() || style.display === 'none') return;
const fgRGB = parseRGB(style.color);
const bgRGB = parseRGB(style.backgroundColor);
// Skip if background is transparent (would need to walk up)
if (style.backgroundColor === 'rgba(0, 0, 0, 0)') return;
const ratio = contrastRatio(fgRGB, bgRGB);
const fontSize = parseFloat(style.fontSize);
const isBold = parseInt(style.fontWeight) >= 700;
const isLargeText = fontSize >= 24 || (fontSize >= 18.66 && isBold);
const required = isLargeText ? 3.0 : 4.5;
if (ratio < required) {
issues.push({
text: el.textContent.trim().slice(0, 40),
fg: style.color,
bg: style.backgroundColor,
ratio: ratio.toFixed(1),
required: required,
tag: el.tagName
});
}
});
if (issues.length === 0) {
console.log('✅ All text passes WCAG AA contrast requirements');
} else {
console.log(`❌ ${issues.length} contrast failures:`);
console.table(issues);
}
})();
2.5.4 — Screen Reader Testing (macOS VoiceOver)
### VoiceOver Manual Test Procedure:
1. Open the app in Safari (VoiceOver works best with Safari)
2. Enable VoiceOver: Cmd+F5
3. Navigate with VO+Right Arrow through all elements
4. Verify:
- [ ] App title/heading is announced
- [ ] Data table rows are announced with column headers
- [ ] Status badges announce text (not just color)
- [ ] Loading state announces "Loading" or similar
- [ ] Empty state announces helpful message
- [ ] Interactive elements announce their purpose
- [ ] No "blank" or "group" without context
5. Disable VoiceOver: Cmd+F5
Quality Gate:
- axe-core score >90% on all apps
- Zero critical/serious axe violations
- All text meets WCAG AA contrast (4.5:1 normal, 3:1 large)
- Secondary text uses #b0b2b8 or lighter (not #96989d)
- All interactive elements reachable via Tab
- VoiceOver reads meaningful content (no blank/unlabeled regions)
Layer 3: Functional Testing
3.1 — Jest Unit Tests with MSW (Mock Service Worker)
Test tool handlers without hitting real APIs:
// tests/tools.test.ts
import { http, HttpResponse } from 'msw';
import { setupServer } from 'msw/node';
// Mock API responses
const mockContacts = [
{ id: '1', name: 'John Doe', email: 'john@example.com', phone: '555-0101', status: 'active' },
{ id: '2', name: 'Jane Smith', email: 'jane@example.com', phone: '555-0102', status: 'inactive' },
{ id: '3', name: 'Bob Wilson', email: 'bob@example.com', phone: '555-0103', status: 'active' },
];
const handlers = [
// Mock the external API endpoints your tools call
http.get('https://api.example.com/v1/contacts', ({ request }) => {
const url = new URL(request.url);
const page = Number(url.searchParams.get('page') || 1);
const pageSize = Number(url.searchParams.get('pageSize') || 25);
const status = url.searchParams.get('status');
let filtered = mockContacts;
if (status) filtered = filtered.filter(c => c.status === status);
return HttpResponse.json({
data: filtered.slice((page - 1) * pageSize, page * pageSize),
meta: { total: filtered.length, page, pageSize }
});
}),
http.get('https://api.example.com/v1/contacts/:id', ({ params }) => {
const contact = mockContacts.find(c => c.id === params.id);
if (!contact) {
return new HttpResponse(null, { status: 404 });
}
return HttpResponse.json(contact);
}),
http.post('https://api.example.com/v1/contacts', async ({ request }) => {
const body = await request.json() as any;
return HttpResponse.json({
id: 'new-1',
...body,
created_at: new Date().toISOString()
}, { status: 201 });
}),
// Mock 500 error for chaos testing
http.get('https://api.example.com/v1/error-endpoint', () => {
return new HttpResponse(null, { status: 500 });
}),
];
const server = setupServer(...handlers);
beforeAll(() => server.listen({ onUnhandledRequest: 'warn' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
describe('Tool Handlers', () => {
test('list_contacts returns paginated results', async () => {
// Import your actual tool handler
// const { handleListContacts } = require('../src/tools/contacts');
// const result = await handleListContacts({ page: 1, pageSize: 25 });
// For now, test the API client directly
const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=25');
const data = await response.json();
expect(data.data).toBeInstanceOf(Array);
expect(data.data.length).toBeGreaterThan(0);
expect(data.meta.total).toBeDefined();
expect(data.meta.page).toBe(1);
// Validate each contact shape
for (const contact of data.data) {
expect(contact.id).toBeTruthy();
expect(contact.name).toBeTruthy();
expect(contact.email).toBeTruthy();
}
});
test('list_contacts filters by status', async () => {
const response = await fetch('https://api.example.com/v1/contacts?status=active');
const data = await response.json();
for (const contact of data.data) {
expect(contact.status).toBe('active');
}
});
test('get_contact returns single contact', async () => {
const response = await fetch('https://api.example.com/v1/contacts/1');
const data = await response.json();
expect(data.id).toBe('1');
expect(data.name).toBe('John Doe');
});
test('get_contact returns 404 for unknown ID', async () => {
const response = await fetch('https://api.example.com/v1/contacts/unknown-99');
expect(response.status).toBe(404);
});
test('create_contact returns created entity', async () => {
const response = await fetch('https://api.example.com/v1/contacts', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ name: 'New Contact', email: 'new@test.com' })
});
const data = await response.json();
expect(response.status).toBe(201);
expect(data.id).toBeTruthy();
expect(data.name).toBe('New Contact');
});
test('handles API 500 errors gracefully', async () => {
const response = await fetch('https://api.example.com/v1/error-endpoint');
expect(response.status).toBe(500);
// Tool handler should return isError: true, not crash
});
});
MSW Mock Validation: Hand-crafted mocks can drift from real API responses. When credentials are available (Layer 4), validate that MSW mock response shapes match actual API responses. Run a script that calls the real API once and diffs the response keys/types against your mock handlers. Update mocks quarterly or whenever the API ships a new version.
3.2 — Tool Routing Smoke Test
Automated script that sends NL messages and checks tool selection:
// tests/tool-routing.test.ts
import * as fs from 'fs';
import * as path from 'path';
interface RoutingFixture {
message: string;
expectedTool: string;
category: string;
}
// Load routing fixtures (maintain this file!)
const ROUTING_FIXTURES_PATH = path.resolve(__dirname, '../test-fixtures/tool-routing.json');
const routingFixtures: RoutingFixture[] = JSON.parse(
fs.readFileSync(ROUTING_FIXTURES_PATH, 'utf8')
);
describe('Tool Routing', () => {
// This test requires the AI/LLM in the loop — typically run via LocalBosses API
// or by mocking the tool selection logic
test('routing fixtures file is valid', () => {
expect(routingFixtures.length).toBeGreaterThanOrEqual(20);
for (const fixture of routingFixtures) {
expect(fixture.message).toBeTruthy();
expect(fixture.expectedTool).toBeTruthy();
expect(fixture.category).toBeTruthy();
}
});
test('all expected tools exist in server', async () => {
// Parse the server's tool definitions to get available tool names
const toolNames = new Set<string>();
// Read from compiled server or source
// This validates that routing fixtures reference real tools
const srcDir = path.resolve(__dirname, '../src/tools');
if (fs.existsSync(srcDir)) {
const toolFiles = fs.readdirSync(srcDir).filter(f => f.endsWith('.ts'));
for (const file of toolFiles) {
const content = fs.readFileSync(path.join(srcDir, file), 'utf8');
const nameMatches = content.matchAll(/name:\s*['"]([^'"]+)['"]/g);
for (const match of nameMatches) {
toolNames.add(match[1]);
}
}
}
if (toolNames.size > 0) {
for (const fixture of routingFixtures) {
expect(toolNames.has(fixture.expectedTool)).toBe(true);
}
}
});
});
// Tool routing fixtures template — save as test-fixtures/tool-routing.json:
/*
[
{ "message": "Show me all contacts", "expectedTool": "list_contacts", "category": "list" },
{ "message": "Find John Smith", "expectedTool": "search_contacts", "category": "search" },
{ "message": "What's John's email?", "expectedTool": "get_contact", "category": "get" },
{ "message": "Add a new contact", "expectedTool": "create_contact", "category": "create" },
{ "message": "Update John's phone number", "expectedTool": "update_contact", "category": "update" },
{ "message": "Remove the test contact", "expectedTool": "delete_contact", "category": "delete" },
{ "message": "Show me a summary of this month", "expectedTool": "get_dashboard", "category": "analytics" },
... (at least 20 fixtures per server)
]
*/
3.2b — DeepEval LLM-in-the-Loop Tool Routing Evaluation
Static routing fixtures validate that tool names exist, but they don't test whether the LLM actually selects the right tool. Use DeepEval for real LLM tool routing evaluation with ToolCorrectnessMetric and TaskCompletionMetric.
Setup:
pip install deepeval
deepeval login # Optional: for dashboard tracking
Test file — save as tests/tool_routing_eval.py:
# tests/tool_routing_eval.py
# Requires: pip install deepeval anthropic
# Run: deepeval test run tests/tool_routing_eval.py
import json
import os
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric, TaskCompletionMetric
from deepeval.test_case import LLMTestCase, ToolCall
from anthropic import Anthropic
client = Anthropic()
def load_tool_definitions(server_dir: str) -> list[dict]:
"""Load tool definitions from compiled MCP server."""
# Read tool names/schemas from the source files
# Adapt path to your server structure
import glob
tools = []
for f in glob.glob(f"{server_dir}/src/tools/*.ts"):
with open(f) as fh:
content = fh.read()
# Extract tool definitions (simplified — adapt to your codebase)
import re
for match in re.finditer(r'name:\s*["\'](\w+)["\']', content):
tools.append({"name": match.group(1)})
return tools
def run_agent(message: str, system_prompt: str, tools: list[dict]) -> tuple[str, list[ToolCall]]:
"""Send message through Claude with tools, return response + tool calls."""
# Convert MCP tool defs to Anthropic tool format
anthropic_tools = [
{
"name": t["name"],
"description": t.get("description", f"Tool: {t['name']}"),
"input_schema": t.get("inputSchema", {"type": "object", "properties": {}})
}
for t in tools
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": message}],
tools=anthropic_tools,
)
tool_calls = []
text_response = ""
for block in response.content:
if block.type == "tool_use":
tool_calls.append(ToolCall(name=block.name, arguments=block.input))
elif block.type == "text":
text_response += block.text
return text_response, tool_calls
# Load fixtures and system prompt
FIXTURES_PATH = "test-fixtures/tool-routing.json"
SYSTEM_PROMPT_PATH = "test-fixtures/system-prompt.txt"
with open(FIXTURES_PATH) as f:
fixtures = json.load(f)
system_prompt = ""
if os.path.exists(SYSTEM_PROMPT_PATH):
with open(SYSTEM_PROMPT_PATH) as f:
system_prompt = f.read()
# Build test cases
tool_correctness = ToolCorrectnessMetric()
task_completion = TaskCompletionMetric()
test_cases = []
for fixture in fixtures:
response_text, actual_calls = run_agent(
fixture["message"], system_prompt, load_tool_definitions(".")
)
test_cases.append(
LLMTestCase(
input=fixture["message"],
actual_output=response_text,
expected_tools=[ToolCall(name=fixture["expectedTool"])],
tools_called=actual_calls,
)
)
# Evaluate
results = evaluate(test_cases, [tool_correctness, task_completion])
print(f"\n=== DeepEval Results ===")
print(f"Tool Correctness: {tool_correctness.score:.1%}")
print(f"Task Completion: {task_completion.score:.1%}")
# Target: Tool Correctness >95%, Task Completion >90%
When to run: After every tool description change, system prompt update, or model upgrade. This is the REAL test of whether the AI routes correctly — fixture files alone are testing theater.
3.3 — APP_DATA Schema Validator
// tests/app-data-validator.ts
import Ajv from 'ajv';
import * as fs from 'fs';
import * as path from 'path';
const ajv = new Ajv({ allErrors: true, strict: false });
// Define expected schemas per app type
const APP_DATA_SCHEMAS: Record<string, object> = {
'dashboard': {
type: 'object',
required: ['title'],
properties: {
title: { type: 'string' },
metrics: {
type: 'array',
items: {
type: 'object',
required: ['label', 'value'],
properties: {
label: { type: 'string' },
value: { type: ['string', 'number'] },
change: { type: ['string', 'number'] },
trend: { enum: ['up', 'down', 'flat'] }
}
}
},
charts: { type: 'array' },
data: { type: ['array', 'object'] }
}
},
'data-grid': {
type: 'object',
required: ['data'],
properties: {
title: { type: 'string' },
data: {
type: 'array',
items: { type: 'object' },
minItems: 0
},
meta: {
type: 'object',
properties: {
total: { type: 'number' },
page: { type: 'number' },
pageSize: { type: 'number' }
}
},
columns: { type: 'array' }
}
},
'detail-card': {
type: 'object',
properties: {
title: { type: 'string' },
data: { type: 'object' },
sections: { type: 'array' },
fields: { type: 'array' }
}
},
'timeline': {
type: 'object',
properties: {
title: { type: 'string' },
events: {
type: 'array',
items: {
type: 'object',
required: ['date'],
properties: {
date: { type: 'string' },
title: { type: 'string' },
description: { type: 'string' },
type: { type: 'string' }
}
}
},
data: { type: 'array' }
}
},
'pipeline': {
type: 'object',
properties: {
title: { type: 'string' },
stages: {
type: 'array',
items: {
type: 'object',
required: ['name'],
properties: {
name: { type: 'string' },
items: { type: 'array' },
count: { type: 'number' },
value: { type: ['number', 'string'] }
}
}
}
}
}
};
export function validateAppData(
appType: string,
appData: any
): { valid: boolean; errors: string[]; warnings: string[] } {
const errors: string[] = [];
const warnings: string[] = [];
// Basic checks
if (!appData || typeof appData !== 'object') {
return { valid: false, errors: ['APP_DATA is null or not an object'], warnings: [] };
}
// Schema validation
const schema = APP_DATA_SCHEMAS[appType];
if (schema) {
const validate = ajv.compile(schema);
const isValid = validate(appData);
if (!isValid && validate.errors) {
for (const err of validate.errors) {
errors.push(`${err.instancePath || '/'} ${err.message}`);
}
}
} else {
warnings.push(`No schema defined for app type: ${appType}`);
}
// Common checks regardless of app type
if (appData.data && Array.isArray(appData.data)) {
if (appData.data.length === 0) {
warnings.push('data array is empty — app will show empty state');
}
// Check for null/undefined values in data items
for (let i = 0; i < Math.min(appData.data.length, 5); i++) {
const item = appData.data[i];
for (const [key, val] of Object.entries(item || {})) {
if (val === undefined) {
warnings.push(`data[${i}].${key} is undefined (will show as "undefined" in app)`);
}
}
}
}
return { valid: errors.length === 0, errors, warnings };
}
// Parse APP_DATA from AI response text
export function extractAppData(responseText: string): any | null {
// Standard format
const match = responseText.match(/<!--APP_DATA:([\s\S]*?):END_APP_DATA-->/);
if (match) {
try {
// Strip whitespace/newlines that LLMs sometimes add
const cleaned = match[1].replace(/[\n\r]/g, '').trim();
return JSON.parse(cleaned);
} catch (e) {
// Try with more aggressive cleanup
try {
const aggressive = match[1]
.replace(/[\n\r\t]/g, '')
.replace(/,\s*}/g, '}') // trailing commas
.replace(/,\s*]/g, ']') // trailing commas in arrays
.trim();
return JSON.parse(aggressive);
} catch (e2) {
return null;
}
}
}
// Fallback: try to find JSON in code blocks
const codeBlockMatch = responseText.match(/```(?:json)?\s*([\s\S]*?)```/);
if (codeBlockMatch) {
try {
return JSON.parse(codeBlockMatch[1].trim());
} catch (e) {
return null;
}
}
return null;
}
3.4 — Thread Lifecycle Testing
### Thread Lifecycle: {channel}
1. [ ] Click app in toolbar → thread panel opens
2. [ ] Intake question appears in thread
3. [ ] Type response → AI processes in thread context
4. [ ] App loads in thread panel (if data returned or skipped)
5. [ ] Send follow-up message → app updates with new data
6. [ ] Close thread panel (X) → panel closes, thread indicator remains
7. [ ] Click thread indicator → panel reopens with preserved state
8. [ ] Delete thread → thread removed, parent message removed
9. [ ] Switch channels → come back → thread state persists (localStorage)
Quality Gate:
- All tool handler unit tests pass (Jest + MSW)
- Tool routing fixtures file has ≥20 test messages
- All routing fixture tools exist in the server
- APP_DATA schema validation passes for all app types
- APP_DATA parser handles malformed JSON gracefully
- Thread lifecycle completes without errors
Layer 3.5: Performance Testing
3.5.1 — Server Cold Start
#!/bin/bash
# Measure cold start time
SERVICE_DIR="$1"
cd "$SERVICE_DIR"
echo "=== Cold Start Benchmark ==="
# Measure time to first ListTools response
START=$(date +%s%N)
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf-test","version":"1.0.0"}}}' | \
timeout 10 node dist/index.js 2>/dev/null | head -1 > /dev/null
END=$(date +%s%N)
ELAPSED=$(( (END - START) / 1000000 ))
echo "Cold start to first response: ${ELAPSED}ms"
if [ "$ELAPSED" -gt 2000 ]; then
echo "❌ FAIL — exceeds 2000ms target"
else
echo "✅ PASS — under 2000ms target"
fi
3.5.2 — Tool Invocation Latency
// tests/performance.test.ts
import { performance } from 'perf_hooks';
describe('Performance', () => {
test('tool invocation overhead is under 100ms (excluding API time)', async () => {
// With MSW intercepting API calls (near-zero latency),
// measure the tool handler overhead itself
const times: number[] = [];
for (let i = 0; i < 10; i++) {
const start = performance.now();
// Call a read-only tool through the handler
// await toolHandler({ page: 1, pageSize: 10 });
const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=10');
await response.json();
const elapsed = performance.now() - start;
times.push(elapsed);
}
const sorted = times.sort((a, b) => a - b);
const p50 = sorted[Math.floor(sorted.length * 0.5)];
const p95 = sorted[Math.floor(sorted.length * 0.95)];
console.log(`Tool overhead P50: ${p50.toFixed(1)}ms, P95: ${p95.toFixed(1)}ms`);
expect(p50).toBeLessThan(100);
});
test('memory usage stays under 100MB with all tools loaded', async () => {
const used = process.memoryUsage();
const heapMB = Math.round(used.heapUsed / 1024 / 1024);
const rssMB = Math.round(used.rss / 1024 / 1024);
console.log(`Heap: ${heapMB}MB, RSS: ${rssMB}MB`);
expect(rssMB).toBeLessThan(100);
});
});
3.5.3 — App File Size Budget
#!/bin/bash
echo "=== App File Size Budget (max 50KB) ==="
OVER=0
for f in app-ui/*.html; do
if [ -f "$f" ]; then
SIZE=$(wc -c < "$f" | tr -d ' ')
KB=$((SIZE / 1024))
if [ "$SIZE" -gt 51200 ]; then
echo "❌ $(basename $f): ${KB}KB (OVER BUDGET)"
OVER=$((OVER + 1))
else
echo "✅ $(basename $f): ${KB}KB"
fi
fi
done
[ "$OVER" -eq 0 ] && echo "All apps within budget" || echo "⚠️ $OVER apps over 50KB budget"
3.5.4 — App Render Performance (Playwright)
// In visual.test.ts, add:
test('time to first render is under 2s', async ({ page }) => {
const start = Date.now();
await page.goto(`file://${appFile}`);
const fixture = loadFixture(appFile);
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, fixture);
// Wait for content to be visible
await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
const renderTime = Date.now() - start;
console.log(`[${appName}] Time to first render: ${renderTime}ms`);
expect(renderTime).toBeLessThan(2000);
});
3.5.5 — Load Testing (HTTP Transport)
For servers running with MCP_TRANSPORT=http, test concurrent connection handling:
#!/bin/bash
# load-test-http.sh — Test concurrent MCP connections
# Requires: npm install -g autocannon (or use curl + GNU parallel)
MCP_PORT="${1:-3000}"
CONCURRENCY="${2:-10}"
DURATION="${3:-10}"
echo "=== MCP HTTP Load Test ==="
echo "Target: http://localhost:${MCP_PORT}/mcp"
echo "Concurrency: ${CONCURRENCY} connections"
echo "Duration: ${DURATION}s"
echo ""
# Test 1: Concurrent initialize requests
echo "--- Test 1: Concurrent initialize ---"
for i in $(seq 1 $CONCURRENCY); do
curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":'$i',"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"load-test-'$i'","version":"1.0.0"}}}' \
-o /dev/null -w "Connection $i: %{http_code} in %{time_total}s\n" &
done
wait
echo ""
# Test 2: Concurrent tools/list under load
echo "--- Test 2: Concurrent tools/list ---"
START=$(date +%s%N)
for i in $(seq 1 $CONCURRENCY); do
curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
-o /dev/null -w "%{http_code} " &
done
wait
END=$(date +%s%N)
ELAPSED=$(( (END - START) / 1000000 ))
echo ""
echo "All $CONCURRENCY requests completed in ${ELAPSED}ms"
echo ""
# Test 3: Session management under load (verify no cross-session leaks)
echo "--- Test 3: Session isolation ---"
SESSION1=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-1","version":"1.0.0"}}}' \
-D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')
SESSION2=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-2","version":"1.0.0"}}}' \
-D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')
if [ "$SESSION1" != "$SESSION2" ] && [ -n "$SESSION1" ] && [ -n "$SESSION2" ]; then
echo "✅ Sessions are unique (no cross-session leaks)"
else
echo "⚠️ Session isolation check inconclusive"
fi
echo ""
echo "=== Load Test Complete ==="
echo "Target: ${CONCURRENCY} concurrent connections should complete without 5xx errors"
Pass criteria:
- Zero 5xx errors under 10 concurrent connections
- All responses return within 5s
- No cross-session data leaks (GHSA-345p-7cg4-v4c7 regression test)
- Memory usage stays under 200MB during load
Quality Gate:
- Cold start <2s to first ListTools response
- Tool invocation overhead P50 <100ms (excluding API latency)
- Memory usage <100MB after loading all tool groups
- All HTML app files <50KB
- Time to first render <2s for all apps
- HTTP transport handles 10 concurrent connections without errors
Layer 4: Live API Testing
4.1 — Credential Management Strategy
Before running Layer 4, categorize the server:
| Category | Description | Layer 4 Approach |
|---|---|---|
| has-creds | API key/OAuth token available in .env |
Full live testing |
| needs-creds | Credentials needed but not yet obtained | Skip Layer 4, note in report |
| sandbox-available | API provides sandbox/test environment | Use sandbox creds (preferred) |
| no-sandbox | Only production credentials available | Careful read-only testing only |
Centralized credential management:
# Master credentials file (NOT committed to git)
# Location: ~/.clawdbot/workspace/.env.mcp-testing
# Format per service:
# {SERVICE}_API_KEY=xxx
# {SERVICE}_API_BASE_URL=https://api.example.com
# {SERVICE}_SANDBOX=true|false
# {SERVICE}_CRED_STATUS=has-creds|needs-creds|sandbox|no-sandbox
# {SERVICE}_CRED_EXPIRES=2026-03-01
# Script to distribute to individual servers:
cat ~/.clawdbot/workspace/.env.mcp-testing | grep "^${SERVICE}_" | sed "s/${SERVICE}_//" > ${SERVICE}-mcp/.env
For servers WITHOUT credentials, focus on Layers 0-3:
- Layer 0: Protocol compliance (no API needed)
- Layer 1: Static analysis (no API needed)
- Layer 2: Visual testing with fixture data (no API needed)
- Layer 2.5: Accessibility (no API needed)
- Layer 3: Functional testing with MSW mocks (no API needed)
- Layer 3.5: Performance with mocks (no API needed)
- Layer 4: SKIP — note in report as "No credentials available"
- Layer 4.5: Security (most checks don't need API)
- Layer 5: Partial — E2E with mocked responses
4.2 — Test Each Tool Group
### Live API Test: {service} / {tool-group}
**Auth:** {method} — Token/key set in .env
**Base URL:** {url}
**Cred Status:** {has-creds|sandbox|no-creds}
| Tool | Test Input | Expected | Actual | Latency | Status |
|------|-----------|----------|--------|---------|--------|
| list_{entities} | {} (default) | Array of items | | ms | |
| list_{entities} | { status: "active" } | Filtered array | | ms | |
| get_{entity} | { id: "known-id" } | Single item | | ms | |
| create_{entity} | { name: "QA Test" } | Created w/ ID | | ms | |
| update_{entity} | { id: "id", name: "Updated" } | Updated item | | ms | |
| delete_{entity} | { id: "qa-test-id" } | Confirmation | | ms | |
4.3 — Response Shape Verification
# For each tool, verify response shape matches what the app expects
# Extract field references from app HTML
grep -oP 'data\.\K[a-zA-Z_]+' app-ui/{app}.html | sort -u > /tmp/expected-fields.txt
# Compare with actual API response fields
echo '{api_response}' | jq 'keys' > /tmp/actual-fields.txt
# Diff
diff /tmp/expected-fields.txt /tmp/actual-fields.txt
Quality Gate:
- All read-only tools return valid data
- Write tools create/update/delete correctly (use sandbox)
- Response shapes match what apps expect
- Error responses (401, 403, 404, 422, 429) handled gracefully
- All response latencies recorded for P50/P95 metrics
- Cleanup: delete any test data created during QA
Layer 4.5: Security Testing
4.5.1 — XSS Testing
// tests/security.test.ts
import { test, expect } from '@playwright/test';
import * as path from 'path';
const XSS_PAYLOADS = [
'<script>alert("xss")</script>',
'<img src=x onerror=alert("xss")>',
'"><script>alert(1)</script>',
"';alert(String.fromCharCode(88,83,83))//",
'<svg onload=alert("xss")>',
'javascript:alert("xss")',
'<iframe src="javascript:alert(1)">',
'{{constructor.constructor("return this")().alert(1)}}',
'<details open ontoggle=alert(1)>',
'<math><mtext><table><mglyph><svg><mtext><style><img src=x onerror=alert(1)>',
];
test.describe('XSS Security', () => {
test('escapeHtml blocks all XSS payloads in text fields', async ({ page }) => {
const appFile = path.resolve(__dirname, '../app-ui/contact-grid.html');
await page.goto(`file://${appFile}`);
for (const payload of XSS_PAYLOADS) {
let alertFired = false;
page.on('dialog', async dialog => {
alertFired = true;
await dialog.dismiss();
});
// Inject data with XSS payloads in every text field
await page.evaluate((xss) => {
window.postMessage({
type: 'mcp_app_data',
data: {
title: xss,
data: [
{ name: xss, email: xss, phone: xss, status: xss },
],
meta: { total: 1, page: 1, pageSize: 25 }
}
}, '*');
}, payload);
await page.waitForTimeout(200);
expect(alertFired).toBe(false);
}
});
});
4.5.2 — postMessage Origin Validation
// Check in browser console — app should validate message origin
// Inject from a different origin simulation:
(function testOriginValidation() {
// Check if app code validates event.origin
const appScript = document.querySelector('script')?.textContent || '';
const checksOrigin = appScript.includes('event.origin') ||
appScript.includes('e.origin') ||
appScript.includes('message.origin');
if (checksOrigin) {
console.log('✅ App validates postMessage origin');
} else {
console.log('⚠️ App does NOT validate postMessage origin — potential security issue');
console.log(' Recommended: Add origin check in message event listener');
}
})();
4.5.3 — Content Security Policy Check
# Check if HTML apps declare CSP
for f in app-ui/*.html; do
if grep -q "Content-Security-Policy" "$f"; then
echo "✅ $(basename $f) has CSP meta tag"
else
echo "⚠️ $(basename $f) — no CSP meta tag"
fi
done
# Check for inline event handlers (CSP-unfriendly)
for f in app-ui/*.html; do
INLINE=$(grep -c 'on[a-z]*=' "$f" || echo "0")
if [ "$INLINE" -gt 0 ]; then
echo "⚠️ $(basename $f) has $INLINE inline event handlers"
fi
done
4.5.4 — API Key Exposure Check
# Check for leaked secrets in client-side code
echo "=== API Key Exposure Scan ==="
# Common patterns for API keys/secrets
PATTERNS=(
'api[_-]?key'
'apikey'
'secret'
'token'
'password'
'authorization.*Bearer'
'sk_live_'
'pk_live_'
'ghp_'
'gho_'
)
for f in app-ui/*.html; do
for pat in "${PATTERNS[@]}"; do
MATCHES=$(grep -ci "$pat" "$f" || echo "0")
if [ "$MATCHES" -gt 0 ]; then
echo "❌ $(basename $f) may contain exposed secrets (pattern: $pat)"
grep -in "$pat" "$f" | head -3
fi
done
done
# Also check compiled JS
for f in dist/**/*.js; do
if [ -f "$f" ]; then
for pat in "${PATTERNS[@]}"; do
MATCHES=$(grep -ci "$pat" "$f" || echo "0")
if [ "$MATCHES" -gt 0 ]; then
echo "⚠️ $(basename $f) references: $pat (verify not actual key)"
fi
done
fi
done
Quality Gate:
- All XSS payloads blocked (escapeHtml works)
- No alert dialogs triggered from any payload
- postMessage origin validated (or documented as acceptable risk)
- No API keys/secrets exposed in HTML app files
- No API keys/secrets in client-facing JavaScript
- CSP meta tag present (or documented why not)
Layer 5: Integration & Chaos Testing
5.1 — End-to-End Scenarios
Write at least 1 E2E scenario per app type (minimum 5 per server):
### E2E Scenario: {scenario-name}
**Channel:** {channel}
**Goal:** {what the user is trying to accomplish}
**App type:** {dashboard|grid|card|timeline|pipeline|calendar|analytics|monitor}
**Steps:**
1. Navigate to #{channel}
2. Type: "{natural language message}"
3. Verify: AI responds with correct tool call
4. Verify: APP_DATA block present and valid JSON
5. Verify: App {app-id} renders with correct data
6. In thread, type: "{follow-up message}"
7. Verify: App updates with new/refined data
8. Measure: Response latency for each step
**Metrics:**
- Tool selected correctly: ✅/❌
- APP_DATA valid: ✅/❌
- App rendered: ✅/❌
- Latency step 3: ___ms
- Latency step 7: ___ms
**Pass criteria:**
- [ ] All steps complete without errors
- [ ] Response time <5s for each step
- [ ] Zero console errors
- [ ] Data is accurate and well-formatted
5.1b — Automated End-to-End Data Flow Test (Playwright)
The magic moment: message → AI → tool → APP_DATA → app render → correct data. This test automates the entire flow:
// tests/e2e-dataflow.test.ts
import { test, expect } from '@playwright/test';
const LOCALBOSSES_URL = process.env.LB_URL || 'http://localhost:3000';
test.describe('End-to-End Data Flow', () => {
test('message triggers tool → APP_DATA → app renders correct data', async ({ page }) => {
// 1. Navigate to the channel
await page.goto(`${LOCALBOSSES_URL}/#/channel/{channel-id}`);
await page.waitForLoadState('networkidle');
// 2. Send a test message
const chatInput = page.locator('[data-testid="chat-input"], textarea, input[type="text"]');
await chatInput.fill('Show me all active contacts');
await chatInput.press('Enter');
// 3. Wait for AI response (tool call indicator or text response)
const aiResponse = page.locator('[data-testid="ai-response"], .message-content').last();
await aiResponse.waitFor({ state: 'visible', timeout: 15000 });
// 4. Verify APP_DATA block was generated
const responseText = await aiResponse.textContent();
// The APP_DATA is in the raw response (may be hidden in the UI)
// Check that the app iframe loaded
const appFrame = page.frameLocator('iframe[data-app-id]').first();
// 5. Verify app rendered with data (not empty/loading state)
const appContent = appFrame.locator('#content');
await appContent.waitFor({ state: 'visible', timeout: 10000 });
// 6. Verify correct data is displayed
// App should show contact data, not empty state
const appText = await appContent.textContent();
expect(appText).toBeTruthy();
expect(appText!.length).toBeGreaterThan(10); // Has real content
// 7. Verify no console errors in the app iframe
const consoleErrors: string[] = [];
page.on('console', msg => {
if (msg.type() === 'error') consoleErrors.push(msg.text());
});
expect(consoleErrors).toHaveLength(0);
// 8. Screenshot for the record
await page.screenshot({ path: 'test-results/e2e-dataflow.png', fullPage: true });
});
});
Note: This test requires LocalBosses running locally with the integrated channel. It's the most important test — it validates the complete user experience end-to-end. Run this after every integration change.
5.2 — Chaos Testing
Test resilience under adverse conditions:
// tests/chaos.test.ts
describe('Chaos Testing', () => {
test('API returns 500 on every call', async () => {
// Override MSW handlers to return 500
server.use(
http.get('https://api.example.com/*', () => {
return new HttpResponse('Internal Server Error', { status: 500 });
}),
http.post('https://api.example.com/*', () => {
return new HttpResponse('Internal Server Error', { status: 500 });
})
);
// Tool should return isError: true, NOT crash
// const result = await callTool('list_contacts', {});
// expect(result.isError).toBe(true);
// expect(result.content[0].text).toContain('error');
});
test('postMessage sends wrong format data', async ({ page }) => {
await page.goto(`file://${appFile}`);
// Send wrong type
await page.evaluate(() => {
window.postMessage({ type: 'wrong_type', data: {} }, '*');
});
await page.waitForTimeout(300);
// App should not crash — should still show loading/empty
const bodyText = await page.textContent('body');
expect(bodyText).not.toContain('undefined');
expect(bodyText).not.toContain('TypeError');
// Send data with wrong shape
await page.evaluate(() => {
window.postMessage({ type: 'mcp_app_data', data: 'not an object' }, '*');
});
await page.waitForTimeout(300);
const bodyText2 = await page.textContent('body');
expect(bodyText2).not.toContain('undefined');
});
test('APP_DATA is 500KB+ (huge dataset)', async ({ page }) => {
await page.goto(`file://${appFile}`);
// Generate huge dataset
const hugeData = {
title: 'Performance Stress Test',
data: Array.from({ length: 2000 }, (_, i) => ({
id: `item-${i}`,
name: `Contact ${i} ${'A'.repeat(100)}`,
email: `contact${i}@example.com`,
phone: `555-${String(i).padStart(4, '0')}`,
status: i % 2 === 0 ? 'active' : 'inactive',
notes: 'X'.repeat(200)
})),
meta: { total: 2000, page: 1, pageSize: 2000 }
};
const start = Date.now();
await page.evaluate((data) => {
window.postMessage({ type: 'mcp_app_data', data }, '*');
}, hugeData);
// Should render within 5 seconds even with huge data
await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
const renderTime = Date.now() - start;
console.log(`Huge dataset render time: ${renderTime}ms`);
expect(renderTime).toBeLessThan(5000);
});
test('rapid-fire 10 messages', async ({ page }) => {
await page.goto(`file://${appFile}`);
// Send 10 data updates in quick succession
for (let i = 0; i < 10; i++) {
await page.evaluate((idx) => {
window.postMessage({
type: 'mcp_app_data',
data: {
title: `Update ${idx}`,
data: [{ name: `Item ${idx}`, status: 'active' }],
meta: { total: 1, page: 1, pageSize: 25 }
}
}, '*');
}, i);
}
await page.waitForTimeout(1000);
// App should show the LAST update (not crash or show stale data)
const content = await page.textContent('body');
expect(content).toContain('Update 9');
});
test('two apps rendering simultaneously', async ({ browser }) => {
const page1 = await browser.newPage();
const page2 = await browser.newPage();
await page1.goto(`file://${appFile}`);
await page2.goto(`file://${appFile}`);
// Send data to both simultaneously
await Promise.all([
page1.evaluate(() => {
window.postMessage({
type: 'mcp_app_data',
data: { title: 'App 1', data: [{ name: 'One' }] }
}, '*');
}),
page2.evaluate(() => {
window.postMessage({
type: 'mcp_app_data',
data: { title: 'App 2', data: [{ name: 'Two' }] }
}, '*');
})
]);
await page1.waitForTimeout(500);
await page2.waitForTimeout(500);
// Both should render their respective data
expect(await page1.textContent('body')).toContain('One');
expect(await page2.textContent('body')).toContain('Two');
await page1.close();
await page2.close();
});
});
5.3 — Cross-Browser Testing Notes
| Browser | Priority | Key Differences | How to Test |
|---|---|---|---|
| Chrome | P0 | Primary target — test all features here | Playwright chromium channel |
| Firefox | P1 | CSS Grid/Flexbox rendering differs slightly; backdrop-filter needs -webkit- prefix |
Playwright firefox channel |
| Mobile Safari | P1 | Touch targets (min 44×44px), safe area insets, -webkit- prefixes, no backdrop-filter |
Playwright webkit channel or real device |
| Electron | P2 | If LocalBosses ships as desktop app; test Node integration, contextBridge |
Playwright with Electron |
// playwright.config.ts — multi-browser setup
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
{ name: 'mobile-chrome', use: { ...devices['Pixel 5'] } },
{ name: 'mobile-safari', use: { ...devices['iPhone 13'] } },
],
});
Quality Gate:
- All E2E scenarios pass (≥1 per app type)
- Chaos tests: API 500s handled gracefully
- Chaos tests: wrong postMessage format doesn't crash app
- Chaos tests: 500KB+ dataset renders within 5s
- Chaos tests: rapid-fire messages show final state
- Cross-browser: Chrome + Firefox + WebKit all render correctly
Layer 5.5: Production Smoke Test (Post-Deployment)
After deploying a server + apps to production, run this validation before considering it shipped:
#!/bin/bash
# smoke-test.sh — Post-deployment validation
# Usage: ./smoke-test.sh <service-name> [base-url]
SERVICE="$1"
BASE_URL="${2:-http://localhost:3000}"
echo "=== Production Smoke Test: ${SERVICE} ==="
echo "Target: ${BASE_URL}"
echo ""
PASS=0
FAIL=0
# 1. Server is reachable (HTTP transport)
echo "--- Server Reachability ---"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST "${BASE_URL}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"smoke-test","version":"1.0.0"}}}')
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ Server responds to initialize (HTTP $HTTP_CODE)"
PASS=$((PASS + 1))
else
echo "❌ Server unreachable or error (HTTP $HTTP_CODE)"
FAIL=$((FAIL + 1))
fi
# 2. tools/list returns tools
echo "--- Tool List ---"
TOOLS_RESPONSE=$(curl -s -X POST "${BASE_URL}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}')
TOOL_COUNT=$(echo "$TOOLS_RESPONSE" | grep -o '"name"' | wc -l | tr -d ' ')
if [ "$TOOL_COUNT" -gt 0 ]; then
echo "✅ tools/list returns $TOOL_COUNT tools"
PASS=$((PASS + 1))
else
echo "❌ tools/list returned 0 tools"
FAIL=$((FAIL + 1))
fi
# 3. health_check tool responds
echo "--- Health Check ---"
HEALTH=$(curl -s -X POST "${BASE_URL}/mcp" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"health_check","arguments":{}}}')
if echo "$HEALTH" | grep -q '"status"'; then
echo "✅ health_check tool responds"
PASS=$((PASS + 1))
else
echo "⚠️ health_check tool not found or error"
fi
# 4. App HTML files are served (if HTTP)
echo "--- App Files ---"
for app_id in $(echo "$TOOLS_RESPONSE" | grep -oP '"name":\s*"\K[^"]+' | head -3); do
APP_HTTP=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}/api/mcp-apps?app=${app_id}")
if [ "$APP_HTTP" = "200" ]; then
echo "✅ App ${app_id} is served"
fi
done
# Summary
echo ""
echo "=== Smoke Test Results ==="
echo "Passed: $PASS"
echo "Failed: $FAIL"
[ "$FAIL" -eq 0 ] && echo "✅ SMOKE TEST PASSED" || echo "❌ SMOKE TEST FAILED"
Layer 6: Production Monitoring (Post-Ship)
"All testing is pre-ship. There's no guidance on tracking tool correctness, APP_DATA parse success rate, or user satisfaction in production." — Kofi
Pre-ship testing validates that everything can work. Production monitoring validates that everything does work, continuously.
6.1 — Production Quality Metrics
Track these metrics in production via logging in the chat route and aggregating weekly:
| Metric | Target | How to Measure | Alert Threshold |
|---|---|---|---|
| APP_DATA Parse Success Rate | >98% | Log every parseAppData() call: success vs fallback vs failure |
<95% over 1 hour |
| Tool Correctness Sampling | >95% | Sample 5% of interactions weekly, LLM-judge correctness | <90% in weekly sample |
| Time to First App Render | P50 <3s, P95 <8s | Measure from user message send → app #content visible |
P95 >12s |
| User Retry Rate | <15% | Count rephrased messages within 30s of previous message | >25% over 1 day |
| Thread Completion Rate | >80% | % of threads where user reaches a data-displaying app state | <60% over 1 week |
6.2 — Instrumentation Code
Add to the chat route to collect production metrics:
// lib/production-metrics.ts
interface MetricEvent {
timestamp: string;
channel: string;
metric: string;
value: number;
metadata?: Record<string, unknown>;
}
const metrics: MetricEvent[] = [];
export function trackMetric(channel: string, metric: string, value: number, metadata?: Record<string, unknown>) {
metrics.push({
timestamp: new Date().toISOString(),
channel,
metric,
value,
metadata,
});
// Flush to file every 100 events
if (metrics.length >= 100) flushMetrics();
}
function flushMetrics() {
const fs = require('fs');
const path = require('path');
const file = path.join(process.cwd(), 'logs', `metrics-${new Date().toISOString().split('T')[0]}.jsonl`);
fs.mkdirSync(path.dirname(file), { recursive: true });
fs.appendFileSync(file, metrics.map(m => JSON.stringify(m)).join('\n') + '\n');
metrics.length = 0;
}
// Usage in chat route:
// trackMetric(channelId, 'app_data_parse', success ? 1 : 0, { fallback: usedFallback });
// trackMetric(channelId, 'tool_call_latency', latencyMs, { tool: toolName });
// trackMetric(channelId, 'thread_completed', 1);
6.3 — Weekly Quality Review
#!/bin/bash
# weekly-quality-report.sh — Aggregate production metrics
METRICS_DIR="logs"
WEEK_START=$(date -v-7d +%Y-%m-%d)
echo "=== Weekly Production Quality Report ==="
echo "Period: ${WEEK_START} to $(date +%Y-%m-%d)"
echo ""
# APP_DATA parse success rate
TOTAL_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | wc -l | tr -d ' ')
SUCCESS_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | grep '"value":1' | wc -l | tr -d ' ')
if [ "$TOTAL_PARSES" -gt 0 ]; then
PARSE_RATE=$((SUCCESS_PARSES * 100 / TOTAL_PARSES))
echo "APP_DATA Parse Success: ${PARSE_RATE}% (${SUCCESS_PARSES}/${TOTAL_PARSES})"
else
echo "APP_DATA Parse Success: No data"
fi
echo ""
echo "Action items:"
echo "- Review any channels with parse rate <95%"
echo "- Check retry rate spikes for system prompt issues"
echo "- Sample 5 random interactions for manual correctness review"
CI/CD Pipeline Template
Automate the QA pipeline in CI. Save as .github/workflows/mcp-qa.yml:
# .github/workflows/mcp-qa.yml
name: MCP QA Pipeline
on:
push:
paths: ['*-mcp/**', 'mcp-servers/**']
pull_request:
paths: ['*-mcp/**', 'mcp-servers/**']
jobs:
qa:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [22]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: TypeScript build
run: npm run build
- name: Type check
run: npx tsc --noEmit
- name: Jest unit tests
run: npx jest --ci --coverage
env:
NODE_ENV: test
- name: Install Playwright browsers
run: npx playwright install --with-deps
- name: Playwright visual + accessibility tests
run: npx playwright test
- name: App file size check
run: |
for f in app-ui/*.html; do
if [ -f "$f" ]; then
SIZE=$(wc -c < "$f" | tr -d ' ')
if [ "$SIZE" -gt 51200 ]; then
echo "❌ $(basename $f) exceeds 50KB ($SIZE bytes)"
exit 1
fi
echo "✅ $(basename $f) ($SIZE bytes)"
fi
done
- name: Security scan
run: |
ISSUES=0
for f in app-ui/*.html; do
for pat in "api_key" "apikey" "secret" "sk_live" "pk_live"; do
if grep -qi "$pat" "$f" 2>/dev/null; then
echo "❌ $(basename $f): potential key exposure ($pat)"
ISSUES=$((ISSUES + 1))
fi
done
done
[ "$ISSUES" -eq 0 ] || exit 1
- name: Upload test results
uses: actions/upload-artifact@v4
if: always()
with:
name: test-results
path: |
test-results/
coverage/
retention-days: 30
# Optional: DeepEval tool routing (requires API key)
tool-routing:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: qa
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install deepeval anthropic
- name: Run DeepEval tool routing evaluation
run: deepeval test run tests/tool_routing_eval.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
Testing Reality Check
What the QA catches vs what it misses — from Kofi's review
✅ What This QA Framework CATCHES (real quality):
| Test | What It Validates | Real-World Impact |
|---|---|---|
| TypeScript compilation | Code compiles, types correct | Prevents server crashes |
| MCP Inspector | Protocol compliance | Server works with any MCP client |
| Playwright visual tests | Apps render all 3 states, dark theme, responsive | Users see a polished UI |
| axe-core accessibility | WCAG AA, keyboard nav, screen reader | Accessible to all users |
| XSS payload testing | No script injection via user data | Security against malicious data |
| Chaos testing (500 errors, wrong formats, huge data) | Graceful degradation | App doesn't crash under adverse conditions |
| Static cross-reference | All app IDs consistent across 4 files | No broken routes or missing entries |
| File size budgets | Apps under 50KB | Fast loading |
| BackstopJS regression | Visual changes are intentional | No accidental UI regressions |
| Cold start / latency benchmarks | Performance within targets | Users don't wait too long |
❌ What This QA Framework MISSES (gaps to be aware of):
| Gap | Why It Matters | Current State | Mitigation |
|---|---|---|---|
| Tool routing accuracy with real LLM | THE quality metric — does the AI pick the right tool? | DeepEval added (3.2b) but requires API key + cost | Run DeepEval on main branch pushes, not every PR |
| APP_DATA generation quality | Does the LLM produce valid JSON matching app expectations? | Not fully tested — parser is tested, generator is probabilistic | Few-shot examples in system prompts + Layer 6 monitoring |
| Multi-step tool chains | "Find John's email and send him a meeting invite" — requires 3 tool calls | Not tested — all routing tests are single-tool | Add multi-step fixtures to DeepEval test cases |
| Conversation context | "Show me more details about the second one" — requires memory | Not addressed in any skill | Requires thread state tracking — future work |
| Real API response shape drift | MSW mocks may not match real API | MSW validation note added (3.1) but manual | Quarterly mock validation when credentials available |
| Production quality after ship | Is quality maintained over time? | Layer 6 monitoring added | Implement metric collection + weekly review |
| APP_DATA parse failure rate in production | How often does the LLM produce unparseable JSON? | Layer 6 tracks this now | Set alerting threshold at <95% success |
The Hard Truth:
This QA framework is excellent at testing infrastructure (server compiles, apps render, accessibility passes, security is clean) — roughly 40% of the user experience. The AI interaction quality (tool routing, data generation, multi-step flows) is the other 60%, and it's harder to test deterministically because the LLM is probabilistic. Layer 6 monitoring and DeepEval close this gap but don't eliminate it. Ship with awareness, monitor in production, iterate on system prompts.
Test Data Fixtures Library
Standard Fixture: Dashboard
Save as test-fixtures/dashboard.json:
{
"title": "Monthly Performance Overview",
"metrics": [
{ "label": "Total Revenue", "value": "$124,500", "change": "+12.3%", "trend": "up" },
{ "label": "New Customers", "value": 847, "change": "+5.2%", "trend": "up" },
{ "label": "Churn Rate", "value": "2.1%", "change": "-0.3%", "trend": "down" },
{ "label": "Avg Response Time", "value": "1.4h", "change": "-8.5%", "trend": "down" }
],
"charts": [
{
"type": "bar",
"title": "Revenue by Month",
"data": [
{ "label": "Sep", "value": 95000 },
{ "label": "Oct", "value": 102000 },
{ "label": "Nov", "value": 98000 },
{ "label": "Dec", "value": 115000 },
{ "label": "Jan", "value": 124500 }
]
}
],
"data": {
"summary": "Revenue is up 12.3% month-over-month with strong customer acquisition."
}
}
Standard Fixture: Data Grid
Save as test-fixtures/data-grid.json:
{
"title": "Active Contacts",
"columns": ["Name", "Email", "Phone", "Status", "Created"],
"data": [
{ "name": "John Doe", "email": "john@acmecorp.com", "phone": "555-0101", "status": "active", "created": "2026-01-15" },
{ "name": "Jane Smith", "email": "jane@techstart.io", "phone": "555-0102", "status": "active", "created": "2026-01-20" },
{ "name": "Bob Wilson", "email": "bob@globalinc.com", "phone": "555-0103", "status": "inactive", "created": "2025-12-01" },
{ "name": "Alice Brown", "email": "alice@startup.co", "phone": "555-0104", "status": "active", "created": "2026-02-01" },
{ "name": "Charlie Davis", "email": "charlie@enterprise.net", "phone": "555-0105", "status": "pending", "created": "2026-02-03" },
{ "name": "Diana Evans", "email": "diana@agency.com", "phone": "555-0106", "status": "active", "created": "2025-11-15" },
{ "name": "Frank Garcia", "email": "frank@solutions.biz", "phone": "555-0107", "status": "active", "created": "2026-01-28" },
{ "name": "Grace Hill", "email": "grace@design.studio", "phone": "555-0108", "status": "inactive", "created": "2025-10-05" }
],
"meta": { "total": 156, "page": 1, "pageSize": 25 }
}
Standard Fixture: Timeline
Save as test-fixtures/timeline.json:
{
"title": "Contact Activity Timeline",
"events": [
{ "date": "2026-02-04T14:30:00Z", "title": "Email Opened", "description": "Campaign: February Newsletter", "type": "email" },
{ "date": "2026-02-03T10:15:00Z", "title": "Meeting Scheduled", "description": "Demo call with sales team", "type": "meeting" },
{ "date": "2026-02-01T09:00:00Z", "title": "Deal Created", "description": "Enterprise Plan — $15,000/yr", "type": "deal" },
{ "date": "2026-01-28T16:45:00Z", "title": "Form Submitted", "description": "Requested pricing information", "type": "form" },
{ "date": "2026-01-25T11:30:00Z", "title": "First Visit", "description": "Visited pricing page from Google Ads", "type": "visit" }
]
}
Edge Case Fixtures
Save as test-fixtures/edge-cases.json:
{
"empty_strings": {
"data": [
{ "name": "", "email": "", "phone": "", "status": "" }
]
},
"null_values": {
"data": [
{ "name": null, "email": null, "phone": null, "status": null }
]
},
"extremely_long_text": {
"data": [
{
"name": "Bartholomew Christopherson-Williamsworth III, Esq., Ph.D., M.B.A., J.D., CPA, CFP®, CAIA®, FRM®",
"email": "bartholomew.christopherson-williamsworth.the.third.esquire.phd.mba.jd@extremely-long-company-name-international-holdings-corporation-unlimited.com",
"phone": "+1 (555) 012-3456 ext. 78901234",
"status": "active — pending final review by committee chairperson and board of directors",
"notes": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
}
]
},
"unicode": {
"data": [
{ "name": "田中太郎", "email": "tanaka@例え.jp", "status": "アクティブ" },
{ "name": "Müller, Günther", "email": "günther@münchen.de", "status": "aktiv" },
{ "name": "Дмитрий Иванов", "email": "dmitry@компания.ru", "status": "активный" },
{ "name": "محمد عبدالله", "email": "mohammed@شركة.sa", "status": "نشط" },
{ "name": "🧑💻 Developer", "email": "dev@🏢.com", "status": "✅ Active" }
]
},
"html_entities": {
"data": [
{ "name": "O'Brien & Sons <LLC>", "email": "info@obrien&sons.com", "notes": 'He said "hello" & left' }
]
}
}
Adversarial Fixtures
Save as test-fixtures/adversarial.json:
{
"xss_payloads": {
"data": [
{ "name": "<script>alert('xss')</script>", "email": "test@test.com" },
{ "name": "<img src=x onerror=alert(1)>", "email": "\"><script>alert(1)</script>" },
{ "name": "<svg onload=alert('xss')>", "email": "javascript:alert(1)" },
{ "name": "{{constructor.constructor('return this')().alert(1)}}", "email": "test@test.com" },
{ "name": "<details open ontoggle=alert(1)>", "email": "<iframe src='javascript:alert(1)'>" }
]
},
"sql_injection": {
"data": [
{ "name": "'; DROP TABLE contacts; --", "email": "test@test.com" },
{ "name": "1' OR '1'='1", "email": "' UNION SELECT * FROM users --" },
{ "name": "admin'--", "email": "1; UPDATE users SET role='admin'" }
]
},
"malformed": {
"missing_fields": { "data": [{ "id": "1" }] },
"wrong_types": { "data": "not an array", "meta": "not an object" },
"nested_nulls": { "data": [{ "name": { "first": null, "last": null }, "contacts": [null, null] }] },
"circular_attempt": { "data": [{ "self": "[Circular]" }] },
"massive_nesting": { "a": { "b": { "c": { "d": { "e": { "f": { "g": "deep" } } } } } } }
}
}
Scale Fixture Generator
// tests/generate-scale-fixture.ts
// Run: npx ts-node tests/generate-scale-fixture.ts > test-fixtures/scale-1000.json
function generateScaleData(count: number) {
const statuses = ['active', 'inactive', 'pending', 'archived'];
const domains = ['gmail.com', 'outlook.com', 'company.co', 'startup.io', 'enterprise.net'];
return {
title: `Scale Test: ${count} Records`,
data: Array.from({ length: count }, (_, i) => ({
id: `contact-${String(i).padStart(6, '0')}`,
name: `Contact ${i + 1}`,
email: `user${i + 1}@${domains[i % domains.length]}`,
phone: `555-${String(i).padStart(4, '0')}`,
status: statuses[i % statuses.length],
created: new Date(2025, 0, 1 + (i % 365)).toISOString().split('T')[0],
value: Math.round(Math.random() * 100000) / 100,
tags: [`tag-${i % 10}`, `region-${i % 5}`]
})),
meta: { total: count, page: 1, pageSize: count }
};
}
console.log(JSON.stringify(generateScaleData(1000), null, 2));
Regression Testing Baselines
Baseline Workflow
1. CAPTURE — First time app is verified correct:
backstop reference
# Stores golden screenshots in test-baselines/backstop/
2. TEST — On every subsequent QA run:
backstop test
# Compares current screenshots against baselines
# Result: PASS (<5% diff) or FAIL (>5% diff)
3. APPROVE — When intentional changes are made:
backstop approve
# Updates baselines to reflect new correct state
4. TRACK — Tool routing baselines:
# test-fixtures/tool-routing.json is the routing baseline
# Update ONLY when intentionally changing tool descriptions
# Run routing tests after ANY tool description change
Screenshot Baseline Structure
test-baselines/
├── backstop/
│ ├── {app-name}_thread-panel_data.png
│ ├── {app-name}_thread-panel_loading.png
│ ├── {app-name}_thread-panel_empty.png
│ ├── {app-name}_narrow_data.png
│ └── {app-name}_wide_data.png
├── tool-routing.json # NL → tool mapping baseline
└── app-data-schemas/ # JSON schemas per app type
├── dashboard.schema.json
├── data-grid.schema.json
├── detail-card.schema.json
├── timeline.schema.json
└── pipeline.schema.json
Programmatic Screenshot Comparison (Without BackstopJS)
// tests/screenshot-diff.ts
import { PNG } from 'pngjs';
import * as fs from 'fs';
import pixelmatch from 'pixelmatch';
function compareScreenshots(
baselinePath: string,
currentPath: string,
diffOutputPath: string
): { diffPercent: number; pass: boolean } {
const baseline = PNG.sync.read(fs.readFileSync(baselinePath));
const current = PNG.sync.read(fs.readFileSync(currentPath));
const { width, height } = baseline;
const diff = new PNG({ width, height });
const numDiffPixels = pixelmatch(
baseline.data, current.data, diff.data,
width, height,
{ threshold: 0.1 }
);
const totalPixels = width * height;
const diffPercent = (numDiffPixels / totalPixels) * 100;
if (diffPercent > 5) {
fs.writeFileSync(diffOutputPath, PNG.sync.write(diff));
}
return {
diffPercent: Math.round(diffPercent * 100) / 100,
pass: diffPercent <= 5.0
};
}
Automated QA Script (Full)
Save as scripts/mcp-qa.sh:
#!/bin/bash
set -euo pipefail
# MCP QA — Automated Testing Pipeline
# Usage: ./mcp-qa.sh <service-name> [--skip-layer4]
#
# Runs all automated layers and generates a persistent report.
SERVICE="$1"
SKIP_LAYER4="${2:-}"
DATE=$(date +%Y-%m-%d)
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
if [ -z "$SERVICE" ]; then
echo "Usage: $0 <service-name> [--skip-layer4]"
exit 1
fi
# Persistent report location
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"
mkdir -p "$REPORT_DIR"
REPORT="${REPORT_DIR}/qa-report-${DATE}.md"
# Find server directory
SERVER_DIR=""
for d in "${SERVICE}-mcp" "mcp-servers/${SERVICE}" "mcp-diagrams/mcp-servers/${SERVICE}"; do
if [ -d "$d" ]; then
SERVER_DIR="$d"
break
fi
done
if [ -z "$SERVER_DIR" ]; then
echo "❌ Server directory not found for ${SERVICE}"
exit 1
fi
cat > "$REPORT" << EOF
# MCP QA Report: ${SERVICE}
**Date:** ${DATE}
**Timestamp:** ${TIMESTAMP}
**Tester:** Automated QA Pipeline
**Server:** ${SERVER_DIR}
---
## Quantitative Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
EOF
TOTAL_PASS=0
TOTAL_FAIL=0
TOTAL_WARN=0
TOTAL_SKIP=0
pass() { TOTAL_PASS=$((TOTAL_PASS + 1)); echo "✅ $1"; }
fail() { TOTAL_FAIL=$((TOTAL_FAIL + 1)); echo "❌ $1"; }
warn() { TOTAL_WARN=$((TOTAL_WARN + 1)); echo "⚠️ $1"; }
skip() { TOTAL_SKIP=$((TOTAL_SKIP + 1)); echo "⏭️ $1"; }
echo ""
echo "========================================"
echo " MCP QA Pipeline: ${SERVICE}"
echo " $(date)"
echo "========================================"
echo ""
# ─── LAYER 0: Protocol Compliance ───
echo "--- Layer 0: Protocol Compliance ---"
echo "" >> "$REPORT"
echo "## Layer 0: Protocol Compliance" >> "$REPORT"
cd "$SERVER_DIR"
# Build first
if npm run build 2>&1 | tail -5 > /tmp/mcp-qa-build.log; then
pass "TypeScript build succeeded"
echo "- ✅ TypeScript build succeeded" >> "$REPORT"
else
fail "TypeScript build FAILED"
echo "- ❌ TypeScript build FAILED" >> "$REPORT"
cat /tmp/mcp-qa-build.log >> "$REPORT"
fi
# MCP Inspector (if available)
if command -v npx &> /dev/null; then
echo "Running MCP Inspector..."
if timeout 15 npx @modelcontextprotocol/inspector stdio node dist/index.js 2>/tmp/mcp-inspector.log; then
pass "MCP Inspector passed"
echo "- ✅ MCP Inspector passed" >> "$REPORT"
else
warn "MCP Inspector had issues (check /tmp/mcp-inspector.log)"
echo "- ⚠️ MCP Inspector had issues" >> "$REPORT"
fi
else
skip "MCP Inspector (npx not available)"
echo "- ⏭️ MCP Inspector skipped" >> "$REPORT"
fi
cd - > /dev/null
# ─── LAYER 1: Static Analysis ───
echo ""
echo "--- Layer 1: Static Analysis ---"
echo "" >> "$REPORT"
echo "## Layer 1: Static Analysis" >> "$REPORT"
# TypeScript type check
cd "$SERVER_DIR"
if npx tsc --noEmit 2>&1 | tail -3 > /tmp/mcp-qa-typecheck.log; then
pass "tsc --noEmit clean"
echo "- ✅ Type check clean" >> "$REPORT"
else
fail "tsc --noEmit has errors"
echo "- ❌ Type check errors:" >> "$REPORT"
cat /tmp/mcp-qa-typecheck.log >> "$REPORT"
fi
cd - > /dev/null
# Any types
ANY_COUNT=$(grep -rn ": any" "$SERVER_DIR/src/" --include="*.ts" 2>/dev/null | grep -cv "catch\|eslint\|node_modules" || echo "0")
if [ "$ANY_COUNT" -eq 0 ]; then
pass "No unintended 'any' types"
else
warn "${ANY_COUNT} 'any' types found"
fi
echo "- any types: ${ANY_COUNT}" >> "$REPORT"
# SDK version
SDK_VER=$(cd "$SERVER_DIR" && node -e "console.log(require('./package.json').dependencies['@modelcontextprotocol/sdk'] || 'NOT FOUND')" 2>/dev/null || echo "UNKNOWN")
echo "- SDK version: ${SDK_VER}" >> "$REPORT"
# Warn if SDK is below 1.26.0 (security fix)
if echo "$SDK_VER" | grep -q "1.25"; then
warn "SDK version ${SDK_VER} — should be ^1.26.0+ (security fix GHSA-345p-7cg4-v4c7)"
echo "- ⚠️ SDK should be ^1.26.0+ (security fix)" >> "$REPORT"
fi
# App files
echo "" >> "$REPORT"
echo "### App Files" >> "$REPORT"
APP_COUNT=0
APP_OVERSIZED=0
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
if [ -d "$dir" ]; then
for f in "$dir"/*.html; do
if [ -f "$f" ]; then
SIZE=$(wc -c < "$f" | tr -d ' ')
KB=$((SIZE / 1024))
APP_COUNT=$((APP_COUNT + 1))
if [ "$SIZE" -gt 51200 ]; then
APP_OVERSIZED=$((APP_OVERSIZED + 1))
echo "- ⚠️ $(basename $f): ${KB}KB (over 50KB budget)" >> "$REPORT"
else
echo "- ✅ $(basename $f): ${KB}KB" >> "$REPORT"
fi
fi
done
fi
done
echo "| App File Size | <50KB each | ${APP_OVERSIZED}/${APP_COUNT} over budget | $([ $APP_OVERSIZED -eq 0 ] && echo '✅' || echo '⚠️') |" >> /tmp/mcp-qa-metrics.txt
# ─── LAYER 2: Jest Unit Tests ───
echo ""
echo "--- Layer 2: Automated Tests ---"
echo "" >> "$REPORT"
echo "## Layer 2: Automated Tests" >> "$REPORT"
cd "$SERVER_DIR"
if [ -f "jest.config.ts" ] || [ -f "jest.config.js" ] || grep -q '"jest"' package.json 2>/dev/null; then
echo "Running Jest tests..."
if npx jest --ci --coverage 2>&1 | tee /tmp/mcp-qa-jest.log | tail -10; then
pass "Jest tests passed"
echo "- ✅ Jest tests passed" >> "$REPORT"
else
fail "Jest tests FAILED"
echo "- ❌ Jest tests failed" >> "$REPORT"
tail -20 /tmp/mcp-qa-jest.log >> "$REPORT"
fi
else
skip "No Jest config found"
echo "- ⏭️ No Jest test suite found" >> "$REPORT"
fi
# Playwright visual tests
if [ -f "playwright.config.ts" ] || [ -f "playwright.config.js" ]; then
echo "Running Playwright visual tests..."
if npx playwright test 2>&1 | tee /tmp/mcp-qa-playwright.log | tail -10; then
pass "Playwright tests passed"
echo "- ✅ Playwright tests passed" >> "$REPORT"
else
fail "Playwright tests FAILED"
echo "- ❌ Playwright tests failed" >> "$REPORT"
tail -20 /tmp/mcp-qa-playwright.log >> "$REPORT"
fi
else
skip "No Playwright config found"
echo "- ⏭️ No Playwright test suite found" >> "$REPORT"
fi
# BackstopJS visual regression
if [ -f "backstop.json" ]; then
echo "Running BackstopJS regression..."
if backstop test 2>&1 | tee /tmp/mcp-qa-backstop.log | tail -5; then
pass "BackstopJS regression passed"
echo "- ✅ Visual regression passed" >> "$REPORT"
else
warn "BackstopJS regression detected differences"
echo "- ⚠️ Visual regression diffs detected" >> "$REPORT"
fi
else
skip "No backstop.json found"
echo "- ⏭️ No BackstopJS config found" >> "$REPORT"
fi
cd - > /dev/null
# ─── LAYER 4: Live API (optional) ───
if [ "$SKIP_LAYER4" != "--skip-layer4" ]; then
echo ""
echo "--- Layer 4: Live API Testing ---"
echo "" >> "$REPORT"
echo "## Layer 4: Live API Testing" >> "$REPORT"
if [ -f "$SERVER_DIR/.env" ]; then
pass ".env file exists"
echo "- ✅ .env credentials found" >> "$REPORT"
echo "- ⚠️ Manual verification of live API required" >> "$REPORT"
else
skip "No .env file — skipping live API tests"
echo "- ⏭️ No credentials available" >> "$REPORT"
fi
else
skip "Layer 4 skipped (--skip-layer4)"
echo "" >> "$REPORT"
echo "## Layer 4: Live API Testing — SKIPPED" >> "$REPORT"
fi
# ─── SECURITY SCAN ───
echo ""
echo "--- Layer 4.5: Security Scan ---"
echo "" >> "$REPORT"
echo "## Layer 4.5: Security Scan" >> "$REPORT"
SECURITY_ISSUES=0
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
if [ -d "$dir" ]; then
for f in "$dir"/*.html; do
if [ -f "$f" ]; then
# Check for potential key exposure
for pat in "api.key" "apikey" "api_key" "secret" "sk_live" "pk_live"; do
if grep -qi "$pat" "$f" 2>/dev/null; then
SECURITY_ISSUES=$((SECURITY_ISSUES + 1))
echo "- ❌ $(basename $f): potential key exposure (${pat})" >> "$REPORT"
fi
done
fi
done
fi
done
if [ "$SECURITY_ISSUES" -eq 0 ]; then
pass "No API key exposure detected"
echo "- ✅ No API key exposure detected in app files" >> "$REPORT"
else
fail "${SECURITY_ISSUES} potential security issues"
fi
# ─── SUMMARY ───
echo ""
echo "========================================"
echo " SUMMARY"
echo "========================================"
echo " ✅ Passed: ${TOTAL_PASS}"
echo " ❌ Failed: ${TOTAL_FAIL}"
echo " ⚠️ Warnings: ${TOTAL_WARN}"
echo " ⏭️ Skipped: ${TOTAL_SKIP}"
echo "========================================"
OVERALL="PASS"
[ "$TOTAL_FAIL" -gt 0 ] && OVERALL="FAIL"
[ "$TOTAL_FAIL" -eq 0 ] && [ "$TOTAL_WARN" -gt 0 ] && OVERALL="PASS WITH WARNINGS"
cat >> "$REPORT" << EOF
---
## Summary
| Category | Count |
|----------|-------|
| ✅ Passed | ${TOTAL_PASS} |
| ❌ Failed | ${TOTAL_FAIL} |
| ⚠️ Warnings | ${TOTAL_WARN} |
| ⏭️ Skipped | ${TOTAL_SKIP} |
## Overall: **${OVERALL}**
---
*Report generated by MCP QA Pipeline v2.0*
*Saved to: ${REPORT}*
EOF
echo ""
echo "Report saved to: $REPORT"
echo "Overall: ${OVERALL}"
Test Report Template (Full)
Generate this after running all layers. Save to mcp-factory-reviews/{service}/qa-report-{date}.md:
# MCP QA Report: {Service Name}
**Date:** {YYYY-MM-DD}
**Tester:** {agent/human}
**Server:** {service}-mcp v{version}
**Apps:** {count} apps tested
**Credential Status:** {has-creds|needs-creds|sandbox|no-sandbox}
---
## Quantitative Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| MCP Protocol Compliance | 100% | __% | ✅/❌ |
| Tool Correctness Rate | >95% | __/20 (__%) | ✅/❌ |
| Task Completion Rate | >90% | __/10 (__%) | ✅/❌ |
| APP_DATA Schema Match | 100% | __/__ (__%) | ✅/❌ |
| Response Latency P50 | <3s | __s | ✅/❌ |
| Response Latency P95 | <8s | __s | ✅/❌ |
| App Render Success | 100% | __/__ | ✅/❌ |
| Accessibility Score | >90 | __% | ✅/❌ |
| Cold Start Time | <2s | __ms | ✅/❌ |
| App File Size (max) | <50KB | __KB | ✅/❌ |
| Security (critical) | 0 | __ | ✅/❌ |
## Layer Results
| Layer | Status | Issues | Details |
|-------|--------|--------|---------|
| 0 — Protocol | ✅/⚠️/❌ | {count} | {notes} |
| 1 — Static | ✅/⚠️/❌ | {count} | {notes} |
| 2 — Visual | ✅/⚠️/❌ | {count} | {notes} |
| 2.5 — Accessibility | ✅/⚠️/❌ | {count} | {notes} |
| 3 — Functional | ✅/⚠️/❌ | {count} | {notes} |
| 3.5 — Performance | ✅/⚠️/❌ | {count} | {notes} |
| 4 — Live API | ✅/⚠️/❌/⏭️ | {count} | {notes} |
| 4.5 — Security | ✅/⚠️/❌ | {count} | {notes} |
| 5 — Integration | ✅/⚠️/❌ | {count} | {notes} |
## Overall: {PASS / PASS WITH WARNINGS / FAIL}
---
## Issues Found
### Critical (must fix before ship)
1. {issue}: {description} — {file:line}
### Warnings (should fix)
1. {issue}: {description}
### Notes (nice to have)
1. {observation}
---
## App-by-App Results
### {app-id-1}
- Visual: ✅/❌ — {notes}
- Accessibility: Score __% — {violations}
- Data flow: ✅/❌ — {notes}
- States (loading/empty/data): ✅/❌
- File size: __KB
- XSS test: ✅/❌
- Screenshot: {path}
---
## Tool Invocation Results
| # | NL Message | Expected Tool | Actual Tool | Correct? | Latency |
|---|-----------|---------------|-------------|----------|---------|
| 1 | "Show me all contacts" | list_contacts | | ✅/❌ | ms |
| 2 | "Find John Smith" | search_contacts | | ✅/❌ | ms |
| ... | | | | | |
| 20 | | | | | |
**Tool Correctness Rate: __/20 = __%**
---
## E2E Scenario Results
| # | Scenario | Steps | Completed? | Latency | Notes |
|---|----------|-------|-----------|---------|-------|
| 1 | {name} | {n} | ✅/❌ | ms | |
| ... | | | | | |
| 10 | | | | | |
**Task Completion Rate: __/10 = __%**
---
## Trend (vs Previous Report)
| Metric | Previous | Current | Change |
|--------|----------|---------|--------|
| Tool Correctness | __% | __% | +/-__% |
| Task Completion | __% | __% | +/-__% |
| Accessibility | __% | __% | +/-__% |
| Avg Latency | __s | __s | +/-__s |
---
## Recommendations
1. {what to fix/improve before shipping}
2. {items for next QA cycle}
---
*Report saved to: mcp-factory-reviews/{service}/qa-report-{date}.md*
*Previous reports in same directory for trending.*
Report Trending Script
#!/bin/bash
# Aggregate QA trends across reports
# Usage: ./qa-trend.sh <service-name>
SERVICE="$1"
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"
if [ ! -d "$REPORT_DIR" ]; then
echo "No reports found for ${SERVICE}"
exit 1
fi
echo "=== QA Trend: ${SERVICE} ==="
echo ""
echo "| Date | Overall | Pass | Fail | Warn |"
echo "|------|---------|------|------|------|"
for report in $(ls -1 "$REPORT_DIR"/qa-report-*.md 2>/dev/null | sort); do
DATE=$(basename "$report" | sed 's/qa-report-//' | sed 's/.md//')
OVERALL=$(grep "^## Overall:" "$report" 2>/dev/null | head -1 | sed 's/.*\*\*//' | sed 's/\*\*.*//')
PASS=$(grep "✅ Passed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
FAIL=$(grep "❌ Failed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
WARN=$(grep "⚠️" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
echo "| ${DATE} | ${OVERALL} | ${PASS} | ${FAIL} | ${WARN} |"
done
Quick Reference Commands
# ─── LAYER 0 ───
# MCP Inspector (protocol compliance)
npx @modelcontextprotocol/inspector stdio node dist/index.js
# ─── LAYER 1 ───
# Quick compile + type check
cd {service}-mcp && npm run build && npx tsc --noEmit
# ─── LAYER 2 ───
# Run Playwright visual tests
npx playwright test tests/visual.test.ts
# Run BackstopJS regression
backstop test
# Capture new baselines
backstop reference
# ─── LAYER 2.5 ───
# Run accessibility tests
npx playwright test tests/accessibility.test.ts
# ─── LAYER 3 ───
# Run Jest unit tests
npx jest --verbose
# Run tool routing tests
npx jest tests/tool-routing.test.ts
# Validate APP_DATA schemas
npx ts-node tests/app-data-validator.ts
# ─── LAYER 3.5 ───
# Cold start benchmark
time echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf","version":"1.0"}}}' | timeout 10 node dist/index.js | head -1
# File size audit
for f in app-ui/*.html; do echo "$(wc -c < "$f" | tr -d ' ') $f"; done | sort -n
# ─── LAYER 4 ───
# Start server for manual testing
node dist/index.js
# ─── LAYER 4.5 ───
# Security scan
grep -rn "apikey\|api_key\|secret\|sk_live" app-ui/ --include="*.html"
# ─── LAYER 5 ───
# Full automated pipeline
./scripts/mcp-qa.sh {service-name}
# Trend report
./scripts/qa-trend.sh {service-name}
# ─── BROWSER TOOLS ───
# Screenshot via browser tool
# browser → open → http://192.168.0.25:3000 → navigate → screenshot
# Monitor postMessages in browser console
# window.addEventListener('message', e => console.log('[PM]', e.data.type, e.data))
# axe-core in browser console (paste the snippet from Layer 2.5.2)
Common Issues & Fixes
| Symptom | Layer | Cause | Fix |
|---|---|---|---|
| App shows blank white screen | 2 | HTML file not found or wrong path | Check APP_NAME_MAP + APP_DIRS in route.ts |
| App shows loading forever | 3 | postMessage not received | Check data block format: <!--APP_DATA:{...}:END_APP_DATA--> |
| App renders but wrong data | 3 | APP_DATA JSON shape mismatch | Compare tool response fields with app's render() expectations |
| Tool not triggered by NL | 3 | Poor tool description | Add "do NOT use when" disambiguation |
| Wrong tool triggered | 3 | Similar tool descriptions | Add negative examples to both competing tools |
| Thread panel empty | 3 | Thread state not persisted | Check localStorage lb-threads key |
| Console error: CORS | 2 | iframe cross-origin issue | Ensure app served from same origin |
| Dark theme wrong | 2 | Hardcoded light colors | Audit CSS for #fff, white, #f colors |
| Overflow at narrow width | 2 | Fixed widths in CSS | Use max-width: 100%, overflow-x: auto, flex/grid |
| axe-core contrast fail | 2.5 | Text color too dim | Use #b0b2b8+ for secondary text (not #96989d) |
| MCP Inspector fails | 0 | Protocol error in server | Check initialize handler, verify JSON-RPC framing |
| Cold start >2s | 3.5 | Heavy imports at startup | Use lazy loading for tool groups |
| structuredContent mismatch | 0 | Output doesn't match outputSchema | Validate tool return against declared schema |
| APP_DATA parse fails | 3 | LLM produced invalid JSON | Use robust parser with newline stripping + trailing comma fix |
| XSS detected | 4.5 | Missing escapeHtml on field | Add escapeHtml() to all dynamic text insertions |
| Key exposure | 4.5 | API key in HTML file | Move to server-side only, never send to client |
Project Setup: Adding Tests to an Existing Server
When adding this test framework to a server that doesn't have it yet:
cd {service}-mcp
# 1. Install test dependencies
npm install -D jest ts-jest @types/jest msw playwright @playwright/test @axe-core/playwright ajv pngjs pixelmatch backstopjs
# 2. Add Jest config
cat > jest.config.ts << 'EOF'
export default {
preset: 'ts-jest',
testEnvironment: 'node',
testPathPattern: 'tests/.*\\.test\\.ts$',
setupFilesAfterSetup: ['./tests/setup.ts'],
};
EOF
# 3. Add Playwright config
cat > playwright.config.ts << 'EOF'
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
testDir: './tests',
testMatch: ['visual.test.ts', 'accessibility.test.ts', 'chaos.test.ts'],
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
],
});
EOF
# 4. Create directory structure
mkdir -p tests test-fixtures test-baselines/backstop test-baselines/app-data-schemas test-results/screenshots
# 5. Create initial fixture files
# (copy from the fixtures library section above)
# 6. Add scripts to package.json
npm pkg set scripts.test="jest"
npm pkg set scripts.test:visual="playwright test"
npm pkg set scripts.test:a11y="playwright test tests/accessibility.test.ts"
npm pkg set scripts.test:all="jest && playwright test"
npm pkg set scripts.qa="../../scripts/mcp-qa.sh $(basename $(pwd) -mcp)"
# 7. Install Playwright browsers
npx playwright install