Jake Shore 8d65417afe Add 11 MCP agent skills to repo — 550KB of encoded pipeline knowledge

Skills added:
- mcp-api-analyzer (43KB) — Phase 1: API analysis
- mcp-server-builder (88KB) — Phase 2: Server build
- mcp-server-development (31KB) — TS MCP patterns
- mcp-app-designer (85KB) — Phase 3: Visual apps
- mcp-apps-integration (20KB) — structuredContent UI
- mcp-apps-official (48KB) — MCP Apps SDK
- mcp-apps-merged (39KB) — Combined apps reference
- mcp-localbosses-integrator (61KB) — Phase 4: LocalBosses wiring
- mcp-qa-tester (113KB) — Phase 5: Full QA framework
- mcp-deployment (17KB) — Phase 6: Production deploy
- mcp-skill (exa integration)

These skills are the encoded knowledge that lets agents build
production-quality MCP servers autonomously through the pipeline.

2026-02-06 06:36:37 -05:00

111 KiB

Raw Permalink Blame History

MCP QA Tester — Automated Testing Framework & Quality Metrics Pipeline

When to use this skill: Testing MCP servers, apps, and their LocalBosses integration. Use after Phase 4 (integration) to verify everything works — at the protocol level, visually, functionally, and against live APIs. This is an automated-first framework with quantitative metrics, regression baselines, and persistent reporting.

What this covers: MCP protocol compliance, automated unit/visual/functional testing, accessibility auditing, performance benchmarking, security validation, chaos testing, and quantitative quality metrics with regression tracking.

Testing Architecture

Layer 0: Protocol Compliance ─── MCP Inspector + JSON-RPC lifecycle validation
Layer 1: Static Analysis ──────── TypeScript build, linting, file structure, schema validation
Layer 2: Visual Testing ────────── Playwright screenshots, BackstopJS regression, Gemini analysis
Layer 2.5: Accessibility ────────── axe-core, keyboard nav, contrast audit, screen reader compat
Layer 3: Functional Testing ───── Tool routing smoke tests, data flow validation, thread lifecycle
Layer 3.5: Performance ────────── Cold start, latency, memory, file size budgets
Layer 4: Live API Testing ──────── Real API calls with credential management strategy
Layer 4.5: Security ────────────── XSS, CSP, postMessage origin, key exposure
Layer 5: Integration Testing ──── Full E2E scenarios, chaos testing, cross-browser validation

Every layer has quantitative pass/fail criteria. Do NOT skip layers — issues compound.

Quantitative Quality Metrics (REQUIRED)

Every QA report MUST include these metrics. No more pass/fail checklists — we measure.

Metric	Target	Method	Priority
MCP Protocol Compliance	100%	MCP Inspector — all checks pass	P0
Tool Correctness Rate	>95%	Run 20 NL messages, count correct tool selections	P0
Task Completion Rate	>90%	Run 10 E2E scenarios, count fully completed	P0
APP_DATA Schema Match	100%	Validate every APP_DATA against JSON schema	P0
Response Latency P50	<3s	Measure 10 read interactions	P1
Response Latency P95	<8s	Measure 10 interactions (reads + writes)	P1
App Render Success	100%	All apps render data state without console errors	P0
Accessibility Score	>90	axe-core audit on every app HTML	P1
Cold Start Time	<2s	`time node dist/index.js` → first ListTools response	P1
App File Size	<50KB each	Check all HTML files	P1
Security Scan	0 critical	XSS + CSP + key exposure checks	P0

How to calculate:

Tool Correctness Rate = (correct_tool_selections / total_test_messages) × 100
Task Completion Rate  = (completed_scenarios / total_scenarios) × 100
APP_DATA Schema Match = (valid_app_data_blocks / total_app_data_blocks) × 100

Layer 0: MCP Protocol Compliance Testing

Why this layer exists: The MCP spec defines exact JSON-RPC lifecycle, tool definition formats, and error codes. If the server isn't protocol-compliant, nothing else matters. This is the foundation.

0.1 — MCP Inspector (Official Tool)

# Install and run MCP Inspector against the server
npx @modelcontextprotocol/inspector stdio node dist/index.js

# The Inspector validates:
# ✅ initialize → initialized lifecycle
# ✅ tools/list response format
# ✅ tools/call request/response format
# ✅ JSON-RPC message framing
# ✅ Capability negotiation
# ✅ Notification handling

0.2 — Automated Protocol Test Script

Save as tests/protocol-compliance.test.ts:

import { spawn, ChildProcess } from 'child_process';
import * as readline from 'readline';

// Minimal JSON-RPC client for testing MCP servers over stdio
class MCPTestClient {
  private proc: ChildProcess;
  private rl: readline.Interface;
  private pending: Map<number, { resolve: Function; reject: Function }> = new Map();
  private nextId = 1;
  private notifications: any[] = [];

  constructor(command: string, args: string[]) {
    this.proc = spawn(command, args, { stdio: ['pipe', 'pipe', 'pipe'] });
    this.rl = readline.createInterface({ input: this.proc.stdout! });
    this.rl.on('line', (line) => {
      try {
        const msg = JSON.parse(line);
        if (msg.id && this.pending.has(msg.id)) {
          this.pending.get(msg.id)!.resolve(msg);
          this.pending.delete(msg.id);
        } else if (!msg.id) {
          this.notifications.push(msg);
        }
      } catch (e) { /* ignore non-JSON lines */ }
    });
  }

  async request(method: string, params?: any): Promise<any> {
    const id = this.nextId++;
    const msg = JSON.stringify({ jsonrpc: '2.0', id, method, params: params || {} });
    this.proc.stdin!.write(msg + '\n');
    return new Promise((resolve, reject) => {
      this.pending.set(id, { resolve, reject });
      setTimeout(() => {
        if (this.pending.has(id)) {
          this.pending.delete(id);
          reject(new Error(`Timeout on ${method}`));
        }
      }, 10000);
    });
  }

  getNotifications() { return this.notifications; }

  async close() {
    this.proc.kill();
  }
}

describe('MCP Protocol Compliance', () => {
  let client: MCPTestClient;

  beforeAll(async () => {
    client = new MCPTestClient('node', ['dist/index.js']);
  });

  afterAll(async () => {
    await client.close();
  });

  test('initialize → initialized lifecycle', async () => {
    const initResult = await client.request('initialize', {
      protocolVersion: '2025-11-25',
      capabilities: {},
      clientInfo: { name: 'qa-test-client', version: '1.0.0' }
    });

    expect(initResult.result).toBeDefined();
    expect(initResult.result.protocolVersion).toBeDefined();
    expect(initResult.result.capabilities).toBeDefined();
    expect(initResult.result.serverInfo).toBeDefined();
    expect(initResult.result.serverInfo.name).toBeTruthy();
    expect(initResult.result.serverInfo.version).toBeTruthy();

    // Send initialized notification (no id = notification)
    client.request('notifications/initialized', {}).catch(() => {});
  });

  test('tools/list returns valid tool definitions', async () => {
    const result = await client.request('tools/list', {});
    
    expect(result.result).toBeDefined();
    expect(result.result.tools).toBeInstanceOf(Array);
    expect(result.result.tools.length).toBeGreaterThan(0);

    for (const tool of result.result.tools) {
      // Required fields per MCP 2025-11-25
      expect(tool.name).toBeTruthy();
      expect(tool.description).toBeTruthy();
      expect(typeof tool.name).toBe('string');
      expect(typeof tool.description).toBe('string');
      
      // Name format: must be alphanumeric + underscores/hyphens/dots
      expect(tool.name).toMatch(/^[a-zA-Z0-9_.\-]+$/);
      
      // inputSchema must be valid JSON Schema object
      if (tool.inputSchema) {
        expect(tool.inputSchema.type).toBe('object');
      }

      // If title exists, must be string
      if (tool.title) {
        expect(typeof tool.title).toBe('string');
      }

      // If outputSchema exists, validate it
      if (tool.outputSchema) {
        expect(tool.outputSchema.type).toBeDefined();
      }

      // If annotations exist, validate known fields
      if (tool.annotations) {
        const validAnnotations = [
          'readOnlyHint', 'destructiveHint', 'idempotentHint', 'openWorldHint'
        ];
        for (const key of Object.keys(tool.annotations)) {
          if (validAnnotations.includes(key)) {
            expect(typeof tool.annotations[key]).toBe('boolean');
          }
        }
      }
    }
  });

  test('tools/call returns valid response for read-only tools', async () => {
    // Get list of tools first
    const listResult = await client.request('tools/list', {});
    const readOnlyTools = listResult.result.tools.filter(
      (t: any) => t.annotations?.readOnlyHint === true
    );

    // Test first read-only tool (safest to call)
    if (readOnlyTools.length > 0) {
      const tool = readOnlyTools[0];
      const callResult = await client.request('tools/call', {
        name: tool.name,
        arguments: {}
      });

      expect(callResult.result).toBeDefined();
      
      // Result must have content array
      if (!callResult.result.isError) {
        expect(callResult.result.content).toBeInstanceOf(Array);
        for (const item of callResult.result.content) {
          expect(item.type).toBeDefined();
          // Text content must have text field
          if (item.type === 'text') {
            expect(typeof item.text).toBe('string');
          }
        }
      }

      // If structuredContent exists, validate against outputSchema
      if (callResult.result.structuredContent && tool.outputSchema) {
        // Basic type check — full JSON Schema validation is in the schema validator section
        expect(typeof callResult.result.structuredContent).toBe('object');
      }
    }
  });

  test('error responses use correct JSON-RPC error codes', async () => {
    // Call non-existent tool — should get method not found or tool error
    const result = await client.request('tools/call', {
      name: 'nonexistent_tool_that_should_not_exist_12345',
      arguments: {}
    });

    // Should be an error response
    expect(
      result.error || result.result?.isError
    ).toBeTruthy();

    // If protocol error, must use standard JSON-RPC codes
    if (result.error) {
      expect(result.error.code).toBeDefined();
      expect(typeof result.error.code).toBe('number');
      expect(result.error.message).toBeTruthy();
      // Standard codes: -32700 (parse), -32600 (invalid request),
      // -32601 (method not found), -32602 (invalid params), -32603 (internal)
    }
  });

  test('notification handling works', async () => {
    // Server should handle ping
    try {
      await client.request('ping', {});
      // If no error, ping is supported
    } catch (e) {
      // Ping timeout is acceptable for some servers
    }
  });
});

0.3 — structuredContent Validation

// tests/structured-content.test.ts
import Ajv from 'ajv';

const ajv = new Ajv({ allErrors: true });

function validateStructuredContent(
  toolName: string,
  outputSchema: object,
  structuredContent: any
): { valid: boolean; errors: string[] } {
  const validate = ajv.compile(outputSchema);
  const valid = validate(structuredContent);
  return {
    valid: !!valid,
    errors: valid ? [] : (validate.errors || []).map(e =>
      `${e.instancePath} ${e.message}`
    )
  };
}

// Run this after getting tools/list + tools/call results
describe('structuredContent schema validation', () => {
  test('every tool with outputSchema returns conforming structuredContent', async () => {
    // This would be populated from actual tool calls
    const toolResults: Array<{
      toolName: string;
      outputSchema: object;
      structuredContent: any;
    }> = []; // Populate from Layer 4 results

    for (const { toolName, outputSchema, structuredContent } of toolResults) {
      if (structuredContent && outputSchema) {
        const result = validateStructuredContent(toolName, outputSchema, structuredContent);
        expect(result.valid).toBe(true);
        if (!result.valid) {
          console.error(`Schema mismatch for ${toolName}:`, result.errors);
        }
      }
    }
  });
});

0.4 — Tasks & Elicitation Testing (2025-11-25 Spec)

If the server declares tasks capability (async operations via SEP-1686), test the task lifecycle:

test('tasks/list returns valid task list', async () => {
  const result = await client.request('tasks/list', {});
  if (result.result) {
    expect(result.result.tasks).toBeInstanceOf(Array);
  }
  // Some servers may not implement tasks — that's OK, just verify no crash
});

test('long-running tool call returns task reference when task-enabled', async () => {
  // If a tool has execution.taskSupport = "required" or "optional",
  // calling it with _meta.taskId should return a task reference
  // rather than blocking until completion
  const listResult = await client.request('tools/list', {});
  const taskTools = listResult.result.tools.filter(
    (t: any) => t.execution?.taskSupport === 'required' || t.execution?.taskSupport === 'optional'
  );
  // Log task-capable tools for the report
  console.log(`Task-capable tools: ${taskTools.map((t: any) => t.name).join(', ') || 'none'}`);
});

If the server uses elicitation (elicitation/create), test that:

Elicitation requests include valid requestedSchema with JSON Schema
The server handles user-provided elicitation responses gracefully
URL mode elicitation (2025-11-25) correctly redirects to external URLs
The server doesn't hang if elicitation is denied by the client

test('server handles elicitation denial gracefully', async () => {
  // If server requests elicitation and client denies, server should
  // return a useful error message, not crash or hang
  // This is tested implicitly by calling tools without providing
  // elicitation responses — the server should timeout or fallback
});

Quality Gate:

MCP Inspector passes all checks
initialize → initialized lifecycle works
tools/list returns valid, non-empty tool array
All tool names match /^[a-zA-Z0-9_.\-]+$/
All tool descriptions are non-empty strings
tools/call returns valid content arrays
structuredContent (if present) matches outputSchema
Error responses use correct JSON-RPC codes
Server handles unknown methods gracefully (doesn't crash)

Layer 1: Static Analysis

1.1 — TypeScript Compilation

cd {service}-mcp
npm run build 2>&1
# Must exit 0 with no errors
# Warnings are OK but should be reviewed

# Separate type-check (catches issues build might miss)
npx tsc --noEmit 2>&1

1.2 — Code Quality Checks

# Check for `any` types (red flag)
grep -rn ": any" src/ --include="*.ts" | grep -v "node_modules" | grep -v "// eslint" | grep -v "catch"
# Goal: zero instances in tool handlers
# Exception: catch(error: any) is acceptable

# Check for console.log (should use structured logging)
grep -rn "console.log" src/ --include="*.ts" | grep -v "node_modules"
# Goal: zero — use console.error for MCP server logging

# Check SDK version is pinned appropriately
node -e "const p = require('./package.json'); console.log('SDK:', p.dependencies['@modelcontextprotocol/sdk'])"
# Should be ^1.26.0 or higher (security fix: GHSA-345p-7cg4-v4c7)

# Check Zod version
node -e "const p = require('./package.json'); console.log('Zod:', p.dependencies['zod'])"
# Should be ^3.25.0 or higher

1.3 — HTML App Validation

# Check all app HTML files exist and are within size budget
for f in app-ui/*.html ui/dist/*.html; do
  if [ -f "$f" ]; then
    SIZE=$(wc -c < "$f" | tr -d ' ')
    if [ "$SIZE" -gt 51200 ]; then
      echo "⚠️  $f ($SIZE bytes) — EXCEEDS 50KB budget"
    else
      echo "✅ $f ($SIZE bytes)"
    fi
  else
    echo "❌ $f MISSING"
  fi
done

1.4 — Route Mapping Cross-Reference

# Verify every app ID in channels.ts has a matching entry in ALL integration files
node -e "
const fs = require('fs');
const path = require('path');

const LB_ROOT = 'localbosses-app/src';
const files = {
  channels: fs.readFileSync(path.join(LB_ROOT, 'lib/channels.ts'), 'utf8'),
  appNames: fs.readFileSync(path.join(LB_ROOT, 'lib/appNames.ts'), 'utf8'),
  intakes: fs.readFileSync(path.join(LB_ROOT, 'lib/app-intakes.ts'), 'utf8'),
  route: fs.readFileSync(path.join(LB_ROOT, 'app/api/mcp-apps/route.ts'), 'utf8'),
};

// Extract app IDs from channels (anything in mcpApps arrays)
const channelApps = [...files.channels.matchAll(/['\"]([a-z0-9-]+)['\"]/g)]
  .map(m => m[1])
  .filter(id => id.length > 3 && !['true','false','null'].includes(id));

let issues = 0;
const unique = [...new Set(channelApps)];
for (const id of unique) {
  const inNames = files.appNames.includes(id);
  const inIntakes = files.intakes.includes(id);
  const inRoute = files.route.includes(id);
  if (!inNames || !inIntakes || !inRoute) {
    console.log('❌ ' + id + ': ' +
      (!inNames ? 'MISSING appNames ' : '') +
      (!inIntakes ? 'MISSING app-intakes ' : '') +
      (!inRoute ? 'MISSING route ' : ''));
    issues++;
  }
}
if (issues === 0) console.log('✅ All ' + unique.length + ' app IDs cross-referenced');
else console.log('\\n⚠️  ' + issues + ' cross-reference issues found');
"

Quality Gate:

TypeScript compiles with zero errors
tsc --noEmit passes clean
No unintended any types in tool handlers
SDK pinned to ^1.26.0+, Zod to ^3.25.0+ (Do NOT use Zod v4.x with SDK v1.x — known incompatibility, issue #1429)
All HTML app files exist, are >1KB and <50KB
All app IDs cross-referenced across channels, appNames, app-intakes, and route map
All route mappings resolve to actual HTML files

Layer 2: Visual Testing

2.1 — Automated Playwright Visual Tests

Save as tests/visual.test.ts:

import { test, expect, Page } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';

// Configuration
const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
const SCREENSHOTS_DIR = path.resolve(__dirname, '../test-results/screenshots');
const BASELINES_DIR = path.resolve(__dirname, '../test-baselines/screenshots');
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');

// Ensure directories exist
fs.mkdirSync(SCREENSHOTS_DIR, { recursive: true });

// Discover all HTML app files
const appFiles = fs.readdirSync(APP_UI_DIR)
  .filter(f => f.endsWith('.html'))
  .map(f => path.join(APP_UI_DIR, f));

// Load fixture for app type (or use default)
function loadFixture(appFile: string): any {
  const baseName = path.basename(appFile, '.html');
  const fixturePath = path.join(FIXTURES_DIR, `${baseName}.json`);
  if (fs.existsSync(fixturePath)) {
    return JSON.parse(fs.readFileSync(fixturePath, 'utf8'));
  }
  // Default fixture
  return {
    title: 'Test Data',
    data: [
      { name: 'Test Item 1', status: 'active', value: 100 },
      { name: 'Test Item 2', status: 'inactive', value: 200 },
      { name: 'Test Item 3', status: 'pending', value: 300 },
    ],
    meta: { total: 3, page: 1, pageSize: 25 }
  };
}

for (const appFile of appFiles) {
  const appName = path.basename(appFile, '.html');

  test.describe(`Visual: ${appName}`, () => {
    let page: Page;

    test.beforeEach(async ({ browser }) => {
      page = await browser.newPage({ viewport: { width: 400, height: 600 } });
      await page.goto(`file://${appFile}`);
      // Collect console errors
      page.on('console', msg => {
        if (msg.type() === 'error') {
          console.error(`[${appName}] Console error:`, msg.text());
        }
      });
    });

    test.afterEach(async () => {
      await page.close();
    });

    test('renders loading state initially', async () => {
      // Before any data, loading state should show
      const loading = page.locator('#loading');
      const content = page.locator('#content');
      // At least one should be visible
      const loadingVis = await loading.isVisible().catch(() => false);
      const contentVis = await content.isVisible().catch(() => false);
      expect(loadingVis || contentVis).toBe(true);

      await page.screenshot({
        path: path.join(SCREENSHOTS_DIR, `${appName}-loading.png`)
      });
    });

    test('renders empty state', async () => {
      // Inject empty data
      await page.evaluate(() => {
        window.postMessage({ type: 'mcp_app_data', data: {} }, '*');
      });
      await page.waitForTimeout(500);

      // Should show empty state, not crash
      const hasError = await page.evaluate(() => {
        return document.body.innerText.includes('Error') ||
               document.body.innerText.includes('undefined');
      });
      
      await page.screenshot({
        path: path.join(SCREENSHOTS_DIR, `${appName}-empty.png`)
      });
      
      // No JS crashes
      expect(hasError).toBe(false);
    });

    test('renders data state without console errors', async () => {
      const fixture = loadFixture(appFile);
      const consoleErrors: string[] = [];
      page.on('console', msg => {
        if (msg.type() === 'error') consoleErrors.push(msg.text());
      });

      // Inject fixture data
      await page.evaluate((data) => {
        window.postMessage({ type: 'mcp_app_data', data }, '*');
      }, fixture);
      await page.waitForTimeout(1000);

      // Content should be visible (loading hidden)
      const loading = page.locator('#loading');
      const loadingHidden = !(await loading.isVisible().catch(() => true));
      
      await page.screenshot({
        path: path.join(SCREENSHOTS_DIR, `${appName}-data.png`)
      });

      expect(consoleErrors).toHaveLength(0);
    });

    test('no horizontal overflow at 320px', async () => {
      await page.setViewportSize({ width: 320, height: 600 });
      const fixture = loadFixture(appFile);
      
      await page.evaluate((data) => {
        window.postMessage({ type: 'mcp_app_data', data }, '*');
      }, fixture);
      await page.waitForTimeout(500);

      const hasOverflow = await page.evaluate(() => {
        return document.documentElement.scrollWidth > document.documentElement.clientWidth;
      });

      await page.screenshot({
        path: path.join(SCREENSHOTS_DIR, `${appName}-narrow.png`)
      });

      expect(hasOverflow).toBe(false);
    });

    test('dark theme compliance', async () => {
      const fixture = loadFixture(appFile);
      await page.evaluate((data) => {
        window.postMessage({ type: 'mcp_app_data', data }, '*');
      }, fixture);
      await page.waitForTimeout(500);

      // Check background color is dark
      const bgColor = await page.evaluate(() => {
        return getComputedStyle(document.body).backgroundColor;
      });
      // Should be dark (r,g,b each < 60)
      const match = bgColor.match(/\d+/g);
      if (match) {
        const [r, g, b] = match.map(Number);
        expect(r).toBeLessThan(60);
        expect(g).toBeLessThan(60);
        expect(b).toBeLessThan(60);
      }
    });
  });
}

2.2 — BackstopJS Visual Regression

# Initialize BackstopJS (one-time setup)
npm install -g backstopjs
backstop init

# Configure backstop.json:

{
  "id": "mcp-apps",
  "viewports": [
    { "label": "thread-panel", "width": 400, "height": 600 },
    { "label": "narrow", "width": 320, "height": 600 },
    { "label": "wide", "width": 800, "height": 600 }
  ],
  "scenarios": [
    {
      "label": "contact-grid-data",
      "url": "file:///path/to/app-ui/contact-grid.html",
      "onReadyScript": "inject-data.js",
      "delay": 1000,
      "misMatchThreshold": 5.0,
      "requireSameDimensions": true
    }
  ],
  "paths": {
    "bitmaps_reference": "test-baselines/backstop",
    "bitmaps_test": "test-results/backstop",
    "engine_scripts": "tests/backstop-scripts"
  },
  "engine": "playwright",
  "engineOptions": {
    "args": ["--no-sandbox"]
  }
}

// tests/backstop-scripts/inject-data.js
module.exports = async (page, scenario, viewport, isReference, browserContext) => {
  const fixtures = require('../test-fixtures/' + scenario.label.split('-')[0] + '.json');
  await page.evaluate((data) => {
    window.postMessage({ type: 'mcp_app_data', data }, '*');
  }, fixtures);
  await page.waitForTimeout(500);
};

# Capture baselines (run once when apps are verified correct)
backstop reference

# Test against baselines (run on every QA cycle)
backstop test
# Result: PASS if <5% pixel diff, FAIL otherwise
# Visual diff report opens in browser automatically

2.3 — Gemini Multimodal Analysis (Subjective Quality)

# After Playwright captures screenshots, run Gemini for subjective quality:
gemini "Analyze this MCP app screenshot. Check and rate PASS/WARN/FAIL:

1. RENDERING: Does it show real content (not blank/placeholder)?
2. DARK THEME: Background ~#1a1d23, accent ~#ff6d5a, text ~#dcddde
3. LAYOUT: Content properly aligned, no overlapping elements?
4. TYPOGRAPHY: Text readable, proper sizing, no clipping?
5. DATA QUALITY: Does the rendered data look realistic?
6. RESPONSIVENESS: Would this work at 280px (thread panel)?
7. BUGS: Any visual artifacts, broken images, misaligned elements?" -f screenshot.png

Quality Gate:

All apps render loading → empty → data states without crashes
Zero console errors in data state
No horizontal overflow at 320px width
Dark theme compliance (background RGB each <60)
BackstopJS regression: <5% pixel diff from baselines
Gemini subjective review: no FAIL ratings

Layer 2.5: Accessibility Testing

2.5.1 — axe-core Automated Audit

Integrate directly into Playwright tests:

// tests/accessibility.test.ts
import { test, expect, Page } from '@playwright/test';
import AxeBuilder from '@axe-core/playwright';
import * as fs from 'fs';
import * as path from 'path';

const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');

const appFiles = fs.readdirSync(APP_UI_DIR)
  .filter(f => f.endsWith('.html'));

for (const appFile of appFiles) {
  const appName = path.basename(appFile, '.html');

  test.describe(`Accessibility: ${appName}`, () => {
    test('passes axe-core audit with data loaded', async ({ page }) => {
      await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);

      // Load fixture data
      const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
      const fixture = fs.existsSync(fixturePath)
        ? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
        : { title: 'Test', data: [{ name: 'Test', status: 'active' }] };

      await page.evaluate((data) => {
        window.postMessage({ type: 'mcp_app_data', data }, '*');
      }, fixture);
      await page.waitForTimeout(1000);

      // Run axe-core
      const results = await new AxeBuilder({ page })
        .withTags(['wcag2a', 'wcag2aa', 'wcag21a', 'wcag21aa'])
        .analyze();

      // Log violations for debugging
      if (results.violations.length > 0) {
        console.log(`\n[${appName}] Accessibility violations:`);
        for (const v of results.violations) {
          console.log(`  ${v.impact}: ${v.id} — ${v.description}`);
          console.log(`    Help: ${v.helpUrl}`);
          for (const node of v.nodes.slice(0, 3)) {
            console.log(`    Target: ${node.target.join(' > ')}`);
          }
        }
      }

      // Calculate score: (passes / (passes + violations)) * 100
      const totalChecks = results.passes.length + results.violations.length;
      const score = totalChecks > 0
        ? Math.round((results.passes.length / totalChecks) * 100)
        : 100;

      console.log(`[${appName}] Accessibility score: ${score}%`);

      // Target: >90% score, zero critical/serious violations
      const criticalViolations = results.violations.filter(
        v => v.impact === 'critical' || v.impact === 'serious'
      );
      expect(criticalViolations).toHaveLength(0);
      expect(score).toBeGreaterThanOrEqual(90);
    });

    test('all interactive elements reachable via keyboard', async ({ page }) => {
      await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);
      
      // Inject data first
      const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
      const fixture = fs.existsSync(fixturePath)
        ? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
        : { title: 'Test', data: [{ name: 'Test' }] };

      await page.evaluate((data) => {
        window.postMessage({ type: 'mcp_app_data', data }, '*');
      }, fixture);
      await page.waitForTimeout(500);

      // Get all interactive elements
      const interactiveElements = await page.evaluate(() => {
        const selectors = 'a, button, input, select, textarea, [tabindex], [role="button"], [role="link"], [role="tab"]';
        const elements = document.querySelectorAll(selectors);
        return Array.from(elements).map(el => ({
          tag: el.tagName.toLowerCase(),
          text: (el as HTMLElement).innerText?.slice(0, 50) || el.getAttribute('aria-label') || '',
          tabIndex: (el as HTMLElement).tabIndex,
          visible: (el as HTMLElement).offsetParent !== null,
        }));
      });

      // Filter to visible elements
      const visibleInteractive = interactiveElements.filter(el => el.visible);

      // Tab through all elements and verify focus reaches each
      let focusedCount = 0;
      for (let i = 0; i < visibleInteractive.length + 5; i++) {
        await page.keyboard.press('Tab');
        const focused = await page.evaluate(() => {
          const el = document.activeElement;
          return el ? el.tagName.toLowerCase() : 'none';
        });
        if (focused !== 'body' && focused !== 'none') {
          focusedCount++;
        }
      }

      // At least 80% of visible interactive elements should be reachable
      if (visibleInteractive.length > 0) {
        const reachRate = focusedCount / visibleInteractive.length;
        expect(reachRate).toBeGreaterThanOrEqual(0.8);
      }
    });
  });
}

2.5.2 — Standalone axe-core Snippet (for Browser DevTools)

// Paste this into browser console on any app iframe:
(async () => {
  if (!window.axe) {
    const s = document.createElement('script');
    s.src = 'https://cdnjs.cloudflare.com/ajax/libs/axe-core/4.10.0/axe.min.js';
    document.head.appendChild(s);
    await new Promise(r => s.onload = r);
  }
  const results = await axe.run(document, {
    runOnly: ['wcag2a', 'wcag2aa', 'wcag21aa']
  });
  console.log('=== Accessibility Results ===');
  console.log(`Passes: ${results.passes.length}`);
  console.log(`Violations: ${results.violations.length}`);
  const score = Math.round(
    (results.passes.length / (results.passes.length + results.violations.length)) * 100
  );
  console.log(`Score: ${score}%`);
  if (results.violations.length > 0) {
    console.table(results.violations.map(v => ({
      impact: v.impact,
      id: v.id,
      description: v.description,
      nodes: v.nodes.length
    })));
  }
  return results;
})();

2.5.3 — Color Contrast Audit

// Validate contrast ratios for all text elements
// Paste into browser console on any app iframe:
(function auditContrast() {
  function luminance(r, g, b) {
    const a = [r, g, b].map(v => {
      v /= 255;
      return v <= 0.03928 ? v / 12.92 : Math.pow((v + 0.055) / 1.055, 2.4);
    });
    return a[0] * 0.2126 + a[1] * 0.7152 + a[2] * 0.0722;
  }
  function contrastRatio(rgb1, rgb2) {
    const l1 = luminance(...rgb1) + 0.05;
    const l2 = luminance(...rgb2) + 0.05;
    return l1 > l2 ? l1 / l2 : l2 / l1;
  }
  function parseRGB(color) {
    const m = color.match(/\d+/g);
    return m ? m.slice(0, 3).map(Number) : [0, 0, 0];
  }

  const textElements = document.querySelectorAll('*');
  const issues = [];
  
  textElements.forEach(el => {
    const style = getComputedStyle(el);
    if (!el.textContent?.trim() || style.display === 'none') return;
    
    const fgRGB = parseRGB(style.color);
    const bgRGB = parseRGB(style.backgroundColor);
    
    // Skip if background is transparent (would need to walk up)
    if (style.backgroundColor === 'rgba(0, 0, 0, 0)') return;
    
    const ratio = contrastRatio(fgRGB, bgRGB);
    const fontSize = parseFloat(style.fontSize);
    const isBold = parseInt(style.fontWeight) >= 700;
    const isLargeText = fontSize >= 24 || (fontSize >= 18.66 && isBold);
    const required = isLargeText ? 3.0 : 4.5;
    
    if (ratio < required) {
      issues.push({
        text: el.textContent.trim().slice(0, 40),
        fg: style.color,
        bg: style.backgroundColor,
        ratio: ratio.toFixed(1),
        required: required,
        tag: el.tagName
      });
    }
  });
  
  if (issues.length === 0) {
    console.log('✅ All text passes WCAG AA contrast requirements');
  } else {
    console.log(`❌ ${issues.length} contrast failures:`);
    console.table(issues);
  }
})();

### VoiceOver Manual Test Procedure:
1. Open the app in Safari (VoiceOver works best with Safari)
2. Enable VoiceOver: Cmd+F5
3. Navigate with VO+Right Arrow through all elements
4. Verify:
   - [ ] App title/heading is announced
   - [ ] Data table rows are announced with column headers
   - [ ] Status badges announce text (not just color)
   - [ ] Loading state announces "Loading" or similar
   - [ ] Empty state announces helpful message
   - [ ] Interactive elements announce their purpose
   - [ ] No "blank" or "group" without context
5. Disable VoiceOver: Cmd+F5

Quality Gate:

axe-core score >90% on all apps
Zero critical/serious axe violations
All text meets WCAG AA contrast (4.5:1 normal, 3:1 large)
Secondary text uses #b0b2b8 or lighter (not #96989d)
All interactive elements reachable via Tab
VoiceOver reads meaningful content (no blank/unlabeled regions)

Layer 3: Functional Testing

3.1 — Jest Unit Tests with MSW (Mock Service Worker)

Test tool handlers without hitting real APIs:

// tests/tools.test.ts
import { http, HttpResponse } from 'msw';
import { setupServer } from 'msw/node';

// Mock API responses
const mockContacts = [
  { id: '1', name: 'John Doe', email: 'john@example.com', phone: '555-0101', status: 'active' },
  { id: '2', name: 'Jane Smith', email: 'jane@example.com', phone: '555-0102', status: 'inactive' },
  { id: '3', name: 'Bob Wilson', email: 'bob@example.com', phone: '555-0103', status: 'active' },
];

const handlers = [
  // Mock the external API endpoints your tools call
  http.get('https://api.example.com/v1/contacts', ({ request }) => {
    const url = new URL(request.url);
    const page = Number(url.searchParams.get('page') || 1);
    const pageSize = Number(url.searchParams.get('pageSize') || 25);
    const status = url.searchParams.get('status');
    
    let filtered = mockContacts;
    if (status) filtered = filtered.filter(c => c.status === status);
    
    return HttpResponse.json({
      data: filtered.slice((page - 1) * pageSize, page * pageSize),
      meta: { total: filtered.length, page, pageSize }
    });
  }),

  http.get('https://api.example.com/v1/contacts/:id', ({ params }) => {
    const contact = mockContacts.find(c => c.id === params.id);
    if (!contact) {
      return new HttpResponse(null, { status: 404 });
    }
    return HttpResponse.json(contact);
  }),

  http.post('https://api.example.com/v1/contacts', async ({ request }) => {
    const body = await request.json() as any;
    return HttpResponse.json({
      id: 'new-1',
      ...body,
      created_at: new Date().toISOString()
    }, { status: 201 });
  }),

  // Mock 500 error for chaos testing
  http.get('https://api.example.com/v1/error-endpoint', () => {
    return new HttpResponse(null, { status: 500 });
  }),
];

const server = setupServer(...handlers);

beforeAll(() => server.listen({ onUnhandledRequest: 'warn' }));
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

describe('Tool Handlers', () => {
  test('list_contacts returns paginated results', async () => {
    // Import your actual tool handler
    // const { handleListContacts } = require('../src/tools/contacts');
    // const result = await handleListContacts({ page: 1, pageSize: 25 });
    
    // For now, test the API client directly
    const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=25');
    const data = await response.json();
    
    expect(data.data).toBeInstanceOf(Array);
    expect(data.data.length).toBeGreaterThan(0);
    expect(data.meta.total).toBeDefined();
    expect(data.meta.page).toBe(1);
    
    // Validate each contact shape
    for (const contact of data.data) {
      expect(contact.id).toBeTruthy();
      expect(contact.name).toBeTruthy();
      expect(contact.email).toBeTruthy();
    }
  });

  test('list_contacts filters by status', async () => {
    const response = await fetch('https://api.example.com/v1/contacts?status=active');
    const data = await response.json();
    
    for (const contact of data.data) {
      expect(contact.status).toBe('active');
    }
  });

  test('get_contact returns single contact', async () => {
    const response = await fetch('https://api.example.com/v1/contacts/1');
    const data = await response.json();
    
    expect(data.id).toBe('1');
    expect(data.name).toBe('John Doe');
  });

  test('get_contact returns 404 for unknown ID', async () => {
    const response = await fetch('https://api.example.com/v1/contacts/unknown-99');
    expect(response.status).toBe(404);
  });

  test('create_contact returns created entity', async () => {
    const response = await fetch('https://api.example.com/v1/contacts', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ name: 'New Contact', email: 'new@test.com' })
    });
    const data = await response.json();
    
    expect(response.status).toBe(201);
    expect(data.id).toBeTruthy();
    expect(data.name).toBe('New Contact');
  });

  test('handles API 500 errors gracefully', async () => {
    const response = await fetch('https://api.example.com/v1/error-endpoint');
    expect(response.status).toBe(500);
    // Tool handler should return isError: true, not crash
  });
});

MSW Mock Validation: Hand-crafted mocks can drift from real API responses. When credentials are available (Layer 4), validate that MSW mock response shapes match actual API responses. Run a script that calls the real API once and diffs the response keys/types against your mock handlers. Update mocks quarterly or whenever the API ships a new version.

3.2 — Tool Routing Smoke Test

Automated script that sends NL messages and checks tool selection:

// tests/tool-routing.test.ts
import * as fs from 'fs';
import * as path from 'path';

interface RoutingFixture {
  message: string;
  expectedTool: string;
  category: string;
}

// Load routing fixtures (maintain this file!)
const ROUTING_FIXTURES_PATH = path.resolve(__dirname, '../test-fixtures/tool-routing.json');

const routingFixtures: RoutingFixture[] = JSON.parse(
  fs.readFileSync(ROUTING_FIXTURES_PATH, 'utf8')
);

describe('Tool Routing', () => {
  // This test requires the AI/LLM in the loop — typically run via LocalBosses API
  // or by mocking the tool selection logic
  
  test('routing fixtures file is valid', () => {
    expect(routingFixtures.length).toBeGreaterThanOrEqual(20);
    
    for (const fixture of routingFixtures) {
      expect(fixture.message).toBeTruthy();
      expect(fixture.expectedTool).toBeTruthy();
      expect(fixture.category).toBeTruthy();
    }
  });

  test('all expected tools exist in server', async () => {
    // Parse the server's tool definitions to get available tool names
    const toolNames = new Set<string>();
    
    // Read from compiled server or source
    // This validates that routing fixtures reference real tools
    const srcDir = path.resolve(__dirname, '../src/tools');
    if (fs.existsSync(srcDir)) {
      const toolFiles = fs.readdirSync(srcDir).filter(f => f.endsWith('.ts'));
      for (const file of toolFiles) {
        const content = fs.readFileSync(path.join(srcDir, file), 'utf8');
        const nameMatches = content.matchAll(/name:\s*['"]([^'"]+)['"]/g);
        for (const match of nameMatches) {
          toolNames.add(match[1]);
        }
      }
    }

    if (toolNames.size > 0) {
      for (const fixture of routingFixtures) {
        expect(toolNames.has(fixture.expectedTool)).toBe(true);
      }
    }
  });
});

// Tool routing fixtures template — save as test-fixtures/tool-routing.json:
/*
[
  { "message": "Show me all contacts", "expectedTool": "list_contacts", "category": "list" },
  { "message": "Find John Smith", "expectedTool": "search_contacts", "category": "search" },
  { "message": "What's John's email?", "expectedTool": "get_contact", "category": "get" },
  { "message": "Add a new contact", "expectedTool": "create_contact", "category": "create" },
  { "message": "Update John's phone number", "expectedTool": "update_contact", "category": "update" },
  { "message": "Remove the test contact", "expectedTool": "delete_contact", "category": "delete" },
  { "message": "Show me a summary of this month", "expectedTool": "get_dashboard", "category": "analytics" },
  ... (at least 20 fixtures per server)
]
*/

3.2b — DeepEval LLM-in-the-Loop Tool Routing Evaluation

Static routing fixtures validate that tool names exist, but they don't test whether the LLM actually selects the right tool. Use DeepEval for real LLM tool routing evaluation with ToolCorrectnessMetric and TaskCompletionMetric.

Setup:

pip install deepeval
deepeval login  # Optional: for dashboard tracking

Test file — save as tests/tool_routing_eval.py:

# tests/tool_routing_eval.py
# Requires: pip install deepeval anthropic
# Run: deepeval test run tests/tool_routing_eval.py

import json
import os
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric, TaskCompletionMetric
from deepeval.test_case import LLMTestCase, ToolCall
from anthropic import Anthropic

client = Anthropic()

def load_tool_definitions(server_dir: str) -> list[dict]:
    """Load tool definitions from compiled MCP server."""
    # Read tool names/schemas from the source files
    # Adapt path to your server structure
    import glob
    tools = []
    for f in glob.glob(f"{server_dir}/src/tools/*.ts"):
        with open(f) as fh:
            content = fh.read()
            # Extract tool definitions (simplified — adapt to your codebase)
            import re
            for match in re.finditer(r'name:\s*["\'](\w+)["\']', content):
                tools.append({"name": match.group(1)})
    return tools

def run_agent(message: str, system_prompt: str, tools: list[dict]) -> tuple[str, list[ToolCall]]:
    """Send message through Claude with tools, return response + tool calls."""
    # Convert MCP tool defs to Anthropic tool format
    anthropic_tools = [
        {
            "name": t["name"],
            "description": t.get("description", f"Tool: {t['name']}"),
            "input_schema": t.get("inputSchema", {"type": "object", "properties": {}})
        }
        for t in tools
    ]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": message}],
        tools=anthropic_tools,
    )

    tool_calls = []
    text_response = ""
    for block in response.content:
        if block.type == "tool_use":
            tool_calls.append(ToolCall(name=block.name, arguments=block.input))
        elif block.type == "text":
            text_response += block.text

    return text_response, tool_calls

# Load fixtures and system prompt
FIXTURES_PATH = "test-fixtures/tool-routing.json"
SYSTEM_PROMPT_PATH = "test-fixtures/system-prompt.txt"

with open(FIXTURES_PATH) as f:
    fixtures = json.load(f)

system_prompt = ""
if os.path.exists(SYSTEM_PROMPT_PATH):
    with open(SYSTEM_PROMPT_PATH) as f:
        system_prompt = f.read()

# Build test cases
tool_correctness = ToolCorrectnessMetric()
task_completion = TaskCompletionMetric()

test_cases = []
for fixture in fixtures:
    response_text, actual_calls = run_agent(
        fixture["message"], system_prompt, load_tool_definitions(".")
    )
    test_cases.append(
        LLMTestCase(
            input=fixture["message"],
            actual_output=response_text,
            expected_tools=[ToolCall(name=fixture["expectedTool"])],
            tools_called=actual_calls,
        )
    )

# Evaluate
results = evaluate(test_cases, [tool_correctness, task_completion])
print(f"\n=== DeepEval Results ===")
print(f"Tool Correctness: {tool_correctness.score:.1%}")
print(f"Task Completion: {task_completion.score:.1%}")
# Target: Tool Correctness >95%, Task Completion >90%

When to run: After every tool description change, system prompt update, or model upgrade. This is the REAL test of whether the AI routes correctly — fixture files alone are testing theater.

3.3 — APP_DATA Schema Validator

// tests/app-data-validator.ts
import Ajv from 'ajv';
import * as fs from 'fs';
import * as path from 'path';

const ajv = new Ajv({ allErrors: true, strict: false });

// Define expected schemas per app type
const APP_DATA_SCHEMAS: Record<string, object> = {
  'dashboard': {
    type: 'object',
    required: ['title'],
    properties: {
      title: { type: 'string' },
      metrics: {
        type: 'array',
        items: {
          type: 'object',
          required: ['label', 'value'],
          properties: {
            label: { type: 'string' },
            value: { type: ['string', 'number'] },
            change: { type: ['string', 'number'] },
            trend: { enum: ['up', 'down', 'flat'] }
          }
        }
      },
      charts: { type: 'array' },
      data: { type: ['array', 'object'] }
    }
  },
  'data-grid': {
    type: 'object',
    required: ['data'],
    properties: {
      title: { type: 'string' },
      data: {
        type: 'array',
        items: { type: 'object' },
        minItems: 0
      },
      meta: {
        type: 'object',
        properties: {
          total: { type: 'number' },
          page: { type: 'number' },
          pageSize: { type: 'number' }
        }
      },
      columns: { type: 'array' }
    }
  },
  'detail-card': {
    type: 'object',
    properties: {
      title: { type: 'string' },
      data: { type: 'object' },
      sections: { type: 'array' },
      fields: { type: 'array' }
    }
  },
  'timeline': {
    type: 'object',
    properties: {
      title: { type: 'string' },
      events: {
        type: 'array',
        items: {
          type: 'object',
          required: ['date'],
          properties: {
            date: { type: 'string' },
            title: { type: 'string' },
            description: { type: 'string' },
            type: { type: 'string' }
          }
        }
      },
      data: { type: 'array' }
    }
  },
  'pipeline': {
    type: 'object',
    properties: {
      title: { type: 'string' },
      stages: {
        type: 'array',
        items: {
          type: 'object',
          required: ['name'],
          properties: {
            name: { type: 'string' },
            items: { type: 'array' },
            count: { type: 'number' },
            value: { type: ['number', 'string'] }
          }
        }
      }
    }
  }
};

export function validateAppData(
  appType: string,
  appData: any
): { valid: boolean; errors: string[]; warnings: string[] } {
  const errors: string[] = [];
  const warnings: string[] = [];

  // Basic checks
  if (!appData || typeof appData !== 'object') {
    return { valid: false, errors: ['APP_DATA is null or not an object'], warnings: [] };
  }

  // Schema validation
  const schema = APP_DATA_SCHEMAS[appType];
  if (schema) {
    const validate = ajv.compile(schema);
    const isValid = validate(appData);
    if (!isValid && validate.errors) {
      for (const err of validate.errors) {
        errors.push(`${err.instancePath || '/'} ${err.message}`);
      }
    }
  } else {
    warnings.push(`No schema defined for app type: ${appType}`);
  }

  // Common checks regardless of app type
  if (appData.data && Array.isArray(appData.data)) {
    if (appData.data.length === 0) {
      warnings.push('data array is empty — app will show empty state');
    }
    // Check for null/undefined values in data items
    for (let i = 0; i < Math.min(appData.data.length, 5); i++) {
      const item = appData.data[i];
      for (const [key, val] of Object.entries(item || {})) {
        if (val === undefined) {
          warnings.push(`data[${i}].${key} is undefined (will show as "undefined" in app)`);
        }
      }
    }
  }

  return { valid: errors.length === 0, errors, warnings };
}

// Parse APP_DATA from AI response text
export function extractAppData(responseText: string): any | null {
  // Standard format
  const match = responseText.match(/<!--APP_DATA:([\s\S]*?):END_APP_DATA-->/);
  if (match) {
    try {
      // Strip whitespace/newlines that LLMs sometimes add
      const cleaned = match[1].replace(/[\n\r]/g, '').trim();
      return JSON.parse(cleaned);
    } catch (e) {
      // Try with more aggressive cleanup
      try {
        const aggressive = match[1]
          .replace(/[\n\r\t]/g, '')
          .replace(/,\s*}/g, '}')   // trailing commas
          .replace(/,\s*]/g, ']')   // trailing commas in arrays
          .trim();
        return JSON.parse(aggressive);
      } catch (e2) {
        return null;
      }
    }
  }
  
  // Fallback: try to find JSON in code blocks
  const codeBlockMatch = responseText.match(/```(?:json)?\s*([\s\S]*?)```/);
  if (codeBlockMatch) {
    try {
      return JSON.parse(codeBlockMatch[1].trim());
    } catch (e) {
      return null;
    }
  }
  
  return null;
}

3.4 — Thread Lifecycle Testing

### Thread Lifecycle: {channel}

1. [ ] Click app in toolbar → thread panel opens
2. [ ] Intake question appears in thread
3. [ ] Type response → AI processes in thread context
4. [ ] App loads in thread panel (if data returned or skipped)
5. [ ] Send follow-up message → app updates with new data
6. [ ] Close thread panel (X) → panel closes, thread indicator remains
7. [ ] Click thread indicator → panel reopens with preserved state
8. [ ] Delete thread → thread removed, parent message removed
9. [ ] Switch channels → come back → thread state persists (localStorage)

Quality Gate:

All tool handler unit tests pass (Jest + MSW)
Tool routing fixtures file has ≥20 test messages
All routing fixture tools exist in the server
APP_DATA schema validation passes for all app types
APP_DATA parser handles malformed JSON gracefully
Thread lifecycle completes without errors

Layer 3.5: Performance Testing

3.5.1 — Server Cold Start

#!/bin/bash
# Measure cold start time
SERVICE_DIR="$1"
cd "$SERVICE_DIR"

echo "=== Cold Start Benchmark ==="

# Measure time to first ListTools response
START=$(date +%s%N)
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf-test","version":"1.0.0"}}}' | \
  timeout 10 node dist/index.js 2>/dev/null | head -1 > /dev/null
END=$(date +%s%N)

ELAPSED=$(( (END - START) / 1000000 ))
echo "Cold start to first response: ${ELAPSED}ms"
if [ "$ELAPSED" -gt 2000 ]; then
  echo "❌ FAIL — exceeds 2000ms target"
else
  echo "✅ PASS — under 2000ms target"
fi

3.5.2 — Tool Invocation Latency

// tests/performance.test.ts
import { performance } from 'perf_hooks';

describe('Performance', () => {
  test('tool invocation overhead is under 100ms (excluding API time)', async () => {
    // With MSW intercepting API calls (near-zero latency),
    // measure the tool handler overhead itself
    const times: number[] = [];
    
    for (let i = 0; i < 10; i++) {
      const start = performance.now();
      // Call a read-only tool through the handler
      // await toolHandler({ page: 1, pageSize: 10 });
      const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=10');
      await response.json();
      const elapsed = performance.now() - start;
      times.push(elapsed);
    }

    const sorted = times.sort((a, b) => a - b);
    const p50 = sorted[Math.floor(sorted.length * 0.5)];
    const p95 = sorted[Math.floor(sorted.length * 0.95)];

    console.log(`Tool overhead P50: ${p50.toFixed(1)}ms, P95: ${p95.toFixed(1)}ms`);
    expect(p50).toBeLessThan(100);
  });

  test('memory usage stays under 100MB with all tools loaded', async () => {
    const used = process.memoryUsage();
    const heapMB = Math.round(used.heapUsed / 1024 / 1024);
    const rssMB = Math.round(used.rss / 1024 / 1024);
    
    console.log(`Heap: ${heapMB}MB, RSS: ${rssMB}MB`);
    expect(rssMB).toBeLessThan(100);
  });
});

3.5.3 — App File Size Budget

#!/bin/bash
echo "=== App File Size Budget (max 50KB) ==="
OVER=0
for f in app-ui/*.html; do
  if [ -f "$f" ]; then
    SIZE=$(wc -c < "$f" | tr -d ' ')
    KB=$((SIZE / 1024))
    if [ "$SIZE" -gt 51200 ]; then
      echo "❌ $(basename $f): ${KB}KB (OVER BUDGET)"
      OVER=$((OVER + 1))
    else
      echo "✅ $(basename $f): ${KB}KB"
    fi
  fi
done
[ "$OVER" -eq 0 ] && echo "All apps within budget" || echo "⚠️  $OVER apps over 50KB budget"

3.5.4 — App Render Performance (Playwright)

// In visual.test.ts, add:
test('time to first render is under 2s', async ({ page }) => {
  const start = Date.now();
  await page.goto(`file://${appFile}`);
  
  const fixture = loadFixture(appFile);
  await page.evaluate((data) => {
    window.postMessage({ type: 'mcp_app_data', data }, '*');
  }, fixture);
  
  // Wait for content to be visible
  await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
  const renderTime = Date.now() - start;
  
  console.log(`[${appName}] Time to first render: ${renderTime}ms`);
  expect(renderTime).toBeLessThan(2000);
});

3.5.5 — Load Testing (HTTP Transport)

For servers running with MCP_TRANSPORT=http, test concurrent connection handling:

#!/bin/bash
# load-test-http.sh — Test concurrent MCP connections
# Requires: npm install -g autocannon (or use curl + GNU parallel)

MCP_PORT="${1:-3000}"
CONCURRENCY="${2:-10}"
DURATION="${3:-10}"

echo "=== MCP HTTP Load Test ==="
echo "Target: http://localhost:${MCP_PORT}/mcp"
echo "Concurrency: ${CONCURRENCY} connections"
echo "Duration: ${DURATION}s"
echo ""

# Test 1: Concurrent initialize requests
echo "--- Test 1: Concurrent initialize ---"
for i in $(seq 1 $CONCURRENCY); do
  curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":'$i',"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"load-test-'$i'","version":"1.0.0"}}}' \
    -o /dev/null -w "Connection $i: %{http_code} in %{time_total}s\n" &
done
wait
echo ""

# Test 2: Concurrent tools/list under load
echo "--- Test 2: Concurrent tools/list ---"
START=$(date +%s%N)
for i in $(seq 1 $CONCURRENCY); do
  curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
    -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
    -o /dev/null -w "%{http_code} " &
done
wait
END=$(date +%s%N)
ELAPSED=$(( (END - START) / 1000000 ))
echo ""
echo "All $CONCURRENCY requests completed in ${ELAPSED}ms"
echo ""

# Test 3: Session management under load (verify no cross-session leaks)
echo "--- Test 3: Session isolation ---"
SESSION1=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-1","version":"1.0.0"}}}' \
  -D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')
SESSION2=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-2","version":"1.0.0"}}}' \
  -D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')

if [ "$SESSION1" != "$SESSION2" ] && [ -n "$SESSION1" ] && [ -n "$SESSION2" ]; then
  echo "✅ Sessions are unique (no cross-session leaks)"
else
  echo "⚠️  Session isolation check inconclusive"
fi

echo ""
echo "=== Load Test Complete ==="
echo "Target: ${CONCURRENCY} concurrent connections should complete without 5xx errors"

Pass criteria:

Zero 5xx errors under 10 concurrent connections
All responses return within 5s
No cross-session data leaks (GHSA-345p-7cg4-v4c7 regression test)
Memory usage stays under 200MB during load

Quality Gate:

Cold start <2s to first ListTools response
Tool invocation overhead P50 <100ms (excluding API latency)
Memory usage <100MB after loading all tool groups
All HTML app files <50KB
Time to first render <2s for all apps
HTTP transport handles 10 concurrent connections without errors

Layer 4: Live API Testing

4.1 — Credential Management Strategy

Before running Layer 4, categorize the server:

Category	Description	Layer 4 Approach
has-creds	API key/OAuth token available in `.env`	Full live testing
needs-creds	Credentials needed but not yet obtained	Skip Layer 4, note in report
sandbox-available	API provides sandbox/test environment	Use sandbox creds (preferred)
no-sandbox	Only production credentials available	Careful read-only testing only

Centralized credential management:

# Master credentials file (NOT committed to git)
# Location: ~/.clawdbot/workspace/.env.mcp-testing

# Format per service:
# {SERVICE}_API_KEY=xxx
# {SERVICE}_API_BASE_URL=https://api.example.com
# {SERVICE}_SANDBOX=true|false
# {SERVICE}_CRED_STATUS=has-creds|needs-creds|sandbox|no-sandbox
# {SERVICE}_CRED_EXPIRES=2026-03-01

# Script to distribute to individual servers:
cat ~/.clawdbot/workspace/.env.mcp-testing | grep "^${SERVICE}_" | sed "s/${SERVICE}_//" > ${SERVICE}-mcp/.env

For servers WITHOUT credentials, focus on Layers 0-3:

Layer 0: Protocol compliance (no API needed)
Layer 1: Static analysis (no API needed)
Layer 2: Visual testing with fixture data (no API needed)
Layer 2.5: Accessibility (no API needed)
Layer 3: Functional testing with MSW mocks (no API needed)
Layer 3.5: Performance with mocks (no API needed)
Layer 4: SKIP — note in report as "No credentials available"
Layer 4.5: Security (most checks don't need API)
Layer 5: Partial — E2E with mocked responses

4.2 — Test Each Tool Group

### Live API Test: {service} / {tool-group}

**Auth:** {method} — Token/key set in .env
**Base URL:** {url}
**Cred Status:** {has-creds|sandbox|no-creds}

| Tool | Test Input | Expected | Actual | Latency | Status |
|------|-----------|----------|--------|---------|--------|
| list_{entities} | {} (default) | Array of items | | ms | |
| list_{entities} | { status: "active" } | Filtered array | | ms | |
| get_{entity} | { id: "known-id" } | Single item | | ms | |
| create_{entity} | { name: "QA Test" } | Created w/ ID | | ms | |
| update_{entity} | { id: "id", name: "Updated" } | Updated item | | ms | |
| delete_{entity} | { id: "qa-test-id" } | Confirmation | | ms | |

4.3 — Response Shape Verification

# For each tool, verify response shape matches what the app expects
# Extract field references from app HTML
grep -oP 'data\.\K[a-zA-Z_]+' app-ui/{app}.html | sort -u > /tmp/expected-fields.txt

# Compare with actual API response fields
echo '{api_response}' | jq 'keys' > /tmp/actual-fields.txt

# Diff
diff /tmp/expected-fields.txt /tmp/actual-fields.txt

Quality Gate:

All read-only tools return valid data
Write tools create/update/delete correctly (use sandbox)
Response shapes match what apps expect
Error responses (401, 403, 404, 422, 429) handled gracefully
All response latencies recorded for P50/P95 metrics
Cleanup: delete any test data created during QA

Layer 4.5: Security Testing

4.5.1 — XSS Testing

// tests/security.test.ts
import { test, expect } from '@playwright/test';
import * as path from 'path';

const XSS_PAYLOADS = [
  '<script>alert("xss")</script>',
  '<img src=x onerror=alert("xss")>',
  '"><script>alert(1)</script>',
  "';alert(String.fromCharCode(88,83,83))//",
  '<svg onload=alert("xss")>',
  'javascript:alert("xss")',
  '<iframe src="javascript:alert(1)">',
  '{{constructor.constructor("return this")().alert(1)}}',
  '<details open ontoggle=alert(1)>',
  '<math><mtext><table><mglyph><svg><mtext><style><img src=x onerror=alert(1)>',
];

test.describe('XSS Security', () => {
  test('escapeHtml blocks all XSS payloads in text fields', async ({ page }) => {
    const appFile = path.resolve(__dirname, '../app-ui/contact-grid.html');
    await page.goto(`file://${appFile}`);

    for (const payload of XSS_PAYLOADS) {
      let alertFired = false;
      page.on('dialog', async dialog => {
        alertFired = true;
        await dialog.dismiss();
      });

      // Inject data with XSS payloads in every text field
      await page.evaluate((xss) => {
        window.postMessage({
          type: 'mcp_app_data',
          data: {
            title: xss,
            data: [
              { name: xss, email: xss, phone: xss, status: xss },
            ],
            meta: { total: 1, page: 1, pageSize: 25 }
          }
        }, '*');
      }, payload);

      await page.waitForTimeout(200);
      expect(alertFired).toBe(false);
    }
  });
});

4.5.2 — postMessage Origin Validation

// Check in browser console — app should validate message origin
// Inject from a different origin simulation:
(function testOriginValidation() {
  // Check if app code validates event.origin
  const appScript = document.querySelector('script')?.textContent || '';
  const checksOrigin = appScript.includes('event.origin') ||
                       appScript.includes('e.origin') ||
                       appScript.includes('message.origin');
  
  if (checksOrigin) {
    console.log('✅ App validates postMessage origin');
  } else {
    console.log('⚠️  App does NOT validate postMessage origin — potential security issue');
    console.log('   Recommended: Add origin check in message event listener');
  }
})();

4.5.3 — Content Security Policy Check

# Check if HTML apps declare CSP
for f in app-ui/*.html; do
  if grep -q "Content-Security-Policy" "$f"; then
    echo "✅ $(basename $f) has CSP meta tag"
  else
    echo "⚠️  $(basename $f) — no CSP meta tag"
  fi
done

# Check for inline event handlers (CSP-unfriendly)
for f in app-ui/*.html; do
  INLINE=$(grep -c 'on[a-z]*=' "$f" || echo "0")
  if [ "$INLINE" -gt 0 ]; then
    echo "⚠️  $(basename $f) has $INLINE inline event handlers"
  fi
done

4.5.4 — API Key Exposure Check

# Check for leaked secrets in client-side code
echo "=== API Key Exposure Scan ==="

# Common patterns for API keys/secrets
PATTERNS=(
  'api[_-]?key'
  'apikey'
  'secret'
  'token'
  'password'
  'authorization.*Bearer'
  'sk_live_'
  'pk_live_'
  'ghp_'
  'gho_'
)

for f in app-ui/*.html; do
  for pat in "${PATTERNS[@]}"; do
    MATCHES=$(grep -ci "$pat" "$f" || echo "0")
    if [ "$MATCHES" -gt 0 ]; then
      echo "❌ $(basename $f) may contain exposed secrets (pattern: $pat)"
      grep -in "$pat" "$f" | head -3
    fi
  done
done

# Also check compiled JS
for f in dist/**/*.js; do
  if [ -f "$f" ]; then
    for pat in "${PATTERNS[@]}"; do
      MATCHES=$(grep -ci "$pat" "$f" || echo "0")
      if [ "$MATCHES" -gt 0 ]; then
        echo "⚠️  $(basename $f) references: $pat (verify not actual key)"
      fi
    done
  fi
done

Quality Gate:

All XSS payloads blocked (escapeHtml works)
No alert dialogs triggered from any payload
postMessage origin validated (or documented as acceptable risk)
No API keys/secrets exposed in HTML app files
No API keys/secrets in client-facing JavaScript
CSP meta tag present (or documented why not)

Layer 5: Integration & Chaos Testing

5.1 — End-to-End Scenarios

Write at least 1 E2E scenario per app type (minimum 5 per server):

### E2E Scenario: {scenario-name}

**Channel:** {channel}
**Goal:** {what the user is trying to accomplish}
**App type:** {dashboard|grid|card|timeline|pipeline|calendar|analytics|monitor}

**Steps:**
1. Navigate to #{channel}
2. Type: "{natural language message}"
3. Verify: AI responds with correct tool call
4. Verify: APP_DATA block present and valid JSON
5. Verify: App {app-id} renders with correct data
6. In thread, type: "{follow-up message}"
7. Verify: App updates with new/refined data
8. Measure: Response latency for each step

**Metrics:**
- Tool selected correctly: ✅/❌
- APP_DATA valid: ✅/❌
- App rendered: ✅/❌
- Latency step 3: ___ms
- Latency step 7: ___ms

**Pass criteria:**
- [ ] All steps complete without errors
- [ ] Response time <5s for each step
- [ ] Zero console errors
- [ ] Data is accurate and well-formatted

5.1b — Automated End-to-End Data Flow Test (Playwright)

The magic moment: message → AI → tool → APP_DATA → app render → correct data. This test automates the entire flow:

// tests/e2e-dataflow.test.ts
import { test, expect } from '@playwright/test';

const LOCALBOSSES_URL = process.env.LB_URL || 'http://localhost:3000';

test.describe('End-to-End Data Flow', () => {
  test('message triggers tool → APP_DATA → app renders correct data', async ({ page }) => {
    // 1. Navigate to the channel
    await page.goto(`${LOCALBOSSES_URL}/#/channel/{channel-id}`);
    await page.waitForLoadState('networkidle');

    // 2. Send a test message
    const chatInput = page.locator('[data-testid="chat-input"], textarea, input[type="text"]');
    await chatInput.fill('Show me all active contacts');
    await chatInput.press('Enter');

    // 3. Wait for AI response (tool call indicator or text response)
    const aiResponse = page.locator('[data-testid="ai-response"], .message-content').last();
    await aiResponse.waitFor({ state: 'visible', timeout: 15000 });

    // 4. Verify APP_DATA block was generated
    const responseText = await aiResponse.textContent();
    // The APP_DATA is in the raw response (may be hidden in the UI)
    // Check that the app iframe loaded
    const appFrame = page.frameLocator('iframe[data-app-id]').first();

    // 5. Verify app rendered with data (not empty/loading state)
    const appContent = appFrame.locator('#content');
    await appContent.waitFor({ state: 'visible', timeout: 10000 });

    // 6. Verify correct data is displayed
    // App should show contact data, not empty state
    const appText = await appContent.textContent();
    expect(appText).toBeTruthy();
    expect(appText!.length).toBeGreaterThan(10); // Has real content

    // 7. Verify no console errors in the app iframe
    const consoleErrors: string[] = [];
    page.on('console', msg => {
      if (msg.type() === 'error') consoleErrors.push(msg.text());
    });
    expect(consoleErrors).toHaveLength(0);

    // 8. Screenshot for the record
    await page.screenshot({ path: 'test-results/e2e-dataflow.png', fullPage: true });
  });
});

Note: This test requires LocalBosses running locally with the integrated channel. It's the most important test — it validates the complete user experience end-to-end. Run this after every integration change.

5.2 — Chaos Testing

Test resilience under adverse conditions:

// tests/chaos.test.ts

describe('Chaos Testing', () => {
  test('API returns 500 on every call', async () => {
    // Override MSW handlers to return 500
    server.use(
      http.get('https://api.example.com/*', () => {
        return new HttpResponse('Internal Server Error', { status: 500 });
      }),
      http.post('https://api.example.com/*', () => {
        return new HttpResponse('Internal Server Error', { status: 500 });
      })
    );

    // Tool should return isError: true, NOT crash
    // const result = await callTool('list_contacts', {});
    // expect(result.isError).toBe(true);
    // expect(result.content[0].text).toContain('error');
  });

  test('postMessage sends wrong format data', async ({ page }) => {
    await page.goto(`file://${appFile}`);
    
    // Send wrong type
    await page.evaluate(() => {
      window.postMessage({ type: 'wrong_type', data: {} }, '*');
    });
    await page.waitForTimeout(300);
    
    // App should not crash — should still show loading/empty
    const bodyText = await page.textContent('body');
    expect(bodyText).not.toContain('undefined');
    expect(bodyText).not.toContain('TypeError');

    // Send data with wrong shape
    await page.evaluate(() => {
      window.postMessage({ type: 'mcp_app_data', data: 'not an object' }, '*');
    });
    await page.waitForTimeout(300);
    
    const bodyText2 = await page.textContent('body');
    expect(bodyText2).not.toContain('undefined');
  });

  test('APP_DATA is 500KB+ (huge dataset)', async ({ page }) => {
    await page.goto(`file://${appFile}`);
    
    // Generate huge dataset
    const hugeData = {
      title: 'Performance Stress Test',
      data: Array.from({ length: 2000 }, (_, i) => ({
        id: `item-${i}`,
        name: `Contact ${i} ${'A'.repeat(100)}`,
        email: `contact${i}@example.com`,
        phone: `555-${String(i).padStart(4, '0')}`,
        status: i % 2 === 0 ? 'active' : 'inactive',
        notes: 'X'.repeat(200)
      })),
      meta: { total: 2000, page: 1, pageSize: 2000 }
    };

    const start = Date.now();
    await page.evaluate((data) => {
      window.postMessage({ type: 'mcp_app_data', data }, '*');
    }, hugeData);
    
    // Should render within 5 seconds even with huge data
    await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
    const renderTime = Date.now() - start;
    
    console.log(`Huge dataset render time: ${renderTime}ms`);
    expect(renderTime).toBeLessThan(5000);
  });

  test('rapid-fire 10 messages', async ({ page }) => {
    await page.goto(`file://${appFile}`);
    
    // Send 10 data updates in quick succession
    for (let i = 0; i < 10; i++) {
      await page.evaluate((idx) => {
        window.postMessage({
          type: 'mcp_app_data',
          data: {
            title: `Update ${idx}`,
            data: [{ name: `Item ${idx}`, status: 'active' }],
            meta: { total: 1, page: 1, pageSize: 25 }
          }
        }, '*');
      }, i);
    }
    
    await page.waitForTimeout(1000);
    
    // App should show the LAST update (not crash or show stale data)
    const content = await page.textContent('body');
    expect(content).toContain('Update 9');
  });

  test('two apps rendering simultaneously', async ({ browser }) => {
    const page1 = await browser.newPage();
    const page2 = await browser.newPage();
    
    await page1.goto(`file://${appFile}`);
    await page2.goto(`file://${appFile}`);
    
    // Send data to both simultaneously
    await Promise.all([
      page1.evaluate(() => {
        window.postMessage({
          type: 'mcp_app_data',
          data: { title: 'App 1', data: [{ name: 'One' }] }
        }, '*');
      }),
      page2.evaluate(() => {
        window.postMessage({
          type: 'mcp_app_data',
          data: { title: 'App 2', data: [{ name: 'Two' }] }
        }, '*');
      })
    ]);
    
    await page1.waitForTimeout(500);
    await page2.waitForTimeout(500);
    
    // Both should render their respective data
    expect(await page1.textContent('body')).toContain('One');
    expect(await page2.textContent('body')).toContain('Two');
    
    await page1.close();
    await page2.close();
  });
});

5.3 — Cross-Browser Testing Notes

Browser	Priority	Key Differences	How to Test
Chrome	P0	Primary target — test all features here	Playwright `chromium` channel
Firefox	P1	CSS Grid/Flexbox rendering differs slightly; `backdrop-filter` needs `-webkit-` prefix	Playwright `firefox` channel
Mobile Safari	P1	Touch targets (min 44×44px), safe area insets, `-webkit-` prefixes, no `backdrop-filter`	Playwright `webkit` channel or real device
Electron	P2	If LocalBosses ships as desktop app; test Node integration, `contextBridge`	Playwright with Electron

// playwright.config.ts — multi-browser setup
import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
    { name: 'firefox', use: { ...devices['Desktop Firefox'] } },
    { name: 'webkit', use: { ...devices['Desktop Safari'] } },
    { name: 'mobile-chrome', use: { ...devices['Pixel 5'] } },
    { name: 'mobile-safari', use: { ...devices['iPhone 13'] } },
  ],
});

Quality Gate:

All E2E scenarios pass (≥1 per app type)
Chaos tests: API 500s handled gracefully
Chaos tests: wrong postMessage format doesn't crash app
Chaos tests: 500KB+ dataset renders within 5s
Chaos tests: rapid-fire messages show final state
Cross-browser: Chrome + Firefox + WebKit all render correctly

Layer 5.5: Production Smoke Test (Post-Deployment)

After deploying a server + apps to production, run this validation before considering it shipped:

#!/bin/bash
# smoke-test.sh — Post-deployment validation
# Usage: ./smoke-test.sh <service-name> [base-url]

SERVICE="$1"
BASE_URL="${2:-http://localhost:3000}"

echo "=== Production Smoke Test: ${SERVICE} ==="
echo "Target: ${BASE_URL}"
echo ""

PASS=0
FAIL=0

# 1. Server is reachable (HTTP transport)
echo "--- Server Reachability ---"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST "${BASE_URL}/mcp" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"smoke-test","version":"1.0.0"}}}')

if [ "$HTTP_CODE" = "200" ]; then
  echo "✅ Server responds to initialize (HTTP $HTTP_CODE)"
  PASS=$((PASS + 1))
else
  echo "❌ Server unreachable or error (HTTP $HTTP_CODE)"
  FAIL=$((FAIL + 1))
fi

# 2. tools/list returns tools
echo "--- Tool List ---"
TOOLS_RESPONSE=$(curl -s -X POST "${BASE_URL}/mcp" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}')
TOOL_COUNT=$(echo "$TOOLS_RESPONSE" | grep -o '"name"' | wc -l | tr -d ' ')

if [ "$TOOL_COUNT" -gt 0 ]; then
  echo "✅ tools/list returns $TOOL_COUNT tools"
  PASS=$((PASS + 1))
else
  echo "❌ tools/list returned 0 tools"
  FAIL=$((FAIL + 1))
fi

# 3. health_check tool responds
echo "--- Health Check ---"
HEALTH=$(curl -s -X POST "${BASE_URL}/mcp" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"health_check","arguments":{}}}')

if echo "$HEALTH" | grep -q '"status"'; then
  echo "✅ health_check tool responds"
  PASS=$((PASS + 1))
else
  echo "⚠️  health_check tool not found or error"
fi

# 4. App HTML files are served (if HTTP)
echo "--- App Files ---"
for app_id in $(echo "$TOOLS_RESPONSE" | grep -oP '"name":\s*"\K[^"]+' | head -3); do
  APP_HTTP=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}/api/mcp-apps?app=${app_id}")
  if [ "$APP_HTTP" = "200" ]; then
    echo "✅ App ${app_id} is served"
  fi
done

# Summary
echo ""
echo "=== Smoke Test Results ==="
echo "Passed: $PASS"
echo "Failed: $FAIL"
[ "$FAIL" -eq 0 ] && echo "✅ SMOKE TEST PASSED" || echo "❌ SMOKE TEST FAILED"

Layer 6: Production Monitoring (Post-Ship)

"All testing is pre-ship. There's no guidance on tracking tool correctness, APP_DATA parse success rate, or user satisfaction in production." — Kofi

Pre-ship testing validates that everything can work. Production monitoring validates that everything does work, continuously.

6.1 — Production Quality Metrics

Track these metrics in production via logging in the chat route and aggregating weekly:

Metric	Target	How to Measure	Alert Threshold
APP_DATA Parse Success Rate	>98%	Log every `parseAppData()` call: success vs fallback vs failure	<95% over 1 hour
Tool Correctness Sampling	>95%	Sample 5% of interactions weekly, LLM-judge correctness	<90% in weekly sample
Time to First App Render	P50 <3s, P95 <8s	Measure from user message send → app `#content` visible	P95 >12s
User Retry Rate	<15%	Count rephrased messages within 30s of previous message	>25% over 1 day
Thread Completion Rate	>80%	% of threads where user reaches a data-displaying app state	<60% over 1 week

6.2 — Instrumentation Code

Add to the chat route to collect production metrics:

// lib/production-metrics.ts
interface MetricEvent {
  timestamp: string;
  channel: string;
  metric: string;
  value: number;
  metadata?: Record<string, unknown>;
}

const metrics: MetricEvent[] = [];

export function trackMetric(channel: string, metric: string, value: number, metadata?: Record<string, unknown>) {
  metrics.push({
    timestamp: new Date().toISOString(),
    channel,
    metric,
    value,
    metadata,
  });
  // Flush to file every 100 events
  if (metrics.length >= 100) flushMetrics();
}

function flushMetrics() {
  const fs = require('fs');
  const path = require('path');
  const file = path.join(process.cwd(), 'logs', `metrics-${new Date().toISOString().split('T')[0]}.jsonl`);
  fs.mkdirSync(path.dirname(file), { recursive: true });
  fs.appendFileSync(file, metrics.map(m => JSON.stringify(m)).join('\n') + '\n');
  metrics.length = 0;
}

// Usage in chat route:
// trackMetric(channelId, 'app_data_parse', success ? 1 : 0, { fallback: usedFallback });
// trackMetric(channelId, 'tool_call_latency', latencyMs, { tool: toolName });
// trackMetric(channelId, 'thread_completed', 1);

6.3 — Weekly Quality Review

#!/bin/bash
# weekly-quality-report.sh — Aggregate production metrics
METRICS_DIR="logs"
WEEK_START=$(date -v-7d +%Y-%m-%d)

echo "=== Weekly Production Quality Report ==="
echo "Period: ${WEEK_START} to $(date +%Y-%m-%d)"
echo ""

# APP_DATA parse success rate
TOTAL_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | wc -l | tr -d ' ')
SUCCESS_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | grep '"value":1' | wc -l | tr -d ' ')
if [ "$TOTAL_PARSES" -gt 0 ]; then
  PARSE_RATE=$((SUCCESS_PARSES * 100 / TOTAL_PARSES))
  echo "APP_DATA Parse Success: ${PARSE_RATE}% (${SUCCESS_PARSES}/${TOTAL_PARSES})"
else
  echo "APP_DATA Parse Success: No data"
fi

echo ""
echo "Action items:"
echo "- Review any channels with parse rate <95%"
echo "- Check retry rate spikes for system prompt issues"
echo "- Sample 5 random interactions for manual correctness review"

CI/CD Pipeline Template

Automate the QA pipeline in CI. Save as .github/workflows/mcp-qa.yml:

# .github/workflows/mcp-qa.yml
name: MCP QA Pipeline
on:
  push:
    paths: ['*-mcp/**', 'mcp-servers/**']
  pull_request:
    paths: ['*-mcp/**', 'mcp-servers/**']

jobs:
  qa:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [22]
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: TypeScript build
        run: npm run build

      - name: Type check
        run: npx tsc --noEmit

      - name: Jest unit tests
        run: npx jest --ci --coverage
        env:
          NODE_ENV: test

      - name: Install Playwright browsers
        run: npx playwright install --with-deps

      - name: Playwright visual + accessibility tests
        run: npx playwright test

      - name: App file size check
        run: |
          for f in app-ui/*.html; do
            if [ -f "$f" ]; then
              SIZE=$(wc -c < "$f" | tr -d ' ')
              if [ "$SIZE" -gt 51200 ]; then
                echo "❌ $(basename $f) exceeds 50KB ($SIZE bytes)"
                exit 1
              fi
              echo "✅ $(basename $f) ($SIZE bytes)"
            fi
          done

      - name: Security scan
        run: |
          ISSUES=0
          for f in app-ui/*.html; do
            for pat in "api_key" "apikey" "secret" "sk_live" "pk_live"; do
              if grep -qi "$pat" "$f" 2>/dev/null; then
                echo "❌ $(basename $f): potential key exposure ($pat)"
                ISSUES=$((ISSUES + 1))
              fi
            done
          done
          [ "$ISSUES" -eq 0 ] || exit 1

      - name: Upload test results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: test-results
          path: |
            test-results/
            coverage/
          retention-days: 30

  # Optional: DeepEval tool routing (requires API key)
  tool-routing:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    needs: qa
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install deepeval anthropic
      - name: Run DeepEval tool routing evaluation
        run: deepeval test run tests/tool_routing_eval.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}

Testing Reality Check

What the QA catches vs what it misses — from Kofi's review

✅ What This QA Framework CATCHES (real quality):

Test	What It Validates	Real-World Impact
TypeScript compilation	Code compiles, types correct	Prevents server crashes
MCP Inspector	Protocol compliance	Server works with any MCP client
Playwright visual tests	Apps render all 3 states, dark theme, responsive	Users see a polished UI
axe-core accessibility	WCAG AA, keyboard nav, screen reader	Accessible to all users
XSS payload testing	No script injection via user data	Security against malicious data
Chaos testing (500 errors, wrong formats, huge data)	Graceful degradation	App doesn't crash under adverse conditions
Static cross-reference	All app IDs consistent across 4 files	No broken routes or missing entries
File size budgets	Apps under 50KB	Fast loading
BackstopJS regression	Visual changes are intentional	No accidental UI regressions
Cold start / latency benchmarks	Performance within targets	Users don't wait too long

❌ What This QA Framework MISSES (gaps to be aware of):

Gap	Why It Matters	Current State	Mitigation
Tool routing accuracy with real LLM	THE quality metric — does the AI pick the right tool?	DeepEval added (3.2b) but requires API key + cost	Run DeepEval on main branch pushes, not every PR
APP_DATA generation quality	Does the LLM produce valid JSON matching app expectations?	Not fully tested — parser is tested, generator is probabilistic	Few-shot examples in system prompts + Layer 6 monitoring
Multi-step tool chains	"Find John's email and send him a meeting invite" — requires 3 tool calls	Not tested — all routing tests are single-tool	Add multi-step fixtures to DeepEval test cases
Conversation context	"Show me more details about the second one" — requires memory	Not addressed in any skill	Requires thread state tracking — future work
Real API response shape drift	MSW mocks may not match real API	MSW validation note added (3.1) but manual	Quarterly mock validation when credentials available
Production quality after ship	Is quality maintained over time?	Layer 6 monitoring added	Implement metric collection + weekly review
APP_DATA parse failure rate in production	How often does the LLM produce unparseable JSON?	Layer 6 tracks this now	Set alerting threshold at <95% success

The Hard Truth:

This QA framework is excellent at testing infrastructure (server compiles, apps render, accessibility passes, security is clean) — roughly 40% of the user experience. The AI interaction quality (tool routing, data generation, multi-step flows) is the other 60%, and it's harder to test deterministically because the LLM is probabilistic. Layer 6 monitoring and DeepEval close this gap but don't eliminate it. Ship with awareness, monitor in production, iterate on system prompts.

Test Data Fixtures Library

Standard Fixture: Dashboard

Save as test-fixtures/dashboard.json:

{
  "title": "Monthly Performance Overview",
  "metrics": [
    { "label": "Total Revenue", "value": "$124,500", "change": "+12.3%", "trend": "up" },
    { "label": "New Customers", "value": 847, "change": "+5.2%", "trend": "up" },
    { "label": "Churn Rate", "value": "2.1%", "change": "-0.3%", "trend": "down" },
    { "label": "Avg Response Time", "value": "1.4h", "change": "-8.5%", "trend": "down" }
  ],
  "charts": [
    {
      "type": "bar",
      "title": "Revenue by Month",
      "data": [
        { "label": "Sep", "value": 95000 },
        { "label": "Oct", "value": 102000 },
        { "label": "Nov", "value": 98000 },
        { "label": "Dec", "value": 115000 },
        { "label": "Jan", "value": 124500 }
      ]
    }
  ],
  "data": {
    "summary": "Revenue is up 12.3% month-over-month with strong customer acquisition."
  }
}

Standard Fixture: Data Grid

Save as test-fixtures/data-grid.json:

{
  "title": "Active Contacts",
  "columns": ["Name", "Email", "Phone", "Status", "Created"],
  "data": [
    { "name": "John Doe", "email": "john@acmecorp.com", "phone": "555-0101", "status": "active", "created": "2026-01-15" },
    { "name": "Jane Smith", "email": "jane@techstart.io", "phone": "555-0102", "status": "active", "created": "2026-01-20" },
    { "name": "Bob Wilson", "email": "bob@globalinc.com", "phone": "555-0103", "status": "inactive", "created": "2025-12-01" },
    { "name": "Alice Brown", "email": "alice@startup.co", "phone": "555-0104", "status": "active", "created": "2026-02-01" },
    { "name": "Charlie Davis", "email": "charlie@enterprise.net", "phone": "555-0105", "status": "pending", "created": "2026-02-03" },
    { "name": "Diana Evans", "email": "diana@agency.com", "phone": "555-0106", "status": "active", "created": "2025-11-15" },
    { "name": "Frank Garcia", "email": "frank@solutions.biz", "phone": "555-0107", "status": "active", "created": "2026-01-28" },
    { "name": "Grace Hill", "email": "grace@design.studio", "phone": "555-0108", "status": "inactive", "created": "2025-10-05" }
  ],
  "meta": { "total": 156, "page": 1, "pageSize": 25 }
}

Standard Fixture: Timeline

Save as test-fixtures/timeline.json:

{
  "title": "Contact Activity Timeline",
  "events": [
    { "date": "2026-02-04T14:30:00Z", "title": "Email Opened", "description": "Campaign: February Newsletter", "type": "email" },
    { "date": "2026-02-03T10:15:00Z", "title": "Meeting Scheduled", "description": "Demo call with sales team", "type": "meeting" },
    { "date": "2026-02-01T09:00:00Z", "title": "Deal Created", "description": "Enterprise Plan — $15,000/yr", "type": "deal" },
    { "date": "2026-01-28T16:45:00Z", "title": "Form Submitted", "description": "Requested pricing information", "type": "form" },
    { "date": "2026-01-25T11:30:00Z", "title": "First Visit", "description": "Visited pricing page from Google Ads", "type": "visit" }
  ]
}

Edge Case Fixtures

Save as test-fixtures/edge-cases.json:

{
  "empty_strings": {
    "data": [
      { "name": "", "email": "", "phone": "", "status": "" }
    ]
  },
  "null_values": {
    "data": [
      { "name": null, "email": null, "phone": null, "status": null }
    ]
  },
  "extremely_long_text": {
    "data": [
      {
        "name": "Bartholomew Christopherson-Williamsworth III, Esq., Ph.D., M.B.A., J.D., CPA, CFP®, CAIA®, FRM®",
        "email": "bartholomew.christopherson-williamsworth.the.third.esquire.phd.mba.jd@extremely-long-company-name-international-holdings-corporation-unlimited.com",
        "phone": "+1 (555) 012-3456 ext. 78901234",
        "status": "active — pending final review by committee chairperson and board of directors",
        "notes": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
      }
    ]
  },
  "unicode": {
    "data": [
      { "name": "田中太郎", "email": "tanaka@例え.jp", "status": "アクティブ" },
      { "name": "Müller, Günther", "email": "günther@münchen.de", "status": "aktiv" },
      { "name": "Дмитрий Иванов", "email": "dmitry@компания.ru", "status": "активный" },
      { "name": "محمد عبدالله", "email": "mohammed@شركة.sa", "status": "نشط" },
      { "name": "🧑‍💻 Developer", "email": "dev@🏢.com", "status": "✅ Active" }
    ]
  },
  "html_entities": {
    "data": [
      { "name": "O'Brien & Sons <LLC>", "email": "info@obrien&sons.com", "notes": 'He said "hello" & left' }
    ]
  }
}

Adversarial Fixtures

Save as test-fixtures/adversarial.json:

{
  "xss_payloads": {
    "data": [
      { "name": "<script>alert('xss')</script>", "email": "test@test.com" },
      { "name": "<img src=x onerror=alert(1)>", "email": "\"><script>alert(1)</script>" },
      { "name": "<svg onload=alert('xss')>", "email": "javascript:alert(1)" },
      { "name": "{{constructor.constructor('return this')().alert(1)}}", "email": "test@test.com" },
      { "name": "<details open ontoggle=alert(1)>", "email": "<iframe src='javascript:alert(1)'>" }
    ]
  },
  "sql_injection": {
    "data": [
      { "name": "'; DROP TABLE contacts; --", "email": "test@test.com" },
      { "name": "1' OR '1'='1", "email": "' UNION SELECT * FROM users --" },
      { "name": "admin'--", "email": "1; UPDATE users SET role='admin'" }
    ]
  },
  "malformed": {
    "missing_fields": { "data": [{ "id": "1" }] },
    "wrong_types": { "data": "not an array", "meta": "not an object" },
    "nested_nulls": { "data": [{ "name": { "first": null, "last": null }, "contacts": [null, null] }] },
    "circular_attempt": { "data": [{ "self": "[Circular]" }] },
    "massive_nesting": { "a": { "b": { "c": { "d": { "e": { "f": { "g": "deep" } } } } } } }
  }
}

Scale Fixture Generator

// tests/generate-scale-fixture.ts
// Run: npx ts-node tests/generate-scale-fixture.ts > test-fixtures/scale-1000.json

function generateScaleData(count: number) {
  const statuses = ['active', 'inactive', 'pending', 'archived'];
  const domains = ['gmail.com', 'outlook.com', 'company.co', 'startup.io', 'enterprise.net'];
  
  return {
    title: `Scale Test: ${count} Records`,
    data: Array.from({ length: count }, (_, i) => ({
      id: `contact-${String(i).padStart(6, '0')}`,
      name: `Contact ${i + 1}`,
      email: `user${i + 1}@${domains[i % domains.length]}`,
      phone: `555-${String(i).padStart(4, '0')}`,
      status: statuses[i % statuses.length],
      created: new Date(2025, 0, 1 + (i % 365)).toISOString().split('T')[0],
      value: Math.round(Math.random() * 100000) / 100,
      tags: [`tag-${i % 10}`, `region-${i % 5}`]
    })),
    meta: { total: count, page: 1, pageSize: count }
  };
}

console.log(JSON.stringify(generateScaleData(1000), null, 2));

Regression Testing Baselines

Baseline Workflow

1. CAPTURE — First time app is verified correct:
   backstop reference
   # Stores golden screenshots in test-baselines/backstop/

2. TEST — On every subsequent QA run:
   backstop test
   # Compares current screenshots against baselines
   # Result: PASS (<5% diff) or FAIL (>5% diff)

3. APPROVE — When intentional changes are made:
   backstop approve
   # Updates baselines to reflect new correct state

4. TRACK — Tool routing baselines:
   # test-fixtures/tool-routing.json is the routing baseline
   # Update ONLY when intentionally changing tool descriptions
   # Run routing tests after ANY tool description change

Screenshot Baseline Structure

test-baselines/
├── backstop/
│   ├── {app-name}_thread-panel_data.png
│   ├── {app-name}_thread-panel_loading.png
│   ├── {app-name}_thread-panel_empty.png
│   ├── {app-name}_narrow_data.png
│   └── {app-name}_wide_data.png
├── tool-routing.json          # NL → tool mapping baseline
└── app-data-schemas/          # JSON schemas per app type
    ├── dashboard.schema.json
    ├── data-grid.schema.json
    ├── detail-card.schema.json
    ├── timeline.schema.json
    └── pipeline.schema.json

Programmatic Screenshot Comparison (Without BackstopJS)

// tests/screenshot-diff.ts
import { PNG } from 'pngjs';
import * as fs from 'fs';
import pixelmatch from 'pixelmatch';

function compareScreenshots(
  baselinePath: string,
  currentPath: string,
  diffOutputPath: string
): { diffPercent: number; pass: boolean } {
  const baseline = PNG.sync.read(fs.readFileSync(baselinePath));
  const current = PNG.sync.read(fs.readFileSync(currentPath));
  
  const { width, height } = baseline;
  const diff = new PNG({ width, height });
  
  const numDiffPixels = pixelmatch(
    baseline.data, current.data, diff.data,
    width, height,
    { threshold: 0.1 }
  );
  
  const totalPixels = width * height;
  const diffPercent = (numDiffPixels / totalPixels) * 100;
  
  if (diffPercent > 5) {
    fs.writeFileSync(diffOutputPath, PNG.sync.write(diff));
  }
  
  return {
    diffPercent: Math.round(diffPercent * 100) / 100,
    pass: diffPercent <= 5.0
  };
}

Automated QA Script (Full)

Save as scripts/mcp-qa.sh:

#!/bin/bash
set -euo pipefail

# MCP QA — Automated Testing Pipeline
# Usage: ./mcp-qa.sh <service-name> [--skip-layer4]
#
# Runs all automated layers and generates a persistent report.

SERVICE="$1"
SKIP_LAYER4="${2:-}"
DATE=$(date +%Y-%m-%d)
TIMESTAMP=$(date +%Y%m%d-%H%M%S)

if [ -z "$SERVICE" ]; then
  echo "Usage: $0 <service-name> [--skip-layer4]"
  exit 1
fi

# Persistent report location
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"
mkdir -p "$REPORT_DIR"
REPORT="${REPORT_DIR}/qa-report-${DATE}.md"

# Find server directory
SERVER_DIR=""
for d in "${SERVICE}-mcp" "mcp-servers/${SERVICE}" "mcp-diagrams/mcp-servers/${SERVICE}"; do
  if [ -d "$d" ]; then
    SERVER_DIR="$d"
    break
  fi
done

if [ -z "$SERVER_DIR" ]; then
  echo "❌ Server directory not found for ${SERVICE}"
  exit 1
fi

cat > "$REPORT" << EOF
# MCP QA Report: ${SERVICE}
**Date:** ${DATE}
**Timestamp:** ${TIMESTAMP}
**Tester:** Automated QA Pipeline
**Server:** ${SERVER_DIR}

---

## Quantitative Metrics

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
EOF

TOTAL_PASS=0
TOTAL_FAIL=0
TOTAL_WARN=0
TOTAL_SKIP=0

pass() { TOTAL_PASS=$((TOTAL_PASS + 1)); echo "✅ $1"; }
fail() { TOTAL_FAIL=$((TOTAL_FAIL + 1)); echo "❌ $1"; }
warn() { TOTAL_WARN=$((TOTAL_WARN + 1)); echo "⚠️  $1"; }
skip() { TOTAL_SKIP=$((TOTAL_SKIP + 1)); echo "⏭️  $1"; }

echo ""
echo "========================================"
echo "  MCP QA Pipeline: ${SERVICE}"
echo "  $(date)"
echo "========================================"
echo ""

# ─── LAYER 0: Protocol Compliance ───
echo "--- Layer 0: Protocol Compliance ---"
echo "" >> "$REPORT"
echo "## Layer 0: Protocol Compliance" >> "$REPORT"

cd "$SERVER_DIR"

# Build first
if npm run build 2>&1 | tail -5 > /tmp/mcp-qa-build.log; then
  pass "TypeScript build succeeded"
  echo "- ✅ TypeScript build succeeded" >> "$REPORT"
else
  fail "TypeScript build FAILED"
  echo "- ❌ TypeScript build FAILED" >> "$REPORT"
  cat /tmp/mcp-qa-build.log >> "$REPORT"
fi

# MCP Inspector (if available)
if command -v npx &> /dev/null; then
  echo "Running MCP Inspector..."
  if timeout 15 npx @modelcontextprotocol/inspector stdio node dist/index.js 2>/tmp/mcp-inspector.log; then
    pass "MCP Inspector passed"
    echo "- ✅ MCP Inspector passed" >> "$REPORT"
  else
    warn "MCP Inspector had issues (check /tmp/mcp-inspector.log)"
    echo "- ⚠️  MCP Inspector had issues" >> "$REPORT"
  fi
else
  skip "MCP Inspector (npx not available)"
  echo "- ⏭️  MCP Inspector skipped" >> "$REPORT"
fi

cd - > /dev/null

# ─── LAYER 1: Static Analysis ───
echo ""
echo "--- Layer 1: Static Analysis ---"
echo "" >> "$REPORT"
echo "## Layer 1: Static Analysis" >> "$REPORT"

# TypeScript type check
cd "$SERVER_DIR"
if npx tsc --noEmit 2>&1 | tail -3 > /tmp/mcp-qa-typecheck.log; then
  pass "tsc --noEmit clean"
  echo "- ✅ Type check clean" >> "$REPORT"
else
  fail "tsc --noEmit has errors"
  echo "- ❌ Type check errors:" >> "$REPORT"
  cat /tmp/mcp-qa-typecheck.log >> "$REPORT"
fi
cd - > /dev/null

# Any types
ANY_COUNT=$(grep -rn ": any" "$SERVER_DIR/src/" --include="*.ts" 2>/dev/null | grep -cv "catch\|eslint\|node_modules" || echo "0")
if [ "$ANY_COUNT" -eq 0 ]; then
  pass "No unintended 'any' types"
else
  warn "${ANY_COUNT} 'any' types found"
fi
echo "- any types: ${ANY_COUNT}" >> "$REPORT"

# SDK version
SDK_VER=$(cd "$SERVER_DIR" && node -e "console.log(require('./package.json').dependencies['@modelcontextprotocol/sdk'] || 'NOT FOUND')" 2>/dev/null || echo "UNKNOWN")
echo "- SDK version: ${SDK_VER}" >> "$REPORT"
# Warn if SDK is below 1.26.0 (security fix)
if echo "$SDK_VER" | grep -q "1.25"; then
  warn "SDK version ${SDK_VER} — should be ^1.26.0+ (security fix GHSA-345p-7cg4-v4c7)"
  echo "- ⚠️  SDK should be ^1.26.0+ (security fix)" >> "$REPORT"
fi

# App files
echo "" >> "$REPORT"
echo "### App Files" >> "$REPORT"
APP_COUNT=0
APP_OVERSIZED=0
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
  if [ -d "$dir" ]; then
    for f in "$dir"/*.html; do
      if [ -f "$f" ]; then
        SIZE=$(wc -c < "$f" | tr -d ' ')
        KB=$((SIZE / 1024))
        APP_COUNT=$((APP_COUNT + 1))
        if [ "$SIZE" -gt 51200 ]; then
          APP_OVERSIZED=$((APP_OVERSIZED + 1))
          echo "- ⚠️  $(basename $f): ${KB}KB (over 50KB budget)" >> "$REPORT"
        else
          echo "- ✅ $(basename $f): ${KB}KB" >> "$REPORT"
        fi
      fi
    done
  fi
done
echo "| App File Size | <50KB each | ${APP_OVERSIZED}/${APP_COUNT} over budget | $([ $APP_OVERSIZED -eq 0 ] && echo '✅' || echo '⚠️') |" >> /tmp/mcp-qa-metrics.txt

# ─── LAYER 2: Jest Unit Tests ───
echo ""
echo "--- Layer 2: Automated Tests ---"
echo "" >> "$REPORT"
echo "## Layer 2: Automated Tests" >> "$REPORT"

cd "$SERVER_DIR"
if [ -f "jest.config.ts" ] || [ -f "jest.config.js" ] || grep -q '"jest"' package.json 2>/dev/null; then
  echo "Running Jest tests..."
  if npx jest --ci --coverage 2>&1 | tee /tmp/mcp-qa-jest.log | tail -10; then
    pass "Jest tests passed"
    echo "- ✅ Jest tests passed" >> "$REPORT"
  else
    fail "Jest tests FAILED"
    echo "- ❌ Jest tests failed" >> "$REPORT"
    tail -20 /tmp/mcp-qa-jest.log >> "$REPORT"
  fi
else
  skip "No Jest config found"
  echo "- ⏭️  No Jest test suite found" >> "$REPORT"
fi

# Playwright visual tests
if [ -f "playwright.config.ts" ] || [ -f "playwright.config.js" ]; then
  echo "Running Playwright visual tests..."
  if npx playwright test 2>&1 | tee /tmp/mcp-qa-playwright.log | tail -10; then
    pass "Playwright tests passed"
    echo "- ✅ Playwright tests passed" >> "$REPORT"
  else
    fail "Playwright tests FAILED"
    echo "- ❌ Playwright tests failed" >> "$REPORT"
    tail -20 /tmp/mcp-qa-playwright.log >> "$REPORT"
  fi
else
  skip "No Playwright config found"
  echo "- ⏭️  No Playwright test suite found" >> "$REPORT"
fi

# BackstopJS visual regression
if [ -f "backstop.json" ]; then
  echo "Running BackstopJS regression..."
  if backstop test 2>&1 | tee /tmp/mcp-qa-backstop.log | tail -5; then
    pass "BackstopJS regression passed"
    echo "- ✅ Visual regression passed" >> "$REPORT"
  else
    warn "BackstopJS regression detected differences"
    echo "- ⚠️  Visual regression diffs detected" >> "$REPORT"
  fi
else
  skip "No backstop.json found"
  echo "- ⏭️  No BackstopJS config found" >> "$REPORT"
fi

cd - > /dev/null

# ─── LAYER 4: Live API (optional) ───
if [ "$SKIP_LAYER4" != "--skip-layer4" ]; then
  echo ""
  echo "--- Layer 4: Live API Testing ---"
  echo "" >> "$REPORT"
  echo "## Layer 4: Live API Testing" >> "$REPORT"

  if [ -f "$SERVER_DIR/.env" ]; then
    pass ".env file exists"
    echo "- ✅ .env credentials found" >> "$REPORT"
    echo "- ⚠️  Manual verification of live API required" >> "$REPORT"
  else
    skip "No .env file — skipping live API tests"
    echo "- ⏭️  No credentials available" >> "$REPORT"
  fi
else
  skip "Layer 4 skipped (--skip-layer4)"
  echo "" >> "$REPORT"
  echo "## Layer 4: Live API Testing — SKIPPED" >> "$REPORT"
fi

# ─── SECURITY SCAN ───
echo ""
echo "--- Layer 4.5: Security Scan ---"
echo "" >> "$REPORT"
echo "## Layer 4.5: Security Scan" >> "$REPORT"

SECURITY_ISSUES=0
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
  if [ -d "$dir" ]; then
    for f in "$dir"/*.html; do
      if [ -f "$f" ]; then
        # Check for potential key exposure
        for pat in "api.key" "apikey" "api_key" "secret" "sk_live" "pk_live"; do
          if grep -qi "$pat" "$f" 2>/dev/null; then
            SECURITY_ISSUES=$((SECURITY_ISSUES + 1))
            echo "- ❌ $(basename $f): potential key exposure (${pat})" >> "$REPORT"
          fi
        done
      fi
    done
  fi
done

if [ "$SECURITY_ISSUES" -eq 0 ]; then
  pass "No API key exposure detected"
  echo "- ✅ No API key exposure detected in app files" >> "$REPORT"
else
  fail "${SECURITY_ISSUES} potential security issues"
fi

# ─── SUMMARY ───
echo ""
echo "========================================"
echo "  SUMMARY"
echo "========================================"
echo "  ✅ Passed: ${TOTAL_PASS}"
echo "  ❌ Failed: ${TOTAL_FAIL}"
echo "  ⚠️  Warnings: ${TOTAL_WARN}"
echo "  ⏭️  Skipped: ${TOTAL_SKIP}"
echo "========================================"

OVERALL="PASS"
[ "$TOTAL_FAIL" -gt 0 ] && OVERALL="FAIL"
[ "$TOTAL_FAIL" -eq 0 ] && [ "$TOTAL_WARN" -gt 0 ] && OVERALL="PASS WITH WARNINGS"

cat >> "$REPORT" << EOF

---

## Summary

| Category | Count |
|----------|-------|
| ✅ Passed | ${TOTAL_PASS} |
| ❌ Failed | ${TOTAL_FAIL} |
| ⚠️  Warnings | ${TOTAL_WARN} |
| ⏭️  Skipped | ${TOTAL_SKIP} |

## Overall: **${OVERALL}**

---

*Report generated by MCP QA Pipeline v2.0*
*Saved to: ${REPORT}*
EOF

echo ""
echo "Report saved to: $REPORT"
echo "Overall: ${OVERALL}"

Test Report Template (Full)

Generate this after running all layers. Save to mcp-factory-reviews/{service}/qa-report-{date}.md:

# MCP QA Report: {Service Name}
**Date:** {YYYY-MM-DD}
**Tester:** {agent/human}
**Server:** {service}-mcp v{version}
**Apps:** {count} apps tested
**Credential Status:** {has-creds|needs-creds|sandbox|no-sandbox}

---

## Quantitative Metrics

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| MCP Protocol Compliance | 100% | __%  | ✅/❌ |
| Tool Correctness Rate | >95% | __/20 (__%) | ✅/❌ |
| Task Completion Rate | >90% | __/10 (__%) | ✅/❌ |
| APP_DATA Schema Match | 100% | __/__ (__%) | ✅/❌ |
| Response Latency P50 | <3s | __s | ✅/❌ |
| Response Latency P95 | <8s | __s | ✅/❌ |
| App Render Success | 100% | __/__ | ✅/❌ |
| Accessibility Score | >90 | __% | ✅/❌ |
| Cold Start Time | <2s | __ms | ✅/❌ |
| App File Size (max) | <50KB | __KB | ✅/❌ |
| Security (critical) | 0 | __ | ✅/❌ |

## Layer Results

| Layer | Status | Issues | Details |
|-------|--------|--------|---------|
| 0 — Protocol | ✅/⚠️/❌ | {count} | {notes} |
| 1 — Static | ✅/⚠️/❌ | {count} | {notes} |
| 2 — Visual | ✅/⚠️/❌ | {count} | {notes} |
| 2.5 — Accessibility | ✅/⚠️/❌ | {count} | {notes} |
| 3 — Functional | ✅/⚠️/❌ | {count} | {notes} |
| 3.5 — Performance | ✅/⚠️/❌ | {count} | {notes} |
| 4 — Live API | ✅/⚠️/❌/⏭️ | {count} | {notes} |
| 4.5 — Security | ✅/⚠️/❌ | {count} | {notes} |
| 5 — Integration | ✅/⚠️/❌ | {count} | {notes} |

## Overall: {PASS / PASS WITH WARNINGS / FAIL}

---

## Issues Found

### Critical (must fix before ship)
1. {issue}: {description} — {file:line}

### Warnings (should fix)
1. {issue}: {description}

### Notes (nice to have)
1. {observation}

---

## App-by-App Results

### {app-id-1}
- Visual: ✅/❌ — {notes}
- Accessibility: Score __% — {violations}
- Data flow: ✅/❌ — {notes}
- States (loading/empty/data): ✅/❌
- File size: __KB
- XSS test: ✅/❌
- Screenshot: {path}

---

## Tool Invocation Results

| # | NL Message | Expected Tool | Actual Tool | Correct? | Latency |
|---|-----------|---------------|-------------|----------|---------|
| 1 | "Show me all contacts" | list_contacts | | ✅/❌ | ms |
| 2 | "Find John Smith" | search_contacts | | ✅/❌ | ms |
| ... | | | | | |
| 20 | | | | | |

**Tool Correctness Rate: __/20 = __%**

---

## E2E Scenario Results

| # | Scenario | Steps | Completed? | Latency | Notes |
|---|----------|-------|-----------|---------|-------|
| 1 | {name} | {n} | ✅/❌ | ms | |
| ... | | | | | |
| 10 | | | | | |

**Task Completion Rate: __/10 = __%**

---

## Trend (vs Previous Report)

| Metric | Previous | Current | Change |
|--------|----------|---------|--------|
| Tool Correctness | __% | __% | +/-__% |
| Task Completion | __% | __% | +/-__% |
| Accessibility | __% | __% | +/-__% |
| Avg Latency | __s | __s | +/-__s |

---

## Recommendations
1. {what to fix/improve before shipping}
2. {items for next QA cycle}

---

*Report saved to: mcp-factory-reviews/{service}/qa-report-{date}.md*
*Previous reports in same directory for trending.*

#!/bin/bash
# Aggregate QA trends across reports
# Usage: ./qa-trend.sh <service-name>

SERVICE="$1"
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"

if [ ! -d "$REPORT_DIR" ]; then
  echo "No reports found for ${SERVICE}"
  exit 1
fi

echo "=== QA Trend: ${SERVICE} ==="
echo ""
echo "| Date | Overall | Pass | Fail | Warn |"
echo "|------|---------|------|------|------|"

for report in $(ls -1 "$REPORT_DIR"/qa-report-*.md 2>/dev/null | sort); do
  DATE=$(basename "$report" | sed 's/qa-report-//' | sed 's/.md//')
  OVERALL=$(grep "^## Overall:" "$report" 2>/dev/null | head -1 | sed 's/.*\*\*//' | sed 's/\*\*.*//')
  PASS=$(grep "✅ Passed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
  FAIL=$(grep "❌ Failed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
  WARN=$(grep "⚠️" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
  echo "| ${DATE} | ${OVERALL} | ${PASS} | ${FAIL} | ${WARN} |"
done

Quick Reference Commands

# ─── LAYER 0 ───
# MCP Inspector (protocol compliance)
npx @modelcontextprotocol/inspector stdio node dist/index.js

# ─── LAYER 1 ───
# Quick compile + type check
cd {service}-mcp && npm run build && npx tsc --noEmit

# ─── LAYER 2 ───
# Run Playwright visual tests
npx playwright test tests/visual.test.ts

# Run BackstopJS regression
backstop test

# Capture new baselines
backstop reference

# ─── LAYER 2.5 ───
# Run accessibility tests
npx playwright test tests/accessibility.test.ts

# ─── LAYER 3 ───
# Run Jest unit tests
npx jest --verbose

# Run tool routing tests
npx jest tests/tool-routing.test.ts

# Validate APP_DATA schemas
npx ts-node tests/app-data-validator.ts

# ─── LAYER 3.5 ───
# Cold start benchmark
time echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf","version":"1.0"}}}' | timeout 10 node dist/index.js | head -1

# File size audit
for f in app-ui/*.html; do echo "$(wc -c < "$f" | tr -d ' ') $f"; done | sort -n

# ─── LAYER 4 ───
# Start server for manual testing
node dist/index.js

# ─── LAYER 4.5 ───
# Security scan
grep -rn "apikey\|api_key\|secret\|sk_live" app-ui/ --include="*.html"

# ─── LAYER 5 ───
# Full automated pipeline
./scripts/mcp-qa.sh {service-name}

# Trend report
./scripts/qa-trend.sh {service-name}

# ─── BROWSER TOOLS ───
# Screenshot via browser tool
# browser → open → http://192.168.0.25:3000 → navigate → screenshot

# Monitor postMessages in browser console
# window.addEventListener('message', e => console.log('[PM]', e.data.type, e.data))

# axe-core in browser console (paste the snippet from Layer 2.5.2)

Common Issues & Fixes

Symptom	Layer	Cause	Fix
App shows blank white screen	2	HTML file not found or wrong path	Check APP_NAME_MAP + APP_DIRS in route.ts
App shows loading forever	3	postMessage not received	Check data block format: `<!--APP_DATA:{...}:END_APP_DATA-->`
App renders but wrong data	3	APP_DATA JSON shape mismatch	Compare tool response fields with app's render() expectations
Tool not triggered by NL	3	Poor tool description	Add "do NOT use when" disambiguation
Wrong tool triggered	3	Similar tool descriptions	Add negative examples to both competing tools
Thread panel empty	3	Thread state not persisted	Check localStorage `lb-threads` key
Console error: CORS	2	iframe cross-origin issue	Ensure app served from same origin
Dark theme wrong	2	Hardcoded light colors	Audit CSS for `#fff`, `white`, `#f` colors
Overflow at narrow width	2	Fixed widths in CSS	Use `max-width: 100%`, `overflow-x: auto`, flex/grid
axe-core contrast fail	2.5	Text color too dim	Use #b0b2b8+ for secondary text (not #96989d)
MCP Inspector fails	0	Protocol error in server	Check initialize handler, verify JSON-RPC framing
Cold start >2s	3.5	Heavy imports at startup	Use lazy loading for tool groups
structuredContent mismatch	0	Output doesn't match outputSchema	Validate tool return against declared schema
APP_DATA parse fails	3	LLM produced invalid JSON	Use robust parser with newline stripping + trailing comma fix
XSS detected	4.5	Missing escapeHtml on field	Add escapeHtml() to all dynamic text insertions
Key exposure	4.5	API key in HTML file	Move to server-side only, never send to client

Project Setup: Adding Tests to an Existing Server

When adding this test framework to a server that doesn't have it yet:

cd {service}-mcp

# 1. Install test dependencies
npm install -D jest ts-jest @types/jest msw playwright @playwright/test @axe-core/playwright ajv pngjs pixelmatch backstopjs

# 2. Add Jest config
cat > jest.config.ts << 'EOF'
export default {
  preset: 'ts-jest',
  testEnvironment: 'node',
  testPathPattern: 'tests/.*\\.test\\.ts$',
  setupFilesAfterSetup: ['./tests/setup.ts'],
};
EOF

# 3. Add Playwright config
cat > playwright.config.ts << 'EOF'
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
  testDir: './tests',
  testMatch: ['visual.test.ts', 'accessibility.test.ts', 'chaos.test.ts'],
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
    { name: 'firefox', use: { ...devices['Desktop Firefox'] } },
    { name: 'webkit', use: { ...devices['Desktop Safari'] } },
  ],
});
EOF

# 4. Create directory structure
mkdir -p tests test-fixtures test-baselines/backstop test-baselines/app-data-schemas test-results/screenshots

# 5. Create initial fixture files
# (copy from the fixtures library section above)

# 6. Add scripts to package.json
npm pkg set scripts.test="jest"
npm pkg set scripts.test:visual="playwright test"
npm pkg set scripts.test:a11y="playwright test tests/accessibility.test.ts"
npm pkg set scripts.test:all="jest && playwright test"
npm pkg set scripts.qa="../../scripts/mcp-qa.sh $(basename $(pwd) -mcp)"

# 7. Install Playwright browsers
npx playwright install

111 KiB Raw Permalink Blame History Unescape Escape

MCP QA Tester — Automated Testing Framework & Quality Metrics Pipeline

Testing Architecture

Quantitative Quality Metrics (REQUIRED)

How to calculate:

Layer 0: MCP Protocol Compliance Testing

0.1 — MCP Inspector (Official Tool)

0.2 — Automated Protocol Test Script

0.3 — structuredContent Validation

0.4 — Tasks & Elicitation Testing (2025-11-25 Spec)

Quality Gate:

Layer 1: Static Analysis

1.1 — TypeScript Compilation

1.2 — Code Quality Checks

1.3 — HTML App Validation

1.4 — Route Mapping Cross-Reference

Quality Gate:

Layer 2: Visual Testing

2.1 — Automated Playwright Visual Tests

2.2 — BackstopJS Visual Regression

2.3 — Gemini Multimodal Analysis (Subjective Quality)

Quality Gate:

Layer 2.5: Accessibility Testing

2.5.1 — axe-core Automated Audit

2.5.2 — Standalone axe-core Snippet (for Browser DevTools)

2.5.3 — Color Contrast Audit

2.5.4 — Screen Reader Testing (macOS VoiceOver)

Quality Gate:

Layer 3: Functional Testing

3.1 — Jest Unit Tests with MSW (Mock Service Worker)

3.2 — Tool Routing Smoke Test

3.2b — DeepEval LLM-in-the-Loop Tool Routing Evaluation

3.3 — APP_DATA Schema Validator

3.4 — Thread Lifecycle Testing

Quality Gate:

Layer 3.5: Performance Testing

3.5.1 — Server Cold Start

3.5.2 — Tool Invocation Latency

3.5.3 — App File Size Budget

3.5.4 — App Render Performance (Playwright)

3.5.5 — Load Testing (HTTP Transport)

Quality Gate:

Layer 4: Live API Testing

4.1 — Credential Management Strategy

4.2 — Test Each Tool Group

4.3 — Response Shape Verification

Quality Gate:

Layer 4.5: Security Testing

4.5.1 — XSS Testing

4.5.2 — postMessage Origin Validation

4.5.3 — Content Security Policy Check

4.5.4 — API Key Exposure Check

Quality Gate:

Layer 5: Integration & Chaos Testing

5.1 — End-to-End Scenarios

5.1b — Automated End-to-End Data Flow Test (Playwright)

5.2 — Chaos Testing

5.3 — Cross-Browser Testing Notes

Quality Gate:

Layer 5.5: Production Smoke Test (Post-Deployment)

Layer 6: Production Monitoring (Post-Ship)

6.1 — Production Quality Metrics

6.2 — Instrumentation Code

6.3 — Weekly Quality Review

CI/CD Pipeline Template

Testing Reality Check

✅ What This QA Framework CATCHES (real quality):

❌ What This QA Framework MISSES (gaps to be aware of):

The Hard Truth:

Test Data Fixtures Library

Standard Fixture: Dashboard

Standard Fixture: Data Grid

Standard Fixture: Timeline

Edge Case Fixtures

Adversarial Fixtures

Scale Fixture Generator

Regression Testing Baselines

Baseline Workflow

Screenshot Baseline Structure

Programmatic Screenshot Comparison (Without BackstopJS)

111 KiB

Raw Permalink Blame History