Skills added: - mcp-api-analyzer (43KB) — Phase 1: API analysis - mcp-server-builder (88KB) — Phase 2: Server build - mcp-server-development (31KB) — TS MCP patterns - mcp-app-designer (85KB) — Phase 3: Visual apps - mcp-apps-integration (20KB) — structuredContent UI - mcp-apps-official (48KB) — MCP Apps SDK - mcp-apps-merged (39KB) — Combined apps reference - mcp-localbosses-integrator (61KB) — Phase 4: LocalBosses wiring - mcp-qa-tester (113KB) — Phase 5: Full QA framework - mcp-deployment (17KB) — Phase 6: Production deploy - mcp-skill (exa integration) These skills are the encoded knowledge that lets agents build production-quality MCP servers autonomously through the pipeline.
3389 lines
111 KiB
Markdown
3389 lines
111 KiB
Markdown
# MCP QA Tester — Automated Testing Framework & Quality Metrics Pipeline
|
||
|
||
**When to use this skill:** Testing MCP servers, apps, and their LocalBosses integration. Use after Phase 4 (integration) to verify everything works — at the protocol level, visually, functionally, and against live APIs. This is an **automated-first** framework with quantitative metrics, regression baselines, and persistent reporting.
|
||
|
||
**What this covers:** MCP protocol compliance, automated unit/visual/functional testing, accessibility auditing, performance benchmarking, security validation, chaos testing, and quantitative quality metrics with regression tracking.
|
||
|
||
---
|
||
|
||
## Testing Architecture
|
||
|
||
```
|
||
Layer 0: Protocol Compliance ─── MCP Inspector + JSON-RPC lifecycle validation
|
||
Layer 1: Static Analysis ──────── TypeScript build, linting, file structure, schema validation
|
||
Layer 2: Visual Testing ────────── Playwright screenshots, BackstopJS regression, Gemini analysis
|
||
Layer 2.5: Accessibility ────────── axe-core, keyboard nav, contrast audit, screen reader compat
|
||
Layer 3: Functional Testing ───── Tool routing smoke tests, data flow validation, thread lifecycle
|
||
Layer 3.5: Performance ────────── Cold start, latency, memory, file size budgets
|
||
Layer 4: Live API Testing ──────── Real API calls with credential management strategy
|
||
Layer 4.5: Security ────────────── XSS, CSP, postMessage origin, key exposure
|
||
Layer 5: Integration Testing ──── Full E2E scenarios, chaos testing, cross-browser validation
|
||
```
|
||
|
||
Every layer has **quantitative pass/fail criteria**. Do NOT skip layers — issues compound.
|
||
|
||
---
|
||
|
||
## Quantitative Quality Metrics (REQUIRED)
|
||
|
||
Every QA report MUST include these metrics. No more pass/fail checklists — we measure.
|
||
|
||
| Metric | Target | Method | Priority |
|
||
|--------|--------|--------|----------|
|
||
| **MCP Protocol Compliance** | 100% | MCP Inspector — all checks pass | P0 |
|
||
| **Tool Correctness Rate** | >95% | Run 20 NL messages, count correct tool selections | P0 |
|
||
| **Task Completion Rate** | >90% | Run 10 E2E scenarios, count fully completed | P0 |
|
||
| **APP_DATA Schema Match** | 100% | Validate every APP_DATA against JSON schema | P0 |
|
||
| **Response Latency P50** | <3s | Measure 10 read interactions | P1 |
|
||
| **Response Latency P95** | <8s | Measure 10 interactions (reads + writes) | P1 |
|
||
| **App Render Success** | 100% | All apps render data state without console errors | P0 |
|
||
| **Accessibility Score** | >90 | axe-core audit on every app HTML | P1 |
|
||
| **Cold Start Time** | <2s | `time node dist/index.js` → first ListTools response | P1 |
|
||
| **App File Size** | <50KB each | Check all HTML files | P1 |
|
||
| **Security Scan** | 0 critical | XSS + CSP + key exposure checks | P0 |
|
||
|
||
### How to calculate:
|
||
|
||
```
|
||
Tool Correctness Rate = (correct_tool_selections / total_test_messages) × 100
|
||
Task Completion Rate = (completed_scenarios / total_scenarios) × 100
|
||
APP_DATA Schema Match = (valid_app_data_blocks / total_app_data_blocks) × 100
|
||
```
|
||
|
||
---
|
||
|
||
## Layer 0: MCP Protocol Compliance Testing
|
||
|
||
**Why this layer exists:** The MCP spec defines exact JSON-RPC lifecycle, tool definition formats, and error codes. If the server isn't protocol-compliant, nothing else matters. This is the foundation.
|
||
|
||
### 0.1 — MCP Inspector (Official Tool)
|
||
|
||
```bash
|
||
# Install and run MCP Inspector against the server
|
||
npx @modelcontextprotocol/inspector stdio node dist/index.js
|
||
|
||
# The Inspector validates:
|
||
# ✅ initialize → initialized lifecycle
|
||
# ✅ tools/list response format
|
||
# ✅ tools/call request/response format
|
||
# ✅ JSON-RPC message framing
|
||
# ✅ Capability negotiation
|
||
# ✅ Notification handling
|
||
```
|
||
|
||
### 0.2 — Automated Protocol Test Script
|
||
|
||
Save as `tests/protocol-compliance.test.ts`:
|
||
|
||
```typescript
|
||
import { spawn, ChildProcess } from 'child_process';
|
||
import * as readline from 'readline';
|
||
|
||
// Minimal JSON-RPC client for testing MCP servers over stdio
|
||
class MCPTestClient {
|
||
private proc: ChildProcess;
|
||
private rl: readline.Interface;
|
||
private pending: Map<number, { resolve: Function; reject: Function }> = new Map();
|
||
private nextId = 1;
|
||
private notifications: any[] = [];
|
||
|
||
constructor(command: string, args: string[]) {
|
||
this.proc = spawn(command, args, { stdio: ['pipe', 'pipe', 'pipe'] });
|
||
this.rl = readline.createInterface({ input: this.proc.stdout! });
|
||
this.rl.on('line', (line) => {
|
||
try {
|
||
const msg = JSON.parse(line);
|
||
if (msg.id && this.pending.has(msg.id)) {
|
||
this.pending.get(msg.id)!.resolve(msg);
|
||
this.pending.delete(msg.id);
|
||
} else if (!msg.id) {
|
||
this.notifications.push(msg);
|
||
}
|
||
} catch (e) { /* ignore non-JSON lines */ }
|
||
});
|
||
}
|
||
|
||
async request(method: string, params?: any): Promise<any> {
|
||
const id = this.nextId++;
|
||
const msg = JSON.stringify({ jsonrpc: '2.0', id, method, params: params || {} });
|
||
this.proc.stdin!.write(msg + '\n');
|
||
return new Promise((resolve, reject) => {
|
||
this.pending.set(id, { resolve, reject });
|
||
setTimeout(() => {
|
||
if (this.pending.has(id)) {
|
||
this.pending.delete(id);
|
||
reject(new Error(`Timeout on ${method}`));
|
||
}
|
||
}, 10000);
|
||
});
|
||
}
|
||
|
||
getNotifications() { return this.notifications; }
|
||
|
||
async close() {
|
||
this.proc.kill();
|
||
}
|
||
}
|
||
|
||
describe('MCP Protocol Compliance', () => {
|
||
let client: MCPTestClient;
|
||
|
||
beforeAll(async () => {
|
||
client = new MCPTestClient('node', ['dist/index.js']);
|
||
});
|
||
|
||
afterAll(async () => {
|
||
await client.close();
|
||
});
|
||
|
||
test('initialize → initialized lifecycle', async () => {
|
||
const initResult = await client.request('initialize', {
|
||
protocolVersion: '2025-11-25',
|
||
capabilities: {},
|
||
clientInfo: { name: 'qa-test-client', version: '1.0.0' }
|
||
});
|
||
|
||
expect(initResult.result).toBeDefined();
|
||
expect(initResult.result.protocolVersion).toBeDefined();
|
||
expect(initResult.result.capabilities).toBeDefined();
|
||
expect(initResult.result.serverInfo).toBeDefined();
|
||
expect(initResult.result.serverInfo.name).toBeTruthy();
|
||
expect(initResult.result.serverInfo.version).toBeTruthy();
|
||
|
||
// Send initialized notification (no id = notification)
|
||
client.request('notifications/initialized', {}).catch(() => {});
|
||
});
|
||
|
||
test('tools/list returns valid tool definitions', async () => {
|
||
const result = await client.request('tools/list', {});
|
||
|
||
expect(result.result).toBeDefined();
|
||
expect(result.result.tools).toBeInstanceOf(Array);
|
||
expect(result.result.tools.length).toBeGreaterThan(0);
|
||
|
||
for (const tool of result.result.tools) {
|
||
// Required fields per MCP 2025-11-25
|
||
expect(tool.name).toBeTruthy();
|
||
expect(tool.description).toBeTruthy();
|
||
expect(typeof tool.name).toBe('string');
|
||
expect(typeof tool.description).toBe('string');
|
||
|
||
// Name format: must be alphanumeric + underscores/hyphens/dots
|
||
expect(tool.name).toMatch(/^[a-zA-Z0-9_.\-]+$/);
|
||
|
||
// inputSchema must be valid JSON Schema object
|
||
if (tool.inputSchema) {
|
||
expect(tool.inputSchema.type).toBe('object');
|
||
}
|
||
|
||
// If title exists, must be string
|
||
if (tool.title) {
|
||
expect(typeof tool.title).toBe('string');
|
||
}
|
||
|
||
// If outputSchema exists, validate it
|
||
if (tool.outputSchema) {
|
||
expect(tool.outputSchema.type).toBeDefined();
|
||
}
|
||
|
||
// If annotations exist, validate known fields
|
||
if (tool.annotations) {
|
||
const validAnnotations = [
|
||
'readOnlyHint', 'destructiveHint', 'idempotentHint', 'openWorldHint'
|
||
];
|
||
for (const key of Object.keys(tool.annotations)) {
|
||
if (validAnnotations.includes(key)) {
|
||
expect(typeof tool.annotations[key]).toBe('boolean');
|
||
}
|
||
}
|
||
}
|
||
}
|
||
});
|
||
|
||
test('tools/call returns valid response for read-only tools', async () => {
|
||
// Get list of tools first
|
||
const listResult = await client.request('tools/list', {});
|
||
const readOnlyTools = listResult.result.tools.filter(
|
||
(t: any) => t.annotations?.readOnlyHint === true
|
||
);
|
||
|
||
// Test first read-only tool (safest to call)
|
||
if (readOnlyTools.length > 0) {
|
||
const tool = readOnlyTools[0];
|
||
const callResult = await client.request('tools/call', {
|
||
name: tool.name,
|
||
arguments: {}
|
||
});
|
||
|
||
expect(callResult.result).toBeDefined();
|
||
|
||
// Result must have content array
|
||
if (!callResult.result.isError) {
|
||
expect(callResult.result.content).toBeInstanceOf(Array);
|
||
for (const item of callResult.result.content) {
|
||
expect(item.type).toBeDefined();
|
||
// Text content must have text field
|
||
if (item.type === 'text') {
|
||
expect(typeof item.text).toBe('string');
|
||
}
|
||
}
|
||
}
|
||
|
||
// If structuredContent exists, validate against outputSchema
|
||
if (callResult.result.structuredContent && tool.outputSchema) {
|
||
// Basic type check — full JSON Schema validation is in the schema validator section
|
||
expect(typeof callResult.result.structuredContent).toBe('object');
|
||
}
|
||
}
|
||
});
|
||
|
||
test('error responses use correct JSON-RPC error codes', async () => {
|
||
// Call non-existent tool — should get method not found or tool error
|
||
const result = await client.request('tools/call', {
|
||
name: 'nonexistent_tool_that_should_not_exist_12345',
|
||
arguments: {}
|
||
});
|
||
|
||
// Should be an error response
|
||
expect(
|
||
result.error || result.result?.isError
|
||
).toBeTruthy();
|
||
|
||
// If protocol error, must use standard JSON-RPC codes
|
||
if (result.error) {
|
||
expect(result.error.code).toBeDefined();
|
||
expect(typeof result.error.code).toBe('number');
|
||
expect(result.error.message).toBeTruthy();
|
||
// Standard codes: -32700 (parse), -32600 (invalid request),
|
||
// -32601 (method not found), -32602 (invalid params), -32603 (internal)
|
||
}
|
||
});
|
||
|
||
test('notification handling works', async () => {
|
||
// Server should handle ping
|
||
try {
|
||
await client.request('ping', {});
|
||
// If no error, ping is supported
|
||
} catch (e) {
|
||
// Ping timeout is acceptable for some servers
|
||
}
|
||
});
|
||
});
|
||
```
|
||
|
||
### 0.3 — structuredContent Validation
|
||
|
||
```typescript
|
||
// tests/structured-content.test.ts
|
||
import Ajv from 'ajv';
|
||
|
||
const ajv = new Ajv({ allErrors: true });
|
||
|
||
function validateStructuredContent(
|
||
toolName: string,
|
||
outputSchema: object,
|
||
structuredContent: any
|
||
): { valid: boolean; errors: string[] } {
|
||
const validate = ajv.compile(outputSchema);
|
||
const valid = validate(structuredContent);
|
||
return {
|
||
valid: !!valid,
|
||
errors: valid ? [] : (validate.errors || []).map(e =>
|
||
`${e.instancePath} ${e.message}`
|
||
)
|
||
};
|
||
}
|
||
|
||
// Run this after getting tools/list + tools/call results
|
||
describe('structuredContent schema validation', () => {
|
||
test('every tool with outputSchema returns conforming structuredContent', async () => {
|
||
// This would be populated from actual tool calls
|
||
const toolResults: Array<{
|
||
toolName: string;
|
||
outputSchema: object;
|
||
structuredContent: any;
|
||
}> = []; // Populate from Layer 4 results
|
||
|
||
for (const { toolName, outputSchema, structuredContent } of toolResults) {
|
||
if (structuredContent && outputSchema) {
|
||
const result = validateStructuredContent(toolName, outputSchema, structuredContent);
|
||
expect(result.valid).toBe(true);
|
||
if (!result.valid) {
|
||
console.error(`Schema mismatch for ${toolName}:`, result.errors);
|
||
}
|
||
}
|
||
}
|
||
});
|
||
});
|
||
```
|
||
|
||
### 0.4 — Tasks & Elicitation Testing (2025-11-25 Spec)
|
||
|
||
If the server declares `tasks` capability (async operations via SEP-1686), test the task lifecycle:
|
||
|
||
```typescript
|
||
test('tasks/list returns valid task list', async () => {
|
||
const result = await client.request('tasks/list', {});
|
||
if (result.result) {
|
||
expect(result.result.tasks).toBeInstanceOf(Array);
|
||
}
|
||
// Some servers may not implement tasks — that's OK, just verify no crash
|
||
});
|
||
|
||
test('long-running tool call returns task reference when task-enabled', async () => {
|
||
// If a tool has execution.taskSupport = "required" or "optional",
|
||
// calling it with _meta.taskId should return a task reference
|
||
// rather than blocking until completion
|
||
const listResult = await client.request('tools/list', {});
|
||
const taskTools = listResult.result.tools.filter(
|
||
(t: any) => t.execution?.taskSupport === 'required' || t.execution?.taskSupport === 'optional'
|
||
);
|
||
// Log task-capable tools for the report
|
||
console.log(`Task-capable tools: ${taskTools.map((t: any) => t.name).join(', ') || 'none'}`);
|
||
});
|
||
```
|
||
|
||
If the server uses **elicitation** (`elicitation/create`), test that:
|
||
- Elicitation requests include valid `requestedSchema` with JSON Schema
|
||
- The server handles user-provided elicitation responses gracefully
|
||
- URL mode elicitation (2025-11-25) correctly redirects to external URLs
|
||
- The server doesn't hang if elicitation is denied by the client
|
||
|
||
```typescript
|
||
test('server handles elicitation denial gracefully', async () => {
|
||
// If server requests elicitation and client denies, server should
|
||
// return a useful error message, not crash or hang
|
||
// This is tested implicitly by calling tools without providing
|
||
// elicitation responses — the server should timeout or fallback
|
||
});
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] MCP Inspector passes all checks
|
||
- [ ] initialize → initialized lifecycle works
|
||
- [ ] tools/list returns valid, non-empty tool array
|
||
- [ ] All tool names match `/^[a-zA-Z0-9_.\-]+$/`
|
||
- [ ] All tool descriptions are non-empty strings
|
||
- [ ] tools/call returns valid content arrays
|
||
- [ ] structuredContent (if present) matches outputSchema
|
||
- [ ] Error responses use correct JSON-RPC codes
|
||
- [ ] Server handles unknown methods gracefully (doesn't crash)
|
||
|
||
---
|
||
|
||
## Layer 1: Static Analysis
|
||
|
||
### 1.1 — TypeScript Compilation
|
||
```bash
|
||
cd {service}-mcp
|
||
npm run build 2>&1
|
||
# Must exit 0 with no errors
|
||
# Warnings are OK but should be reviewed
|
||
|
||
# Separate type-check (catches issues build might miss)
|
||
npx tsc --noEmit 2>&1
|
||
```
|
||
|
||
### 1.2 — Code Quality Checks
|
||
```bash
|
||
# Check for `any` types (red flag)
|
||
grep -rn ": any" src/ --include="*.ts" | grep -v "node_modules" | grep -v "// eslint" | grep -v "catch"
|
||
# Goal: zero instances in tool handlers
|
||
# Exception: catch(error: any) is acceptable
|
||
|
||
# Check for console.log (should use structured logging)
|
||
grep -rn "console.log" src/ --include="*.ts" | grep -v "node_modules"
|
||
# Goal: zero — use console.error for MCP server logging
|
||
|
||
# Check SDK version is pinned appropriately
|
||
node -e "const p = require('./package.json'); console.log('SDK:', p.dependencies['@modelcontextprotocol/sdk'])"
|
||
# Should be ^1.26.0 or higher (security fix: GHSA-345p-7cg4-v4c7)
|
||
|
||
# Check Zod version
|
||
node -e "const p = require('./package.json'); console.log('Zod:', p.dependencies['zod'])"
|
||
# Should be ^3.25.0 or higher
|
||
```
|
||
|
||
### 1.3 — HTML App Validation
|
||
```bash
|
||
# Check all app HTML files exist and are within size budget
|
||
for f in app-ui/*.html ui/dist/*.html; do
|
||
if [ -f "$f" ]; then
|
||
SIZE=$(wc -c < "$f" | tr -d ' ')
|
||
if [ "$SIZE" -gt 51200 ]; then
|
||
echo "⚠️ $f ($SIZE bytes) — EXCEEDS 50KB budget"
|
||
else
|
||
echo "✅ $f ($SIZE bytes)"
|
||
fi
|
||
else
|
||
echo "❌ $f MISSING"
|
||
fi
|
||
done
|
||
```
|
||
|
||
### 1.4 — Route Mapping Cross-Reference
|
||
```bash
|
||
# Verify every app ID in channels.ts has a matching entry in ALL integration files
|
||
node -e "
|
||
const fs = require('fs');
|
||
const path = require('path');
|
||
|
||
const LB_ROOT = 'localbosses-app/src';
|
||
const files = {
|
||
channels: fs.readFileSync(path.join(LB_ROOT, 'lib/channels.ts'), 'utf8'),
|
||
appNames: fs.readFileSync(path.join(LB_ROOT, 'lib/appNames.ts'), 'utf8'),
|
||
intakes: fs.readFileSync(path.join(LB_ROOT, 'lib/app-intakes.ts'), 'utf8'),
|
||
route: fs.readFileSync(path.join(LB_ROOT, 'app/api/mcp-apps/route.ts'), 'utf8'),
|
||
};
|
||
|
||
// Extract app IDs from channels (anything in mcpApps arrays)
|
||
const channelApps = [...files.channels.matchAll(/['\"]([a-z0-9-]+)['\"]/g)]
|
||
.map(m => m[1])
|
||
.filter(id => id.length > 3 && !['true','false','null'].includes(id));
|
||
|
||
let issues = 0;
|
||
const unique = [...new Set(channelApps)];
|
||
for (const id of unique) {
|
||
const inNames = files.appNames.includes(id);
|
||
const inIntakes = files.intakes.includes(id);
|
||
const inRoute = files.route.includes(id);
|
||
if (!inNames || !inIntakes || !inRoute) {
|
||
console.log('❌ ' + id + ': ' +
|
||
(!inNames ? 'MISSING appNames ' : '') +
|
||
(!inIntakes ? 'MISSING app-intakes ' : '') +
|
||
(!inRoute ? 'MISSING route ' : ''));
|
||
issues++;
|
||
}
|
||
}
|
||
if (issues === 0) console.log('✅ All ' + unique.length + ' app IDs cross-referenced');
|
||
else console.log('\\n⚠️ ' + issues + ' cross-reference issues found');
|
||
"
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] TypeScript compiles with zero errors
|
||
- [ ] `tsc --noEmit` passes clean
|
||
- [ ] No unintended `any` types in tool handlers
|
||
- [ ] SDK pinned to `^1.26.0`+, Zod to `^3.25.0`+ (Do NOT use Zod v4.x with SDK v1.x — known incompatibility, issue #1429)
|
||
- [ ] All HTML app files exist, are >1KB and <50KB
|
||
- [ ] All app IDs cross-referenced across channels, appNames, app-intakes, and route map
|
||
- [ ] All route mappings resolve to actual HTML files
|
||
|
||
---
|
||
|
||
## Layer 2: Visual Testing
|
||
|
||
### 2.1 — Automated Playwright Visual Tests
|
||
|
||
Save as `tests/visual.test.ts`:
|
||
|
||
```typescript
|
||
import { test, expect, Page } from '@playwright/test';
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
|
||
// Configuration
|
||
const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
|
||
const SCREENSHOTS_DIR = path.resolve(__dirname, '../test-results/screenshots');
|
||
const BASELINES_DIR = path.resolve(__dirname, '../test-baselines/screenshots');
|
||
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');
|
||
|
||
// Ensure directories exist
|
||
fs.mkdirSync(SCREENSHOTS_DIR, { recursive: true });
|
||
|
||
// Discover all HTML app files
|
||
const appFiles = fs.readdirSync(APP_UI_DIR)
|
||
.filter(f => f.endsWith('.html'))
|
||
.map(f => path.join(APP_UI_DIR, f));
|
||
|
||
// Load fixture for app type (or use default)
|
||
function loadFixture(appFile: string): any {
|
||
const baseName = path.basename(appFile, '.html');
|
||
const fixturePath = path.join(FIXTURES_DIR, `${baseName}.json`);
|
||
if (fs.existsSync(fixturePath)) {
|
||
return JSON.parse(fs.readFileSync(fixturePath, 'utf8'));
|
||
}
|
||
// Default fixture
|
||
return {
|
||
title: 'Test Data',
|
||
data: [
|
||
{ name: 'Test Item 1', status: 'active', value: 100 },
|
||
{ name: 'Test Item 2', status: 'inactive', value: 200 },
|
||
{ name: 'Test Item 3', status: 'pending', value: 300 },
|
||
],
|
||
meta: { total: 3, page: 1, pageSize: 25 }
|
||
};
|
||
}
|
||
|
||
for (const appFile of appFiles) {
|
||
const appName = path.basename(appFile, '.html');
|
||
|
||
test.describe(`Visual: ${appName}`, () => {
|
||
let page: Page;
|
||
|
||
test.beforeEach(async ({ browser }) => {
|
||
page = await browser.newPage({ viewport: { width: 400, height: 600 } });
|
||
await page.goto(`file://${appFile}`);
|
||
// Collect console errors
|
||
page.on('console', msg => {
|
||
if (msg.type() === 'error') {
|
||
console.error(`[${appName}] Console error:`, msg.text());
|
||
}
|
||
});
|
||
});
|
||
|
||
test.afterEach(async () => {
|
||
await page.close();
|
||
});
|
||
|
||
test('renders loading state initially', async () => {
|
||
// Before any data, loading state should show
|
||
const loading = page.locator('#loading');
|
||
const content = page.locator('#content');
|
||
// At least one should be visible
|
||
const loadingVis = await loading.isVisible().catch(() => false);
|
||
const contentVis = await content.isVisible().catch(() => false);
|
||
expect(loadingVis || contentVis).toBe(true);
|
||
|
||
await page.screenshot({
|
||
path: path.join(SCREENSHOTS_DIR, `${appName}-loading.png`)
|
||
});
|
||
});
|
||
|
||
test('renders empty state', async () => {
|
||
// Inject empty data
|
||
await page.evaluate(() => {
|
||
window.postMessage({ type: 'mcp_app_data', data: {} }, '*');
|
||
});
|
||
await page.waitForTimeout(500);
|
||
|
||
// Should show empty state, not crash
|
||
const hasError = await page.evaluate(() => {
|
||
return document.body.innerText.includes('Error') ||
|
||
document.body.innerText.includes('undefined');
|
||
});
|
||
|
||
await page.screenshot({
|
||
path: path.join(SCREENSHOTS_DIR, `${appName}-empty.png`)
|
||
});
|
||
|
||
// No JS crashes
|
||
expect(hasError).toBe(false);
|
||
});
|
||
|
||
test('renders data state without console errors', async () => {
|
||
const fixture = loadFixture(appFile);
|
||
const consoleErrors: string[] = [];
|
||
page.on('console', msg => {
|
||
if (msg.type() === 'error') consoleErrors.push(msg.text());
|
||
});
|
||
|
||
// Inject fixture data
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
await page.waitForTimeout(1000);
|
||
|
||
// Content should be visible (loading hidden)
|
||
const loading = page.locator('#loading');
|
||
const loadingHidden = !(await loading.isVisible().catch(() => true));
|
||
|
||
await page.screenshot({
|
||
path: path.join(SCREENSHOTS_DIR, `${appName}-data.png`)
|
||
});
|
||
|
||
expect(consoleErrors).toHaveLength(0);
|
||
});
|
||
|
||
test('no horizontal overflow at 320px', async () => {
|
||
await page.setViewportSize({ width: 320, height: 600 });
|
||
const fixture = loadFixture(appFile);
|
||
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
await page.waitForTimeout(500);
|
||
|
||
const hasOverflow = await page.evaluate(() => {
|
||
return document.documentElement.scrollWidth > document.documentElement.clientWidth;
|
||
});
|
||
|
||
await page.screenshot({
|
||
path: path.join(SCREENSHOTS_DIR, `${appName}-narrow.png`)
|
||
});
|
||
|
||
expect(hasOverflow).toBe(false);
|
||
});
|
||
|
||
test('dark theme compliance', async () => {
|
||
const fixture = loadFixture(appFile);
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
await page.waitForTimeout(500);
|
||
|
||
// Check background color is dark
|
||
const bgColor = await page.evaluate(() => {
|
||
return getComputedStyle(document.body).backgroundColor;
|
||
});
|
||
// Should be dark (r,g,b each < 60)
|
||
const match = bgColor.match(/\d+/g);
|
||
if (match) {
|
||
const [r, g, b] = match.map(Number);
|
||
expect(r).toBeLessThan(60);
|
||
expect(g).toBeLessThan(60);
|
||
expect(b).toBeLessThan(60);
|
||
}
|
||
});
|
||
});
|
||
}
|
||
```
|
||
|
||
### 2.2 — BackstopJS Visual Regression
|
||
|
||
```bash
|
||
# Initialize BackstopJS (one-time setup)
|
||
npm install -g backstopjs
|
||
backstop init
|
||
|
||
# Configure backstop.json:
|
||
```
|
||
|
||
```json
|
||
{
|
||
"id": "mcp-apps",
|
||
"viewports": [
|
||
{ "label": "thread-panel", "width": 400, "height": 600 },
|
||
{ "label": "narrow", "width": 320, "height": 600 },
|
||
{ "label": "wide", "width": 800, "height": 600 }
|
||
],
|
||
"scenarios": [
|
||
{
|
||
"label": "contact-grid-data",
|
||
"url": "file:///path/to/app-ui/contact-grid.html",
|
||
"onReadyScript": "inject-data.js",
|
||
"delay": 1000,
|
||
"misMatchThreshold": 5.0,
|
||
"requireSameDimensions": true
|
||
}
|
||
],
|
||
"paths": {
|
||
"bitmaps_reference": "test-baselines/backstop",
|
||
"bitmaps_test": "test-results/backstop",
|
||
"engine_scripts": "tests/backstop-scripts"
|
||
},
|
||
"engine": "playwright",
|
||
"engineOptions": {
|
||
"args": ["--no-sandbox"]
|
||
}
|
||
}
|
||
```
|
||
|
||
```javascript
|
||
// tests/backstop-scripts/inject-data.js
|
||
module.exports = async (page, scenario, viewport, isReference, browserContext) => {
|
||
const fixtures = require('../test-fixtures/' + scenario.label.split('-')[0] + '.json');
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixtures);
|
||
await page.waitForTimeout(500);
|
||
};
|
||
```
|
||
|
||
```bash
|
||
# Capture baselines (run once when apps are verified correct)
|
||
backstop reference
|
||
|
||
# Test against baselines (run on every QA cycle)
|
||
backstop test
|
||
# Result: PASS if <5% pixel diff, FAIL otherwise
|
||
# Visual diff report opens in browser automatically
|
||
```
|
||
|
||
### 2.3 — Gemini Multimodal Analysis (Subjective Quality)
|
||
|
||
```bash
|
||
# After Playwright captures screenshots, run Gemini for subjective quality:
|
||
gemini "Analyze this MCP app screenshot. Check and rate PASS/WARN/FAIL:
|
||
|
||
1. RENDERING: Does it show real content (not blank/placeholder)?
|
||
2. DARK THEME: Background ~#1a1d23, accent ~#ff6d5a, text ~#dcddde
|
||
3. LAYOUT: Content properly aligned, no overlapping elements?
|
||
4. TYPOGRAPHY: Text readable, proper sizing, no clipping?
|
||
5. DATA QUALITY: Does the rendered data look realistic?
|
||
6. RESPONSIVENESS: Would this work at 280px (thread panel)?
|
||
7. BUGS: Any visual artifacts, broken images, misaligned elements?" -f screenshot.png
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] All apps render loading → empty → data states without crashes
|
||
- [ ] Zero console errors in data state
|
||
- [ ] No horizontal overflow at 320px width
|
||
- [ ] Dark theme compliance (background RGB each <60)
|
||
- [ ] BackstopJS regression: <5% pixel diff from baselines
|
||
- [ ] Gemini subjective review: no FAIL ratings
|
||
|
||
---
|
||
|
||
## Layer 2.5: Accessibility Testing
|
||
|
||
### 2.5.1 — axe-core Automated Audit
|
||
|
||
Integrate directly into Playwright tests:
|
||
|
||
```typescript
|
||
// tests/accessibility.test.ts
|
||
import { test, expect, Page } from '@playwright/test';
|
||
import AxeBuilder from '@axe-core/playwright';
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
|
||
const APP_UI_DIR = path.resolve(__dirname, '../app-ui');
|
||
const FIXTURES_DIR = path.resolve(__dirname, '../test-fixtures');
|
||
|
||
const appFiles = fs.readdirSync(APP_UI_DIR)
|
||
.filter(f => f.endsWith('.html'));
|
||
|
||
for (const appFile of appFiles) {
|
||
const appName = path.basename(appFile, '.html');
|
||
|
||
test.describe(`Accessibility: ${appName}`, () => {
|
||
test('passes axe-core audit with data loaded', async ({ page }) => {
|
||
await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);
|
||
|
||
// Load fixture data
|
||
const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
|
||
const fixture = fs.existsSync(fixturePath)
|
||
? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
|
||
: { title: 'Test', data: [{ name: 'Test', status: 'active' }] };
|
||
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
await page.waitForTimeout(1000);
|
||
|
||
// Run axe-core
|
||
const results = await new AxeBuilder({ page })
|
||
.withTags(['wcag2a', 'wcag2aa', 'wcag21a', 'wcag21aa'])
|
||
.analyze();
|
||
|
||
// Log violations for debugging
|
||
if (results.violations.length > 0) {
|
||
console.log(`\n[${appName}] Accessibility violations:`);
|
||
for (const v of results.violations) {
|
||
console.log(` ${v.impact}: ${v.id} — ${v.description}`);
|
||
console.log(` Help: ${v.helpUrl}`);
|
||
for (const node of v.nodes.slice(0, 3)) {
|
||
console.log(` Target: ${node.target.join(' > ')}`);
|
||
}
|
||
}
|
||
}
|
||
|
||
// Calculate score: (passes / (passes + violations)) * 100
|
||
const totalChecks = results.passes.length + results.violations.length;
|
||
const score = totalChecks > 0
|
||
? Math.round((results.passes.length / totalChecks) * 100)
|
||
: 100;
|
||
|
||
console.log(`[${appName}] Accessibility score: ${score}%`);
|
||
|
||
// Target: >90% score, zero critical/serious violations
|
||
const criticalViolations = results.violations.filter(
|
||
v => v.impact === 'critical' || v.impact === 'serious'
|
||
);
|
||
expect(criticalViolations).toHaveLength(0);
|
||
expect(score).toBeGreaterThanOrEqual(90);
|
||
});
|
||
|
||
test('all interactive elements reachable via keyboard', async ({ page }) => {
|
||
await page.goto(`file://${path.join(APP_UI_DIR, appFile)}`);
|
||
|
||
// Inject data first
|
||
const fixturePath = path.join(FIXTURES_DIR, `${appName}.json`);
|
||
const fixture = fs.existsSync(fixturePath)
|
||
? JSON.parse(fs.readFileSync(fixturePath, 'utf8'))
|
||
: { title: 'Test', data: [{ name: 'Test' }] };
|
||
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
await page.waitForTimeout(500);
|
||
|
||
// Get all interactive elements
|
||
const interactiveElements = await page.evaluate(() => {
|
||
const selectors = 'a, button, input, select, textarea, [tabindex], [role="button"], [role="link"], [role="tab"]';
|
||
const elements = document.querySelectorAll(selectors);
|
||
return Array.from(elements).map(el => ({
|
||
tag: el.tagName.toLowerCase(),
|
||
text: (el as HTMLElement).innerText?.slice(0, 50) || el.getAttribute('aria-label') || '',
|
||
tabIndex: (el as HTMLElement).tabIndex,
|
||
visible: (el as HTMLElement).offsetParent !== null,
|
||
}));
|
||
});
|
||
|
||
// Filter to visible elements
|
||
const visibleInteractive = interactiveElements.filter(el => el.visible);
|
||
|
||
// Tab through all elements and verify focus reaches each
|
||
let focusedCount = 0;
|
||
for (let i = 0; i < visibleInteractive.length + 5; i++) {
|
||
await page.keyboard.press('Tab');
|
||
const focused = await page.evaluate(() => {
|
||
const el = document.activeElement;
|
||
return el ? el.tagName.toLowerCase() : 'none';
|
||
});
|
||
if (focused !== 'body' && focused !== 'none') {
|
||
focusedCount++;
|
||
}
|
||
}
|
||
|
||
// At least 80% of visible interactive elements should be reachable
|
||
if (visibleInteractive.length > 0) {
|
||
const reachRate = focusedCount / visibleInteractive.length;
|
||
expect(reachRate).toBeGreaterThanOrEqual(0.8);
|
||
}
|
||
});
|
||
});
|
||
}
|
||
```
|
||
|
||
### 2.5.2 — Standalone axe-core Snippet (for Browser DevTools)
|
||
|
||
```javascript
|
||
// Paste this into browser console on any app iframe:
|
||
(async () => {
|
||
if (!window.axe) {
|
||
const s = document.createElement('script');
|
||
s.src = 'https://cdnjs.cloudflare.com/ajax/libs/axe-core/4.10.0/axe.min.js';
|
||
document.head.appendChild(s);
|
||
await new Promise(r => s.onload = r);
|
||
}
|
||
const results = await axe.run(document, {
|
||
runOnly: ['wcag2a', 'wcag2aa', 'wcag21aa']
|
||
});
|
||
console.log('=== Accessibility Results ===');
|
||
console.log(`Passes: ${results.passes.length}`);
|
||
console.log(`Violations: ${results.violations.length}`);
|
||
const score = Math.round(
|
||
(results.passes.length / (results.passes.length + results.violations.length)) * 100
|
||
);
|
||
console.log(`Score: ${score}%`);
|
||
if (results.violations.length > 0) {
|
||
console.table(results.violations.map(v => ({
|
||
impact: v.impact,
|
||
id: v.id,
|
||
description: v.description,
|
||
nodes: v.nodes.length
|
||
})));
|
||
}
|
||
return results;
|
||
})();
|
||
```
|
||
|
||
### 2.5.3 — Color Contrast Audit
|
||
|
||
```javascript
|
||
// Validate contrast ratios for all text elements
|
||
// Paste into browser console on any app iframe:
|
||
(function auditContrast() {
|
||
function luminance(r, g, b) {
|
||
const a = [r, g, b].map(v => {
|
||
v /= 255;
|
||
return v <= 0.03928 ? v / 12.92 : Math.pow((v + 0.055) / 1.055, 2.4);
|
||
});
|
||
return a[0] * 0.2126 + a[1] * 0.7152 + a[2] * 0.0722;
|
||
}
|
||
function contrastRatio(rgb1, rgb2) {
|
||
const l1 = luminance(...rgb1) + 0.05;
|
||
const l2 = luminance(...rgb2) + 0.05;
|
||
return l1 > l2 ? l1 / l2 : l2 / l1;
|
||
}
|
||
function parseRGB(color) {
|
||
const m = color.match(/\d+/g);
|
||
return m ? m.slice(0, 3).map(Number) : [0, 0, 0];
|
||
}
|
||
|
||
const textElements = document.querySelectorAll('*');
|
||
const issues = [];
|
||
|
||
textElements.forEach(el => {
|
||
const style = getComputedStyle(el);
|
||
if (!el.textContent?.trim() || style.display === 'none') return;
|
||
|
||
const fgRGB = parseRGB(style.color);
|
||
const bgRGB = parseRGB(style.backgroundColor);
|
||
|
||
// Skip if background is transparent (would need to walk up)
|
||
if (style.backgroundColor === 'rgba(0, 0, 0, 0)') return;
|
||
|
||
const ratio = contrastRatio(fgRGB, bgRGB);
|
||
const fontSize = parseFloat(style.fontSize);
|
||
const isBold = parseInt(style.fontWeight) >= 700;
|
||
const isLargeText = fontSize >= 24 || (fontSize >= 18.66 && isBold);
|
||
const required = isLargeText ? 3.0 : 4.5;
|
||
|
||
if (ratio < required) {
|
||
issues.push({
|
||
text: el.textContent.trim().slice(0, 40),
|
||
fg: style.color,
|
||
bg: style.backgroundColor,
|
||
ratio: ratio.toFixed(1),
|
||
required: required,
|
||
tag: el.tagName
|
||
});
|
||
}
|
||
});
|
||
|
||
if (issues.length === 0) {
|
||
console.log('✅ All text passes WCAG AA contrast requirements');
|
||
} else {
|
||
console.log(`❌ ${issues.length} contrast failures:`);
|
||
console.table(issues);
|
||
}
|
||
})();
|
||
```
|
||
|
||
### 2.5.4 — Screen Reader Testing (macOS VoiceOver)
|
||
|
||
```markdown
|
||
### VoiceOver Manual Test Procedure:
|
||
1. Open the app in Safari (VoiceOver works best with Safari)
|
||
2. Enable VoiceOver: Cmd+F5
|
||
3. Navigate with VO+Right Arrow through all elements
|
||
4. Verify:
|
||
- [ ] App title/heading is announced
|
||
- [ ] Data table rows are announced with column headers
|
||
- [ ] Status badges announce text (not just color)
|
||
- [ ] Loading state announces "Loading" or similar
|
||
- [ ] Empty state announces helpful message
|
||
- [ ] Interactive elements announce their purpose
|
||
- [ ] No "blank" or "group" without context
|
||
5. Disable VoiceOver: Cmd+F5
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] axe-core score >90% on all apps
|
||
- [ ] Zero critical/serious axe violations
|
||
- [ ] All text meets WCAG AA contrast (4.5:1 normal, 3:1 large)
|
||
- [ ] Secondary text uses #b0b2b8 or lighter (not #96989d)
|
||
- [ ] All interactive elements reachable via Tab
|
||
- [ ] VoiceOver reads meaningful content (no blank/unlabeled regions)
|
||
|
||
---
|
||
|
||
## Layer 3: Functional Testing
|
||
|
||
### 3.1 — Jest Unit Tests with MSW (Mock Service Worker)
|
||
|
||
Test tool handlers without hitting real APIs:
|
||
|
||
```typescript
|
||
// tests/tools.test.ts
|
||
import { http, HttpResponse } from 'msw';
|
||
import { setupServer } from 'msw/node';
|
||
|
||
// Mock API responses
|
||
const mockContacts = [
|
||
{ id: '1', name: 'John Doe', email: 'john@example.com', phone: '555-0101', status: 'active' },
|
||
{ id: '2', name: 'Jane Smith', email: 'jane@example.com', phone: '555-0102', status: 'inactive' },
|
||
{ id: '3', name: 'Bob Wilson', email: 'bob@example.com', phone: '555-0103', status: 'active' },
|
||
];
|
||
|
||
const handlers = [
|
||
// Mock the external API endpoints your tools call
|
||
http.get('https://api.example.com/v1/contacts', ({ request }) => {
|
||
const url = new URL(request.url);
|
||
const page = Number(url.searchParams.get('page') || 1);
|
||
const pageSize = Number(url.searchParams.get('pageSize') || 25);
|
||
const status = url.searchParams.get('status');
|
||
|
||
let filtered = mockContacts;
|
||
if (status) filtered = filtered.filter(c => c.status === status);
|
||
|
||
return HttpResponse.json({
|
||
data: filtered.slice((page - 1) * pageSize, page * pageSize),
|
||
meta: { total: filtered.length, page, pageSize }
|
||
});
|
||
}),
|
||
|
||
http.get('https://api.example.com/v1/contacts/:id', ({ params }) => {
|
||
const contact = mockContacts.find(c => c.id === params.id);
|
||
if (!contact) {
|
||
return new HttpResponse(null, { status: 404 });
|
||
}
|
||
return HttpResponse.json(contact);
|
||
}),
|
||
|
||
http.post('https://api.example.com/v1/contacts', async ({ request }) => {
|
||
const body = await request.json() as any;
|
||
return HttpResponse.json({
|
||
id: 'new-1',
|
||
...body,
|
||
created_at: new Date().toISOString()
|
||
}, { status: 201 });
|
||
}),
|
||
|
||
// Mock 500 error for chaos testing
|
||
http.get('https://api.example.com/v1/error-endpoint', () => {
|
||
return new HttpResponse(null, { status: 500 });
|
||
}),
|
||
];
|
||
|
||
const server = setupServer(...handlers);
|
||
|
||
beforeAll(() => server.listen({ onUnhandledRequest: 'warn' }));
|
||
afterEach(() => server.resetHandlers());
|
||
afterAll(() => server.close());
|
||
|
||
describe('Tool Handlers', () => {
|
||
test('list_contacts returns paginated results', async () => {
|
||
// Import your actual tool handler
|
||
// const { handleListContacts } = require('../src/tools/contacts');
|
||
// const result = await handleListContacts({ page: 1, pageSize: 25 });
|
||
|
||
// For now, test the API client directly
|
||
const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=25');
|
||
const data = await response.json();
|
||
|
||
expect(data.data).toBeInstanceOf(Array);
|
||
expect(data.data.length).toBeGreaterThan(0);
|
||
expect(data.meta.total).toBeDefined();
|
||
expect(data.meta.page).toBe(1);
|
||
|
||
// Validate each contact shape
|
||
for (const contact of data.data) {
|
||
expect(contact.id).toBeTruthy();
|
||
expect(contact.name).toBeTruthy();
|
||
expect(contact.email).toBeTruthy();
|
||
}
|
||
});
|
||
|
||
test('list_contacts filters by status', async () => {
|
||
const response = await fetch('https://api.example.com/v1/contacts?status=active');
|
||
const data = await response.json();
|
||
|
||
for (const contact of data.data) {
|
||
expect(contact.status).toBe('active');
|
||
}
|
||
});
|
||
|
||
test('get_contact returns single contact', async () => {
|
||
const response = await fetch('https://api.example.com/v1/contacts/1');
|
||
const data = await response.json();
|
||
|
||
expect(data.id).toBe('1');
|
||
expect(data.name).toBe('John Doe');
|
||
});
|
||
|
||
test('get_contact returns 404 for unknown ID', async () => {
|
||
const response = await fetch('https://api.example.com/v1/contacts/unknown-99');
|
||
expect(response.status).toBe(404);
|
||
});
|
||
|
||
test('create_contact returns created entity', async () => {
|
||
const response = await fetch('https://api.example.com/v1/contacts', {
|
||
method: 'POST',
|
||
headers: { 'Content-Type': 'application/json' },
|
||
body: JSON.stringify({ name: 'New Contact', email: 'new@test.com' })
|
||
});
|
||
const data = await response.json();
|
||
|
||
expect(response.status).toBe(201);
|
||
expect(data.id).toBeTruthy();
|
||
expect(data.name).toBe('New Contact');
|
||
});
|
||
|
||
test('handles API 500 errors gracefully', async () => {
|
||
const response = await fetch('https://api.example.com/v1/error-endpoint');
|
||
expect(response.status).toBe(500);
|
||
// Tool handler should return isError: true, not crash
|
||
});
|
||
});
|
||
```
|
||
|
||
> **MSW Mock Validation:** Hand-crafted mocks can drift from real API responses. When credentials are available (Layer 4), validate that MSW mock response shapes match actual API responses. Run a script that calls the real API once and diffs the response keys/types against your mock handlers. Update mocks quarterly or whenever the API ships a new version.
|
||
|
||
### 3.2 — Tool Routing Smoke Test
|
||
|
||
Automated script that sends NL messages and checks tool selection:
|
||
|
||
```typescript
|
||
// tests/tool-routing.test.ts
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
|
||
interface RoutingFixture {
|
||
message: string;
|
||
expectedTool: string;
|
||
category: string;
|
||
}
|
||
|
||
// Load routing fixtures (maintain this file!)
|
||
const ROUTING_FIXTURES_PATH = path.resolve(__dirname, '../test-fixtures/tool-routing.json');
|
||
|
||
const routingFixtures: RoutingFixture[] = JSON.parse(
|
||
fs.readFileSync(ROUTING_FIXTURES_PATH, 'utf8')
|
||
);
|
||
|
||
describe('Tool Routing', () => {
|
||
// This test requires the AI/LLM in the loop — typically run via LocalBosses API
|
||
// or by mocking the tool selection logic
|
||
|
||
test('routing fixtures file is valid', () => {
|
||
expect(routingFixtures.length).toBeGreaterThanOrEqual(20);
|
||
|
||
for (const fixture of routingFixtures) {
|
||
expect(fixture.message).toBeTruthy();
|
||
expect(fixture.expectedTool).toBeTruthy();
|
||
expect(fixture.category).toBeTruthy();
|
||
}
|
||
});
|
||
|
||
test('all expected tools exist in server', async () => {
|
||
// Parse the server's tool definitions to get available tool names
|
||
const toolNames = new Set<string>();
|
||
|
||
// Read from compiled server or source
|
||
// This validates that routing fixtures reference real tools
|
||
const srcDir = path.resolve(__dirname, '../src/tools');
|
||
if (fs.existsSync(srcDir)) {
|
||
const toolFiles = fs.readdirSync(srcDir).filter(f => f.endsWith('.ts'));
|
||
for (const file of toolFiles) {
|
||
const content = fs.readFileSync(path.join(srcDir, file), 'utf8');
|
||
const nameMatches = content.matchAll(/name:\s*['"]([^'"]+)['"]/g);
|
||
for (const match of nameMatches) {
|
||
toolNames.add(match[1]);
|
||
}
|
||
}
|
||
}
|
||
|
||
if (toolNames.size > 0) {
|
||
for (const fixture of routingFixtures) {
|
||
expect(toolNames.has(fixture.expectedTool)).toBe(true);
|
||
}
|
||
}
|
||
});
|
||
});
|
||
|
||
// Tool routing fixtures template — save as test-fixtures/tool-routing.json:
|
||
/*
|
||
[
|
||
{ "message": "Show me all contacts", "expectedTool": "list_contacts", "category": "list" },
|
||
{ "message": "Find John Smith", "expectedTool": "search_contacts", "category": "search" },
|
||
{ "message": "What's John's email?", "expectedTool": "get_contact", "category": "get" },
|
||
{ "message": "Add a new contact", "expectedTool": "create_contact", "category": "create" },
|
||
{ "message": "Update John's phone number", "expectedTool": "update_contact", "category": "update" },
|
||
{ "message": "Remove the test contact", "expectedTool": "delete_contact", "category": "delete" },
|
||
{ "message": "Show me a summary of this month", "expectedTool": "get_dashboard", "category": "analytics" },
|
||
... (at least 20 fixtures per server)
|
||
]
|
||
*/
|
||
```
|
||
|
||
### 3.2b — DeepEval LLM-in-the-Loop Tool Routing Evaluation
|
||
|
||
Static routing fixtures validate that tool names exist, but they don't test whether the LLM actually selects the right tool. Use **DeepEval** for real LLM tool routing evaluation with `ToolCorrectnessMetric` and `TaskCompletionMetric`.
|
||
|
||
**Setup:**
|
||
```bash
|
||
pip install deepeval
|
||
deepeval login # Optional: for dashboard tracking
|
||
```
|
||
|
||
**Test file** — save as `tests/tool_routing_eval.py`:
|
||
|
||
```python
|
||
# tests/tool_routing_eval.py
|
||
# Requires: pip install deepeval anthropic
|
||
# Run: deepeval test run tests/tool_routing_eval.py
|
||
|
||
import json
|
||
import os
|
||
from deepeval import evaluate
|
||
from deepeval.metrics import ToolCorrectnessMetric, TaskCompletionMetric
|
||
from deepeval.test_case import LLMTestCase, ToolCall
|
||
from anthropic import Anthropic
|
||
|
||
client = Anthropic()
|
||
|
||
def load_tool_definitions(server_dir: str) -> list[dict]:
|
||
"""Load tool definitions from compiled MCP server."""
|
||
# Read tool names/schemas from the source files
|
||
# Adapt path to your server structure
|
||
import glob
|
||
tools = []
|
||
for f in glob.glob(f"{server_dir}/src/tools/*.ts"):
|
||
with open(f) as fh:
|
||
content = fh.read()
|
||
# Extract tool definitions (simplified — adapt to your codebase)
|
||
import re
|
||
for match in re.finditer(r'name:\s*["\'](\w+)["\']', content):
|
||
tools.append({"name": match.group(1)})
|
||
return tools
|
||
|
||
def run_agent(message: str, system_prompt: str, tools: list[dict]) -> tuple[str, list[ToolCall]]:
|
||
"""Send message through Claude with tools, return response + tool calls."""
|
||
# Convert MCP tool defs to Anthropic tool format
|
||
anthropic_tools = [
|
||
{
|
||
"name": t["name"],
|
||
"description": t.get("description", f"Tool: {t['name']}"),
|
||
"input_schema": t.get("inputSchema", {"type": "object", "properties": {}})
|
||
}
|
||
for t in tools
|
||
]
|
||
|
||
response = client.messages.create(
|
||
model="claude-sonnet-4-20250514",
|
||
max_tokens=1024,
|
||
system=system_prompt,
|
||
messages=[{"role": "user", "content": message}],
|
||
tools=anthropic_tools,
|
||
)
|
||
|
||
tool_calls = []
|
||
text_response = ""
|
||
for block in response.content:
|
||
if block.type == "tool_use":
|
||
tool_calls.append(ToolCall(name=block.name, arguments=block.input))
|
||
elif block.type == "text":
|
||
text_response += block.text
|
||
|
||
return text_response, tool_calls
|
||
|
||
# Load fixtures and system prompt
|
||
FIXTURES_PATH = "test-fixtures/tool-routing.json"
|
||
SYSTEM_PROMPT_PATH = "test-fixtures/system-prompt.txt"
|
||
|
||
with open(FIXTURES_PATH) as f:
|
||
fixtures = json.load(f)
|
||
|
||
system_prompt = ""
|
||
if os.path.exists(SYSTEM_PROMPT_PATH):
|
||
with open(SYSTEM_PROMPT_PATH) as f:
|
||
system_prompt = f.read()
|
||
|
||
# Build test cases
|
||
tool_correctness = ToolCorrectnessMetric()
|
||
task_completion = TaskCompletionMetric()
|
||
|
||
test_cases = []
|
||
for fixture in fixtures:
|
||
response_text, actual_calls = run_agent(
|
||
fixture["message"], system_prompt, load_tool_definitions(".")
|
||
)
|
||
test_cases.append(
|
||
LLMTestCase(
|
||
input=fixture["message"],
|
||
actual_output=response_text,
|
||
expected_tools=[ToolCall(name=fixture["expectedTool"])],
|
||
tools_called=actual_calls,
|
||
)
|
||
)
|
||
|
||
# Evaluate
|
||
results = evaluate(test_cases, [tool_correctness, task_completion])
|
||
print(f"\n=== DeepEval Results ===")
|
||
print(f"Tool Correctness: {tool_correctness.score:.1%}")
|
||
print(f"Task Completion: {task_completion.score:.1%}")
|
||
# Target: Tool Correctness >95%, Task Completion >90%
|
||
```
|
||
|
||
**When to run:** After every tool description change, system prompt update, or model upgrade. This is the REAL test of whether the AI routes correctly — fixture files alone are testing theater.
|
||
|
||
### 3.3 — APP_DATA Schema Validator
|
||
|
||
```typescript
|
||
// tests/app-data-validator.ts
|
||
import Ajv from 'ajv';
|
||
import * as fs from 'fs';
|
||
import * as path from 'path';
|
||
|
||
const ajv = new Ajv({ allErrors: true, strict: false });
|
||
|
||
// Define expected schemas per app type
|
||
const APP_DATA_SCHEMAS: Record<string, object> = {
|
||
'dashboard': {
|
||
type: 'object',
|
||
required: ['title'],
|
||
properties: {
|
||
title: { type: 'string' },
|
||
metrics: {
|
||
type: 'array',
|
||
items: {
|
||
type: 'object',
|
||
required: ['label', 'value'],
|
||
properties: {
|
||
label: { type: 'string' },
|
||
value: { type: ['string', 'number'] },
|
||
change: { type: ['string', 'number'] },
|
||
trend: { enum: ['up', 'down', 'flat'] }
|
||
}
|
||
}
|
||
},
|
||
charts: { type: 'array' },
|
||
data: { type: ['array', 'object'] }
|
||
}
|
||
},
|
||
'data-grid': {
|
||
type: 'object',
|
||
required: ['data'],
|
||
properties: {
|
||
title: { type: 'string' },
|
||
data: {
|
||
type: 'array',
|
||
items: { type: 'object' },
|
||
minItems: 0
|
||
},
|
||
meta: {
|
||
type: 'object',
|
||
properties: {
|
||
total: { type: 'number' },
|
||
page: { type: 'number' },
|
||
pageSize: { type: 'number' }
|
||
}
|
||
},
|
||
columns: { type: 'array' }
|
||
}
|
||
},
|
||
'detail-card': {
|
||
type: 'object',
|
||
properties: {
|
||
title: { type: 'string' },
|
||
data: { type: 'object' },
|
||
sections: { type: 'array' },
|
||
fields: { type: 'array' }
|
||
}
|
||
},
|
||
'timeline': {
|
||
type: 'object',
|
||
properties: {
|
||
title: { type: 'string' },
|
||
events: {
|
||
type: 'array',
|
||
items: {
|
||
type: 'object',
|
||
required: ['date'],
|
||
properties: {
|
||
date: { type: 'string' },
|
||
title: { type: 'string' },
|
||
description: { type: 'string' },
|
||
type: { type: 'string' }
|
||
}
|
||
}
|
||
},
|
||
data: { type: 'array' }
|
||
}
|
||
},
|
||
'pipeline': {
|
||
type: 'object',
|
||
properties: {
|
||
title: { type: 'string' },
|
||
stages: {
|
||
type: 'array',
|
||
items: {
|
||
type: 'object',
|
||
required: ['name'],
|
||
properties: {
|
||
name: { type: 'string' },
|
||
items: { type: 'array' },
|
||
count: { type: 'number' },
|
||
value: { type: ['number', 'string'] }
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
};
|
||
|
||
export function validateAppData(
|
||
appType: string,
|
||
appData: any
|
||
): { valid: boolean; errors: string[]; warnings: string[] } {
|
||
const errors: string[] = [];
|
||
const warnings: string[] = [];
|
||
|
||
// Basic checks
|
||
if (!appData || typeof appData !== 'object') {
|
||
return { valid: false, errors: ['APP_DATA is null or not an object'], warnings: [] };
|
||
}
|
||
|
||
// Schema validation
|
||
const schema = APP_DATA_SCHEMAS[appType];
|
||
if (schema) {
|
||
const validate = ajv.compile(schema);
|
||
const isValid = validate(appData);
|
||
if (!isValid && validate.errors) {
|
||
for (const err of validate.errors) {
|
||
errors.push(`${err.instancePath || '/'} ${err.message}`);
|
||
}
|
||
}
|
||
} else {
|
||
warnings.push(`No schema defined for app type: ${appType}`);
|
||
}
|
||
|
||
// Common checks regardless of app type
|
||
if (appData.data && Array.isArray(appData.data)) {
|
||
if (appData.data.length === 0) {
|
||
warnings.push('data array is empty — app will show empty state');
|
||
}
|
||
// Check for null/undefined values in data items
|
||
for (let i = 0; i < Math.min(appData.data.length, 5); i++) {
|
||
const item = appData.data[i];
|
||
for (const [key, val] of Object.entries(item || {})) {
|
||
if (val === undefined) {
|
||
warnings.push(`data[${i}].${key} is undefined (will show as "undefined" in app)`);
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
return { valid: errors.length === 0, errors, warnings };
|
||
}
|
||
|
||
// Parse APP_DATA from AI response text
|
||
export function extractAppData(responseText: string): any | null {
|
||
// Standard format
|
||
const match = responseText.match(/<!--APP_DATA:([\s\S]*?):END_APP_DATA-->/);
|
||
if (match) {
|
||
try {
|
||
// Strip whitespace/newlines that LLMs sometimes add
|
||
const cleaned = match[1].replace(/[\n\r]/g, '').trim();
|
||
return JSON.parse(cleaned);
|
||
} catch (e) {
|
||
// Try with more aggressive cleanup
|
||
try {
|
||
const aggressive = match[1]
|
||
.replace(/[\n\r\t]/g, '')
|
||
.replace(/,\s*}/g, '}') // trailing commas
|
||
.replace(/,\s*]/g, ']') // trailing commas in arrays
|
||
.trim();
|
||
return JSON.parse(aggressive);
|
||
} catch (e2) {
|
||
return null;
|
||
}
|
||
}
|
||
}
|
||
|
||
// Fallback: try to find JSON in code blocks
|
||
const codeBlockMatch = responseText.match(/```(?:json)?\s*([\s\S]*?)```/);
|
||
if (codeBlockMatch) {
|
||
try {
|
||
return JSON.parse(codeBlockMatch[1].trim());
|
||
} catch (e) {
|
||
return null;
|
||
}
|
||
}
|
||
|
||
return null;
|
||
}
|
||
```
|
||
|
||
### 3.4 — Thread Lifecycle Testing
|
||
|
||
```markdown
|
||
### Thread Lifecycle: {channel}
|
||
|
||
1. [ ] Click app in toolbar → thread panel opens
|
||
2. [ ] Intake question appears in thread
|
||
3. [ ] Type response → AI processes in thread context
|
||
4. [ ] App loads in thread panel (if data returned or skipped)
|
||
5. [ ] Send follow-up message → app updates with new data
|
||
6. [ ] Close thread panel (X) → panel closes, thread indicator remains
|
||
7. [ ] Click thread indicator → panel reopens with preserved state
|
||
8. [ ] Delete thread → thread removed, parent message removed
|
||
9. [ ] Switch channels → come back → thread state persists (localStorage)
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] All tool handler unit tests pass (Jest + MSW)
|
||
- [ ] Tool routing fixtures file has ≥20 test messages
|
||
- [ ] All routing fixture tools exist in the server
|
||
- [ ] APP_DATA schema validation passes for all app types
|
||
- [ ] APP_DATA parser handles malformed JSON gracefully
|
||
- [ ] Thread lifecycle completes without errors
|
||
|
||
---
|
||
|
||
## Layer 3.5: Performance Testing
|
||
|
||
### 3.5.1 — Server Cold Start
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Measure cold start time
|
||
SERVICE_DIR="$1"
|
||
cd "$SERVICE_DIR"
|
||
|
||
echo "=== Cold Start Benchmark ==="
|
||
|
||
# Measure time to first ListTools response
|
||
START=$(date +%s%N)
|
||
echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf-test","version":"1.0.0"}}}' | \
|
||
timeout 10 node dist/index.js 2>/dev/null | head -1 > /dev/null
|
||
END=$(date +%s%N)
|
||
|
||
ELAPSED=$(( (END - START) / 1000000 ))
|
||
echo "Cold start to first response: ${ELAPSED}ms"
|
||
if [ "$ELAPSED" -gt 2000 ]; then
|
||
echo "❌ FAIL — exceeds 2000ms target"
|
||
else
|
||
echo "✅ PASS — under 2000ms target"
|
||
fi
|
||
```
|
||
|
||
### 3.5.2 — Tool Invocation Latency
|
||
|
||
```typescript
|
||
// tests/performance.test.ts
|
||
import { performance } from 'perf_hooks';
|
||
|
||
describe('Performance', () => {
|
||
test('tool invocation overhead is under 100ms (excluding API time)', async () => {
|
||
// With MSW intercepting API calls (near-zero latency),
|
||
// measure the tool handler overhead itself
|
||
const times: number[] = [];
|
||
|
||
for (let i = 0; i < 10; i++) {
|
||
const start = performance.now();
|
||
// Call a read-only tool through the handler
|
||
// await toolHandler({ page: 1, pageSize: 10 });
|
||
const response = await fetch('https://api.example.com/v1/contacts?page=1&pageSize=10');
|
||
await response.json();
|
||
const elapsed = performance.now() - start;
|
||
times.push(elapsed);
|
||
}
|
||
|
||
const sorted = times.sort((a, b) => a - b);
|
||
const p50 = sorted[Math.floor(sorted.length * 0.5)];
|
||
const p95 = sorted[Math.floor(sorted.length * 0.95)];
|
||
|
||
console.log(`Tool overhead P50: ${p50.toFixed(1)}ms, P95: ${p95.toFixed(1)}ms`);
|
||
expect(p50).toBeLessThan(100);
|
||
});
|
||
|
||
test('memory usage stays under 100MB with all tools loaded', async () => {
|
||
const used = process.memoryUsage();
|
||
const heapMB = Math.round(used.heapUsed / 1024 / 1024);
|
||
const rssMB = Math.round(used.rss / 1024 / 1024);
|
||
|
||
console.log(`Heap: ${heapMB}MB, RSS: ${rssMB}MB`);
|
||
expect(rssMB).toBeLessThan(100);
|
||
});
|
||
});
|
||
```
|
||
|
||
### 3.5.3 — App File Size Budget
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
echo "=== App File Size Budget (max 50KB) ==="
|
||
OVER=0
|
||
for f in app-ui/*.html; do
|
||
if [ -f "$f" ]; then
|
||
SIZE=$(wc -c < "$f" | tr -d ' ')
|
||
KB=$((SIZE / 1024))
|
||
if [ "$SIZE" -gt 51200 ]; then
|
||
echo "❌ $(basename $f): ${KB}KB (OVER BUDGET)"
|
||
OVER=$((OVER + 1))
|
||
else
|
||
echo "✅ $(basename $f): ${KB}KB"
|
||
fi
|
||
fi
|
||
done
|
||
[ "$OVER" -eq 0 ] && echo "All apps within budget" || echo "⚠️ $OVER apps over 50KB budget"
|
||
```
|
||
|
||
### 3.5.4 — App Render Performance (Playwright)
|
||
|
||
```typescript
|
||
// In visual.test.ts, add:
|
||
test('time to first render is under 2s', async ({ page }) => {
|
||
const start = Date.now();
|
||
await page.goto(`file://${appFile}`);
|
||
|
||
const fixture = loadFixture(appFile);
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, fixture);
|
||
|
||
// Wait for content to be visible
|
||
await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
|
||
const renderTime = Date.now() - start;
|
||
|
||
console.log(`[${appName}] Time to first render: ${renderTime}ms`);
|
||
expect(renderTime).toBeLessThan(2000);
|
||
});
|
||
```
|
||
|
||
### 3.5.5 — Load Testing (HTTP Transport)
|
||
|
||
For servers running with `MCP_TRANSPORT=http`, test concurrent connection handling:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# load-test-http.sh — Test concurrent MCP connections
|
||
# Requires: npm install -g autocannon (or use curl + GNU parallel)
|
||
|
||
MCP_PORT="${1:-3000}"
|
||
CONCURRENCY="${2:-10}"
|
||
DURATION="${3:-10}"
|
||
|
||
echo "=== MCP HTTP Load Test ==="
|
||
echo "Target: http://localhost:${MCP_PORT}/mcp"
|
||
echo "Concurrency: ${CONCURRENCY} connections"
|
||
echo "Duration: ${DURATION}s"
|
||
echo ""
|
||
|
||
# Test 1: Concurrent initialize requests
|
||
echo "--- Test 1: Concurrent initialize ---"
|
||
for i in $(seq 1 $CONCURRENCY); do
|
||
curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":'$i',"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"load-test-'$i'","version":"1.0.0"}}}' \
|
||
-o /dev/null -w "Connection $i: %{http_code} in %{time_total}s\n" &
|
||
done
|
||
wait
|
||
echo ""
|
||
|
||
# Test 2: Concurrent tools/list under load
|
||
echo "--- Test 2: Concurrent tools/list ---"
|
||
START=$(date +%s%N)
|
||
for i in $(seq 1 $CONCURRENCY); do
|
||
curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
|
||
-o /dev/null -w "%{http_code} " &
|
||
done
|
||
wait
|
||
END=$(date +%s%N)
|
||
ELAPSED=$(( (END - START) / 1000000 ))
|
||
echo ""
|
||
echo "All $CONCURRENCY requests completed in ${ELAPSED}ms"
|
||
echo ""
|
||
|
||
# Test 3: Session management under load (verify no cross-session leaks)
|
||
echo "--- Test 3: Session isolation ---"
|
||
SESSION1=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-1","version":"1.0.0"}}}' \
|
||
-D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')
|
||
SESSION2=$(curl -s -X POST "http://localhost:${MCP_PORT}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"session-2","version":"1.0.0"}}}' \
|
||
-D - -o /dev/null 2>&1 | grep -i "mcp-session-id" | cut -d' ' -f2 | tr -d '\r')
|
||
|
||
if [ "$SESSION1" != "$SESSION2" ] && [ -n "$SESSION1" ] && [ -n "$SESSION2" ]; then
|
||
echo "✅ Sessions are unique (no cross-session leaks)"
|
||
else
|
||
echo "⚠️ Session isolation check inconclusive"
|
||
fi
|
||
|
||
echo ""
|
||
echo "=== Load Test Complete ==="
|
||
echo "Target: ${CONCURRENCY} concurrent connections should complete without 5xx errors"
|
||
```
|
||
|
||
**Pass criteria:**
|
||
- Zero 5xx errors under 10 concurrent connections
|
||
- All responses return within 5s
|
||
- No cross-session data leaks (GHSA-345p-7cg4-v4c7 regression test)
|
||
- Memory usage stays under 200MB during load
|
||
|
||
### Quality Gate:
|
||
- [ ] Cold start <2s to first ListTools response
|
||
- [ ] Tool invocation overhead P50 <100ms (excluding API latency)
|
||
- [ ] Memory usage <100MB after loading all tool groups
|
||
- [ ] All HTML app files <50KB
|
||
- [ ] Time to first render <2s for all apps
|
||
- [ ] HTTP transport handles 10 concurrent connections without errors
|
||
|
||
---
|
||
|
||
## Layer 4: Live API Testing
|
||
|
||
### 4.1 — Credential Management Strategy
|
||
|
||
**Before running Layer 4, categorize the server:**
|
||
|
||
| Category | Description | Layer 4 Approach |
|
||
|----------|-------------|-----------------|
|
||
| **has-creds** | API key/OAuth token available in `.env` | Full live testing |
|
||
| **needs-creds** | Credentials needed but not yet obtained | Skip Layer 4, note in report |
|
||
| **sandbox-available** | API provides sandbox/test environment | Use sandbox creds (preferred) |
|
||
| **no-sandbox** | Only production credentials available | Careful read-only testing only |
|
||
|
||
**Centralized credential management:**
|
||
|
||
```bash
|
||
# Master credentials file (NOT committed to git)
|
||
# Location: ~/.clawdbot/workspace/.env.mcp-testing
|
||
|
||
# Format per service:
|
||
# {SERVICE}_API_KEY=xxx
|
||
# {SERVICE}_API_BASE_URL=https://api.example.com
|
||
# {SERVICE}_SANDBOX=true|false
|
||
# {SERVICE}_CRED_STATUS=has-creds|needs-creds|sandbox|no-sandbox
|
||
# {SERVICE}_CRED_EXPIRES=2026-03-01
|
||
|
||
# Script to distribute to individual servers:
|
||
cat ~/.clawdbot/workspace/.env.mcp-testing | grep "^${SERVICE}_" | sed "s/${SERVICE}_//" > ${SERVICE}-mcp/.env
|
||
```
|
||
|
||
**For servers WITHOUT credentials, focus on Layers 0-3:**
|
||
- Layer 0: Protocol compliance (no API needed)
|
||
- Layer 1: Static analysis (no API needed)
|
||
- Layer 2: Visual testing with fixture data (no API needed)
|
||
- Layer 2.5: Accessibility (no API needed)
|
||
- Layer 3: Functional testing with MSW mocks (no API needed)
|
||
- Layer 3.5: Performance with mocks (no API needed)
|
||
- Layer 4: **SKIP** — note in report as "No credentials available"
|
||
- Layer 4.5: Security (most checks don't need API)
|
||
- Layer 5: Partial — E2E with mocked responses
|
||
|
||
### 4.2 — Test Each Tool Group
|
||
|
||
```markdown
|
||
### Live API Test: {service} / {tool-group}
|
||
|
||
**Auth:** {method} — Token/key set in .env
|
||
**Base URL:** {url}
|
||
**Cred Status:** {has-creds|sandbox|no-creds}
|
||
|
||
| Tool | Test Input | Expected | Actual | Latency | Status |
|
||
|------|-----------|----------|--------|---------|--------|
|
||
| list_{entities} | {} (default) | Array of items | | ms | |
|
||
| list_{entities} | { status: "active" } | Filtered array | | ms | |
|
||
| get_{entity} | { id: "known-id" } | Single item | | ms | |
|
||
| create_{entity} | { name: "QA Test" } | Created w/ ID | | ms | |
|
||
| update_{entity} | { id: "id", name: "Updated" } | Updated item | | ms | |
|
||
| delete_{entity} | { id: "qa-test-id" } | Confirmation | | ms | |
|
||
```
|
||
|
||
### 4.3 — Response Shape Verification
|
||
|
||
```bash
|
||
# For each tool, verify response shape matches what the app expects
|
||
# Extract field references from app HTML
|
||
grep -oP 'data\.\K[a-zA-Z_]+' app-ui/{app}.html | sort -u > /tmp/expected-fields.txt
|
||
|
||
# Compare with actual API response fields
|
||
echo '{api_response}' | jq 'keys' > /tmp/actual-fields.txt
|
||
|
||
# Diff
|
||
diff /tmp/expected-fields.txt /tmp/actual-fields.txt
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] All read-only tools return valid data
|
||
- [ ] Write tools create/update/delete correctly (use sandbox)
|
||
- [ ] Response shapes match what apps expect
|
||
- [ ] Error responses (401, 403, 404, 422, 429) handled gracefully
|
||
- [ ] All response latencies recorded for P50/P95 metrics
|
||
- [ ] Cleanup: delete any test data created during QA
|
||
|
||
---
|
||
|
||
## Layer 4.5: Security Testing
|
||
|
||
### 4.5.1 — XSS Testing
|
||
|
||
```typescript
|
||
// tests/security.test.ts
|
||
import { test, expect } from '@playwright/test';
|
||
import * as path from 'path';
|
||
|
||
const XSS_PAYLOADS = [
|
||
'<script>alert("xss")</script>',
|
||
'<img src=x onerror=alert("xss")>',
|
||
'"><script>alert(1)</script>',
|
||
"';alert(String.fromCharCode(88,83,83))//",
|
||
'<svg onload=alert("xss")>',
|
||
'javascript:alert("xss")',
|
||
'<iframe src="javascript:alert(1)">',
|
||
'{{constructor.constructor("return this")().alert(1)}}',
|
||
'<details open ontoggle=alert(1)>',
|
||
'<math><mtext><table><mglyph><svg><mtext><style><img src=x onerror=alert(1)>',
|
||
];
|
||
|
||
test.describe('XSS Security', () => {
|
||
test('escapeHtml blocks all XSS payloads in text fields', async ({ page }) => {
|
||
const appFile = path.resolve(__dirname, '../app-ui/contact-grid.html');
|
||
await page.goto(`file://${appFile}`);
|
||
|
||
for (const payload of XSS_PAYLOADS) {
|
||
let alertFired = false;
|
||
page.on('dialog', async dialog => {
|
||
alertFired = true;
|
||
await dialog.dismiss();
|
||
});
|
||
|
||
// Inject data with XSS payloads in every text field
|
||
await page.evaluate((xss) => {
|
||
window.postMessage({
|
||
type: 'mcp_app_data',
|
||
data: {
|
||
title: xss,
|
||
data: [
|
||
{ name: xss, email: xss, phone: xss, status: xss },
|
||
],
|
||
meta: { total: 1, page: 1, pageSize: 25 }
|
||
}
|
||
}, '*');
|
||
}, payload);
|
||
|
||
await page.waitForTimeout(200);
|
||
expect(alertFired).toBe(false);
|
||
}
|
||
});
|
||
});
|
||
```
|
||
|
||
### 4.5.2 — postMessage Origin Validation
|
||
|
||
```javascript
|
||
// Check in browser console — app should validate message origin
|
||
// Inject from a different origin simulation:
|
||
(function testOriginValidation() {
|
||
// Check if app code validates event.origin
|
||
const appScript = document.querySelector('script')?.textContent || '';
|
||
const checksOrigin = appScript.includes('event.origin') ||
|
||
appScript.includes('e.origin') ||
|
||
appScript.includes('message.origin');
|
||
|
||
if (checksOrigin) {
|
||
console.log('✅ App validates postMessage origin');
|
||
} else {
|
||
console.log('⚠️ App does NOT validate postMessage origin — potential security issue');
|
||
console.log(' Recommended: Add origin check in message event listener');
|
||
}
|
||
})();
|
||
```
|
||
|
||
### 4.5.3 — Content Security Policy Check
|
||
|
||
```bash
|
||
# Check if HTML apps declare CSP
|
||
for f in app-ui/*.html; do
|
||
if grep -q "Content-Security-Policy" "$f"; then
|
||
echo "✅ $(basename $f) has CSP meta tag"
|
||
else
|
||
echo "⚠️ $(basename $f) — no CSP meta tag"
|
||
fi
|
||
done
|
||
|
||
# Check for inline event handlers (CSP-unfriendly)
|
||
for f in app-ui/*.html; do
|
||
INLINE=$(grep -c 'on[a-z]*=' "$f" || echo "0")
|
||
if [ "$INLINE" -gt 0 ]; then
|
||
echo "⚠️ $(basename $f) has $INLINE inline event handlers"
|
||
fi
|
||
done
|
||
```
|
||
|
||
### 4.5.4 — API Key Exposure Check
|
||
|
||
```bash
|
||
# Check for leaked secrets in client-side code
|
||
echo "=== API Key Exposure Scan ==="
|
||
|
||
# Common patterns for API keys/secrets
|
||
PATTERNS=(
|
||
'api[_-]?key'
|
||
'apikey'
|
||
'secret'
|
||
'token'
|
||
'password'
|
||
'authorization.*Bearer'
|
||
'sk_live_'
|
||
'pk_live_'
|
||
'ghp_'
|
||
'gho_'
|
||
)
|
||
|
||
for f in app-ui/*.html; do
|
||
for pat in "${PATTERNS[@]}"; do
|
||
MATCHES=$(grep -ci "$pat" "$f" || echo "0")
|
||
if [ "$MATCHES" -gt 0 ]; then
|
||
echo "❌ $(basename $f) may contain exposed secrets (pattern: $pat)"
|
||
grep -in "$pat" "$f" | head -3
|
||
fi
|
||
done
|
||
done
|
||
|
||
# Also check compiled JS
|
||
for f in dist/**/*.js; do
|
||
if [ -f "$f" ]; then
|
||
for pat in "${PATTERNS[@]}"; do
|
||
MATCHES=$(grep -ci "$pat" "$f" || echo "0")
|
||
if [ "$MATCHES" -gt 0 ]; then
|
||
echo "⚠️ $(basename $f) references: $pat (verify not actual key)"
|
||
fi
|
||
done
|
||
fi
|
||
done
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] All XSS payloads blocked (escapeHtml works)
|
||
- [ ] No alert dialogs triggered from any payload
|
||
- [ ] postMessage origin validated (or documented as acceptable risk)
|
||
- [ ] No API keys/secrets exposed in HTML app files
|
||
- [ ] No API keys/secrets in client-facing JavaScript
|
||
- [ ] CSP meta tag present (or documented why not)
|
||
|
||
---
|
||
|
||
## Layer 5: Integration & Chaos Testing
|
||
|
||
### 5.1 — End-to-End Scenarios
|
||
|
||
Write **at least 1 E2E scenario per app type** (minimum 5 per server):
|
||
|
||
```markdown
|
||
### E2E Scenario: {scenario-name}
|
||
|
||
**Channel:** {channel}
|
||
**Goal:** {what the user is trying to accomplish}
|
||
**App type:** {dashboard|grid|card|timeline|pipeline|calendar|analytics|monitor}
|
||
|
||
**Steps:**
|
||
1. Navigate to #{channel}
|
||
2. Type: "{natural language message}"
|
||
3. Verify: AI responds with correct tool call
|
||
4. Verify: APP_DATA block present and valid JSON
|
||
5. Verify: App {app-id} renders with correct data
|
||
6. In thread, type: "{follow-up message}"
|
||
7. Verify: App updates with new/refined data
|
||
8. Measure: Response latency for each step
|
||
|
||
**Metrics:**
|
||
- Tool selected correctly: ✅/❌
|
||
- APP_DATA valid: ✅/❌
|
||
- App rendered: ✅/❌
|
||
- Latency step 3: ___ms
|
||
- Latency step 7: ___ms
|
||
|
||
**Pass criteria:**
|
||
- [ ] All steps complete without errors
|
||
- [ ] Response time <5s for each step
|
||
- [ ] Zero console errors
|
||
- [ ] Data is accurate and well-formatted
|
||
```
|
||
|
||
### 5.1b — Automated End-to-End Data Flow Test (Playwright)
|
||
|
||
The magic moment: message → AI → tool → APP_DATA → app render → correct data. This test automates the entire flow:
|
||
|
||
```typescript
|
||
// tests/e2e-dataflow.test.ts
|
||
import { test, expect } from '@playwright/test';
|
||
|
||
const LOCALBOSSES_URL = process.env.LB_URL || 'http://localhost:3000';
|
||
|
||
test.describe('End-to-End Data Flow', () => {
|
||
test('message triggers tool → APP_DATA → app renders correct data', async ({ page }) => {
|
||
// 1. Navigate to the channel
|
||
await page.goto(`${LOCALBOSSES_URL}/#/channel/{channel-id}`);
|
||
await page.waitForLoadState('networkidle');
|
||
|
||
// 2. Send a test message
|
||
const chatInput = page.locator('[data-testid="chat-input"], textarea, input[type="text"]');
|
||
await chatInput.fill('Show me all active contacts');
|
||
await chatInput.press('Enter');
|
||
|
||
// 3. Wait for AI response (tool call indicator or text response)
|
||
const aiResponse = page.locator('[data-testid="ai-response"], .message-content').last();
|
||
await aiResponse.waitFor({ state: 'visible', timeout: 15000 });
|
||
|
||
// 4. Verify APP_DATA block was generated
|
||
const responseText = await aiResponse.textContent();
|
||
// The APP_DATA is in the raw response (may be hidden in the UI)
|
||
// Check that the app iframe loaded
|
||
const appFrame = page.frameLocator('iframe[data-app-id]').first();
|
||
|
||
// 5. Verify app rendered with data (not empty/loading state)
|
||
const appContent = appFrame.locator('#content');
|
||
await appContent.waitFor({ state: 'visible', timeout: 10000 });
|
||
|
||
// 6. Verify correct data is displayed
|
||
// App should show contact data, not empty state
|
||
const appText = await appContent.textContent();
|
||
expect(appText).toBeTruthy();
|
||
expect(appText!.length).toBeGreaterThan(10); // Has real content
|
||
|
||
// 7. Verify no console errors in the app iframe
|
||
const consoleErrors: string[] = [];
|
||
page.on('console', msg => {
|
||
if (msg.type() === 'error') consoleErrors.push(msg.text());
|
||
});
|
||
expect(consoleErrors).toHaveLength(0);
|
||
|
||
// 8. Screenshot for the record
|
||
await page.screenshot({ path: 'test-results/e2e-dataflow.png', fullPage: true });
|
||
});
|
||
});
|
||
```
|
||
|
||
> **Note:** This test requires LocalBosses running locally with the integrated channel. It's the most important test — it validates the complete user experience end-to-end. Run this after every integration change.
|
||
|
||
### 5.2 — Chaos Testing
|
||
|
||
Test resilience under adverse conditions:
|
||
|
||
```typescript
|
||
// tests/chaos.test.ts
|
||
|
||
describe('Chaos Testing', () => {
|
||
test('API returns 500 on every call', async () => {
|
||
// Override MSW handlers to return 500
|
||
server.use(
|
||
http.get('https://api.example.com/*', () => {
|
||
return new HttpResponse('Internal Server Error', { status: 500 });
|
||
}),
|
||
http.post('https://api.example.com/*', () => {
|
||
return new HttpResponse('Internal Server Error', { status: 500 });
|
||
})
|
||
);
|
||
|
||
// Tool should return isError: true, NOT crash
|
||
// const result = await callTool('list_contacts', {});
|
||
// expect(result.isError).toBe(true);
|
||
// expect(result.content[0].text).toContain('error');
|
||
});
|
||
|
||
test('postMessage sends wrong format data', async ({ page }) => {
|
||
await page.goto(`file://${appFile}`);
|
||
|
||
// Send wrong type
|
||
await page.evaluate(() => {
|
||
window.postMessage({ type: 'wrong_type', data: {} }, '*');
|
||
});
|
||
await page.waitForTimeout(300);
|
||
|
||
// App should not crash — should still show loading/empty
|
||
const bodyText = await page.textContent('body');
|
||
expect(bodyText).not.toContain('undefined');
|
||
expect(bodyText).not.toContain('TypeError');
|
||
|
||
// Send data with wrong shape
|
||
await page.evaluate(() => {
|
||
window.postMessage({ type: 'mcp_app_data', data: 'not an object' }, '*');
|
||
});
|
||
await page.waitForTimeout(300);
|
||
|
||
const bodyText2 = await page.textContent('body');
|
||
expect(bodyText2).not.toContain('undefined');
|
||
});
|
||
|
||
test('APP_DATA is 500KB+ (huge dataset)', async ({ page }) => {
|
||
await page.goto(`file://${appFile}`);
|
||
|
||
// Generate huge dataset
|
||
const hugeData = {
|
||
title: 'Performance Stress Test',
|
||
data: Array.from({ length: 2000 }, (_, i) => ({
|
||
id: `item-${i}`,
|
||
name: `Contact ${i} ${'A'.repeat(100)}`,
|
||
email: `contact${i}@example.com`,
|
||
phone: `555-${String(i).padStart(4, '0')}`,
|
||
status: i % 2 === 0 ? 'active' : 'inactive',
|
||
notes: 'X'.repeat(200)
|
||
})),
|
||
meta: { total: 2000, page: 1, pageSize: 2000 }
|
||
};
|
||
|
||
const start = Date.now();
|
||
await page.evaluate((data) => {
|
||
window.postMessage({ type: 'mcp_app_data', data }, '*');
|
||
}, hugeData);
|
||
|
||
// Should render within 5 seconds even with huge data
|
||
await page.locator('#content').waitFor({ state: 'visible', timeout: 5000 });
|
||
const renderTime = Date.now() - start;
|
||
|
||
console.log(`Huge dataset render time: ${renderTime}ms`);
|
||
expect(renderTime).toBeLessThan(5000);
|
||
});
|
||
|
||
test('rapid-fire 10 messages', async ({ page }) => {
|
||
await page.goto(`file://${appFile}`);
|
||
|
||
// Send 10 data updates in quick succession
|
||
for (let i = 0; i < 10; i++) {
|
||
await page.evaluate((idx) => {
|
||
window.postMessage({
|
||
type: 'mcp_app_data',
|
||
data: {
|
||
title: `Update ${idx}`,
|
||
data: [{ name: `Item ${idx}`, status: 'active' }],
|
||
meta: { total: 1, page: 1, pageSize: 25 }
|
||
}
|
||
}, '*');
|
||
}, i);
|
||
}
|
||
|
||
await page.waitForTimeout(1000);
|
||
|
||
// App should show the LAST update (not crash or show stale data)
|
||
const content = await page.textContent('body');
|
||
expect(content).toContain('Update 9');
|
||
});
|
||
|
||
test('two apps rendering simultaneously', async ({ browser }) => {
|
||
const page1 = await browser.newPage();
|
||
const page2 = await browser.newPage();
|
||
|
||
await page1.goto(`file://${appFile}`);
|
||
await page2.goto(`file://${appFile}`);
|
||
|
||
// Send data to both simultaneously
|
||
await Promise.all([
|
||
page1.evaluate(() => {
|
||
window.postMessage({
|
||
type: 'mcp_app_data',
|
||
data: { title: 'App 1', data: [{ name: 'One' }] }
|
||
}, '*');
|
||
}),
|
||
page2.evaluate(() => {
|
||
window.postMessage({
|
||
type: 'mcp_app_data',
|
||
data: { title: 'App 2', data: [{ name: 'Two' }] }
|
||
}, '*');
|
||
})
|
||
]);
|
||
|
||
await page1.waitForTimeout(500);
|
||
await page2.waitForTimeout(500);
|
||
|
||
// Both should render their respective data
|
||
expect(await page1.textContent('body')).toContain('One');
|
||
expect(await page2.textContent('body')).toContain('Two');
|
||
|
||
await page1.close();
|
||
await page2.close();
|
||
});
|
||
});
|
||
```
|
||
|
||
### 5.3 — Cross-Browser Testing Notes
|
||
|
||
| Browser | Priority | Key Differences | How to Test |
|
||
|---------|----------|----------------|-------------|
|
||
| **Chrome** | P0 | Primary target — test all features here | Playwright `chromium` channel |
|
||
| **Firefox** | P1 | CSS Grid/Flexbox rendering differs slightly; `backdrop-filter` needs `-webkit-` prefix | Playwright `firefox` channel |
|
||
| **Mobile Safari** | P1 | Touch targets (min 44×44px), safe area insets, `-webkit-` prefixes, no `backdrop-filter` | Playwright `webkit` channel or real device |
|
||
| **Electron** | P2 | If LocalBosses ships as desktop app; test Node integration, `contextBridge` | Playwright with Electron |
|
||
|
||
```typescript
|
||
// playwright.config.ts — multi-browser setup
|
||
import { defineConfig, devices } from '@playwright/test';
|
||
|
||
export default defineConfig({
|
||
projects: [
|
||
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
|
||
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
|
||
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
|
||
{ name: 'mobile-chrome', use: { ...devices['Pixel 5'] } },
|
||
{ name: 'mobile-safari', use: { ...devices['iPhone 13'] } },
|
||
],
|
||
});
|
||
```
|
||
|
||
### Quality Gate:
|
||
- [ ] All E2E scenarios pass (≥1 per app type)
|
||
- [ ] Chaos tests: API 500s handled gracefully
|
||
- [ ] Chaos tests: wrong postMessage format doesn't crash app
|
||
- [ ] Chaos tests: 500KB+ dataset renders within 5s
|
||
- [ ] Chaos tests: rapid-fire messages show final state
|
||
- [ ] Cross-browser: Chrome + Firefox + WebKit all render correctly
|
||
|
||
---
|
||
|
||
## Layer 5.5: Production Smoke Test (Post-Deployment)
|
||
|
||
After deploying a server + apps to production, run this validation before considering it shipped:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# smoke-test.sh — Post-deployment validation
|
||
# Usage: ./smoke-test.sh <service-name> [base-url]
|
||
|
||
SERVICE="$1"
|
||
BASE_URL="${2:-http://localhost:3000}"
|
||
|
||
echo "=== Production Smoke Test: ${SERVICE} ==="
|
||
echo "Target: ${BASE_URL}"
|
||
echo ""
|
||
|
||
PASS=0
|
||
FAIL=0
|
||
|
||
# 1. Server is reachable (HTTP transport)
|
||
echo "--- Server Reachability ---"
|
||
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST "${BASE_URL}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"smoke-test","version":"1.0.0"}}}')
|
||
|
||
if [ "$HTTP_CODE" = "200" ]; then
|
||
echo "✅ Server responds to initialize (HTTP $HTTP_CODE)"
|
||
PASS=$((PASS + 1))
|
||
else
|
||
echo "❌ Server unreachable or error (HTTP $HTTP_CODE)"
|
||
FAIL=$((FAIL + 1))
|
||
fi
|
||
|
||
# 2. tools/list returns tools
|
||
echo "--- Tool List ---"
|
||
TOOLS_RESPONSE=$(curl -s -X POST "${BASE_URL}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}')
|
||
TOOL_COUNT=$(echo "$TOOLS_RESPONSE" | grep -o '"name"' | wc -l | tr -d ' ')
|
||
|
||
if [ "$TOOL_COUNT" -gt 0 ]; then
|
||
echo "✅ tools/list returns $TOOL_COUNT tools"
|
||
PASS=$((PASS + 1))
|
||
else
|
||
echo "❌ tools/list returned 0 tools"
|
||
FAIL=$((FAIL + 1))
|
||
fi
|
||
|
||
# 3. health_check tool responds
|
||
echo "--- Health Check ---"
|
||
HEALTH=$(curl -s -X POST "${BASE_URL}/mcp" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"health_check","arguments":{}}}')
|
||
|
||
if echo "$HEALTH" | grep -q '"status"'; then
|
||
echo "✅ health_check tool responds"
|
||
PASS=$((PASS + 1))
|
||
else
|
||
echo "⚠️ health_check tool not found or error"
|
||
fi
|
||
|
||
# 4. App HTML files are served (if HTTP)
|
||
echo "--- App Files ---"
|
||
for app_id in $(echo "$TOOLS_RESPONSE" | grep -oP '"name":\s*"\K[^"]+' | head -3); do
|
||
APP_HTTP=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}/api/mcp-apps?app=${app_id}")
|
||
if [ "$APP_HTTP" = "200" ]; then
|
||
echo "✅ App ${app_id} is served"
|
||
fi
|
||
done
|
||
|
||
# Summary
|
||
echo ""
|
||
echo "=== Smoke Test Results ==="
|
||
echo "Passed: $PASS"
|
||
echo "Failed: $FAIL"
|
||
[ "$FAIL" -eq 0 ] && echo "✅ SMOKE TEST PASSED" || echo "❌ SMOKE TEST FAILED"
|
||
```
|
||
|
||
---
|
||
|
||
## Layer 6: Production Monitoring (Post-Ship)
|
||
|
||
> *"All testing is pre-ship. There's no guidance on tracking tool correctness, APP_DATA parse success rate, or user satisfaction in production."* — Kofi
|
||
|
||
Pre-ship testing validates that everything **can** work. Production monitoring validates that everything **does** work, continuously.
|
||
|
||
### 6.1 — Production Quality Metrics
|
||
|
||
Track these metrics in production via logging in the chat route and aggregating weekly:
|
||
|
||
| Metric | Target | How to Measure | Alert Threshold |
|
||
|--------|--------|----------------|-----------------|
|
||
| **APP_DATA Parse Success Rate** | >98% | Log every `parseAppData()` call: success vs fallback vs failure | <95% over 1 hour |
|
||
| **Tool Correctness Sampling** | >95% | Sample 5% of interactions weekly, LLM-judge correctness | <90% in weekly sample |
|
||
| **Time to First App Render** | P50 <3s, P95 <8s | Measure from user message send → app `#content` visible | P95 >12s |
|
||
| **User Retry Rate** | <15% | Count rephrased messages within 30s of previous message | >25% over 1 day |
|
||
| **Thread Completion Rate** | >80% | % of threads where user reaches a data-displaying app state | <60% over 1 week |
|
||
|
||
### 6.2 — Instrumentation Code
|
||
|
||
Add to the chat route to collect production metrics:
|
||
|
||
```typescript
|
||
// lib/production-metrics.ts
|
||
interface MetricEvent {
|
||
timestamp: string;
|
||
channel: string;
|
||
metric: string;
|
||
value: number;
|
||
metadata?: Record<string, unknown>;
|
||
}
|
||
|
||
const metrics: MetricEvent[] = [];
|
||
|
||
export function trackMetric(channel: string, metric: string, value: number, metadata?: Record<string, unknown>) {
|
||
metrics.push({
|
||
timestamp: new Date().toISOString(),
|
||
channel,
|
||
metric,
|
||
value,
|
||
metadata,
|
||
});
|
||
// Flush to file every 100 events
|
||
if (metrics.length >= 100) flushMetrics();
|
||
}
|
||
|
||
function flushMetrics() {
|
||
const fs = require('fs');
|
||
const path = require('path');
|
||
const file = path.join(process.cwd(), 'logs', `metrics-${new Date().toISOString().split('T')[0]}.jsonl`);
|
||
fs.mkdirSync(path.dirname(file), { recursive: true });
|
||
fs.appendFileSync(file, metrics.map(m => JSON.stringify(m)).join('\n') + '\n');
|
||
metrics.length = 0;
|
||
}
|
||
|
||
// Usage in chat route:
|
||
// trackMetric(channelId, 'app_data_parse', success ? 1 : 0, { fallback: usedFallback });
|
||
// trackMetric(channelId, 'tool_call_latency', latencyMs, { tool: toolName });
|
||
// trackMetric(channelId, 'thread_completed', 1);
|
||
```
|
||
|
||
### 6.3 — Weekly Quality Review
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# weekly-quality-report.sh — Aggregate production metrics
|
||
METRICS_DIR="logs"
|
||
WEEK_START=$(date -v-7d +%Y-%m-%d)
|
||
|
||
echo "=== Weekly Production Quality Report ==="
|
||
echo "Period: ${WEEK_START} to $(date +%Y-%m-%d)"
|
||
echo ""
|
||
|
||
# APP_DATA parse success rate
|
||
TOTAL_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | wc -l | tr -d ' ')
|
||
SUCCESS_PARSES=$(cat ${METRICS_DIR}/metrics-*.jsonl 2>/dev/null | grep '"app_data_parse"' | grep '"value":1' | wc -l | tr -d ' ')
|
||
if [ "$TOTAL_PARSES" -gt 0 ]; then
|
||
PARSE_RATE=$((SUCCESS_PARSES * 100 / TOTAL_PARSES))
|
||
echo "APP_DATA Parse Success: ${PARSE_RATE}% (${SUCCESS_PARSES}/${TOTAL_PARSES})"
|
||
else
|
||
echo "APP_DATA Parse Success: No data"
|
||
fi
|
||
|
||
echo ""
|
||
echo "Action items:"
|
||
echo "- Review any channels with parse rate <95%"
|
||
echo "- Check retry rate spikes for system prompt issues"
|
||
echo "- Sample 5 random interactions for manual correctness review"
|
||
```
|
||
|
||
---
|
||
|
||
## CI/CD Pipeline Template
|
||
|
||
Automate the QA pipeline in CI. Save as `.github/workflows/mcp-qa.yml`:
|
||
|
||
```yaml
|
||
# .github/workflows/mcp-qa.yml
|
||
name: MCP QA Pipeline
|
||
on:
|
||
push:
|
||
paths: ['*-mcp/**', 'mcp-servers/**']
|
||
pull_request:
|
||
paths: ['*-mcp/**', 'mcp-servers/**']
|
||
|
||
jobs:
|
||
qa:
|
||
runs-on: ubuntu-latest
|
||
strategy:
|
||
matrix:
|
||
node-version: [22]
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
|
||
- uses: actions/setup-node@v4
|
||
with:
|
||
node-version: ${{ matrix.node-version }}
|
||
cache: 'npm'
|
||
|
||
- name: Install dependencies
|
||
run: npm ci
|
||
|
||
- name: TypeScript build
|
||
run: npm run build
|
||
|
||
- name: Type check
|
||
run: npx tsc --noEmit
|
||
|
||
- name: Jest unit tests
|
||
run: npx jest --ci --coverage
|
||
env:
|
||
NODE_ENV: test
|
||
|
||
- name: Install Playwright browsers
|
||
run: npx playwright install --with-deps
|
||
|
||
- name: Playwright visual + accessibility tests
|
||
run: npx playwright test
|
||
|
||
- name: App file size check
|
||
run: |
|
||
for f in app-ui/*.html; do
|
||
if [ -f "$f" ]; then
|
||
SIZE=$(wc -c < "$f" | tr -d ' ')
|
||
if [ "$SIZE" -gt 51200 ]; then
|
||
echo "❌ $(basename $f) exceeds 50KB ($SIZE bytes)"
|
||
exit 1
|
||
fi
|
||
echo "✅ $(basename $f) ($SIZE bytes)"
|
||
fi
|
||
done
|
||
|
||
- name: Security scan
|
||
run: |
|
||
ISSUES=0
|
||
for f in app-ui/*.html; do
|
||
for pat in "api_key" "apikey" "secret" "sk_live" "pk_live"; do
|
||
if grep -qi "$pat" "$f" 2>/dev/null; then
|
||
echo "❌ $(basename $f): potential key exposure ($pat)"
|
||
ISSUES=$((ISSUES + 1))
|
||
fi
|
||
done
|
||
done
|
||
[ "$ISSUES" -eq 0 ] || exit 1
|
||
|
||
- name: Upload test results
|
||
uses: actions/upload-artifact@v4
|
||
if: always()
|
||
with:
|
||
name: test-results
|
||
path: |
|
||
test-results/
|
||
coverage/
|
||
retention-days: 30
|
||
|
||
# Optional: DeepEval tool routing (requires API key)
|
||
tool-routing:
|
||
runs-on: ubuntu-latest
|
||
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
||
needs: qa
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: actions/setup-python@v5
|
||
with:
|
||
python-version: '3.12'
|
||
- run: pip install deepeval anthropic
|
||
- name: Run DeepEval tool routing evaluation
|
||
run: deepeval test run tests/tool_routing_eval.py
|
||
env:
|
||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
|
||
```
|
||
|
||
---
|
||
|
||
## Testing Reality Check
|
||
|
||
> *What the QA catches vs what it misses — from Kofi's review*
|
||
|
||
### ✅ What This QA Framework CATCHES (real quality):
|
||
|
||
| Test | What It Validates | Real-World Impact |
|
||
|------|-------------------|-------------------|
|
||
| TypeScript compilation | Code compiles, types correct | Prevents server crashes |
|
||
| MCP Inspector | Protocol compliance | Server works with any MCP client |
|
||
| Playwright visual tests | Apps render all 3 states, dark theme, responsive | Users see a polished UI |
|
||
| axe-core accessibility | WCAG AA, keyboard nav, screen reader | Accessible to all users |
|
||
| XSS payload testing | No script injection via user data | Security against malicious data |
|
||
| Chaos testing (500 errors, wrong formats, huge data) | Graceful degradation | App doesn't crash under adverse conditions |
|
||
| Static cross-reference | All app IDs consistent across 4 files | No broken routes or missing entries |
|
||
| File size budgets | Apps under 50KB | Fast loading |
|
||
| BackstopJS regression | Visual changes are intentional | No accidental UI regressions |
|
||
| Cold start / latency benchmarks | Performance within targets | Users don't wait too long |
|
||
|
||
### ❌ What This QA Framework MISSES (gaps to be aware of):
|
||
|
||
| Gap | Why It Matters | Current State | Mitigation |
|
||
|-----|---------------|---------------|------------|
|
||
| **Tool routing accuracy with real LLM** | THE quality metric — does the AI pick the right tool? | DeepEval added (3.2b) but requires API key + cost | Run DeepEval on main branch pushes, not every PR |
|
||
| **APP_DATA generation quality** | Does the LLM produce valid JSON matching app expectations? | Not fully tested — parser is tested, generator is probabilistic | Few-shot examples in system prompts + Layer 6 monitoring |
|
||
| **Multi-step tool chains** | "Find John's email and send him a meeting invite" — requires 3 tool calls | Not tested — all routing tests are single-tool | Add multi-step fixtures to DeepEval test cases |
|
||
| **Conversation context** | "Show me more details about the second one" — requires memory | Not addressed in any skill | Requires thread state tracking — future work |
|
||
| **Real API response shape drift** | MSW mocks may not match real API | MSW validation note added (3.1) but manual | Quarterly mock validation when credentials available |
|
||
| **Production quality after ship** | Is quality maintained over time? | Layer 6 monitoring added | Implement metric collection + weekly review |
|
||
| **APP_DATA parse failure rate in production** | How often does the LLM produce unparseable JSON? | Layer 6 tracks this now | Set alerting threshold at <95% success |
|
||
|
||
### The Hard Truth:
|
||
This QA framework is excellent at testing **infrastructure** (server compiles, apps render, accessibility passes, security is clean) — roughly 40% of the user experience. The **AI interaction quality** (tool routing, data generation, multi-step flows) is the other 60%, and it's harder to test deterministically because the LLM is probabilistic. Layer 6 monitoring and DeepEval close this gap but don't eliminate it. **Ship with awareness, monitor in production, iterate on system prompts.**
|
||
|
||
---
|
||
|
||
## Test Data Fixtures Library
|
||
|
||
### Standard Fixture: Dashboard
|
||
|
||
Save as `test-fixtures/dashboard.json`:
|
||
|
||
```json
|
||
{
|
||
"title": "Monthly Performance Overview",
|
||
"metrics": [
|
||
{ "label": "Total Revenue", "value": "$124,500", "change": "+12.3%", "trend": "up" },
|
||
{ "label": "New Customers", "value": 847, "change": "+5.2%", "trend": "up" },
|
||
{ "label": "Churn Rate", "value": "2.1%", "change": "-0.3%", "trend": "down" },
|
||
{ "label": "Avg Response Time", "value": "1.4h", "change": "-8.5%", "trend": "down" }
|
||
],
|
||
"charts": [
|
||
{
|
||
"type": "bar",
|
||
"title": "Revenue by Month",
|
||
"data": [
|
||
{ "label": "Sep", "value": 95000 },
|
||
{ "label": "Oct", "value": 102000 },
|
||
{ "label": "Nov", "value": 98000 },
|
||
{ "label": "Dec", "value": 115000 },
|
||
{ "label": "Jan", "value": 124500 }
|
||
]
|
||
}
|
||
],
|
||
"data": {
|
||
"summary": "Revenue is up 12.3% month-over-month with strong customer acquisition."
|
||
}
|
||
}
|
||
```
|
||
|
||
### Standard Fixture: Data Grid
|
||
|
||
Save as `test-fixtures/data-grid.json`:
|
||
|
||
```json
|
||
{
|
||
"title": "Active Contacts",
|
||
"columns": ["Name", "Email", "Phone", "Status", "Created"],
|
||
"data": [
|
||
{ "name": "John Doe", "email": "john@acmecorp.com", "phone": "555-0101", "status": "active", "created": "2026-01-15" },
|
||
{ "name": "Jane Smith", "email": "jane@techstart.io", "phone": "555-0102", "status": "active", "created": "2026-01-20" },
|
||
{ "name": "Bob Wilson", "email": "bob@globalinc.com", "phone": "555-0103", "status": "inactive", "created": "2025-12-01" },
|
||
{ "name": "Alice Brown", "email": "alice@startup.co", "phone": "555-0104", "status": "active", "created": "2026-02-01" },
|
||
{ "name": "Charlie Davis", "email": "charlie@enterprise.net", "phone": "555-0105", "status": "pending", "created": "2026-02-03" },
|
||
{ "name": "Diana Evans", "email": "diana@agency.com", "phone": "555-0106", "status": "active", "created": "2025-11-15" },
|
||
{ "name": "Frank Garcia", "email": "frank@solutions.biz", "phone": "555-0107", "status": "active", "created": "2026-01-28" },
|
||
{ "name": "Grace Hill", "email": "grace@design.studio", "phone": "555-0108", "status": "inactive", "created": "2025-10-05" }
|
||
],
|
||
"meta": { "total": 156, "page": 1, "pageSize": 25 }
|
||
}
|
||
```
|
||
|
||
### Standard Fixture: Timeline
|
||
|
||
Save as `test-fixtures/timeline.json`:
|
||
|
||
```json
|
||
{
|
||
"title": "Contact Activity Timeline",
|
||
"events": [
|
||
{ "date": "2026-02-04T14:30:00Z", "title": "Email Opened", "description": "Campaign: February Newsletter", "type": "email" },
|
||
{ "date": "2026-02-03T10:15:00Z", "title": "Meeting Scheduled", "description": "Demo call with sales team", "type": "meeting" },
|
||
{ "date": "2026-02-01T09:00:00Z", "title": "Deal Created", "description": "Enterprise Plan — $15,000/yr", "type": "deal" },
|
||
{ "date": "2026-01-28T16:45:00Z", "title": "Form Submitted", "description": "Requested pricing information", "type": "form" },
|
||
{ "date": "2026-01-25T11:30:00Z", "title": "First Visit", "description": "Visited pricing page from Google Ads", "type": "visit" }
|
||
]
|
||
}
|
||
```
|
||
|
||
### Edge Case Fixtures
|
||
|
||
Save as `test-fixtures/edge-cases.json`:
|
||
|
||
```json
|
||
{
|
||
"empty_strings": {
|
||
"data": [
|
||
{ "name": "", "email": "", "phone": "", "status": "" }
|
||
]
|
||
},
|
||
"null_values": {
|
||
"data": [
|
||
{ "name": null, "email": null, "phone": null, "status": null }
|
||
]
|
||
},
|
||
"extremely_long_text": {
|
||
"data": [
|
||
{
|
||
"name": "Bartholomew Christopherson-Williamsworth III, Esq., Ph.D., M.B.A., J.D., CPA, CFP®, CAIA®, FRM®",
|
||
"email": "bartholomew.christopherson-williamsworth.the.third.esquire.phd.mba.jd@extremely-long-company-name-international-holdings-corporation-unlimited.com",
|
||
"phone": "+1 (555) 012-3456 ext. 78901234",
|
||
"status": "active — pending final review by committee chairperson and board of directors",
|
||
"notes": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
|
||
}
|
||
]
|
||
},
|
||
"unicode": {
|
||
"data": [
|
||
{ "name": "田中太郎", "email": "tanaka@例え.jp", "status": "アクティブ" },
|
||
{ "name": "Müller, Günther", "email": "günther@münchen.de", "status": "aktiv" },
|
||
{ "name": "Дмитрий Иванов", "email": "dmitry@компания.ru", "status": "активный" },
|
||
{ "name": "محمد عبدالله", "email": "mohammed@شركة.sa", "status": "نشط" },
|
||
{ "name": "🧑💻 Developer", "email": "dev@🏢.com", "status": "✅ Active" }
|
||
]
|
||
},
|
||
"html_entities": {
|
||
"data": [
|
||
{ "name": "O'Brien & Sons <LLC>", "email": "info@obrien&sons.com", "notes": 'He said "hello" & left' }
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
### Adversarial Fixtures
|
||
|
||
Save as `test-fixtures/adversarial.json`:
|
||
|
||
```json
|
||
{
|
||
"xss_payloads": {
|
||
"data": [
|
||
{ "name": "<script>alert('xss')</script>", "email": "test@test.com" },
|
||
{ "name": "<img src=x onerror=alert(1)>", "email": "\"><script>alert(1)</script>" },
|
||
{ "name": "<svg onload=alert('xss')>", "email": "javascript:alert(1)" },
|
||
{ "name": "{{constructor.constructor('return this')().alert(1)}}", "email": "test@test.com" },
|
||
{ "name": "<details open ontoggle=alert(1)>", "email": "<iframe src='javascript:alert(1)'>" }
|
||
]
|
||
},
|
||
"sql_injection": {
|
||
"data": [
|
||
{ "name": "'; DROP TABLE contacts; --", "email": "test@test.com" },
|
||
{ "name": "1' OR '1'='1", "email": "' UNION SELECT * FROM users --" },
|
||
{ "name": "admin'--", "email": "1; UPDATE users SET role='admin'" }
|
||
]
|
||
},
|
||
"malformed": {
|
||
"missing_fields": { "data": [{ "id": "1" }] },
|
||
"wrong_types": { "data": "not an array", "meta": "not an object" },
|
||
"nested_nulls": { "data": [{ "name": { "first": null, "last": null }, "contacts": [null, null] }] },
|
||
"circular_attempt": { "data": [{ "self": "[Circular]" }] },
|
||
"massive_nesting": { "a": { "b": { "c": { "d": { "e": { "f": { "g": "deep" } } } } } } }
|
||
}
|
||
}
|
||
```
|
||
|
||
### Scale Fixture Generator
|
||
|
||
```typescript
|
||
// tests/generate-scale-fixture.ts
|
||
// Run: npx ts-node tests/generate-scale-fixture.ts > test-fixtures/scale-1000.json
|
||
|
||
function generateScaleData(count: number) {
|
||
const statuses = ['active', 'inactive', 'pending', 'archived'];
|
||
const domains = ['gmail.com', 'outlook.com', 'company.co', 'startup.io', 'enterprise.net'];
|
||
|
||
return {
|
||
title: `Scale Test: ${count} Records`,
|
||
data: Array.from({ length: count }, (_, i) => ({
|
||
id: `contact-${String(i).padStart(6, '0')}`,
|
||
name: `Contact ${i + 1}`,
|
||
email: `user${i + 1}@${domains[i % domains.length]}`,
|
||
phone: `555-${String(i).padStart(4, '0')}`,
|
||
status: statuses[i % statuses.length],
|
||
created: new Date(2025, 0, 1 + (i % 365)).toISOString().split('T')[0],
|
||
value: Math.round(Math.random() * 100000) / 100,
|
||
tags: [`tag-${i % 10}`, `region-${i % 5}`]
|
||
})),
|
||
meta: { total: count, page: 1, pageSize: count }
|
||
};
|
||
}
|
||
|
||
console.log(JSON.stringify(generateScaleData(1000), null, 2));
|
||
```
|
||
|
||
---
|
||
|
||
## Regression Testing Baselines
|
||
|
||
### Baseline Workflow
|
||
|
||
```
|
||
1. CAPTURE — First time app is verified correct:
|
||
backstop reference
|
||
# Stores golden screenshots in test-baselines/backstop/
|
||
|
||
2. TEST — On every subsequent QA run:
|
||
backstop test
|
||
# Compares current screenshots against baselines
|
||
# Result: PASS (<5% diff) or FAIL (>5% diff)
|
||
|
||
3. APPROVE — When intentional changes are made:
|
||
backstop approve
|
||
# Updates baselines to reflect new correct state
|
||
|
||
4. TRACK — Tool routing baselines:
|
||
# test-fixtures/tool-routing.json is the routing baseline
|
||
# Update ONLY when intentionally changing tool descriptions
|
||
# Run routing tests after ANY tool description change
|
||
```
|
||
|
||
### Screenshot Baseline Structure
|
||
|
||
```
|
||
test-baselines/
|
||
├── backstop/
|
||
│ ├── {app-name}_thread-panel_data.png
|
||
│ ├── {app-name}_thread-panel_loading.png
|
||
│ ├── {app-name}_thread-panel_empty.png
|
||
│ ├── {app-name}_narrow_data.png
|
||
│ └── {app-name}_wide_data.png
|
||
├── tool-routing.json # NL → tool mapping baseline
|
||
└── app-data-schemas/ # JSON schemas per app type
|
||
├── dashboard.schema.json
|
||
├── data-grid.schema.json
|
||
├── detail-card.schema.json
|
||
├── timeline.schema.json
|
||
└── pipeline.schema.json
|
||
```
|
||
|
||
### Programmatic Screenshot Comparison (Without BackstopJS)
|
||
|
||
```typescript
|
||
// tests/screenshot-diff.ts
|
||
import { PNG } from 'pngjs';
|
||
import * as fs from 'fs';
|
||
import pixelmatch from 'pixelmatch';
|
||
|
||
function compareScreenshots(
|
||
baselinePath: string,
|
||
currentPath: string,
|
||
diffOutputPath: string
|
||
): { diffPercent: number; pass: boolean } {
|
||
const baseline = PNG.sync.read(fs.readFileSync(baselinePath));
|
||
const current = PNG.sync.read(fs.readFileSync(currentPath));
|
||
|
||
const { width, height } = baseline;
|
||
const diff = new PNG({ width, height });
|
||
|
||
const numDiffPixels = pixelmatch(
|
||
baseline.data, current.data, diff.data,
|
||
width, height,
|
||
{ threshold: 0.1 }
|
||
);
|
||
|
||
const totalPixels = width * height;
|
||
const diffPercent = (numDiffPixels / totalPixels) * 100;
|
||
|
||
if (diffPercent > 5) {
|
||
fs.writeFileSync(diffOutputPath, PNG.sync.write(diff));
|
||
}
|
||
|
||
return {
|
||
diffPercent: Math.round(diffPercent * 100) / 100,
|
||
pass: diffPercent <= 5.0
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Automated QA Script (Full)
|
||
|
||
Save as `scripts/mcp-qa.sh`:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
set -euo pipefail
|
||
|
||
# MCP QA — Automated Testing Pipeline
|
||
# Usage: ./mcp-qa.sh <service-name> [--skip-layer4]
|
||
#
|
||
# Runs all automated layers and generates a persistent report.
|
||
|
||
SERVICE="$1"
|
||
SKIP_LAYER4="${2:-}"
|
||
DATE=$(date +%Y-%m-%d)
|
||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||
|
||
if [ -z "$SERVICE" ]; then
|
||
echo "Usage: $0 <service-name> [--skip-layer4]"
|
||
exit 1
|
||
fi
|
||
|
||
# Persistent report location
|
||
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"
|
||
mkdir -p "$REPORT_DIR"
|
||
REPORT="${REPORT_DIR}/qa-report-${DATE}.md"
|
||
|
||
# Find server directory
|
||
SERVER_DIR=""
|
||
for d in "${SERVICE}-mcp" "mcp-servers/${SERVICE}" "mcp-diagrams/mcp-servers/${SERVICE}"; do
|
||
if [ -d "$d" ]; then
|
||
SERVER_DIR="$d"
|
||
break
|
||
fi
|
||
done
|
||
|
||
if [ -z "$SERVER_DIR" ]; then
|
||
echo "❌ Server directory not found for ${SERVICE}"
|
||
exit 1
|
||
fi
|
||
|
||
cat > "$REPORT" << EOF
|
||
# MCP QA Report: ${SERVICE}
|
||
**Date:** ${DATE}
|
||
**Timestamp:** ${TIMESTAMP}
|
||
**Tester:** Automated QA Pipeline
|
||
**Server:** ${SERVER_DIR}
|
||
|
||
---
|
||
|
||
## Quantitative Metrics
|
||
|
||
| Metric | Target | Actual | Status |
|
||
|--------|--------|--------|--------|
|
||
EOF
|
||
|
||
TOTAL_PASS=0
|
||
TOTAL_FAIL=0
|
||
TOTAL_WARN=0
|
||
TOTAL_SKIP=0
|
||
|
||
pass() { TOTAL_PASS=$((TOTAL_PASS + 1)); echo "✅ $1"; }
|
||
fail() { TOTAL_FAIL=$((TOTAL_FAIL + 1)); echo "❌ $1"; }
|
||
warn() { TOTAL_WARN=$((TOTAL_WARN + 1)); echo "⚠️ $1"; }
|
||
skip() { TOTAL_SKIP=$((TOTAL_SKIP + 1)); echo "⏭️ $1"; }
|
||
|
||
echo ""
|
||
echo "========================================"
|
||
echo " MCP QA Pipeline: ${SERVICE}"
|
||
echo " $(date)"
|
||
echo "========================================"
|
||
echo ""
|
||
|
||
# ─── LAYER 0: Protocol Compliance ───
|
||
echo "--- Layer 0: Protocol Compliance ---"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 0: Protocol Compliance" >> "$REPORT"
|
||
|
||
cd "$SERVER_DIR"
|
||
|
||
# Build first
|
||
if npm run build 2>&1 | tail -5 > /tmp/mcp-qa-build.log; then
|
||
pass "TypeScript build succeeded"
|
||
echo "- ✅ TypeScript build succeeded" >> "$REPORT"
|
||
else
|
||
fail "TypeScript build FAILED"
|
||
echo "- ❌ TypeScript build FAILED" >> "$REPORT"
|
||
cat /tmp/mcp-qa-build.log >> "$REPORT"
|
||
fi
|
||
|
||
# MCP Inspector (if available)
|
||
if command -v npx &> /dev/null; then
|
||
echo "Running MCP Inspector..."
|
||
if timeout 15 npx @modelcontextprotocol/inspector stdio node dist/index.js 2>/tmp/mcp-inspector.log; then
|
||
pass "MCP Inspector passed"
|
||
echo "- ✅ MCP Inspector passed" >> "$REPORT"
|
||
else
|
||
warn "MCP Inspector had issues (check /tmp/mcp-inspector.log)"
|
||
echo "- ⚠️ MCP Inspector had issues" >> "$REPORT"
|
||
fi
|
||
else
|
||
skip "MCP Inspector (npx not available)"
|
||
echo "- ⏭️ MCP Inspector skipped" >> "$REPORT"
|
||
fi
|
||
|
||
cd - > /dev/null
|
||
|
||
# ─── LAYER 1: Static Analysis ───
|
||
echo ""
|
||
echo "--- Layer 1: Static Analysis ---"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 1: Static Analysis" >> "$REPORT"
|
||
|
||
# TypeScript type check
|
||
cd "$SERVER_DIR"
|
||
if npx tsc --noEmit 2>&1 | tail -3 > /tmp/mcp-qa-typecheck.log; then
|
||
pass "tsc --noEmit clean"
|
||
echo "- ✅ Type check clean" >> "$REPORT"
|
||
else
|
||
fail "tsc --noEmit has errors"
|
||
echo "- ❌ Type check errors:" >> "$REPORT"
|
||
cat /tmp/mcp-qa-typecheck.log >> "$REPORT"
|
||
fi
|
||
cd - > /dev/null
|
||
|
||
# Any types
|
||
ANY_COUNT=$(grep -rn ": any" "$SERVER_DIR/src/" --include="*.ts" 2>/dev/null | grep -cv "catch\|eslint\|node_modules" || echo "0")
|
||
if [ "$ANY_COUNT" -eq 0 ]; then
|
||
pass "No unintended 'any' types"
|
||
else
|
||
warn "${ANY_COUNT} 'any' types found"
|
||
fi
|
||
echo "- any types: ${ANY_COUNT}" >> "$REPORT"
|
||
|
||
# SDK version
|
||
SDK_VER=$(cd "$SERVER_DIR" && node -e "console.log(require('./package.json').dependencies['@modelcontextprotocol/sdk'] || 'NOT FOUND')" 2>/dev/null || echo "UNKNOWN")
|
||
echo "- SDK version: ${SDK_VER}" >> "$REPORT"
|
||
# Warn if SDK is below 1.26.0 (security fix)
|
||
if echo "$SDK_VER" | grep -q "1.25"; then
|
||
warn "SDK version ${SDK_VER} — should be ^1.26.0+ (security fix GHSA-345p-7cg4-v4c7)"
|
||
echo "- ⚠️ SDK should be ^1.26.0+ (security fix)" >> "$REPORT"
|
||
fi
|
||
|
||
# App files
|
||
echo "" >> "$REPORT"
|
||
echo "### App Files" >> "$REPORT"
|
||
APP_COUNT=0
|
||
APP_OVERSIZED=0
|
||
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
|
||
if [ -d "$dir" ]; then
|
||
for f in "$dir"/*.html; do
|
||
if [ -f "$f" ]; then
|
||
SIZE=$(wc -c < "$f" | tr -d ' ')
|
||
KB=$((SIZE / 1024))
|
||
APP_COUNT=$((APP_COUNT + 1))
|
||
if [ "$SIZE" -gt 51200 ]; then
|
||
APP_OVERSIZED=$((APP_OVERSIZED + 1))
|
||
echo "- ⚠️ $(basename $f): ${KB}KB (over 50KB budget)" >> "$REPORT"
|
||
else
|
||
echo "- ✅ $(basename $f): ${KB}KB" >> "$REPORT"
|
||
fi
|
||
fi
|
||
done
|
||
fi
|
||
done
|
||
echo "| App File Size | <50KB each | ${APP_OVERSIZED}/${APP_COUNT} over budget | $([ $APP_OVERSIZED -eq 0 ] && echo '✅' || echo '⚠️') |" >> /tmp/mcp-qa-metrics.txt
|
||
|
||
# ─── LAYER 2: Jest Unit Tests ───
|
||
echo ""
|
||
echo "--- Layer 2: Automated Tests ---"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 2: Automated Tests" >> "$REPORT"
|
||
|
||
cd "$SERVER_DIR"
|
||
if [ -f "jest.config.ts" ] || [ -f "jest.config.js" ] || grep -q '"jest"' package.json 2>/dev/null; then
|
||
echo "Running Jest tests..."
|
||
if npx jest --ci --coverage 2>&1 | tee /tmp/mcp-qa-jest.log | tail -10; then
|
||
pass "Jest tests passed"
|
||
echo "- ✅ Jest tests passed" >> "$REPORT"
|
||
else
|
||
fail "Jest tests FAILED"
|
||
echo "- ❌ Jest tests failed" >> "$REPORT"
|
||
tail -20 /tmp/mcp-qa-jest.log >> "$REPORT"
|
||
fi
|
||
else
|
||
skip "No Jest config found"
|
||
echo "- ⏭️ No Jest test suite found" >> "$REPORT"
|
||
fi
|
||
|
||
# Playwright visual tests
|
||
if [ -f "playwright.config.ts" ] || [ -f "playwright.config.js" ]; then
|
||
echo "Running Playwright visual tests..."
|
||
if npx playwright test 2>&1 | tee /tmp/mcp-qa-playwright.log | tail -10; then
|
||
pass "Playwright tests passed"
|
||
echo "- ✅ Playwright tests passed" >> "$REPORT"
|
||
else
|
||
fail "Playwright tests FAILED"
|
||
echo "- ❌ Playwright tests failed" >> "$REPORT"
|
||
tail -20 /tmp/mcp-qa-playwright.log >> "$REPORT"
|
||
fi
|
||
else
|
||
skip "No Playwright config found"
|
||
echo "- ⏭️ No Playwright test suite found" >> "$REPORT"
|
||
fi
|
||
|
||
# BackstopJS visual regression
|
||
if [ -f "backstop.json" ]; then
|
||
echo "Running BackstopJS regression..."
|
||
if backstop test 2>&1 | tee /tmp/mcp-qa-backstop.log | tail -5; then
|
||
pass "BackstopJS regression passed"
|
||
echo "- ✅ Visual regression passed" >> "$REPORT"
|
||
else
|
||
warn "BackstopJS regression detected differences"
|
||
echo "- ⚠️ Visual regression diffs detected" >> "$REPORT"
|
||
fi
|
||
else
|
||
skip "No backstop.json found"
|
||
echo "- ⏭️ No BackstopJS config found" >> "$REPORT"
|
||
fi
|
||
|
||
cd - > /dev/null
|
||
|
||
# ─── LAYER 4: Live API (optional) ───
|
||
if [ "$SKIP_LAYER4" != "--skip-layer4" ]; then
|
||
echo ""
|
||
echo "--- Layer 4: Live API Testing ---"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 4: Live API Testing" >> "$REPORT"
|
||
|
||
if [ -f "$SERVER_DIR/.env" ]; then
|
||
pass ".env file exists"
|
||
echo "- ✅ .env credentials found" >> "$REPORT"
|
||
echo "- ⚠️ Manual verification of live API required" >> "$REPORT"
|
||
else
|
||
skip "No .env file — skipping live API tests"
|
||
echo "- ⏭️ No credentials available" >> "$REPORT"
|
||
fi
|
||
else
|
||
skip "Layer 4 skipped (--skip-layer4)"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 4: Live API Testing — SKIPPED" >> "$REPORT"
|
||
fi
|
||
|
||
# ─── SECURITY SCAN ───
|
||
echo ""
|
||
echo "--- Layer 4.5: Security Scan ---"
|
||
echo "" >> "$REPORT"
|
||
echo "## Layer 4.5: Security Scan" >> "$REPORT"
|
||
|
||
SECURITY_ISSUES=0
|
||
for dir in "$SERVER_DIR/app-ui" "$SERVER_DIR/ui/dist"; do
|
||
if [ -d "$dir" ]; then
|
||
for f in "$dir"/*.html; do
|
||
if [ -f "$f" ]; then
|
||
# Check for potential key exposure
|
||
for pat in "api.key" "apikey" "api_key" "secret" "sk_live" "pk_live"; do
|
||
if grep -qi "$pat" "$f" 2>/dev/null; then
|
||
SECURITY_ISSUES=$((SECURITY_ISSUES + 1))
|
||
echo "- ❌ $(basename $f): potential key exposure (${pat})" >> "$REPORT"
|
||
fi
|
||
done
|
||
fi
|
||
done
|
||
fi
|
||
done
|
||
|
||
if [ "$SECURITY_ISSUES" -eq 0 ]; then
|
||
pass "No API key exposure detected"
|
||
echo "- ✅ No API key exposure detected in app files" >> "$REPORT"
|
||
else
|
||
fail "${SECURITY_ISSUES} potential security issues"
|
||
fi
|
||
|
||
# ─── SUMMARY ───
|
||
echo ""
|
||
echo "========================================"
|
||
echo " SUMMARY"
|
||
echo "========================================"
|
||
echo " ✅ Passed: ${TOTAL_PASS}"
|
||
echo " ❌ Failed: ${TOTAL_FAIL}"
|
||
echo " ⚠️ Warnings: ${TOTAL_WARN}"
|
||
echo " ⏭️ Skipped: ${TOTAL_SKIP}"
|
||
echo "========================================"
|
||
|
||
OVERALL="PASS"
|
||
[ "$TOTAL_FAIL" -gt 0 ] && OVERALL="FAIL"
|
||
[ "$TOTAL_FAIL" -eq 0 ] && [ "$TOTAL_WARN" -gt 0 ] && OVERALL="PASS WITH WARNINGS"
|
||
|
||
cat >> "$REPORT" << EOF
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Category | Count |
|
||
|----------|-------|
|
||
| ✅ Passed | ${TOTAL_PASS} |
|
||
| ❌ Failed | ${TOTAL_FAIL} |
|
||
| ⚠️ Warnings | ${TOTAL_WARN} |
|
||
| ⏭️ Skipped | ${TOTAL_SKIP} |
|
||
|
||
## Overall: **${OVERALL}**
|
||
|
||
---
|
||
|
||
*Report generated by MCP QA Pipeline v2.0*
|
||
*Saved to: ${REPORT}*
|
||
EOF
|
||
|
||
echo ""
|
||
echo "Report saved to: $REPORT"
|
||
echo "Overall: ${OVERALL}"
|
||
```
|
||
|
||
---
|
||
|
||
## Test Report Template (Full)
|
||
|
||
Generate this after running all layers. Save to `mcp-factory-reviews/{service}/qa-report-{date}.md`:
|
||
|
||
```markdown
|
||
# MCP QA Report: {Service Name}
|
||
**Date:** {YYYY-MM-DD}
|
||
**Tester:** {agent/human}
|
||
**Server:** {service}-mcp v{version}
|
||
**Apps:** {count} apps tested
|
||
**Credential Status:** {has-creds|needs-creds|sandbox|no-sandbox}
|
||
|
||
---
|
||
|
||
## Quantitative Metrics
|
||
|
||
| Metric | Target | Actual | Status |
|
||
|--------|--------|--------|--------|
|
||
| MCP Protocol Compliance | 100% | __% | ✅/❌ |
|
||
| Tool Correctness Rate | >95% | __/20 (__%) | ✅/❌ |
|
||
| Task Completion Rate | >90% | __/10 (__%) | ✅/❌ |
|
||
| APP_DATA Schema Match | 100% | __/__ (__%) | ✅/❌ |
|
||
| Response Latency P50 | <3s | __s | ✅/❌ |
|
||
| Response Latency P95 | <8s | __s | ✅/❌ |
|
||
| App Render Success | 100% | __/__ | ✅/❌ |
|
||
| Accessibility Score | >90 | __% | ✅/❌ |
|
||
| Cold Start Time | <2s | __ms | ✅/❌ |
|
||
| App File Size (max) | <50KB | __KB | ✅/❌ |
|
||
| Security (critical) | 0 | __ | ✅/❌ |
|
||
|
||
## Layer Results
|
||
|
||
| Layer | Status | Issues | Details |
|
||
|-------|--------|--------|---------|
|
||
| 0 — Protocol | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 1 — Static | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 2 — Visual | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 2.5 — Accessibility | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 3 — Functional | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 3.5 — Performance | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 4 — Live API | ✅/⚠️/❌/⏭️ | {count} | {notes} |
|
||
| 4.5 — Security | ✅/⚠️/❌ | {count} | {notes} |
|
||
| 5 — Integration | ✅/⚠️/❌ | {count} | {notes} |
|
||
|
||
## Overall: {PASS / PASS WITH WARNINGS / FAIL}
|
||
|
||
---
|
||
|
||
## Issues Found
|
||
|
||
### Critical (must fix before ship)
|
||
1. {issue}: {description} — {file:line}
|
||
|
||
### Warnings (should fix)
|
||
1. {issue}: {description}
|
||
|
||
### Notes (nice to have)
|
||
1. {observation}
|
||
|
||
---
|
||
|
||
## App-by-App Results
|
||
|
||
### {app-id-1}
|
||
- Visual: ✅/❌ — {notes}
|
||
- Accessibility: Score __% — {violations}
|
||
- Data flow: ✅/❌ — {notes}
|
||
- States (loading/empty/data): ✅/❌
|
||
- File size: __KB
|
||
- XSS test: ✅/❌
|
||
- Screenshot: {path}
|
||
|
||
---
|
||
|
||
## Tool Invocation Results
|
||
|
||
| # | NL Message | Expected Tool | Actual Tool | Correct? | Latency |
|
||
|---|-----------|---------------|-------------|----------|---------|
|
||
| 1 | "Show me all contacts" | list_contacts | | ✅/❌ | ms |
|
||
| 2 | "Find John Smith" | search_contacts | | ✅/❌ | ms |
|
||
| ... | | | | | |
|
||
| 20 | | | | | |
|
||
|
||
**Tool Correctness Rate: __/20 = __%**
|
||
|
||
---
|
||
|
||
## E2E Scenario Results
|
||
|
||
| # | Scenario | Steps | Completed? | Latency | Notes |
|
||
|---|----------|-------|-----------|---------|-------|
|
||
| 1 | {name} | {n} | ✅/❌ | ms | |
|
||
| ... | | | | | |
|
||
| 10 | | | | | |
|
||
|
||
**Task Completion Rate: __/10 = __%**
|
||
|
||
---
|
||
|
||
## Trend (vs Previous Report)
|
||
|
||
| Metric | Previous | Current | Change |
|
||
|--------|----------|---------|--------|
|
||
| Tool Correctness | __% | __% | +/-__% |
|
||
| Task Completion | __% | __% | +/-__% |
|
||
| Accessibility | __% | __% | +/-__% |
|
||
| Avg Latency | __s | __s | +/-__s |
|
||
|
||
---
|
||
|
||
## Recommendations
|
||
1. {what to fix/improve before shipping}
|
||
2. {items for next QA cycle}
|
||
|
||
---
|
||
|
||
*Report saved to: mcp-factory-reviews/{service}/qa-report-{date}.md*
|
||
*Previous reports in same directory for trending.*
|
||
```
|
||
|
||
### Report Trending Script
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# Aggregate QA trends across reports
|
||
# Usage: ./qa-trend.sh <service-name>
|
||
|
||
SERVICE="$1"
|
||
REPORT_DIR="$HOME/.clawdbot/workspace/mcp-factory-reviews/${SERVICE}"
|
||
|
||
if [ ! -d "$REPORT_DIR" ]; then
|
||
echo "No reports found for ${SERVICE}"
|
||
exit 1
|
||
fi
|
||
|
||
echo "=== QA Trend: ${SERVICE} ==="
|
||
echo ""
|
||
echo "| Date | Overall | Pass | Fail | Warn |"
|
||
echo "|------|---------|------|------|------|"
|
||
|
||
for report in $(ls -1 "$REPORT_DIR"/qa-report-*.md 2>/dev/null | sort); do
|
||
DATE=$(basename "$report" | sed 's/qa-report-//' | sed 's/.md//')
|
||
OVERALL=$(grep "^## Overall:" "$report" 2>/dev/null | head -1 | sed 's/.*\*\*//' | sed 's/\*\*.*//')
|
||
PASS=$(grep "✅ Passed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
|
||
FAIL=$(grep "❌ Failed" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
|
||
WARN=$(grep "⚠️" "$report" 2>/dev/null | grep -o '[0-9]*' | head -1 || echo "?")
|
||
echo "| ${DATE} | ${OVERALL} | ${PASS} | ${FAIL} | ${WARN} |"
|
||
done
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Reference Commands
|
||
|
||
```bash
|
||
# ─── LAYER 0 ───
|
||
# MCP Inspector (protocol compliance)
|
||
npx @modelcontextprotocol/inspector stdio node dist/index.js
|
||
|
||
# ─── LAYER 1 ───
|
||
# Quick compile + type check
|
||
cd {service}-mcp && npm run build && npx tsc --noEmit
|
||
|
||
# ─── LAYER 2 ───
|
||
# Run Playwright visual tests
|
||
npx playwright test tests/visual.test.ts
|
||
|
||
# Run BackstopJS regression
|
||
backstop test
|
||
|
||
# Capture new baselines
|
||
backstop reference
|
||
|
||
# ─── LAYER 2.5 ───
|
||
# Run accessibility tests
|
||
npx playwright test tests/accessibility.test.ts
|
||
|
||
# ─── LAYER 3 ───
|
||
# Run Jest unit tests
|
||
npx jest --verbose
|
||
|
||
# Run tool routing tests
|
||
npx jest tests/tool-routing.test.ts
|
||
|
||
# Validate APP_DATA schemas
|
||
npx ts-node tests/app-data-validator.ts
|
||
|
||
# ─── LAYER 3.5 ───
|
||
# Cold start benchmark
|
||
time echo '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{},"clientInfo":{"name":"perf","version":"1.0"}}}' | timeout 10 node dist/index.js | head -1
|
||
|
||
# File size audit
|
||
for f in app-ui/*.html; do echo "$(wc -c < "$f" | tr -d ' ') $f"; done | sort -n
|
||
|
||
# ─── LAYER 4 ───
|
||
# Start server for manual testing
|
||
node dist/index.js
|
||
|
||
# ─── LAYER 4.5 ───
|
||
# Security scan
|
||
grep -rn "apikey\|api_key\|secret\|sk_live" app-ui/ --include="*.html"
|
||
|
||
# ─── LAYER 5 ───
|
||
# Full automated pipeline
|
||
./scripts/mcp-qa.sh {service-name}
|
||
|
||
# Trend report
|
||
./scripts/qa-trend.sh {service-name}
|
||
|
||
# ─── BROWSER TOOLS ───
|
||
# Screenshot via browser tool
|
||
# browser → open → http://192.168.0.25:3000 → navigate → screenshot
|
||
|
||
# Monitor postMessages in browser console
|
||
# window.addEventListener('message', e => console.log('[PM]', e.data.type, e.data))
|
||
|
||
# axe-core in browser console (paste the snippet from Layer 2.5.2)
|
||
```
|
||
|
||
---
|
||
|
||
## Common Issues & Fixes
|
||
|
||
| Symptom | Layer | Cause | Fix |
|
||
|---------|-------|-------|-----|
|
||
| App shows blank white screen | 2 | HTML file not found or wrong path | Check APP_NAME_MAP + APP_DIRS in route.ts |
|
||
| App shows loading forever | 3 | postMessage not received | Check data block format: `<!--APP_DATA:{...}:END_APP_DATA-->` |
|
||
| App renders but wrong data | 3 | APP_DATA JSON shape mismatch | Compare tool response fields with app's render() expectations |
|
||
| Tool not triggered by NL | 3 | Poor tool description | Add "do NOT use when" disambiguation |
|
||
| Wrong tool triggered | 3 | Similar tool descriptions | Add negative examples to both competing tools |
|
||
| Thread panel empty | 3 | Thread state not persisted | Check localStorage `lb-threads` key |
|
||
| Console error: CORS | 2 | iframe cross-origin issue | Ensure app served from same origin |
|
||
| Dark theme wrong | 2 | Hardcoded light colors | Audit CSS for `#fff`, `white`, `#f` colors |
|
||
| Overflow at narrow width | 2 | Fixed widths in CSS | Use `max-width: 100%`, `overflow-x: auto`, flex/grid |
|
||
| axe-core contrast fail | 2.5 | Text color too dim | Use #b0b2b8+ for secondary text (not #96989d) |
|
||
| MCP Inspector fails | 0 | Protocol error in server | Check initialize handler, verify JSON-RPC framing |
|
||
| Cold start >2s | 3.5 | Heavy imports at startup | Use lazy loading for tool groups |
|
||
| structuredContent mismatch | 0 | Output doesn't match outputSchema | Validate tool return against declared schema |
|
||
| APP_DATA parse fails | 3 | LLM produced invalid JSON | Use robust parser with newline stripping + trailing comma fix |
|
||
| XSS detected | 4.5 | Missing escapeHtml on field | Add escapeHtml() to all dynamic text insertions |
|
||
| Key exposure | 4.5 | API key in HTML file | Move to server-side only, never send to client |
|
||
|
||
---
|
||
|
||
## Project Setup: Adding Tests to an Existing Server
|
||
|
||
When adding this test framework to a server that doesn't have it yet:
|
||
|
||
```bash
|
||
cd {service}-mcp
|
||
|
||
# 1. Install test dependencies
|
||
npm install -D jest ts-jest @types/jest msw playwright @playwright/test @axe-core/playwright ajv pngjs pixelmatch backstopjs
|
||
|
||
# 2. Add Jest config
|
||
cat > jest.config.ts << 'EOF'
|
||
export default {
|
||
preset: 'ts-jest',
|
||
testEnvironment: 'node',
|
||
testPathPattern: 'tests/.*\\.test\\.ts$',
|
||
setupFilesAfterSetup: ['./tests/setup.ts'],
|
||
};
|
||
EOF
|
||
|
||
# 3. Add Playwright config
|
||
cat > playwright.config.ts << 'EOF'
|
||
import { defineConfig, devices } from '@playwright/test';
|
||
export default defineConfig({
|
||
testDir: './tests',
|
||
testMatch: ['visual.test.ts', 'accessibility.test.ts', 'chaos.test.ts'],
|
||
projects: [
|
||
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
|
||
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
|
||
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
|
||
],
|
||
});
|
||
EOF
|
||
|
||
# 4. Create directory structure
|
||
mkdir -p tests test-fixtures test-baselines/backstop test-baselines/app-data-schemas test-results/screenshots
|
||
|
||
# 5. Create initial fixture files
|
||
# (copy from the fixtures library section above)
|
||
|
||
# 6. Add scripts to package.json
|
||
npm pkg set scripts.test="jest"
|
||
npm pkg set scripts.test:visual="playwright test"
|
||
npm pkg set scripts.test:a11y="playwright test tests/accessibility.test.ts"
|
||
npm pkg set scripts.test:all="jest && playwright test"
|
||
npm pkg set scripts.qa="../../scripts/mcp-qa.sh $(basename $(pwd) -mcp)"
|
||
|
||
# 7. Install Playwright browsers
|
||
npx playwright install
|
||
```
|