clawdbot-workspace/design-learning-feedback-system.md
2026-02-06 23:01:30 -05:00

49 KiB
Raw Permalink Blame History

AI Learning & Feedback Loop System — How Buba Gets Smarter From Human Input

Document Type: Architecture Specification
System: GooseFactory — Continuous Learning Subsystem
Version: 1.0
Date: 2025-07-11
Author: Buba (Architecture Subagent)


Table of Contents

  1. Executive Summary
  2. Feedback Data Schema
  3. Learning Domains
  4. Feedback Processing Pipeline
  5. Memory Integration
  6. Self-Improvement Loops
  7. Dashboard Metrics
  8. Feedback-to-Prompt Engineering
  9. Privacy & Ethics
  10. Implementation Roadmap

1. Executive Summary

The GooseFactory Learning System transforms Jake's human feedback from a passive quality gate into an active training signal that compounds Buba's capabilities over time. Every approval, rejection, score, comment, and annotation becomes a data point in a continuous improvement engine.

Core Principle: Feedback Is Data, Patterns Are Knowledge, Knowledge Becomes Behavior

The system operates on three timescales:

Timescale What Happens Example
Immediate (seconds) Raw feedback stored, linked to work product Jake scores code quality 4/10 → stored
Session (hours) Patterns extracted, memory files updated "Jake rejected 3 MCP servers today for missing error handling"
Epoch (weekly) Behavioral rules crystallized, prompts updated, thresholds recalibrated .goosehints updated: "Always include retry logic with exponential backoff in MCP tool handlers"

The system is not a traditional ML training loop — Buba is an LLM agent, not a fine-tuned model. Instead, learning happens through structured memory, dynamic prompt injection, confidence calibration, and behavioral rule extraction. This is "learning" in the pragmatic sense: the agent's outputs measurably improve because its context window carries increasingly refined instructions derived from real human judgment.


2. Feedback Data Schema

2.1 Core Entities

FeedbackEvent — The Atomic Unit

Every interaction with a modal produces exactly one FeedbackEvent:

interface FeedbackEvent {
  // Identity
  id: string;                          // UUID v7 (time-sortable)
  timestamp: string;                   // ISO 8601
  sessionId: string;                   // Links to Buba's work session
  modalId: string;                     // Which modal template was shown
  modalVersion: string;                // Modal schema version (for migration)

  // What was reviewed
  workProduct: WorkProductRef;         // Reference to the artifact under review
  pipelineStage: PipelineStage;        // Which factory stage produced this
  mcpServerType?: string;              // e.g., "shopify", "stripe", "ghl"

  // The feedback itself
  decision?: DecisionFeedback;
  scores?: DimensionScore[];
  freeText?: FreeTextFeedback;
  comparison?: ComparisonFeedback;
  annotations?: Annotation[];
  confidence?: ConfidenceFeedback;

  // Meta-signals (implicit feedback)
  meta: FeedbackMeta;
}

WorkProductRef — What Was Reviewed

interface WorkProductRef {
  type: 'mcp-server' | 'code-module' | 'design' | 'test-suite' | 'documentation' | 'pipeline-config';
  id: string;                          // Artifact ID in the factory
  version: string;                     // Git SHA or version tag
  path?: string;                       // File path if applicable
  summary: string;                     // What Buba presented to Jake
  bubaConfidence: number;              // 0-1: Buba's pre-review confidence
  bubaScorePrediction?: number;        // 1-10: What Buba predicted Jake would score
  generationContext: {
    promptHash: string;                // Hash of system prompt used
    memorySnapshot: string[];          // Which memory files were active
    modelUsed: string;                 // sonnet/opus
    generationTimeMs: number;          // How long Buba spent generating
    iterationCount: number;            // How many self-review rounds before submission
  };
}

DecisionFeedback — Approve / Reject / Needs Work

interface DecisionFeedback {
  decision: 'approved' | 'rejected' | 'needs-work' | 'deferred';
  reason?: string;                     // Optional: why this decision
  severity?: 'minor' | 'moderate' | 'major' | 'critical';  // For rejections
  blockers?: string[];                 // Specific issues that blocked approval
  suggestions?: string[];              // What would make it approvable
}

DimensionScore — Quality Ratings Per Axis

interface DimensionScore {
  dimension: QualityDimension;
  score: number;                       // 1-10
  weight?: number;                     // How much Jake cares about this (learned over time)
  comment?: string;                    // Per-dimension note
}

type QualityDimension =
  | 'code-quality'        // Clean, idiomatic, maintainable
  | 'design-aesthetics'   // Visual appeal, layout, color
  | 'design-ux'           // Usability, information hierarchy
  | 'test-coverage'       // Edge cases, thoroughness
  | 'documentation'       // Clarity, completeness
  | 'architecture'        // Structure, separation of concerns
  | 'error-handling'      // Robustness, graceful degradation
  | 'performance'         // Speed, efficiency
  | 'security'            // Auth, input validation, secrets handling
  | 'creativity'          // Novel approaches, clever solutions
  | 'completeness'        // Did it cover everything asked for?
  | 'naming'              // Variable names, function names, file organization
  | 'dx'                  // Developer experience — easy to work with?
  ;

FreeTextFeedback — Written Praise & Criticism

interface FreeTextFeedback {
  liked?: string;                      // What Jake liked
  disliked?: string;                   // What Jake didn't like
  generalNotes?: string;               // Additional context
  
  // Processed fields (filled by enrichment pipeline)
  _extractedThemes?: string[];         // NLP-extracted themes
  _sentiment?: number;                 // -1 to 1
  _actionableItems?: string[];         // Specific things to change
}

ComparisonFeedback — A vs B Preferences

interface ComparisonFeedback {
  optionA: WorkProductRef;
  optionB: WorkProductRef;
  preferred: 'A' | 'B' | 'neither' | 'both-good';
  reason?: string;
  preferenceStrength: 'slight' | 'moderate' | 'strong' | 'overwhelming';
  dimensions?: {                       // Per-dimension winner
    dimension: QualityDimension;
    winner: 'A' | 'B' | 'tie';
  }[];
}

Annotation — Specific Code/Design Region Feedback

interface Annotation {
  type: 'praise' | 'criticism' | 'question' | 'suggestion';
  target: {
    file?: string;
    lineStart?: number;
    lineEnd?: number;
    selector?: string;                 // CSS selector for design annotations
    region?: string;                   // Named region for non-code artifacts
  };
  text: string;
  severity?: 'nit' | 'minor' | 'major' | 'critical';
  category?: string;                   // e.g., "naming", "error-handling", "style"
}

ConfidenceFeedback — Jake's Self-Assessed Certainty

interface ConfidenceFeedback {
  level: 'very-unsure' | 'somewhat-unsure' | 'fairly-sure' | 'very-sure';
  wouldDelegateToAI: boolean;          // "Could Buba have made this call?"
  needsExpertReview: boolean;          // "Should someone else also look?"
}

FeedbackMeta — Implicit Signals

interface FeedbackMeta {
  timeToDecisionMs: number;            // How long the modal was open
  timeToFirstInteractionMs: number;    // How long before Jake started clicking
  fieldsModified: string[];            // Which optional fields Jake bothered to fill
  scrollDepth?: number;                // How much of the artifact Jake actually viewed
  revisits: number;                    // How many times Jake came back to this modal
  deviceType: 'desktop' | 'mobile';   // Context of where Jake reviewed
  timeOfDay: string;                   // Morning/afternoon/evening/night
  dayOfWeek: string;                   // Weekday vs weekend patterns
  concurrentModals: number;            // How many other modals were pending
  sessionFatigue: number;              // Nth modal in this review session
}

2.2 Storage Strategy

feedback/
├── raw/                               # Immutable event log (append-only)
│   ├── 2025-07-11.jsonl              # One JSONL file per day
│   ├── 2025-07-12.jsonl
│   └── ...
├── indexes/                           # Derived indexes for fast lookup
│   ├── by-server-type.json           # FeedbackEvent[] grouped by MCP server type
│   ├── by-dimension.json             # Scores grouped by quality dimension
│   ├── by-decision.json              # Approval/rejection patterns
│   └── by-stage.json                 # Feedback by pipeline stage
├── aggregates/                        # Computed summaries
│   ├── weekly/
│   │   └── 2025-W28.json            # Weekly aggregate stats
│   ├── monthly/
│   │   └── 2025-07.json
│   └── rolling-30d.json              # Always-current 30-day window
└── analysis/                          # Pattern extraction outputs
    ├── themes.json                    # Recurring feedback themes
    ├── preferences.json               # Learned preference weights
    └── calibration.json               # Confidence calibration data

Why JSONL for raw storage:

  • Append-only = safe under crashes
  • One line per event = easy grep/jq queries
  • Daily files = natural partitioning, easy archival
  • Human-readable = Jake can audit what's stored

3. Learning Domains

3.1 Decision Quality — The Jake Preference Model

Goal: Build a statistical model of what Jake approves vs. rejects, so Buba can predict outcomes and pre-filter.

Data Collection

For every decision event, we track:

(server_type, pipeline_stage, dimension_scores[], decision, time_to_decide)

Pattern Extraction

Approval Rate by Server Type:

| Server Type | Approved | Rejected | Needs Work | Approval Rate |
|------------|----------|----------|------------|---------------|
| shopify    | 12       | 2        | 3          | 70.6%         |
| stripe     | 8        | 5        | 4          | 47.1%         |
| ghl        | 15       | 1        | 1          | 88.2%         |

Minimum Score Thresholds (learned): After 20+ feedback events per dimension, compute the threshold below which Jake always rejects:

error-handling < 5 → 95% rejection rate → HARD GATE
test-coverage < 4 → 90% rejection rate → HARD GATE
code-quality < 6 → 80% rejection rate → SOFT GATE (warn Buba)
naming < 3 → 85% rejection rate → SOFT GATE

Confidence Calibration Table:

Track Buba's predicted confidence vs. actual approval:

| Buba's Confidence | Actual Approval Rate | Calibration Error | n   |
|-------------------|---------------------|-------------------|-----|
| 90-100%           | 82%                 | +12% overconfident | 45  |
| 70-89%            | 71%                 | ~calibrated        | 38  |
| 50-69%            | 35%                 | +22% overconfident | 22  |
| <50%              | 15%                 | ~calibrated        | 12  |

Action: Buba applies a calibration curve to raw confidence. If calibrated confidence < 60%, auto-trigger a second self-review pass before sending to Jake.

Decision Speed as Quality Signal

Fast decisions (< 30 sec) with approval → Buba nailed it, this pattern is solid
Fast decisions (< 30 sec) with rejection → Obvious flaw Buba should have caught
Slow decisions (> 5 min) with approval → Edge case, Jake was thinking carefully
Slow decisions (> 5 min) with rejection → Complex problem, needs better tooling

Fast rejections are the highest-value learning signals — they indicate patterns Buba should learn to self-check.

3.2 Code Quality — Learning Jake's Standards

Style Guide Extraction

After every code-related feedback event with annotations or comments:

  1. Extract rules from criticism patterns:

    • "Missing error handling" × 8 times → Rule: "Every external API call must have try/catch with typed errors"
    • "Variable names too short" × 5 → Rule: "Prefer descriptive names; no single-letter variables outside loops"
    • "No input validation" × 6 → Rule: "Validate all user inputs at function boundaries"
  2. Weight rules by frequency and severity:

    Rule Weight = (occurrence_count × avg_severity_score) / total_reviews
    
  3. Compile into .goosehints as concrete instructions (see Section 8).

Common Rejection Reasons Registry

interface RejectionPattern {
  pattern: string;                     // "Missing retry logic on API calls"
  occurrences: number;
  lastSeen: string;
  avgSeverity: number;
  serverTypes: string[];               // Which types trigger this
  autoFixable: boolean;                // Can Buba fix this programmatically?
  checklistItem: string;               // Becomes a pre-submit checklist entry
}

Top 5 Rejection Reasons (updated weekly, surfaced in Buba's pre-review prompt):

## Pre-Review Checklist (from learned feedback)
1. ☐ All external API calls have retry logic with exponential backoff
2. ☐ Error messages include actionable context (not just "Error occurred")
3. ☐ Input validation on all public function parameters
4. ☐ Rate limiting implemented on all HTTP-facing tools
5. ☐ TypeScript strict mode, no `any` types

3.3 Design Skills — Aesthetic Preference Learning

Preference Vector

From side-by-side comparisons, build a multi-dimensional preference vector:

Jake's Design Preferences (learned from 30+ comparisons):
- Whitespace: prefers generous (0.8/1.0)
- Color palette: prefers muted/professional (0.7/1.0) over vibrant
- Typography: prefers system fonts (0.6/1.0), values readability over style
- Layout: prefers card-based (0.9/1.0) over list-based
- Animation: prefers subtle or none (0.3/1.0)
- Information density: prefers moderate (0.5/1.0) — not too sparse, not too packed

Each comparison updates these weights using a simple ELO-like system:

  • If Jake picks A over B, A's design attributes gain points, B's lose
  • Preference strength multiplies the update magnitude
  • After 50+ comparisons, the vector stabilizes

Design Quality Predictor

A simple scoring function that estimates Jake's likely score before review:

predicted_score = Σ (dimension_weight[i] × self_assessed_score[i])

Where dimension_weight comes from a regression on historical (predicted_features, actual_score) pairs. Initially use equal weights; after 30+ data points, fit a weighted model.

3.4 Delegation Skills — Reducing Jake's Cognitive Load

Task Classification Model

Over time, classify tasks into:

Category Criteria (Learned) Action
Auto-approve Approval rate >95% over 20+ instances, avg time-to-decide <15s Skip HITL, log for audit
Quick review Approval rate >80%, low complexity Show minimal modal (approve/reject only)
Standard review Approval rate 50-80% Full modal with all dimensions
Deep review Approval rate <50% or high complexity Extended modal, request annotations
Escalate Novel task type or <5 historical data points Flag as "needs careful review"

Notification Timing Optimization

Track Jake's engagement patterns:

| Time Window       | Avg Response Time | Completion Rate | Quality of Feedback |
|-------------------|-------------------|-----------------|---------------------|
| 9-11 AM           | 3 min             | 95%             | High (detailed)     |
| 11 AM-1 PM        | 8 min             | 88%             | Medium              |
| 2-5 PM            | 5 min             | 92%             | High                |
| 5-8 PM            | 25 min            | 60%             | Low (rushed)        |
| 8 PM-12 AM        | 45 min            | 40%             | Low                 |

Action: Batch non-urgent reviews for high-engagement windows. Send only critical decisions during low-engagement times.

Session Fatigue Detection

If Jake has reviewed 5+ items in a row:

  • Decision quality likely drops (less detailed feedback, faster but less thoughtful)
  • Auto-insert a break suggestion: "You've reviewed 6 items — want me to batch the rest for later?"
  • Weight feedback from fatigued sessions lower in the learning pipeline

3.5 Communication Skills — Better Presentations

Format Effectiveness Tracking

For each modal format variant, track:

format_effectiveness = decision_speed × feedback_quality × completion_rate
Presentation Style Avg Time to Decide Feedback Richness Effectiveness Score
Code diff + summary 45s High 0.92
Full file + annotations 120s Very High 0.85
Summary only 15s Low 0.60
Screenshot + bullet points 30s Medium 0.78

Action: For each pipeline stage and artifact type, select the format with highest effectiveness score.

Detail Level Calibration

Track which summary sections Jake actually reads (via scroll depth and time-on-section):

  • If Jake never reads the "Architecture Decisions" section → compress or remove it
  • If Jake always expands the "Test Coverage" section → make it prominent
  • If Jake clicks "Show More" on error handling details → default to expanded

4. Feedback Processing Pipeline

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Modal      │───▶│  Validation  │───▶│ Enrichment  │───▶│   Storage    │───▶│  Analysis    │───▶│   Action     │
│   Response   │    │              │    │             │    │              │    │              │    │              │
└─────────────┘    └──────────────┘    └─────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
                          │                   │                  │                   │                    │
                     Reject malformed    Add timestamps     Raw JSONL log       Pattern detect      Update memory
                     Fill defaults       Link to artifact   Update indexes      Theme extraction    Update prompts
                     Sanitize text       Compute meta       Rebuild aggregates  Calibration update  Trigger self-check
                     Schema validate     Extract themes     Archive old data    Anomaly detection   Adjust thresholds

Stage 1: Validation

function validateFeedback(raw: unknown): FeedbackEvent | ValidationError {
  // 1. Schema validation — does it match FeedbackEvent shape?
  const parsed = FeedbackEventSchema.safeParse(raw);
  if (!parsed.success) return { error: 'schema', details: parsed.error };

  // 2. Referential integrity — does the workProductRef exist?
  if (!artifactExists(parsed.data.workProduct.id)) {
    return { error: 'orphaned', details: 'Work product not found' };
  }

  // 3. Range validation — scores in bounds, timestamps sane
  for (const score of parsed.data.scores ?? []) {
    if (score.score < 1 || score.score > 10) {
      return { error: 'range', details: `Score ${score.score} out of bounds` };
    }
  }

  // 4. Deduplication — reject duplicate submissions (same modal, same session)
  if (isDuplicate(parsed.data)) {
    return { error: 'duplicate', details: 'Already processed' };
  }

  return parsed.data;
}

Stage 2: Enrichment

function enrichFeedback(event: FeedbackEvent): EnrichedFeedbackEvent {
  return {
    ...event,
    
    // Add implicit meta-signals
    meta: {
      ...event.meta,
      timeOfDay: categorizeTime(event.timestamp),
      dayOfWeek: getDayOfWeek(event.timestamp),
      sessionFatigue: countPriorModalsInSession(event.sessionId),
    },

    // Link to work product details
    workProduct: {
      ...event.workProduct,
      _generationPrompt: getPromptUsed(event.workProduct.id),
      _pipelineContext: getPipelineState(event.workProduct.id),
    },

    // NLP enrichment on free text
    freeText: event.freeText ? {
      ...event.freeText,
      _extractedThemes: extractThemes(event.freeText),
      _sentiment: analyzeSentiment(event.freeText),
      _actionableItems: extractActionItems(event.freeText),
    } : undefined,

    // Calibration data
    _predictionDelta: event.workProduct.bubaScorePrediction
      ? computeAvgScore(event.scores) - event.workProduct.bubaScorePrediction
      : undefined,
  };
}

Stage 3: Storage

async function storeFeedback(event: EnrichedFeedbackEvent): Promise<void> {
  // 1. Append to daily raw log (immutable)
  const dayFile = `feedback/raw/${formatDate(event.timestamp)}.jsonl`;
  await appendLine(dayFile, JSON.stringify(event));

  // 2. Update indexes (mutable, rebuilt from raw if corrupted)
  await updateIndex('by-server-type', event.mcpServerType, event);
  await updateIndex('by-dimension', event.scores, event);
  await updateIndex('by-decision', event.decision?.decision, event);
  await updateIndex('by-stage', event.pipelineStage, event);

  // 3. Update rolling aggregates
  await updateRollingAggregate('rolling-30d', event);
}

Stage 4: Analysis (runs after every N events or on schedule)

async function analyzeFeedback(): Promise<AnalysisResults> {
  const recent = await loadFeedback({ days: 30 });
  
  return {
    // Pattern detection
    approvalPatterns: computeApprovalRates(recent),
    dimensionTrends: computeDimensionTrends(recent),
    
    // Theme extraction — cluster free-text feedback into themes
    themes: clusterFeedbackThemes(recent.map(e => e.freeText).filter(Boolean)),
    
    // Calibration — compare predictions to actuals
    calibration: computeCalibrationCurve(
      recent.map(e => ({
        predicted: e.workProduct.bubaConfidence,
        actual: e.decision?.decision === 'approved' ? 1 : 0,
      }))
    ),
    
    // Anomalies — sudden changes in feedback patterns
    anomalies: detectAnomalies(recent),
    
    // New rules — patterns strong enough to become rules
    candidateRules: extractCandidateRules(recent),
  };
}

Stage 5: Action

async function applyLearnings(analysis: AnalysisResults): Promise<void> {
  // 1. Update memory files
  await updateMemoryFile('memory/feedback-patterns.md', analysis.themes);
  await updateMemoryFile('memory/quality-standards.md', analysis.dimensionTrends);
  await updateMemoryFile('memory/jake-preferences.md', analysis.approvalPatterns);
  
  // 2. Update pre-review checklist
  await updateChecklist(analysis.candidateRules);
  
  // 3. Recalibrate confidence thresholds
  await updateCalibration(analysis.calibration);
  
  // 4. Adjust HITL frequency for auto-approvable task types
  await updateHITLThresholds(analysis.approvalPatterns);
  
  // 5. Log what changed
  await appendToImprovementLog({
    timestamp: new Date().toISOString(),
    changes: analysis.candidateRules.filter(r => r.confidence > 0.8),
    metrics: { approvalRate: analysis.approvalPatterns.overall },
  });
}

5. Memory Integration

5.1 Memory File Architecture

The learning system writes to four memory files, each with a specific purpose:

memory/feedback-patterns.md

# Feedback Patterns (Auto-Updated)
> Last updated: 2025-07-11 | Based on 147 feedback events

## Recurring Themes (Top 10)
1. **Error handling depth** (mentioned 23x) — Jake wants specific error types, not generic catches
2. **Input validation** (18x) — Validate at boundaries, use Zod schemas
3. **Naming clarity** (15x) — Descriptive function names, no abbreviations
4. **Test edge cases** (12x) — Empty arrays, null values, Unicode, rate limits
5. **Documentation completeness** (10x) — Every public function needs JSDoc
...

## Anti-Patterns (Things Jake Consistently Rejects)
- Using `any` type in TypeScript (rejection rate: 100%)
- Console.log for error reporting (95%)
- Missing rate limiting on external APIs (90%)
- Hardcoded configuration values (85%)
- Missing input sanitization (88%)

## Positive Patterns (Things Jake Consistently Praises)
- Comprehensive error types with context (+0.8 avg score bonus)
- Builder pattern for complex configurations (+0.6)
- Graceful degradation with fallbacks (+0.7)
- Clear separation of concerns (+0.5)

memory/quality-standards.md

# Quality Standards (Learned Thresholds)
> Last calibrated: 2025-07-11 | Confidence: High (100+ data points)

## Minimum Acceptable Scores (Hard Gates)
| Dimension | Minimum Score | Based On | Jake's Avg |
|-----------|--------------|----------|------------|
| error-handling | 6 | 45 reviews | 7.2 |
| test-coverage | 5 | 38 reviews | 6.8 |
| security | 7 | 22 reviews | 7.9 |
| code-quality | 6 | 52 reviews | 7.1 |

## Target Scores (What "Great" Looks Like)
| Dimension | Target | Top Quartile |
|-----------|--------|-------------|
| error-handling | 8+ | 9.1 |
| code-quality | 8+ | 8.7 |
| completeness | 9+ | 9.4 |

## Confidence Calibration
Buba's raw confidence of X% maps to actual approval rate:
- 90%+ raw → 78% actual (apply -12% correction)
- 70-89% raw → 68% actual (apply -8% correction)
- 50-69% raw → 35% actual (apply -25% correction — major overconfidence zone)

memory/jake-preferences.md

# Jake's Preferences (Personal Taste Model)
> Last updated: 2025-07-11

## Decision Patterns
- Approves quickly: boilerplate MCP servers with standard patterns (<20s avg)
- Deliberates on: novel integrations, security-sensitive tools (>5min avg)
- Defers: design decisions when tired (evening reviews)
- Auto-rejects: anything with TypeScript `any`, missing error handling

## Communication Preferences
- Prefers: code diffs over full files for review
- Prefers: bullet points over paragraphs for summaries
- Reads: error handling sections carefully (high scroll depth)
- Skips: architecture decision records (low scroll depth)
- Best review times: 9-11 AM, 2-4 PM
- Worst review times: after 8 PM (fast but low-quality feedback)

## Design Taste
- Prefers: clean, minimal UI with generous whitespace
- Prefers: card-based layouts over tables for dashboards
- Dislikes: excessive animations, gradient-heavy designs
- Font preference: Inter/system fonts > decorative fonts
- Color preference: muted professional palettes > vibrant/playful

## Code Style
- Prefers: functional patterns over class-based
- Prefers: explicit over implicit (no magic)
- Prefers: small focused functions over large multi-purpose ones
- Cares deeply about: naming, error messages, edge cases
- Cares less about: premature optimization, DRY absolutism

memory/improvement-log.md

# Improvement Log — What Changed and Why
> Tracks every behavioral change derived from feedback

## 2025-07-11
- **Rule added:** "Always validate API responses with Zod schemas" 
  - Source: 5 rejections for missing validation in past 2 weeks
  - Confidence: 92%
- **Threshold adjusted:** error-handling minimum raised 5→6
  - Source: Jake rejected 3 items scoring 5 on error-handling
- **Calibration updated:** Buba overconfident by 12% in 90%+ range
  - Action: applying -12% correction to raw confidence

## 2025-07-04
- **Rule added:** "Include retry logic with exponential backoff on all HTTP calls"
  - Source: mentioned in 8 separate free-text feedback entries
- **Format changed:** Switched MCP server reviews to diff-based presentation
  - Source: decision time dropped 40% in A/B test

5.2 Memory Compression Strategy

Feedback data grows. Memory files must stay lean for context windows.

Compression Schedule:

Age Storage Memory Representation
0-7 days Full raw events in JSONL Referenced directly if relevant
8-30 days Raw events + aggregates Summarized in weekly aggregate
31-90 days Raw archived, aggregates active Compressed to patterns only
90+ days Raw cold-stored Only durable rules and anti-patterns preserved

Compression Algorithm:

  1. Scan feedback older than 30 days
  2. For each theme/pattern, check if it's already captured in feedback-patterns.md
  3. If yes → delete from active memory, keep in cold archive
  4. If no → extract the novel pattern, add to memory, then archive
  5. Rules with <5 occurrences that are >90 days old get pruned (they were noise)

5.3 Contextual Memory Retrieval

When Buba starts a new task, the memory system surfaces relevant past feedback:

function getRelevantFeedback(task: Task): RelevantMemory {
  return {
    // Always loaded (small, high-value)
    preferences: loadMemory('jake-preferences.md'),
    standards: loadMemory('quality-standards.md'),
    
    // Task-type specific
    patterns: loadFilteredPatterns({
      serverType: task.mcpServerType,
      pipelineStage: task.stage,
      dimensions: task.relevantDimensions,
    }),
    
    // Recent relevant feedback (last 5 similar tasks)
    recentSimilar: loadRecentFeedback({
      type: task.type,
      limit: 5,
      minSimilarity: 0.7,
    }),
    
    // Specific anti-patterns for this task type
    antiPatterns: loadAntiPatterns(task.mcpServerType),
  };
}

6. Self-Improvement Loops

6.1 Pre-Review Self-Check

Before any work product is sent to Jake, Buba runs it through a learned checklist:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐    ┌──────────────┐
│  Generated Work  │───▶│  Self-Check Gate  │───▶│ Pass? Submit to │───▶│  Jake Reviews │
│  Product         │    │  (learned rules)  │    │ Jake's Queue    │    │              │
└─────────────────┘    └──────────────────┘    └─────────────────┘    └──────────────┘
                              │ Fail                                          │
                              ▼                                               │
                       ┌──────────────┐                                       │
                       │  Auto-Fix or  │                                      │
                       │  Re-generate  │◀─────────────────────────────────────┘
                       └──────────────┘        (feedback updates self-check rules)

Self-Check Implementation:

interface SelfCheckResult {
  passed: boolean;
  score: number;                       // Predicted Jake score
  confidence: number;                  // Calibrated confidence
  violations: SelfCheckViolation[];
  recommendations: string[];
}

async function selfCheck(artifact: WorkProduct): Promise<SelfCheckResult> {
  const rules = await loadLearnedRules();
  const standards = await loadQualityStandards();
  const violations: SelfCheckViolation[] = [];

  // Check each learned rule
  for (const rule of rules) {
    const result = await evaluateRule(rule, artifact);
    if (!result.passed) {
      violations.push({
        rule: rule.id,
        description: rule.description,
        severity: rule.severity,
        autoFixable: rule.autoFixAvailable,
        suggestion: rule.fixSuggestion,
      });
    }
  }

  // Predict Jake's score
  const predictedScore = await predictScore(artifact, standards);
  const calibratedConfidence = applyCalibration(computeRawConfidence(violations));

  return {
    passed: violations.filter(v => v.severity >= 'major').length === 0,
    score: predictedScore,
    confidence: calibratedConfidence,
    violations,
    recommendations: violations
      .filter(v => v.autoFixable)
      .map(v => v.suggestion),
  };
}

6.2 Prediction Scoring Loop

Every time Buba submits work, it records a prediction. After Jake reviews:

Buba predicts: "Jake will give this 7.5/10 with 80% approval confidence"
Jake scores:   8/10, approved
Delta:         +0.5 (Buba was slightly pessimistic — recalibrate upward for this type)

Over time, this creates a calibration dataset:

interface CalibrationEntry {
  taskType: string;
  predicted: { score: number; confidence: number };
  actual: { score: number; approved: boolean };
  delta: number;
  timestamp: string;
}

// Calibration update (exponential moving average)
function updateCalibration(entries: CalibrationEntry[]): CalibrationCurve {
  const buckets = groupBy(entries, e => Math.floor(e.predicted.confidence * 10) / 10);
  return Object.fromEntries(
    Object.entries(buckets).map(([bucket, items]) => [
      bucket,
      {
        avgPredicted: mean(items.map(i => i.predicted.confidence)),
        avgActual: mean(items.map(i => i.actual.approved ? 1 : 0)),
        n: items.length,
        correction: mean(items.map(i => i.delta)),
      }
    ])
  );
}

6.3 Diminishing Review System

As Buba proves competence in specific task types, HITL frequency decreases:

Level 0 (New task type):     Review 100% of outputs
Level 1 (10+ approvals):    Review 80%, spot-check 20%
Level 2 (25+ approvals, >85% rate): Review 50%, auto-approve simple ones
Level 3 (50+ approvals, >90% rate): Review 20%, auto-approve standard patterns
Level 4 (100+ approvals, >95% rate): Review 5%, audit-only mode

Regression Detection: If at any level, the next 5 reviews show approval rate dropping below the level's threshold, immediately regress one level and log the regression in improvement-log.md.

function checkForRegression(taskType: string): RegressionStatus {
  const recentReviews = loadRecentReviews(taskType, 5);
  const approvalRate = recentReviews.filter(r => r.approved).length / recentReviews.length;
  const currentLevel = getAutonomyLevel(taskType);
  const threshold = LEVEL_THRESHOLDS[currentLevel];

  if (approvalRate < threshold) {
    return {
      regressed: true,
      from: currentLevel,
      to: Math.max(0, currentLevel - 1),
      reason: `Approval rate ${approvalRate} < threshold ${threshold}`,
    };
  }
  return { regressed: false };
}

6.4 A/B Learning

When Buba is uncertain between two approaches:

  1. Generate both versions
  2. Present as a comparison modal to Jake
  3. Record which Jake prefers and why
  4. Update preference model

Trigger conditions for A/B:

  • Buba's confidence < 65% on the "right" approach
  • Two valid design patterns exist (e.g., REST vs GraphQL adapter)
  • Jake has shown mixed preferences for this domain in the past

7. Dashboard Metrics

7.1 Key Performance Indicators

┌────────────────────────────────────────────────────────────┐
│                    BUBA LEARNING DASHBOARD                  │
├──────────────────────┬─────────────────────────────────────┤
│ Overall Health       │ ████████░░ 82% (↑ from 74% last mo) │
├──────────────────────┼─────────────────────────────────────┤
│ Approval Rate        │ 78% → 85% → 89% (3-month trend ↑)   │
│ Avg Quality Score    │ 6.2 → 7.1 → 7.6 (3-month trend ↑)   │
│ Time-to-Decision     │ 120s → 85s → 62s (3-month trend ↓)   │
│ Prediction Accuracy  │ ±2.1 → ±1.4 → ±0.8 (calibrating ↑)  │
│ Auto-Approve Rate    │ 0% → 12% → 28% (HITL reduction ↑)    │
│ Regression Events    │ 3 → 1 → 0 (stability improving ↑)    │
├──────────────────────┼─────────────────────────────────────┤
│ Top Improvement Area │ Error handling (score +2.1 since rule)│
│ Top Problem Area     │ Security patterns (still at 6.2 avg) │
│ Feedback Backlog     │ 3 unprocessed events                 │
│ Active Rules         │ 23 learned rules (18 high-confidence) │
└──────────────────────┴─────────────────────────────────────┘

7.2 Metric Definitions

Metric Formula Target Trend
Approval Rate approved / total_decisions per rolling 30 days ↑ (asymptote at ~95%)
Avg Quality Score mean(all dimension scores) per rolling 30 days ↑ (target: 8.0+)
Time-to-Decision mean(timeToDecisionMs) per rolling 30 days ↓ (faster = better context)
Prediction Accuracy mean(abs(predicted_score - actual_score)) ↓ (target: ±0.5)
Auto-Approve Rate auto_approved / total_tasks ↑ (Jake reviews less)
Feedback Volume total_feedback_events per week ↓ (less HITL needed)
Rule Effectiveness quality_delta after rule introduced ↑ (rules actually help)
Regression Rate regression_events / total_level_checks ↓ (target: 0)

7.3 Alerting Thresholds

Condition Alert Level Action
Approval rate drops >10% week-over-week 🔴 Critical Pause auto-approvals, full HITL
Prediction accuracy >±2.0 🟡 Warning Recalibrate immediately
New rejection reason appearing 3+ times 🟡 Warning Extract rule, add to checklist
Time-to-decision increasing 🟠 Notice Review presentation format
No feedback events for 48+ hours 🟠 Notice Check pipeline health

8. Feedback-to-Prompt Engineering

8.1 Rule Extraction Pipeline

Feedback → NLP Theme Extraction → Frequency Analysis → Confidence Check → Rule Formulation → Prompt Injection

interface LearnedRule {
  id: string;
  source: 'feedback-pattern' | 'rejection-analysis' | 'comparison' | 'annotation';
  rule: string;                        // Human-readable rule
  prompt_instruction: string;          // How to inject into system prompt
  confidence: number;                  // 0-1 based on data strength
  occurrences: number;                 // How many feedback events support this
  first_seen: string;
  last_seen: string;
  effectiveness?: number;              // Post-deployment quality delta
  domain: QualityDimension;
  applies_to: string[];                // Server types or 'all'
}

8.2 Dynamic .goosehints Generation

The system auto-generates a .goosehints addendum from learned rules:

# .goosehints (auto-generated section — DO NOT EDIT BELOW THIS LINE)
# Generated: 2025-07-11 | Based on 147 feedback events | 23 active rules

## Code Standards (from Jake's feedback)
- ALWAYS use Zod schemas for runtime validation of API responses
- ALWAYS include retry logic with exponential backoff (base 1s, max 30s, 3 attempts)
- ALWAYS use typed error classes, never throw raw strings
- NEVER use TypeScript `any` — use `unknown` with type guards instead
- PREFER functional composition over class hierarchies
- PREFER explicit error handling over try/catch-all blocks
- Name functions as verb-noun: `fetchUser`, `validateInput`, `buildResponse`
- Every public function needs a JSDoc comment with @param, @returns, @throws

## Design Standards (from Jake's feedback)
- Use 16-24px padding in card components
- Prefer Inter or system font stack
- Use muted color palettes: slate, zinc, gray families
- Card-based layouts for dashboard content
- Minimal animations: 150ms transitions only for state changes

## Quality Gates (auto-calibrated)
- Error handling score must be ≥6 before submission
- Test coverage score must be ≥5 before submission
- Security score must be ≥7 for auth-related MCP servers
- Overall predicted score must be ≥6.5 or self-review is triggered

## Review Presentation (optimized for Jake)
- Lead with a code diff, not full files
- Bullet-point summaries, max 5 bullets
- Highlight error handling approach prominently
- Include test coverage percentage in summary header

8.3 Per-Domain Instruction Sets

Beyond global rules, maintain domain-specific prompt sections:

prompts/
├── global-rules.md                    # Always included
├── domains/
│   ├── mcp-server.md                 # Rules specific to MCP server generation
│   ├── ui-design.md                  # Rules for UI/design tasks
│   ├── test-suite.md                 # Rules for test generation
│   └── documentation.md              # Rules for docs
└── server-types/
    ├── shopify.md                     # Shopify-specific learned patterns
    ├── stripe.md                      # Stripe-specific patterns
    └── ghl.md                         # GHL-specific patterns

Each file is rebuilt when new rules cross the confidence threshold (>0.8, supported by 5+ data points).

8.4 Quality Rubric Generation

From accumulated scores, generate rubrics that Buba uses for self-assessment:

## Code Quality Rubric (generated from 52 scored reviews)

### Score 9-10 (Exceptional) — Jake gave this 8% of the time
- Typed errors with full context chain
- 95%+ test coverage including edge cases
- Zero TypeScript errors in strict mode
- Comprehensive JSDoc on all exports
- Performance considerations documented

### Score 7-8 (Good) — Jake gave this 45% of the time
- Typed errors on main paths
- 80%+ test coverage
- Minimal TypeScript warnings
- JSDoc on public API
- Basic error handling on all external calls

### Score 5-6 (Acceptable) — Jake gave this 30% of the time
- Try/catch on external calls
- 60%+ test coverage
- Some TypeScript warnings
- Partial documentation

### Score 1-4 (Rejection Zone) — Jake gave this 17% of the time
- Generic error handling or none
- <60% test coverage
- TypeScript `any` usage
- Missing documentation
- No input validation

9. Privacy & Ethics

9.1 Data Retention Policy

Data Type Retention Rationale
Raw feedback events 90 days hot, indefinite cold Source of truth for auditing
Free-text comments 90 days verbatim, then extracted themes only Privacy — reduce stored opinions over time
Decision metadata Indefinite (aggregated) Low-sensitivity, high-value for patterns
Annotation data 90 days linked to code, then de-linked Code context changes; old annotations mislead
Timing/behavioral meta 30 days, then averaged into aggregates Sensitive behavioral data, compress aggressively

9.2 Handling Contradictory Feedback

Jake may give contradictory feedback (approves pattern X on Monday, rejects it Friday).

Resolution Strategy:

  1. Recency weighting: Recent feedback counts 2× older feedback
  2. Context matching: Check if the contradiction is actually context-dependent (e.g., pattern X is fine for internal tools but not for public APIs)
  3. Confidence-weighted: Feedback where Jake marked "very sure" outweighs "somewhat unsure"
  4. Volume threshold: Don't form a rule until the same signal appears 3+ times
  5. Clarification trigger: If two equally-strong signals conflict, flag for clarification in the next modal: "You've previously both approved and rejected [pattern]. Which do you prefer for [context]?"

9.3 Overfitting Prevention

Problem: Buba might optimize for Jake's momentary mood rather than durable preferences.

Safeguards:

  1. Minimum sample size: No rule is formed from fewer than 5 data points
  2. Time spread: Rules require data points spread across at least 3 different days
  3. Fatigue filtering: Feedback given after 5+ consecutive reviews in one session gets a 0.7× weight
  4. Trend smoothing: Use 30-day rolling averages, not daily snapshots, for threshold calibration
  5. Preference decay: Old preferences decay at 5% per month unless reinforced by new data
  6. Explicit override: Jake can always mark feedback as "strong preference" (2× weight) or "weak/unsure" (0.5× weight)
  7. Regular audit prompt: Monthly, present Jake with current learned rules for validation: "Here are the rules I've learned. Still accurate?"

9.4 Bias Awareness

  • Track if Buba becomes overly conservative (rejection avoidance → boring/safe outputs)
  • Monitor creativity scores alongside approval rates — if creativity drops as approval rises, inject more experimental variation
  • Preserve a "exploration budget" — 10% of outputs can deliberately try new patterns outside the safe zone, clearly marked as "experimental"

10. Implementation Roadmap

Phase 1: Foundation (Week 1-2)

  • Implement FeedbackEvent schema and validation
  • Build JSONL storage layer with daily rotation
  • Create basic index builders (by-server-type, by-dimension, by-decision)
  • Wire modal responses to storage pipeline
  • Initialize memory files (feedback-patterns.md, quality-standards.md, jake-preferences.md, improvement-log.md)

Phase 2: Analysis (Week 3-4)

  • Build theme extraction from free-text feedback (keyword clustering)
  • Implement approval rate computation per server type and stage
  • Build confidence calibration table
  • Create the pre-review self-check framework (rule-based)
  • Implement prediction scoring (predict → store → compare loop)

Phase 3: Learning Loops (Week 5-6)

  • Auto-generate .goosehints addendum from learned rules
  • Build the diminishing review system (autonomy levels 0-4)
  • Implement regression detection and auto-escalation
  • Create the A/B comparison flow for uncertain decisions
  • Build per-domain instruction generators

Phase 4: Optimization (Week 7-8)

  • Notification timing optimization
  • Session fatigue detection
  • Format effectiveness tracking
  • Memory compression and archival
  • Dashboard metrics computation and display

Phase 5: Maturity (Ongoing)

  • Monthly rule audit system
  • Exploration budget for creative variation
  • Cross-domain pattern transfer (learnings from code quality improving design reviews)
  • Community patterns (if multi-human review is ever added)

Appendix A: Example End-to-End Flow

1. Buba generates a Shopify MCP server
2. Pre-review self-check runs:
   - ✅ Retry logic present
   - ✅ Zod validation on API responses
   - ❌ Missing rate limiting (learned rule #7)
   - Buba auto-fixes: adds rate limiting
   - Re-check: all rules pass
   - Predicted score: 7.8 (calibrated confidence: 74%)
3. Submitted to Jake's review queue
   - Presentation format: code diff + 4-bullet summary (learned optimal format)
   - Timing: queued for 9-11 AM window (Jake's peak engagement)
4. Jake reviews in modal:
   - Decision: Approved
   - Scores: code-quality: 8, error-handling: 8, test-coverage: 7, completeness: 9
   - Free text: "Nice use of builder pattern for config. Wish the webhook handler had more specific error types."
   - Time to decide: 45 seconds
5. Feedback processed:
   - Validation: ✅
   - Enrichment: themes ["builder-pattern-positive", "error-specificity-gap"], sentiment: 0.7
   - Storage: appended to 2025-07-11.jsonl, indexes updated
   - Analysis: "error specificity" theme now at 6 occurrences → approaching rule threshold
   - Calibration: predicted 7.8, actual 8.0 → delta +0.2, slight pessimism
6. Learning applied:
   - "builder pattern" noted as positive for Shopify servers
   - "error specificity in webhooks" at 6 mentions → candidate rule queued (needs 2 more for confidence)
   - Calibration nudged +0.1 for Shopify server type
   - Shopify MCP autonomy: 15 approvals, 88% rate → Level 1 maintained

Appendix B: Storage Size Estimates

Component Per Event Daily (10 events) Monthly Yearly
Raw JSONL ~2 KB ~20 KB ~600 KB ~7.2 MB
Indexes N/A ~50 KB (rebuilt) ~50 KB ~50 KB
Aggregates N/A ~10 KB ~30 KB ~120 KB
Memory files N/A ~5 KB (updated) ~8 KB ~15 KB
Total ~85 KB/day ~688 KB/mo ~7.4 MB/yr

Extremely lightweight. Even at 50 reviews/day, yearly storage stays under 40 MB. No database needed — flat files with JSONL are sufficient for this scale.


This document is a living architecture. As the system collects real feedback, specific thresholds, weights, and rules will be calibrated from actual data rather than these initial estimates.