49 KiB
AI Learning & Feedback Loop System — How Buba Gets Smarter From Human Input
Document Type: Architecture Specification
System: GooseFactory — Continuous Learning Subsystem
Version: 1.0
Date: 2025-07-11
Author: Buba (Architecture Subagent)
Table of Contents
- Executive Summary
- Feedback Data Schema
- Learning Domains
- Feedback Processing Pipeline
- Memory Integration
- Self-Improvement Loops
- Dashboard Metrics
- Feedback-to-Prompt Engineering
- Privacy & Ethics
- Implementation Roadmap
1. Executive Summary
The GooseFactory Learning System transforms Jake's human feedback from a passive quality gate into an active training signal that compounds Buba's capabilities over time. Every approval, rejection, score, comment, and annotation becomes a data point in a continuous improvement engine.
Core Principle: Feedback Is Data, Patterns Are Knowledge, Knowledge Becomes Behavior
The system operates on three timescales:
| Timescale | What Happens | Example |
|---|---|---|
| Immediate (seconds) | Raw feedback stored, linked to work product | Jake scores code quality 4/10 → stored |
| Session (hours) | Patterns extracted, memory files updated | "Jake rejected 3 MCP servers today for missing error handling" |
| Epoch (weekly) | Behavioral rules crystallized, prompts updated, thresholds recalibrated | .goosehints updated: "Always include retry logic with exponential backoff in MCP tool handlers" |
The system is not a traditional ML training loop — Buba is an LLM agent, not a fine-tuned model. Instead, learning happens through structured memory, dynamic prompt injection, confidence calibration, and behavioral rule extraction. This is "learning" in the pragmatic sense: the agent's outputs measurably improve because its context window carries increasingly refined instructions derived from real human judgment.
2. Feedback Data Schema
2.1 Core Entities
FeedbackEvent — The Atomic Unit
Every interaction with a modal produces exactly one FeedbackEvent:
interface FeedbackEvent {
// Identity
id: string; // UUID v7 (time-sortable)
timestamp: string; // ISO 8601
sessionId: string; // Links to Buba's work session
modalId: string; // Which modal template was shown
modalVersion: string; // Modal schema version (for migration)
// What was reviewed
workProduct: WorkProductRef; // Reference to the artifact under review
pipelineStage: PipelineStage; // Which factory stage produced this
mcpServerType?: string; // e.g., "shopify", "stripe", "ghl"
// The feedback itself
decision?: DecisionFeedback;
scores?: DimensionScore[];
freeText?: FreeTextFeedback;
comparison?: ComparisonFeedback;
annotations?: Annotation[];
confidence?: ConfidenceFeedback;
// Meta-signals (implicit feedback)
meta: FeedbackMeta;
}
WorkProductRef — What Was Reviewed
interface WorkProductRef {
type: 'mcp-server' | 'code-module' | 'design' | 'test-suite' | 'documentation' | 'pipeline-config';
id: string; // Artifact ID in the factory
version: string; // Git SHA or version tag
path?: string; // File path if applicable
summary: string; // What Buba presented to Jake
bubaConfidence: number; // 0-1: Buba's pre-review confidence
bubaScorePrediction?: number; // 1-10: What Buba predicted Jake would score
generationContext: {
promptHash: string; // Hash of system prompt used
memorySnapshot: string[]; // Which memory files were active
modelUsed: string; // sonnet/opus
generationTimeMs: number; // How long Buba spent generating
iterationCount: number; // How many self-review rounds before submission
};
}
DecisionFeedback — Approve / Reject / Needs Work
interface DecisionFeedback {
decision: 'approved' | 'rejected' | 'needs-work' | 'deferred';
reason?: string; // Optional: why this decision
severity?: 'minor' | 'moderate' | 'major' | 'critical'; // For rejections
blockers?: string[]; // Specific issues that blocked approval
suggestions?: string[]; // What would make it approvable
}
DimensionScore — Quality Ratings Per Axis
interface DimensionScore {
dimension: QualityDimension;
score: number; // 1-10
weight?: number; // How much Jake cares about this (learned over time)
comment?: string; // Per-dimension note
}
type QualityDimension =
| 'code-quality' // Clean, idiomatic, maintainable
| 'design-aesthetics' // Visual appeal, layout, color
| 'design-ux' // Usability, information hierarchy
| 'test-coverage' // Edge cases, thoroughness
| 'documentation' // Clarity, completeness
| 'architecture' // Structure, separation of concerns
| 'error-handling' // Robustness, graceful degradation
| 'performance' // Speed, efficiency
| 'security' // Auth, input validation, secrets handling
| 'creativity' // Novel approaches, clever solutions
| 'completeness' // Did it cover everything asked for?
| 'naming' // Variable names, function names, file organization
| 'dx' // Developer experience — easy to work with?
;
FreeTextFeedback — Written Praise & Criticism
interface FreeTextFeedback {
liked?: string; // What Jake liked
disliked?: string; // What Jake didn't like
generalNotes?: string; // Additional context
// Processed fields (filled by enrichment pipeline)
_extractedThemes?: string[]; // NLP-extracted themes
_sentiment?: number; // -1 to 1
_actionableItems?: string[]; // Specific things to change
}
ComparisonFeedback — A vs B Preferences
interface ComparisonFeedback {
optionA: WorkProductRef;
optionB: WorkProductRef;
preferred: 'A' | 'B' | 'neither' | 'both-good';
reason?: string;
preferenceStrength: 'slight' | 'moderate' | 'strong' | 'overwhelming';
dimensions?: { // Per-dimension winner
dimension: QualityDimension;
winner: 'A' | 'B' | 'tie';
}[];
}
Annotation — Specific Code/Design Region Feedback
interface Annotation {
type: 'praise' | 'criticism' | 'question' | 'suggestion';
target: {
file?: string;
lineStart?: number;
lineEnd?: number;
selector?: string; // CSS selector for design annotations
region?: string; // Named region for non-code artifacts
};
text: string;
severity?: 'nit' | 'minor' | 'major' | 'critical';
category?: string; // e.g., "naming", "error-handling", "style"
}
ConfidenceFeedback — Jake's Self-Assessed Certainty
interface ConfidenceFeedback {
level: 'very-unsure' | 'somewhat-unsure' | 'fairly-sure' | 'very-sure';
wouldDelegateToAI: boolean; // "Could Buba have made this call?"
needsExpertReview: boolean; // "Should someone else also look?"
}
FeedbackMeta — Implicit Signals
interface FeedbackMeta {
timeToDecisionMs: number; // How long the modal was open
timeToFirstInteractionMs: number; // How long before Jake started clicking
fieldsModified: string[]; // Which optional fields Jake bothered to fill
scrollDepth?: number; // How much of the artifact Jake actually viewed
revisits: number; // How many times Jake came back to this modal
deviceType: 'desktop' | 'mobile'; // Context of where Jake reviewed
timeOfDay: string; // Morning/afternoon/evening/night
dayOfWeek: string; // Weekday vs weekend patterns
concurrentModals: number; // How many other modals were pending
sessionFatigue: number; // Nth modal in this review session
}
2.2 Storage Strategy
feedback/
├── raw/ # Immutable event log (append-only)
│ ├── 2025-07-11.jsonl # One JSONL file per day
│ ├── 2025-07-12.jsonl
│ └── ...
├── indexes/ # Derived indexes for fast lookup
│ ├── by-server-type.json # FeedbackEvent[] grouped by MCP server type
│ ├── by-dimension.json # Scores grouped by quality dimension
│ ├── by-decision.json # Approval/rejection patterns
│ └── by-stage.json # Feedback by pipeline stage
├── aggregates/ # Computed summaries
│ ├── weekly/
│ │ └── 2025-W28.json # Weekly aggregate stats
│ ├── monthly/
│ │ └── 2025-07.json
│ └── rolling-30d.json # Always-current 30-day window
└── analysis/ # Pattern extraction outputs
├── themes.json # Recurring feedback themes
├── preferences.json # Learned preference weights
└── calibration.json # Confidence calibration data
Why JSONL for raw storage:
- Append-only = safe under crashes
- One line per event = easy grep/jq queries
- Daily files = natural partitioning, easy archival
- Human-readable = Jake can audit what's stored
3. Learning Domains
3.1 Decision Quality — The Jake Preference Model
Goal: Build a statistical model of what Jake approves vs. rejects, so Buba can predict outcomes and pre-filter.
Data Collection
For every decision event, we track:
(server_type, pipeline_stage, dimension_scores[], decision, time_to_decide)
Pattern Extraction
Approval Rate by Server Type:
| Server Type | Approved | Rejected | Needs Work | Approval Rate |
|------------|----------|----------|------------|---------------|
| shopify | 12 | 2 | 3 | 70.6% |
| stripe | 8 | 5 | 4 | 47.1% |
| ghl | 15 | 1 | 1 | 88.2% |
Minimum Score Thresholds (learned): After 20+ feedback events per dimension, compute the threshold below which Jake always rejects:
error-handling < 5 → 95% rejection rate → HARD GATE
test-coverage < 4 → 90% rejection rate → HARD GATE
code-quality < 6 → 80% rejection rate → SOFT GATE (warn Buba)
naming < 3 → 85% rejection rate → SOFT GATE
Confidence Calibration Table:
Track Buba's predicted confidence vs. actual approval:
| Buba's Confidence | Actual Approval Rate | Calibration Error | n |
|-------------------|---------------------|-------------------|-----|
| 90-100% | 82% | +12% overconfident | 45 |
| 70-89% | 71% | ~calibrated | 38 |
| 50-69% | 35% | +22% overconfident | 22 |
| <50% | 15% | ~calibrated | 12 |
Action: Buba applies a calibration curve to raw confidence. If calibrated confidence < 60%, auto-trigger a second self-review pass before sending to Jake.
Decision Speed as Quality Signal
Fast decisions (< 30 sec) with approval → Buba nailed it, this pattern is solid
Fast decisions (< 30 sec) with rejection → Obvious flaw Buba should have caught
Slow decisions (> 5 min) with approval → Edge case, Jake was thinking carefully
Slow decisions (> 5 min) with rejection → Complex problem, needs better tooling
Fast rejections are the highest-value learning signals — they indicate patterns Buba should learn to self-check.
3.2 Code Quality — Learning Jake's Standards
Style Guide Extraction
After every code-related feedback event with annotations or comments:
-
Extract rules from criticism patterns:
- "Missing error handling" × 8 times → Rule: "Every external API call must have try/catch with typed errors"
- "Variable names too short" × 5 → Rule: "Prefer descriptive names; no single-letter variables outside loops"
- "No input validation" × 6 → Rule: "Validate all user inputs at function boundaries"
-
Weight rules by frequency and severity:
Rule Weight = (occurrence_count × avg_severity_score) / total_reviews -
Compile into
.goosehintsas concrete instructions (see Section 8).
Common Rejection Reasons Registry
interface RejectionPattern {
pattern: string; // "Missing retry logic on API calls"
occurrences: number;
lastSeen: string;
avgSeverity: number;
serverTypes: string[]; // Which types trigger this
autoFixable: boolean; // Can Buba fix this programmatically?
checklistItem: string; // Becomes a pre-submit checklist entry
}
Top 5 Rejection Reasons (updated weekly, surfaced in Buba's pre-review prompt):
## Pre-Review Checklist (from learned feedback)
1. ☐ All external API calls have retry logic with exponential backoff
2. ☐ Error messages include actionable context (not just "Error occurred")
3. ☐ Input validation on all public function parameters
4. ☐ Rate limiting implemented on all HTTP-facing tools
5. ☐ TypeScript strict mode, no `any` types
3.3 Design Skills — Aesthetic Preference Learning
Preference Vector
From side-by-side comparisons, build a multi-dimensional preference vector:
Jake's Design Preferences (learned from 30+ comparisons):
- Whitespace: prefers generous (0.8/1.0)
- Color palette: prefers muted/professional (0.7/1.0) over vibrant
- Typography: prefers system fonts (0.6/1.0), values readability over style
- Layout: prefers card-based (0.9/1.0) over list-based
- Animation: prefers subtle or none (0.3/1.0)
- Information density: prefers moderate (0.5/1.0) — not too sparse, not too packed
Each comparison updates these weights using a simple ELO-like system:
- If Jake picks A over B, A's design attributes gain points, B's lose
- Preference strength multiplies the update magnitude
- After 50+ comparisons, the vector stabilizes
Design Quality Predictor
A simple scoring function that estimates Jake's likely score before review:
predicted_score = Σ (dimension_weight[i] × self_assessed_score[i])
Where dimension_weight comes from a regression on historical (predicted_features, actual_score) pairs. Initially use equal weights; after 30+ data points, fit a weighted model.
3.4 Delegation Skills — Reducing Jake's Cognitive Load
Task Classification Model
Over time, classify tasks into:
| Category | Criteria (Learned) | Action |
|---|---|---|
| Auto-approve | Approval rate >95% over 20+ instances, avg time-to-decide <15s | Skip HITL, log for audit |
| Quick review | Approval rate >80%, low complexity | Show minimal modal (approve/reject only) |
| Standard review | Approval rate 50-80% | Full modal with all dimensions |
| Deep review | Approval rate <50% or high complexity | Extended modal, request annotations |
| Escalate | Novel task type or <5 historical data points | Flag as "needs careful review" |
Notification Timing Optimization
Track Jake's engagement patterns:
| Time Window | Avg Response Time | Completion Rate | Quality of Feedback |
|-------------------|-------------------|-----------------|---------------------|
| 9-11 AM | 3 min | 95% | High (detailed) |
| 11 AM-1 PM | 8 min | 88% | Medium |
| 2-5 PM | 5 min | 92% | High |
| 5-8 PM | 25 min | 60% | Low (rushed) |
| 8 PM-12 AM | 45 min | 40% | Low |
Action: Batch non-urgent reviews for high-engagement windows. Send only critical decisions during low-engagement times.
Session Fatigue Detection
If Jake has reviewed 5+ items in a row:
- Decision quality likely drops (less detailed feedback, faster but less thoughtful)
- Auto-insert a break suggestion: "You've reviewed 6 items — want me to batch the rest for later?"
- Weight feedback from fatigued sessions lower in the learning pipeline
3.5 Communication Skills — Better Presentations
Format Effectiveness Tracking
For each modal format variant, track:
format_effectiveness = decision_speed × feedback_quality × completion_rate
| Presentation Style | Avg Time to Decide | Feedback Richness | Effectiveness Score |
|---|---|---|---|
| Code diff + summary | 45s | High | 0.92 |
| Full file + annotations | 120s | Very High | 0.85 |
| Summary only | 15s | Low | 0.60 |
| Screenshot + bullet points | 30s | Medium | 0.78 |
Action: For each pipeline stage and artifact type, select the format with highest effectiveness score.
Detail Level Calibration
Track which summary sections Jake actually reads (via scroll depth and time-on-section):
- If Jake never reads the "Architecture Decisions" section → compress or remove it
- If Jake always expands the "Test Coverage" section → make it prominent
- If Jake clicks "Show More" on error handling details → default to expanded
4. Feedback Processing Pipeline
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Modal │───▶│ Validation │───▶│ Enrichment │───▶│ Storage │───▶│ Analysis │───▶│ Action │
│ Response │ │ │ │ │ │ │ │ │ │ │
└─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │ │
Reject malformed Add timestamps Raw JSONL log Pattern detect Update memory
Fill defaults Link to artifact Update indexes Theme extraction Update prompts
Sanitize text Compute meta Rebuild aggregates Calibration update Trigger self-check
Schema validate Extract themes Archive old data Anomaly detection Adjust thresholds
Stage 1: Validation
function validateFeedback(raw: unknown): FeedbackEvent | ValidationError {
// 1. Schema validation — does it match FeedbackEvent shape?
const parsed = FeedbackEventSchema.safeParse(raw);
if (!parsed.success) return { error: 'schema', details: parsed.error };
// 2. Referential integrity — does the workProductRef exist?
if (!artifactExists(parsed.data.workProduct.id)) {
return { error: 'orphaned', details: 'Work product not found' };
}
// 3. Range validation — scores in bounds, timestamps sane
for (const score of parsed.data.scores ?? []) {
if (score.score < 1 || score.score > 10) {
return { error: 'range', details: `Score ${score.score} out of bounds` };
}
}
// 4. Deduplication — reject duplicate submissions (same modal, same session)
if (isDuplicate(parsed.data)) {
return { error: 'duplicate', details: 'Already processed' };
}
return parsed.data;
}
Stage 2: Enrichment
function enrichFeedback(event: FeedbackEvent): EnrichedFeedbackEvent {
return {
...event,
// Add implicit meta-signals
meta: {
...event.meta,
timeOfDay: categorizeTime(event.timestamp),
dayOfWeek: getDayOfWeek(event.timestamp),
sessionFatigue: countPriorModalsInSession(event.sessionId),
},
// Link to work product details
workProduct: {
...event.workProduct,
_generationPrompt: getPromptUsed(event.workProduct.id),
_pipelineContext: getPipelineState(event.workProduct.id),
},
// NLP enrichment on free text
freeText: event.freeText ? {
...event.freeText,
_extractedThemes: extractThemes(event.freeText),
_sentiment: analyzeSentiment(event.freeText),
_actionableItems: extractActionItems(event.freeText),
} : undefined,
// Calibration data
_predictionDelta: event.workProduct.bubaScorePrediction
? computeAvgScore(event.scores) - event.workProduct.bubaScorePrediction
: undefined,
};
}
Stage 3: Storage
async function storeFeedback(event: EnrichedFeedbackEvent): Promise<void> {
// 1. Append to daily raw log (immutable)
const dayFile = `feedback/raw/${formatDate(event.timestamp)}.jsonl`;
await appendLine(dayFile, JSON.stringify(event));
// 2. Update indexes (mutable, rebuilt from raw if corrupted)
await updateIndex('by-server-type', event.mcpServerType, event);
await updateIndex('by-dimension', event.scores, event);
await updateIndex('by-decision', event.decision?.decision, event);
await updateIndex('by-stage', event.pipelineStage, event);
// 3. Update rolling aggregates
await updateRollingAggregate('rolling-30d', event);
}
Stage 4: Analysis (runs after every N events or on schedule)
async function analyzeFeedback(): Promise<AnalysisResults> {
const recent = await loadFeedback({ days: 30 });
return {
// Pattern detection
approvalPatterns: computeApprovalRates(recent),
dimensionTrends: computeDimensionTrends(recent),
// Theme extraction — cluster free-text feedback into themes
themes: clusterFeedbackThemes(recent.map(e => e.freeText).filter(Boolean)),
// Calibration — compare predictions to actuals
calibration: computeCalibrationCurve(
recent.map(e => ({
predicted: e.workProduct.bubaConfidence,
actual: e.decision?.decision === 'approved' ? 1 : 0,
}))
),
// Anomalies — sudden changes in feedback patterns
anomalies: detectAnomalies(recent),
// New rules — patterns strong enough to become rules
candidateRules: extractCandidateRules(recent),
};
}
Stage 5: Action
async function applyLearnings(analysis: AnalysisResults): Promise<void> {
// 1. Update memory files
await updateMemoryFile('memory/feedback-patterns.md', analysis.themes);
await updateMemoryFile('memory/quality-standards.md', analysis.dimensionTrends);
await updateMemoryFile('memory/jake-preferences.md', analysis.approvalPatterns);
// 2. Update pre-review checklist
await updateChecklist(analysis.candidateRules);
// 3. Recalibrate confidence thresholds
await updateCalibration(analysis.calibration);
// 4. Adjust HITL frequency for auto-approvable task types
await updateHITLThresholds(analysis.approvalPatterns);
// 5. Log what changed
await appendToImprovementLog({
timestamp: new Date().toISOString(),
changes: analysis.candidateRules.filter(r => r.confidence > 0.8),
metrics: { approvalRate: analysis.approvalPatterns.overall },
});
}
5. Memory Integration
5.1 Memory File Architecture
The learning system writes to four memory files, each with a specific purpose:
memory/feedback-patterns.md
# Feedback Patterns (Auto-Updated)
> Last updated: 2025-07-11 | Based on 147 feedback events
## Recurring Themes (Top 10)
1. **Error handling depth** (mentioned 23x) — Jake wants specific error types, not generic catches
2. **Input validation** (18x) — Validate at boundaries, use Zod schemas
3. **Naming clarity** (15x) — Descriptive function names, no abbreviations
4. **Test edge cases** (12x) — Empty arrays, null values, Unicode, rate limits
5. **Documentation completeness** (10x) — Every public function needs JSDoc
...
## Anti-Patterns (Things Jake Consistently Rejects)
- Using `any` type in TypeScript (rejection rate: 100%)
- Console.log for error reporting (95%)
- Missing rate limiting on external APIs (90%)
- Hardcoded configuration values (85%)
- Missing input sanitization (88%)
## Positive Patterns (Things Jake Consistently Praises)
- Comprehensive error types with context (+0.8 avg score bonus)
- Builder pattern for complex configurations (+0.6)
- Graceful degradation with fallbacks (+0.7)
- Clear separation of concerns (+0.5)
memory/quality-standards.md
# Quality Standards (Learned Thresholds)
> Last calibrated: 2025-07-11 | Confidence: High (100+ data points)
## Minimum Acceptable Scores (Hard Gates)
| Dimension | Minimum Score | Based On | Jake's Avg |
|-----------|--------------|----------|------------|
| error-handling | 6 | 45 reviews | 7.2 |
| test-coverage | 5 | 38 reviews | 6.8 |
| security | 7 | 22 reviews | 7.9 |
| code-quality | 6 | 52 reviews | 7.1 |
## Target Scores (What "Great" Looks Like)
| Dimension | Target | Top Quartile |
|-----------|--------|-------------|
| error-handling | 8+ | 9.1 |
| code-quality | 8+ | 8.7 |
| completeness | 9+ | 9.4 |
## Confidence Calibration
Buba's raw confidence of X% maps to actual approval rate:
- 90%+ raw → 78% actual (apply -12% correction)
- 70-89% raw → 68% actual (apply -8% correction)
- 50-69% raw → 35% actual (apply -25% correction — major overconfidence zone)
memory/jake-preferences.md
# Jake's Preferences (Personal Taste Model)
> Last updated: 2025-07-11
## Decision Patterns
- Approves quickly: boilerplate MCP servers with standard patterns (<20s avg)
- Deliberates on: novel integrations, security-sensitive tools (>5min avg)
- Defers: design decisions when tired (evening reviews)
- Auto-rejects: anything with TypeScript `any`, missing error handling
## Communication Preferences
- Prefers: code diffs over full files for review
- Prefers: bullet points over paragraphs for summaries
- Reads: error handling sections carefully (high scroll depth)
- Skips: architecture decision records (low scroll depth)
- Best review times: 9-11 AM, 2-4 PM
- Worst review times: after 8 PM (fast but low-quality feedback)
## Design Taste
- Prefers: clean, minimal UI with generous whitespace
- Prefers: card-based layouts over tables for dashboards
- Dislikes: excessive animations, gradient-heavy designs
- Font preference: Inter/system fonts > decorative fonts
- Color preference: muted professional palettes > vibrant/playful
## Code Style
- Prefers: functional patterns over class-based
- Prefers: explicit over implicit (no magic)
- Prefers: small focused functions over large multi-purpose ones
- Cares deeply about: naming, error messages, edge cases
- Cares less about: premature optimization, DRY absolutism
memory/improvement-log.md
# Improvement Log — What Changed and Why
> Tracks every behavioral change derived from feedback
## 2025-07-11
- **Rule added:** "Always validate API responses with Zod schemas"
- Source: 5 rejections for missing validation in past 2 weeks
- Confidence: 92%
- **Threshold adjusted:** error-handling minimum raised 5→6
- Source: Jake rejected 3 items scoring 5 on error-handling
- **Calibration updated:** Buba overconfident by 12% in 90%+ range
- Action: applying -12% correction to raw confidence
## 2025-07-04
- **Rule added:** "Include retry logic with exponential backoff on all HTTP calls"
- Source: mentioned in 8 separate free-text feedback entries
- **Format changed:** Switched MCP server reviews to diff-based presentation
- Source: decision time dropped 40% in A/B test
5.2 Memory Compression Strategy
Feedback data grows. Memory files must stay lean for context windows.
Compression Schedule:
| Age | Storage | Memory Representation |
|---|---|---|
| 0-7 days | Full raw events in JSONL | Referenced directly if relevant |
| 8-30 days | Raw events + aggregates | Summarized in weekly aggregate |
| 31-90 days | Raw archived, aggregates active | Compressed to patterns only |
| 90+ days | Raw cold-stored | Only durable rules and anti-patterns preserved |
Compression Algorithm:
- Scan feedback older than 30 days
- For each theme/pattern, check if it's already captured in
feedback-patterns.md - If yes → delete from active memory, keep in cold archive
- If no → extract the novel pattern, add to memory, then archive
- Rules with <5 occurrences that are >90 days old get pruned (they were noise)
5.3 Contextual Memory Retrieval
When Buba starts a new task, the memory system surfaces relevant past feedback:
function getRelevantFeedback(task: Task): RelevantMemory {
return {
// Always loaded (small, high-value)
preferences: loadMemory('jake-preferences.md'),
standards: loadMemory('quality-standards.md'),
// Task-type specific
patterns: loadFilteredPatterns({
serverType: task.mcpServerType,
pipelineStage: task.stage,
dimensions: task.relevantDimensions,
}),
// Recent relevant feedback (last 5 similar tasks)
recentSimilar: loadRecentFeedback({
type: task.type,
limit: 5,
minSimilarity: 0.7,
}),
// Specific anti-patterns for this task type
antiPatterns: loadAntiPatterns(task.mcpServerType),
};
}
6. Self-Improvement Loops
6.1 Pre-Review Self-Check
Before any work product is sent to Jake, Buba runs it through a learned checklist:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Generated Work │───▶│ Self-Check Gate │───▶│ Pass? Submit to │───▶│ Jake Reviews │
│ Product │ │ (learned rules) │ │ Jake's Queue │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘ └──────────────┘
│ Fail │
▼ │
┌──────────────┐ │
│ Auto-Fix or │ │
│ Re-generate │◀─────────────────────────────────────┘
└──────────────┘ (feedback updates self-check rules)
Self-Check Implementation:
interface SelfCheckResult {
passed: boolean;
score: number; // Predicted Jake score
confidence: number; // Calibrated confidence
violations: SelfCheckViolation[];
recommendations: string[];
}
async function selfCheck(artifact: WorkProduct): Promise<SelfCheckResult> {
const rules = await loadLearnedRules();
const standards = await loadQualityStandards();
const violations: SelfCheckViolation[] = [];
// Check each learned rule
for (const rule of rules) {
const result = await evaluateRule(rule, artifact);
if (!result.passed) {
violations.push({
rule: rule.id,
description: rule.description,
severity: rule.severity,
autoFixable: rule.autoFixAvailable,
suggestion: rule.fixSuggestion,
});
}
}
// Predict Jake's score
const predictedScore = await predictScore(artifact, standards);
const calibratedConfidence = applyCalibration(computeRawConfidence(violations));
return {
passed: violations.filter(v => v.severity >= 'major').length === 0,
score: predictedScore,
confidence: calibratedConfidence,
violations,
recommendations: violations
.filter(v => v.autoFixable)
.map(v => v.suggestion),
};
}
6.2 Prediction Scoring Loop
Every time Buba submits work, it records a prediction. After Jake reviews:
Buba predicts: "Jake will give this 7.5/10 with 80% approval confidence"
Jake scores: 8/10, approved
Delta: +0.5 (Buba was slightly pessimistic — recalibrate upward for this type)
Over time, this creates a calibration dataset:
interface CalibrationEntry {
taskType: string;
predicted: { score: number; confidence: number };
actual: { score: number; approved: boolean };
delta: number;
timestamp: string;
}
// Calibration update (exponential moving average)
function updateCalibration(entries: CalibrationEntry[]): CalibrationCurve {
const buckets = groupBy(entries, e => Math.floor(e.predicted.confidence * 10) / 10);
return Object.fromEntries(
Object.entries(buckets).map(([bucket, items]) => [
bucket,
{
avgPredicted: mean(items.map(i => i.predicted.confidence)),
avgActual: mean(items.map(i => i.actual.approved ? 1 : 0)),
n: items.length,
correction: mean(items.map(i => i.delta)),
}
])
);
}
6.3 Diminishing Review System
As Buba proves competence in specific task types, HITL frequency decreases:
Level 0 (New task type): Review 100% of outputs
Level 1 (10+ approvals): Review 80%, spot-check 20%
Level 2 (25+ approvals, >85% rate): Review 50%, auto-approve simple ones
Level 3 (50+ approvals, >90% rate): Review 20%, auto-approve standard patterns
Level 4 (100+ approvals, >95% rate): Review 5%, audit-only mode
Regression Detection:
If at any level, the next 5 reviews show approval rate dropping below the level's threshold, immediately regress one level and log the regression in improvement-log.md.
function checkForRegression(taskType: string): RegressionStatus {
const recentReviews = loadRecentReviews(taskType, 5);
const approvalRate = recentReviews.filter(r => r.approved).length / recentReviews.length;
const currentLevel = getAutonomyLevel(taskType);
const threshold = LEVEL_THRESHOLDS[currentLevel];
if (approvalRate < threshold) {
return {
regressed: true,
from: currentLevel,
to: Math.max(0, currentLevel - 1),
reason: `Approval rate ${approvalRate} < threshold ${threshold}`,
};
}
return { regressed: false };
}
6.4 A/B Learning
When Buba is uncertain between two approaches:
- Generate both versions
- Present as a comparison modal to Jake
- Record which Jake prefers and why
- Update preference model
Trigger conditions for A/B:
- Buba's confidence < 65% on the "right" approach
- Two valid design patterns exist (e.g., REST vs GraphQL adapter)
- Jake has shown mixed preferences for this domain in the past
7. Dashboard Metrics
7.1 Key Performance Indicators
┌────────────────────────────────────────────────────────────┐
│ BUBA LEARNING DASHBOARD │
├──────────────────────┬─────────────────────────────────────┤
│ Overall Health │ ████████░░ 82% (↑ from 74% last mo) │
├──────────────────────┼─────────────────────────────────────┤
│ Approval Rate │ 78% → 85% → 89% (3-month trend ↑) │
│ Avg Quality Score │ 6.2 → 7.1 → 7.6 (3-month trend ↑) │
│ Time-to-Decision │ 120s → 85s → 62s (3-month trend ↓) │
│ Prediction Accuracy │ ±2.1 → ±1.4 → ±0.8 (calibrating ↑) │
│ Auto-Approve Rate │ 0% → 12% → 28% (HITL reduction ↑) │
│ Regression Events │ 3 → 1 → 0 (stability improving ↑) │
├──────────────────────┼─────────────────────────────────────┤
│ Top Improvement Area │ Error handling (score +2.1 since rule)│
│ Top Problem Area │ Security patterns (still at 6.2 avg) │
│ Feedback Backlog │ 3 unprocessed events │
│ Active Rules │ 23 learned rules (18 high-confidence) │
└──────────────────────┴─────────────────────────────────────┘
7.2 Metric Definitions
| Metric | Formula | Target Trend |
|---|---|---|
| Approval Rate | approved / total_decisions per rolling 30 days |
↑ (asymptote at ~95%) |
| Avg Quality Score | mean(all dimension scores) per rolling 30 days |
↑ (target: 8.0+) |
| Time-to-Decision | mean(timeToDecisionMs) per rolling 30 days |
↓ (faster = better context) |
| Prediction Accuracy | mean(abs(predicted_score - actual_score)) |
↓ (target: ±0.5) |
| Auto-Approve Rate | auto_approved / total_tasks |
↑ (Jake reviews less) |
| Feedback Volume | total_feedback_events per week |
↓ (less HITL needed) |
| Rule Effectiveness | quality_delta after rule introduced |
↑ (rules actually help) |
| Regression Rate | regression_events / total_level_checks |
↓ (target: 0) |
7.3 Alerting Thresholds
| Condition | Alert Level | Action |
|---|---|---|
| Approval rate drops >10% week-over-week | 🔴 Critical | Pause auto-approvals, full HITL |
| Prediction accuracy >±2.0 | 🟡 Warning | Recalibrate immediately |
| New rejection reason appearing 3+ times | 🟡 Warning | Extract rule, add to checklist |
| Time-to-decision increasing | 🟠 Notice | Review presentation format |
| No feedback events for 48+ hours | 🟠 Notice | Check pipeline health |
8. Feedback-to-Prompt Engineering
8.1 Rule Extraction Pipeline
Feedback → NLP Theme Extraction → Frequency Analysis → Confidence Check → Rule Formulation → Prompt Injection
interface LearnedRule {
id: string;
source: 'feedback-pattern' | 'rejection-analysis' | 'comparison' | 'annotation';
rule: string; // Human-readable rule
prompt_instruction: string; // How to inject into system prompt
confidence: number; // 0-1 based on data strength
occurrences: number; // How many feedback events support this
first_seen: string;
last_seen: string;
effectiveness?: number; // Post-deployment quality delta
domain: QualityDimension;
applies_to: string[]; // Server types or 'all'
}
8.2 Dynamic .goosehints Generation
The system auto-generates a .goosehints addendum from learned rules:
# .goosehints (auto-generated section — DO NOT EDIT BELOW THIS LINE)
# Generated: 2025-07-11 | Based on 147 feedback events | 23 active rules
## Code Standards (from Jake's feedback)
- ALWAYS use Zod schemas for runtime validation of API responses
- ALWAYS include retry logic with exponential backoff (base 1s, max 30s, 3 attempts)
- ALWAYS use typed error classes, never throw raw strings
- NEVER use TypeScript `any` — use `unknown` with type guards instead
- PREFER functional composition over class hierarchies
- PREFER explicit error handling over try/catch-all blocks
- Name functions as verb-noun: `fetchUser`, `validateInput`, `buildResponse`
- Every public function needs a JSDoc comment with @param, @returns, @throws
## Design Standards (from Jake's feedback)
- Use 16-24px padding in card components
- Prefer Inter or system font stack
- Use muted color palettes: slate, zinc, gray families
- Card-based layouts for dashboard content
- Minimal animations: 150ms transitions only for state changes
## Quality Gates (auto-calibrated)
- Error handling score must be ≥6 before submission
- Test coverage score must be ≥5 before submission
- Security score must be ≥7 for auth-related MCP servers
- Overall predicted score must be ≥6.5 or self-review is triggered
## Review Presentation (optimized for Jake)
- Lead with a code diff, not full files
- Bullet-point summaries, max 5 bullets
- Highlight error handling approach prominently
- Include test coverage percentage in summary header
8.3 Per-Domain Instruction Sets
Beyond global rules, maintain domain-specific prompt sections:
prompts/
├── global-rules.md # Always included
├── domains/
│ ├── mcp-server.md # Rules specific to MCP server generation
│ ├── ui-design.md # Rules for UI/design tasks
│ ├── test-suite.md # Rules for test generation
│ └── documentation.md # Rules for docs
└── server-types/
├── shopify.md # Shopify-specific learned patterns
├── stripe.md # Stripe-specific patterns
└── ghl.md # GHL-specific patterns
Each file is rebuilt when new rules cross the confidence threshold (>0.8, supported by 5+ data points).
8.4 Quality Rubric Generation
From accumulated scores, generate rubrics that Buba uses for self-assessment:
## Code Quality Rubric (generated from 52 scored reviews)
### Score 9-10 (Exceptional) — Jake gave this 8% of the time
- Typed errors with full context chain
- 95%+ test coverage including edge cases
- Zero TypeScript errors in strict mode
- Comprehensive JSDoc on all exports
- Performance considerations documented
### Score 7-8 (Good) — Jake gave this 45% of the time
- Typed errors on main paths
- 80%+ test coverage
- Minimal TypeScript warnings
- JSDoc on public API
- Basic error handling on all external calls
### Score 5-6 (Acceptable) — Jake gave this 30% of the time
- Try/catch on external calls
- 60%+ test coverage
- Some TypeScript warnings
- Partial documentation
### Score 1-4 (Rejection Zone) — Jake gave this 17% of the time
- Generic error handling or none
- <60% test coverage
- TypeScript `any` usage
- Missing documentation
- No input validation
9. Privacy & Ethics
9.1 Data Retention Policy
| Data Type | Retention | Rationale |
|---|---|---|
| Raw feedback events | 90 days hot, indefinite cold | Source of truth for auditing |
| Free-text comments | 90 days verbatim, then extracted themes only | Privacy — reduce stored opinions over time |
| Decision metadata | Indefinite (aggregated) | Low-sensitivity, high-value for patterns |
| Annotation data | 90 days linked to code, then de-linked | Code context changes; old annotations mislead |
| Timing/behavioral meta | 30 days, then averaged into aggregates | Sensitive behavioral data, compress aggressively |
9.2 Handling Contradictory Feedback
Jake may give contradictory feedback (approves pattern X on Monday, rejects it Friday).
Resolution Strategy:
- Recency weighting: Recent feedback counts 2× older feedback
- Context matching: Check if the contradiction is actually context-dependent (e.g., pattern X is fine for internal tools but not for public APIs)
- Confidence-weighted: Feedback where Jake marked "very sure" outweighs "somewhat unsure"
- Volume threshold: Don't form a rule until the same signal appears 3+ times
- Clarification trigger: If two equally-strong signals conflict, flag for clarification in the next modal: "You've previously both approved and rejected [pattern]. Which do you prefer for [context]?"
9.3 Overfitting Prevention
Problem: Buba might optimize for Jake's momentary mood rather than durable preferences.
Safeguards:
- Minimum sample size: No rule is formed from fewer than 5 data points
- Time spread: Rules require data points spread across at least 3 different days
- Fatigue filtering: Feedback given after 5+ consecutive reviews in one session gets a 0.7× weight
- Trend smoothing: Use 30-day rolling averages, not daily snapshots, for threshold calibration
- Preference decay: Old preferences decay at 5% per month unless reinforced by new data
- Explicit override: Jake can always mark feedback as "strong preference" (2× weight) or "weak/unsure" (0.5× weight)
- Regular audit prompt: Monthly, present Jake with current learned rules for validation: "Here are the rules I've learned. Still accurate?"
9.4 Bias Awareness
- Track if Buba becomes overly conservative (rejection avoidance → boring/safe outputs)
- Monitor creativity scores alongside approval rates — if creativity drops as approval rises, inject more experimental variation
- Preserve a "exploration budget" — 10% of outputs can deliberately try new patterns outside the safe zone, clearly marked as "experimental"
10. Implementation Roadmap
Phase 1: Foundation (Week 1-2)
- Implement
FeedbackEventschema and validation - Build JSONL storage layer with daily rotation
- Create basic index builders (by-server-type, by-dimension, by-decision)
- Wire modal responses to storage pipeline
- Initialize memory files (
feedback-patterns.md,quality-standards.md,jake-preferences.md,improvement-log.md)
Phase 2: Analysis (Week 3-4)
- Build theme extraction from free-text feedback (keyword clustering)
- Implement approval rate computation per server type and stage
- Build confidence calibration table
- Create the pre-review self-check framework (rule-based)
- Implement prediction scoring (predict → store → compare loop)
Phase 3: Learning Loops (Week 5-6)
- Auto-generate
.goosehintsaddendum from learned rules - Build the diminishing review system (autonomy levels 0-4)
- Implement regression detection and auto-escalation
- Create the A/B comparison flow for uncertain decisions
- Build per-domain instruction generators
Phase 4: Optimization (Week 7-8)
- Notification timing optimization
- Session fatigue detection
- Format effectiveness tracking
- Memory compression and archival
- Dashboard metrics computation and display
Phase 5: Maturity (Ongoing)
- Monthly rule audit system
- Exploration budget for creative variation
- Cross-domain pattern transfer (learnings from code quality improving design reviews)
- Community patterns (if multi-human review is ever added)
Appendix A: Example End-to-End Flow
1. Buba generates a Shopify MCP server
2. Pre-review self-check runs:
- ✅ Retry logic present
- ✅ Zod validation on API responses
- ❌ Missing rate limiting (learned rule #7)
- Buba auto-fixes: adds rate limiting
- Re-check: all rules pass
- Predicted score: 7.8 (calibrated confidence: 74%)
3. Submitted to Jake's review queue
- Presentation format: code diff + 4-bullet summary (learned optimal format)
- Timing: queued for 9-11 AM window (Jake's peak engagement)
4. Jake reviews in modal:
- Decision: Approved
- Scores: code-quality: 8, error-handling: 8, test-coverage: 7, completeness: 9
- Free text: "Nice use of builder pattern for config. Wish the webhook handler had more specific error types."
- Time to decide: 45 seconds
5. Feedback processed:
- Validation: ✅
- Enrichment: themes ["builder-pattern-positive", "error-specificity-gap"], sentiment: 0.7
- Storage: appended to 2025-07-11.jsonl, indexes updated
- Analysis: "error specificity" theme now at 6 occurrences → approaching rule threshold
- Calibration: predicted 7.8, actual 8.0 → delta +0.2, slight pessimism
6. Learning applied:
- "builder pattern" noted as positive for Shopify servers
- "error specificity in webhooks" at 6 mentions → candidate rule queued (needs 2 more for confidence)
- Calibration nudged +0.1 for Shopify server type
- Shopify MCP autonomy: 15 approvals, 88% rate → Level 1 maintained
Appendix B: Storage Size Estimates
| Component | Per Event | Daily (10 events) | Monthly | Yearly |
|---|---|---|---|---|
| Raw JSONL | ~2 KB | ~20 KB | ~600 KB | ~7.2 MB |
| Indexes | N/A | ~50 KB (rebuilt) | ~50 KB | ~50 KB |
| Aggregates | N/A | ~10 KB | ~30 KB | ~120 KB |
| Memory files | N/A | ~5 KB (updated) | ~8 KB | ~15 KB |
| Total | ~85 KB/day | ~688 KB/mo | ~7.4 MB/yr |
Extremely lightweight. Even at 50 reviews/day, yearly storage stays under 40 MB. No database needed — flat files with JSONL are sufficient for this scale.
This document is a living architecture. As the system collects real feedback, specific thresholds, weights, and rules will be calibrated from actual data rather than these initial estimates.