1226 lines
49 KiB
Markdown
1226 lines
49 KiB
Markdown
# AI Learning & Feedback Loop System — How Buba Gets Smarter From Human Input
|
||
|
||
> **Document Type:** Architecture Specification
|
||
> **System:** GooseFactory — Continuous Learning Subsystem
|
||
> **Version:** 1.0
|
||
> **Date:** 2025-07-11
|
||
> **Author:** Buba (Architecture Subagent)
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Executive Summary](#1-executive-summary)
|
||
2. [Feedback Data Schema](#2-feedback-data-schema)
|
||
3. [Learning Domains](#3-learning-domains)
|
||
4. [Feedback Processing Pipeline](#4-feedback-processing-pipeline)
|
||
5. [Memory Integration](#5-memory-integration)
|
||
6. [Self-Improvement Loops](#6-self-improvement-loops)
|
||
7. [Dashboard Metrics](#7-dashboard-metrics)
|
||
8. [Feedback-to-Prompt Engineering](#8-feedback-to-prompt-engineering)
|
||
9. [Privacy & Ethics](#9-privacy--ethics)
|
||
10. [Implementation Roadmap](#10-implementation-roadmap)
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
The GooseFactory Learning System transforms Jake's human feedback from a passive quality gate into an **active training signal** that compounds Buba's capabilities over time. Every approval, rejection, score, comment, and annotation becomes a data point in a continuous improvement engine.
|
||
|
||
### Core Principle: Feedback Is Data, Patterns Are Knowledge, Knowledge Becomes Behavior
|
||
|
||
The system operates on three timescales:
|
||
|
||
| Timescale | What Happens | Example |
|
||
|-----------|-------------|---------|
|
||
| **Immediate** (seconds) | Raw feedback stored, linked to work product | Jake scores code quality 4/10 → stored |
|
||
| **Session** (hours) | Patterns extracted, memory files updated | "Jake rejected 3 MCP servers today for missing error handling" |
|
||
| **Epoch** (weekly) | Behavioral rules crystallized, prompts updated, thresholds recalibrated | `.goosehints` updated: "Always include retry logic with exponential backoff in MCP tool handlers" |
|
||
|
||
The system is **not** a traditional ML training loop — Buba is an LLM agent, not a fine-tuned model. Instead, learning happens through **structured memory**, **dynamic prompt injection**, **confidence calibration**, and **behavioral rule extraction**. This is "learning" in the pragmatic sense: the agent's outputs measurably improve because its context window carries increasingly refined instructions derived from real human judgment.
|
||
|
||
---
|
||
|
||
## 2. Feedback Data Schema
|
||
|
||
### 2.1 Core Entities
|
||
|
||
#### `FeedbackEvent` — The Atomic Unit
|
||
|
||
Every interaction with a modal produces exactly one `FeedbackEvent`:
|
||
|
||
```typescript
|
||
interface FeedbackEvent {
|
||
// Identity
|
||
id: string; // UUID v7 (time-sortable)
|
||
timestamp: string; // ISO 8601
|
||
sessionId: string; // Links to Buba's work session
|
||
modalId: string; // Which modal template was shown
|
||
modalVersion: string; // Modal schema version (for migration)
|
||
|
||
// What was reviewed
|
||
workProduct: WorkProductRef; // Reference to the artifact under review
|
||
pipelineStage: PipelineStage; // Which factory stage produced this
|
||
mcpServerType?: string; // e.g., "shopify", "stripe", "ghl"
|
||
|
||
// The feedback itself
|
||
decision?: DecisionFeedback;
|
||
scores?: DimensionScore[];
|
||
freeText?: FreeTextFeedback;
|
||
comparison?: ComparisonFeedback;
|
||
annotations?: Annotation[];
|
||
confidence?: ConfidenceFeedback;
|
||
|
||
// Meta-signals (implicit feedback)
|
||
meta: FeedbackMeta;
|
||
}
|
||
```
|
||
|
||
#### `WorkProductRef` — What Was Reviewed
|
||
|
||
```typescript
|
||
interface WorkProductRef {
|
||
type: 'mcp-server' | 'code-module' | 'design' | 'test-suite' | 'documentation' | 'pipeline-config';
|
||
id: string; // Artifact ID in the factory
|
||
version: string; // Git SHA or version tag
|
||
path?: string; // File path if applicable
|
||
summary: string; // What Buba presented to Jake
|
||
bubaConfidence: number; // 0-1: Buba's pre-review confidence
|
||
bubaScorePrediction?: number; // 1-10: What Buba predicted Jake would score
|
||
generationContext: {
|
||
promptHash: string; // Hash of system prompt used
|
||
memorySnapshot: string[]; // Which memory files were active
|
||
modelUsed: string; // sonnet/opus
|
||
generationTimeMs: number; // How long Buba spent generating
|
||
iterationCount: number; // How many self-review rounds before submission
|
||
};
|
||
}
|
||
```
|
||
|
||
#### `DecisionFeedback` — Approve / Reject / Needs Work
|
||
|
||
```typescript
|
||
interface DecisionFeedback {
|
||
decision: 'approved' | 'rejected' | 'needs-work' | 'deferred';
|
||
reason?: string; // Optional: why this decision
|
||
severity?: 'minor' | 'moderate' | 'major' | 'critical'; // For rejections
|
||
blockers?: string[]; // Specific issues that blocked approval
|
||
suggestions?: string[]; // What would make it approvable
|
||
}
|
||
```
|
||
|
||
#### `DimensionScore` — Quality Ratings Per Axis
|
||
|
||
```typescript
|
||
interface DimensionScore {
|
||
dimension: QualityDimension;
|
||
score: number; // 1-10
|
||
weight?: number; // How much Jake cares about this (learned over time)
|
||
comment?: string; // Per-dimension note
|
||
}
|
||
|
||
type QualityDimension =
|
||
| 'code-quality' // Clean, idiomatic, maintainable
|
||
| 'design-aesthetics' // Visual appeal, layout, color
|
||
| 'design-ux' // Usability, information hierarchy
|
||
| 'test-coverage' // Edge cases, thoroughness
|
||
| 'documentation' // Clarity, completeness
|
||
| 'architecture' // Structure, separation of concerns
|
||
| 'error-handling' // Robustness, graceful degradation
|
||
| 'performance' // Speed, efficiency
|
||
| 'security' // Auth, input validation, secrets handling
|
||
| 'creativity' // Novel approaches, clever solutions
|
||
| 'completeness' // Did it cover everything asked for?
|
||
| 'naming' // Variable names, function names, file organization
|
||
| 'dx' // Developer experience — easy to work with?
|
||
;
|
||
```
|
||
|
||
#### `FreeTextFeedback` — Written Praise & Criticism
|
||
|
||
```typescript
|
||
interface FreeTextFeedback {
|
||
liked?: string; // What Jake liked
|
||
disliked?: string; // What Jake didn't like
|
||
generalNotes?: string; // Additional context
|
||
|
||
// Processed fields (filled by enrichment pipeline)
|
||
_extractedThemes?: string[]; // NLP-extracted themes
|
||
_sentiment?: number; // -1 to 1
|
||
_actionableItems?: string[]; // Specific things to change
|
||
}
|
||
```
|
||
|
||
#### `ComparisonFeedback` — A vs B Preferences
|
||
|
||
```typescript
|
||
interface ComparisonFeedback {
|
||
optionA: WorkProductRef;
|
||
optionB: WorkProductRef;
|
||
preferred: 'A' | 'B' | 'neither' | 'both-good';
|
||
reason?: string;
|
||
preferenceStrength: 'slight' | 'moderate' | 'strong' | 'overwhelming';
|
||
dimensions?: { // Per-dimension winner
|
||
dimension: QualityDimension;
|
||
winner: 'A' | 'B' | 'tie';
|
||
}[];
|
||
}
|
||
```
|
||
|
||
#### `Annotation` — Specific Code/Design Region Feedback
|
||
|
||
```typescript
|
||
interface Annotation {
|
||
type: 'praise' | 'criticism' | 'question' | 'suggestion';
|
||
target: {
|
||
file?: string;
|
||
lineStart?: number;
|
||
lineEnd?: number;
|
||
selector?: string; // CSS selector for design annotations
|
||
region?: string; // Named region for non-code artifacts
|
||
};
|
||
text: string;
|
||
severity?: 'nit' | 'minor' | 'major' | 'critical';
|
||
category?: string; // e.g., "naming", "error-handling", "style"
|
||
}
|
||
```
|
||
|
||
#### `ConfidenceFeedback` — Jake's Self-Assessed Certainty
|
||
|
||
```typescript
|
||
interface ConfidenceFeedback {
|
||
level: 'very-unsure' | 'somewhat-unsure' | 'fairly-sure' | 'very-sure';
|
||
wouldDelegateToAI: boolean; // "Could Buba have made this call?"
|
||
needsExpertReview: boolean; // "Should someone else also look?"
|
||
}
|
||
```
|
||
|
||
#### `FeedbackMeta` — Implicit Signals
|
||
|
||
```typescript
|
||
interface FeedbackMeta {
|
||
timeToDecisionMs: number; // How long the modal was open
|
||
timeToFirstInteractionMs: number; // How long before Jake started clicking
|
||
fieldsModified: string[]; // Which optional fields Jake bothered to fill
|
||
scrollDepth?: number; // How much of the artifact Jake actually viewed
|
||
revisits: number; // How many times Jake came back to this modal
|
||
deviceType: 'desktop' | 'mobile'; // Context of where Jake reviewed
|
||
timeOfDay: string; // Morning/afternoon/evening/night
|
||
dayOfWeek: string; // Weekday vs weekend patterns
|
||
concurrentModals: number; // How many other modals were pending
|
||
sessionFatigue: number; // Nth modal in this review session
|
||
}
|
||
```
|
||
|
||
### 2.2 Storage Strategy
|
||
|
||
```
|
||
feedback/
|
||
├── raw/ # Immutable event log (append-only)
|
||
│ ├── 2025-07-11.jsonl # One JSONL file per day
|
||
│ ├── 2025-07-12.jsonl
|
||
│ └── ...
|
||
├── indexes/ # Derived indexes for fast lookup
|
||
│ ├── by-server-type.json # FeedbackEvent[] grouped by MCP server type
|
||
│ ├── by-dimension.json # Scores grouped by quality dimension
|
||
│ ├── by-decision.json # Approval/rejection patterns
|
||
│ └── by-stage.json # Feedback by pipeline stage
|
||
├── aggregates/ # Computed summaries
|
||
│ ├── weekly/
|
||
│ │ └── 2025-W28.json # Weekly aggregate stats
|
||
│ ├── monthly/
|
||
│ │ └── 2025-07.json
|
||
│ └── rolling-30d.json # Always-current 30-day window
|
||
└── analysis/ # Pattern extraction outputs
|
||
├── themes.json # Recurring feedback themes
|
||
├── preferences.json # Learned preference weights
|
||
└── calibration.json # Confidence calibration data
|
||
```
|
||
|
||
**Why JSONL for raw storage:**
|
||
- Append-only = safe under crashes
|
||
- One line per event = easy grep/jq queries
|
||
- Daily files = natural partitioning, easy archival
|
||
- Human-readable = Jake can audit what's stored
|
||
|
||
---
|
||
|
||
## 3. Learning Domains
|
||
|
||
### 3.1 Decision Quality — The Jake Preference Model
|
||
|
||
**Goal:** Build a statistical model of what Jake approves vs. rejects, so Buba can predict outcomes and pre-filter.
|
||
|
||
#### Data Collection
|
||
|
||
For every decision event, we track:
|
||
|
||
```
|
||
(server_type, pipeline_stage, dimension_scores[], decision, time_to_decide)
|
||
```
|
||
|
||
#### Pattern Extraction
|
||
|
||
**Approval Rate by Server Type:**
|
||
```
|
||
| Server Type | Approved | Rejected | Needs Work | Approval Rate |
|
||
|------------|----------|----------|------------|---------------|
|
||
| shopify | 12 | 2 | 3 | 70.6% |
|
||
| stripe | 8 | 5 | 4 | 47.1% |
|
||
| ghl | 15 | 1 | 1 | 88.2% |
|
||
```
|
||
|
||
**Minimum Score Thresholds (learned):**
|
||
After 20+ feedback events per dimension, compute the threshold below which Jake *always* rejects:
|
||
```
|
||
error-handling < 5 → 95% rejection rate → HARD GATE
|
||
test-coverage < 4 → 90% rejection rate → HARD GATE
|
||
code-quality < 6 → 80% rejection rate → SOFT GATE (warn Buba)
|
||
naming < 3 → 85% rejection rate → SOFT GATE
|
||
```
|
||
|
||
**Confidence Calibration Table:**
|
||
|
||
Track Buba's predicted confidence vs. actual approval:
|
||
```
|
||
| Buba's Confidence | Actual Approval Rate | Calibration Error | n |
|
||
|-------------------|---------------------|-------------------|-----|
|
||
| 90-100% | 82% | +12% overconfident | 45 |
|
||
| 70-89% | 71% | ~calibrated | 38 |
|
||
| 50-69% | 35% | +22% overconfident | 22 |
|
||
| <50% | 15% | ~calibrated | 12 |
|
||
```
|
||
|
||
**Action:** Buba applies a calibration curve to raw confidence. If calibrated confidence < 60%, auto-trigger a second self-review pass before sending to Jake.
|
||
|
||
#### Decision Speed as Quality Signal
|
||
|
||
```
|
||
Fast decisions (< 30 sec) with approval → Buba nailed it, this pattern is solid
|
||
Fast decisions (< 30 sec) with rejection → Obvious flaw Buba should have caught
|
||
Slow decisions (> 5 min) with approval → Edge case, Jake was thinking carefully
|
||
Slow decisions (> 5 min) with rejection → Complex problem, needs better tooling
|
||
```
|
||
|
||
Fast rejections are the **highest-value learning signals** — they indicate patterns Buba should learn to self-check.
|
||
|
||
### 3.2 Code Quality — Learning Jake's Standards
|
||
|
||
#### Style Guide Extraction
|
||
|
||
After every code-related feedback event with annotations or comments:
|
||
|
||
1. **Extract rules** from criticism patterns:
|
||
- "Missing error handling" × 8 times → Rule: "Every external API call must have try/catch with typed errors"
|
||
- "Variable names too short" × 5 → Rule: "Prefer descriptive names; no single-letter variables outside loops"
|
||
- "No input validation" × 6 → Rule: "Validate all user inputs at function boundaries"
|
||
|
||
2. **Weight rules** by frequency and severity:
|
||
```
|
||
Rule Weight = (occurrence_count × avg_severity_score) / total_reviews
|
||
```
|
||
|
||
3. **Compile into `.goosehints`** as concrete instructions (see Section 8).
|
||
|
||
#### Common Rejection Reasons Registry
|
||
|
||
```typescript
|
||
interface RejectionPattern {
|
||
pattern: string; // "Missing retry logic on API calls"
|
||
occurrences: number;
|
||
lastSeen: string;
|
||
avgSeverity: number;
|
||
serverTypes: string[]; // Which types trigger this
|
||
autoFixable: boolean; // Can Buba fix this programmatically?
|
||
checklistItem: string; // Becomes a pre-submit checklist entry
|
||
}
|
||
```
|
||
|
||
**Top 5 Rejection Reasons** (updated weekly, surfaced in Buba's pre-review prompt):
|
||
```markdown
|
||
## Pre-Review Checklist (from learned feedback)
|
||
1. ☐ All external API calls have retry logic with exponential backoff
|
||
2. ☐ Error messages include actionable context (not just "Error occurred")
|
||
3. ☐ Input validation on all public function parameters
|
||
4. ☐ Rate limiting implemented on all HTTP-facing tools
|
||
5. ☐ TypeScript strict mode, no `any` types
|
||
```
|
||
|
||
### 3.3 Design Skills — Aesthetic Preference Learning
|
||
|
||
#### Preference Vector
|
||
|
||
From side-by-side comparisons, build a multi-dimensional preference vector:
|
||
|
||
```
|
||
Jake's Design Preferences (learned from 30+ comparisons):
|
||
- Whitespace: prefers generous (0.8/1.0)
|
||
- Color palette: prefers muted/professional (0.7/1.0) over vibrant
|
||
- Typography: prefers system fonts (0.6/1.0), values readability over style
|
||
- Layout: prefers card-based (0.9/1.0) over list-based
|
||
- Animation: prefers subtle or none (0.3/1.0)
|
||
- Information density: prefers moderate (0.5/1.0) — not too sparse, not too packed
|
||
```
|
||
|
||
Each comparison updates these weights using a simple **ELO-like system**:
|
||
- If Jake picks A over B, A's design attributes gain points, B's lose
|
||
- Preference strength multiplies the update magnitude
|
||
- After 50+ comparisons, the vector stabilizes
|
||
|
||
#### Design Quality Predictor
|
||
|
||
A simple scoring function that estimates Jake's likely score before review:
|
||
|
||
```
|
||
predicted_score = Σ (dimension_weight[i] × self_assessed_score[i])
|
||
```
|
||
|
||
Where `dimension_weight` comes from a regression on historical (predicted_features, actual_score) pairs. Initially use equal weights; after 30+ data points, fit a weighted model.
|
||
|
||
### 3.4 Delegation Skills — Reducing Jake's Cognitive Load
|
||
|
||
#### Task Classification Model
|
||
|
||
Over time, classify tasks into:
|
||
|
||
| Category | Criteria (Learned) | Action |
|
||
|----------|-------------------|--------|
|
||
| **Auto-approve** | Approval rate >95% over 20+ instances, avg time-to-decide <15s | Skip HITL, log for audit |
|
||
| **Quick review** | Approval rate >80%, low complexity | Show minimal modal (approve/reject only) |
|
||
| **Standard review** | Approval rate 50-80% | Full modal with all dimensions |
|
||
| **Deep review** | Approval rate <50% or high complexity | Extended modal, request annotations |
|
||
| **Escalate** | Novel task type or <5 historical data points | Flag as "needs careful review" |
|
||
|
||
#### Notification Timing Optimization
|
||
|
||
Track Jake's engagement patterns:
|
||
```
|
||
| Time Window | Avg Response Time | Completion Rate | Quality of Feedback |
|
||
|-------------------|-------------------|-----------------|---------------------|
|
||
| 9-11 AM | 3 min | 95% | High (detailed) |
|
||
| 11 AM-1 PM | 8 min | 88% | Medium |
|
||
| 2-5 PM | 5 min | 92% | High |
|
||
| 5-8 PM | 25 min | 60% | Low (rushed) |
|
||
| 8 PM-12 AM | 45 min | 40% | Low |
|
||
```
|
||
|
||
**Action:** Batch non-urgent reviews for high-engagement windows. Send only critical decisions during low-engagement times.
|
||
|
||
#### Session Fatigue Detection
|
||
|
||
If Jake has reviewed 5+ items in a row:
|
||
- Decision quality likely drops (less detailed feedback, faster but less thoughtful)
|
||
- Auto-insert a break suggestion: "You've reviewed 6 items — want me to batch the rest for later?"
|
||
- Weight feedback from fatigued sessions lower in the learning pipeline
|
||
|
||
### 3.5 Communication Skills — Better Presentations
|
||
|
||
#### Format Effectiveness Tracking
|
||
|
||
For each modal format variant, track:
|
||
```
|
||
format_effectiveness = decision_speed × feedback_quality × completion_rate
|
||
```
|
||
|
||
| Presentation Style | Avg Time to Decide | Feedback Richness | Effectiveness Score |
|
||
|---|---|---|---|
|
||
| Code diff + summary | 45s | High | 0.92 |
|
||
| Full file + annotations | 120s | Very High | 0.85 |
|
||
| Summary only | 15s | Low | 0.60 |
|
||
| Screenshot + bullet points | 30s | Medium | 0.78 |
|
||
|
||
**Action:** For each pipeline stage and artifact type, select the format with highest effectiveness score.
|
||
|
||
#### Detail Level Calibration
|
||
|
||
Track which summary sections Jake actually reads (via scroll depth and time-on-section):
|
||
- If Jake never reads the "Architecture Decisions" section → compress or remove it
|
||
- If Jake always expands the "Test Coverage" section → make it prominent
|
||
- If Jake clicks "Show More" on error handling details → default to expanded
|
||
|
||
---
|
||
|
||
## 4. Feedback Processing Pipeline
|
||
|
||
```
|
||
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Modal │───▶│ Validation │───▶│ Enrichment │───▶│ Storage │───▶│ Analysis │───▶│ Action │
|
||
│ Response │ │ │ │ │ │ │ │ │ │ │
|
||
└─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||
│ │ │ │ │
|
||
Reject malformed Add timestamps Raw JSONL log Pattern detect Update memory
|
||
Fill defaults Link to artifact Update indexes Theme extraction Update prompts
|
||
Sanitize text Compute meta Rebuild aggregates Calibration update Trigger self-check
|
||
Schema validate Extract themes Archive old data Anomaly detection Adjust thresholds
|
||
```
|
||
|
||
### Stage 1: Validation
|
||
|
||
```typescript
|
||
function validateFeedback(raw: unknown): FeedbackEvent | ValidationError {
|
||
// 1. Schema validation — does it match FeedbackEvent shape?
|
||
const parsed = FeedbackEventSchema.safeParse(raw);
|
||
if (!parsed.success) return { error: 'schema', details: parsed.error };
|
||
|
||
// 2. Referential integrity — does the workProductRef exist?
|
||
if (!artifactExists(parsed.data.workProduct.id)) {
|
||
return { error: 'orphaned', details: 'Work product not found' };
|
||
}
|
||
|
||
// 3. Range validation — scores in bounds, timestamps sane
|
||
for (const score of parsed.data.scores ?? []) {
|
||
if (score.score < 1 || score.score > 10) {
|
||
return { error: 'range', details: `Score ${score.score} out of bounds` };
|
||
}
|
||
}
|
||
|
||
// 4. Deduplication — reject duplicate submissions (same modal, same session)
|
||
if (isDuplicate(parsed.data)) {
|
||
return { error: 'duplicate', details: 'Already processed' };
|
||
}
|
||
|
||
return parsed.data;
|
||
}
|
||
```
|
||
|
||
### Stage 2: Enrichment
|
||
|
||
```typescript
|
||
function enrichFeedback(event: FeedbackEvent): EnrichedFeedbackEvent {
|
||
return {
|
||
...event,
|
||
|
||
// Add implicit meta-signals
|
||
meta: {
|
||
...event.meta,
|
||
timeOfDay: categorizeTime(event.timestamp),
|
||
dayOfWeek: getDayOfWeek(event.timestamp),
|
||
sessionFatigue: countPriorModalsInSession(event.sessionId),
|
||
},
|
||
|
||
// Link to work product details
|
||
workProduct: {
|
||
...event.workProduct,
|
||
_generationPrompt: getPromptUsed(event.workProduct.id),
|
||
_pipelineContext: getPipelineState(event.workProduct.id),
|
||
},
|
||
|
||
// NLP enrichment on free text
|
||
freeText: event.freeText ? {
|
||
...event.freeText,
|
||
_extractedThemes: extractThemes(event.freeText),
|
||
_sentiment: analyzeSentiment(event.freeText),
|
||
_actionableItems: extractActionItems(event.freeText),
|
||
} : undefined,
|
||
|
||
// Calibration data
|
||
_predictionDelta: event.workProduct.bubaScorePrediction
|
||
? computeAvgScore(event.scores) - event.workProduct.bubaScorePrediction
|
||
: undefined,
|
||
};
|
||
}
|
||
```
|
||
|
||
### Stage 3: Storage
|
||
|
||
```typescript
|
||
async function storeFeedback(event: EnrichedFeedbackEvent): Promise<void> {
|
||
// 1. Append to daily raw log (immutable)
|
||
const dayFile = `feedback/raw/${formatDate(event.timestamp)}.jsonl`;
|
||
await appendLine(dayFile, JSON.stringify(event));
|
||
|
||
// 2. Update indexes (mutable, rebuilt from raw if corrupted)
|
||
await updateIndex('by-server-type', event.mcpServerType, event);
|
||
await updateIndex('by-dimension', event.scores, event);
|
||
await updateIndex('by-decision', event.decision?.decision, event);
|
||
await updateIndex('by-stage', event.pipelineStage, event);
|
||
|
||
// 3. Update rolling aggregates
|
||
await updateRollingAggregate('rolling-30d', event);
|
||
}
|
||
```
|
||
|
||
### Stage 4: Analysis (runs after every N events or on schedule)
|
||
|
||
```typescript
|
||
async function analyzeFeedback(): Promise<AnalysisResults> {
|
||
const recent = await loadFeedback({ days: 30 });
|
||
|
||
return {
|
||
// Pattern detection
|
||
approvalPatterns: computeApprovalRates(recent),
|
||
dimensionTrends: computeDimensionTrends(recent),
|
||
|
||
// Theme extraction — cluster free-text feedback into themes
|
||
themes: clusterFeedbackThemes(recent.map(e => e.freeText).filter(Boolean)),
|
||
|
||
// Calibration — compare predictions to actuals
|
||
calibration: computeCalibrationCurve(
|
||
recent.map(e => ({
|
||
predicted: e.workProduct.bubaConfidence,
|
||
actual: e.decision?.decision === 'approved' ? 1 : 0,
|
||
}))
|
||
),
|
||
|
||
// Anomalies — sudden changes in feedback patterns
|
||
anomalies: detectAnomalies(recent),
|
||
|
||
// New rules — patterns strong enough to become rules
|
||
candidateRules: extractCandidateRules(recent),
|
||
};
|
||
}
|
||
```
|
||
|
||
### Stage 5: Action
|
||
|
||
```typescript
|
||
async function applyLearnings(analysis: AnalysisResults): Promise<void> {
|
||
// 1. Update memory files
|
||
await updateMemoryFile('memory/feedback-patterns.md', analysis.themes);
|
||
await updateMemoryFile('memory/quality-standards.md', analysis.dimensionTrends);
|
||
await updateMemoryFile('memory/jake-preferences.md', analysis.approvalPatterns);
|
||
|
||
// 2. Update pre-review checklist
|
||
await updateChecklist(analysis.candidateRules);
|
||
|
||
// 3. Recalibrate confidence thresholds
|
||
await updateCalibration(analysis.calibration);
|
||
|
||
// 4. Adjust HITL frequency for auto-approvable task types
|
||
await updateHITLThresholds(analysis.approvalPatterns);
|
||
|
||
// 5. Log what changed
|
||
await appendToImprovementLog({
|
||
timestamp: new Date().toISOString(),
|
||
changes: analysis.candidateRules.filter(r => r.confidence > 0.8),
|
||
metrics: { approvalRate: analysis.approvalPatterns.overall },
|
||
});
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Memory Integration
|
||
|
||
### 5.1 Memory File Architecture
|
||
|
||
The learning system writes to four memory files, each with a specific purpose:
|
||
|
||
#### `memory/feedback-patterns.md`
|
||
|
||
```markdown
|
||
# Feedback Patterns (Auto-Updated)
|
||
> Last updated: 2025-07-11 | Based on 147 feedback events
|
||
|
||
## Recurring Themes (Top 10)
|
||
1. **Error handling depth** (mentioned 23x) — Jake wants specific error types, not generic catches
|
||
2. **Input validation** (18x) — Validate at boundaries, use Zod schemas
|
||
3. **Naming clarity** (15x) — Descriptive function names, no abbreviations
|
||
4. **Test edge cases** (12x) — Empty arrays, null values, Unicode, rate limits
|
||
5. **Documentation completeness** (10x) — Every public function needs JSDoc
|
||
...
|
||
|
||
## Anti-Patterns (Things Jake Consistently Rejects)
|
||
- Using `any` type in TypeScript (rejection rate: 100%)
|
||
- Console.log for error reporting (95%)
|
||
- Missing rate limiting on external APIs (90%)
|
||
- Hardcoded configuration values (85%)
|
||
- Missing input sanitization (88%)
|
||
|
||
## Positive Patterns (Things Jake Consistently Praises)
|
||
- Comprehensive error types with context (+0.8 avg score bonus)
|
||
- Builder pattern for complex configurations (+0.6)
|
||
- Graceful degradation with fallbacks (+0.7)
|
||
- Clear separation of concerns (+0.5)
|
||
```
|
||
|
||
#### `memory/quality-standards.md`
|
||
|
||
```markdown
|
||
# Quality Standards (Learned Thresholds)
|
||
> Last calibrated: 2025-07-11 | Confidence: High (100+ data points)
|
||
|
||
## Minimum Acceptable Scores (Hard Gates)
|
||
| Dimension | Minimum Score | Based On | Jake's Avg |
|
||
|-----------|--------------|----------|------------|
|
||
| error-handling | 6 | 45 reviews | 7.2 |
|
||
| test-coverage | 5 | 38 reviews | 6.8 |
|
||
| security | 7 | 22 reviews | 7.9 |
|
||
| code-quality | 6 | 52 reviews | 7.1 |
|
||
|
||
## Target Scores (What "Great" Looks Like)
|
||
| Dimension | Target | Top Quartile |
|
||
|-----------|--------|-------------|
|
||
| error-handling | 8+ | 9.1 |
|
||
| code-quality | 8+ | 8.7 |
|
||
| completeness | 9+ | 9.4 |
|
||
|
||
## Confidence Calibration
|
||
Buba's raw confidence of X% maps to actual approval rate:
|
||
- 90%+ raw → 78% actual (apply -12% correction)
|
||
- 70-89% raw → 68% actual (apply -8% correction)
|
||
- 50-69% raw → 35% actual (apply -25% correction — major overconfidence zone)
|
||
```
|
||
|
||
#### `memory/jake-preferences.md`
|
||
|
||
```markdown
|
||
# Jake's Preferences (Personal Taste Model)
|
||
> Last updated: 2025-07-11
|
||
|
||
## Decision Patterns
|
||
- Approves quickly: boilerplate MCP servers with standard patterns (<20s avg)
|
||
- Deliberates on: novel integrations, security-sensitive tools (>5min avg)
|
||
- Defers: design decisions when tired (evening reviews)
|
||
- Auto-rejects: anything with TypeScript `any`, missing error handling
|
||
|
||
## Communication Preferences
|
||
- Prefers: code diffs over full files for review
|
||
- Prefers: bullet points over paragraphs for summaries
|
||
- Reads: error handling sections carefully (high scroll depth)
|
||
- Skips: architecture decision records (low scroll depth)
|
||
- Best review times: 9-11 AM, 2-4 PM
|
||
- Worst review times: after 8 PM (fast but low-quality feedback)
|
||
|
||
## Design Taste
|
||
- Prefers: clean, minimal UI with generous whitespace
|
||
- Prefers: card-based layouts over tables for dashboards
|
||
- Dislikes: excessive animations, gradient-heavy designs
|
||
- Font preference: Inter/system fonts > decorative fonts
|
||
- Color preference: muted professional palettes > vibrant/playful
|
||
|
||
## Code Style
|
||
- Prefers: functional patterns over class-based
|
||
- Prefers: explicit over implicit (no magic)
|
||
- Prefers: small focused functions over large multi-purpose ones
|
||
- Cares deeply about: naming, error messages, edge cases
|
||
- Cares less about: premature optimization, DRY absolutism
|
||
```
|
||
|
||
#### `memory/improvement-log.md`
|
||
|
||
```markdown
|
||
# Improvement Log — What Changed and Why
|
||
> Tracks every behavioral change derived from feedback
|
||
|
||
## 2025-07-11
|
||
- **Rule added:** "Always validate API responses with Zod schemas"
|
||
- Source: 5 rejections for missing validation in past 2 weeks
|
||
- Confidence: 92%
|
||
- **Threshold adjusted:** error-handling minimum raised 5→6
|
||
- Source: Jake rejected 3 items scoring 5 on error-handling
|
||
- **Calibration updated:** Buba overconfident by 12% in 90%+ range
|
||
- Action: applying -12% correction to raw confidence
|
||
|
||
## 2025-07-04
|
||
- **Rule added:** "Include retry logic with exponential backoff on all HTTP calls"
|
||
- Source: mentioned in 8 separate free-text feedback entries
|
||
- **Format changed:** Switched MCP server reviews to diff-based presentation
|
||
- Source: decision time dropped 40% in A/B test
|
||
```
|
||
|
||
### 5.2 Memory Compression Strategy
|
||
|
||
Feedback data grows. Memory files must stay lean for context windows.
|
||
|
||
**Compression Schedule:**
|
||
|
||
| Age | Storage | Memory Representation |
|
||
|-----|---------|----------------------|
|
||
| 0-7 days | Full raw events in JSONL | Referenced directly if relevant |
|
||
| 8-30 days | Raw events + aggregates | Summarized in weekly aggregate |
|
||
| 31-90 days | Raw archived, aggregates active | Compressed to patterns only |
|
||
| 90+ days | Raw cold-stored | Only durable rules and anti-patterns preserved |
|
||
|
||
**Compression Algorithm:**
|
||
1. Scan feedback older than 30 days
|
||
2. For each theme/pattern, check if it's already captured in `feedback-patterns.md`
|
||
3. If yes → delete from active memory, keep in cold archive
|
||
4. If no → extract the novel pattern, add to memory, then archive
|
||
5. Rules with <5 occurrences that are >90 days old get pruned (they were noise)
|
||
|
||
### 5.3 Contextual Memory Retrieval
|
||
|
||
When Buba starts a new task, the memory system surfaces **relevant** past feedback:
|
||
|
||
```typescript
|
||
function getRelevantFeedback(task: Task): RelevantMemory {
|
||
return {
|
||
// Always loaded (small, high-value)
|
||
preferences: loadMemory('jake-preferences.md'),
|
||
standards: loadMemory('quality-standards.md'),
|
||
|
||
// Task-type specific
|
||
patterns: loadFilteredPatterns({
|
||
serverType: task.mcpServerType,
|
||
pipelineStage: task.stage,
|
||
dimensions: task.relevantDimensions,
|
||
}),
|
||
|
||
// Recent relevant feedback (last 5 similar tasks)
|
||
recentSimilar: loadRecentFeedback({
|
||
type: task.type,
|
||
limit: 5,
|
||
minSimilarity: 0.7,
|
||
}),
|
||
|
||
// Specific anti-patterns for this task type
|
||
antiPatterns: loadAntiPatterns(task.mcpServerType),
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Self-Improvement Loops
|
||
|
||
### 6.1 Pre-Review Self-Check
|
||
|
||
Before any work product is sent to Jake, Buba runs it through a learned checklist:
|
||
|
||
```
|
||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌──────────────┐
|
||
│ Generated Work │───▶│ Self-Check Gate │───▶│ Pass? Submit to │───▶│ Jake Reviews │
|
||
│ Product │ │ (learned rules) │ │ Jake's Queue │ │ │
|
||
└─────────────────┘ └──────────────────┘ └─────────────────┘ └──────────────┘
|
||
│ Fail │
|
||
▼ │
|
||
┌──────────────┐ │
|
||
│ Auto-Fix or │ │
|
||
│ Re-generate │◀─────────────────────────────────────┘
|
||
└──────────────┘ (feedback updates self-check rules)
|
||
```
|
||
|
||
**Self-Check Implementation:**
|
||
|
||
```typescript
|
||
interface SelfCheckResult {
|
||
passed: boolean;
|
||
score: number; // Predicted Jake score
|
||
confidence: number; // Calibrated confidence
|
||
violations: SelfCheckViolation[];
|
||
recommendations: string[];
|
||
}
|
||
|
||
async function selfCheck(artifact: WorkProduct): Promise<SelfCheckResult> {
|
||
const rules = await loadLearnedRules();
|
||
const standards = await loadQualityStandards();
|
||
const violations: SelfCheckViolation[] = [];
|
||
|
||
// Check each learned rule
|
||
for (const rule of rules) {
|
||
const result = await evaluateRule(rule, artifact);
|
||
if (!result.passed) {
|
||
violations.push({
|
||
rule: rule.id,
|
||
description: rule.description,
|
||
severity: rule.severity,
|
||
autoFixable: rule.autoFixAvailable,
|
||
suggestion: rule.fixSuggestion,
|
||
});
|
||
}
|
||
}
|
||
|
||
// Predict Jake's score
|
||
const predictedScore = await predictScore(artifact, standards);
|
||
const calibratedConfidence = applyCalibration(computeRawConfidence(violations));
|
||
|
||
return {
|
||
passed: violations.filter(v => v.severity >= 'major').length === 0,
|
||
score: predictedScore,
|
||
confidence: calibratedConfidence,
|
||
violations,
|
||
recommendations: violations
|
||
.filter(v => v.autoFixable)
|
||
.map(v => v.suggestion),
|
||
};
|
||
}
|
||
```
|
||
|
||
### 6.2 Prediction Scoring Loop
|
||
|
||
Every time Buba submits work, it records a prediction. After Jake reviews:
|
||
|
||
```
|
||
Buba predicts: "Jake will give this 7.5/10 with 80% approval confidence"
|
||
Jake scores: 8/10, approved
|
||
Delta: +0.5 (Buba was slightly pessimistic — recalibrate upward for this type)
|
||
```
|
||
|
||
Over time, this creates a calibration dataset:
|
||
|
||
```typescript
|
||
interface CalibrationEntry {
|
||
taskType: string;
|
||
predicted: { score: number; confidence: number };
|
||
actual: { score: number; approved: boolean };
|
||
delta: number;
|
||
timestamp: string;
|
||
}
|
||
|
||
// Calibration update (exponential moving average)
|
||
function updateCalibration(entries: CalibrationEntry[]): CalibrationCurve {
|
||
const buckets = groupBy(entries, e => Math.floor(e.predicted.confidence * 10) / 10);
|
||
return Object.fromEntries(
|
||
Object.entries(buckets).map(([bucket, items]) => [
|
||
bucket,
|
||
{
|
||
avgPredicted: mean(items.map(i => i.predicted.confidence)),
|
||
avgActual: mean(items.map(i => i.actual.approved ? 1 : 0)),
|
||
n: items.length,
|
||
correction: mean(items.map(i => i.delta)),
|
||
}
|
||
])
|
||
);
|
||
}
|
||
```
|
||
|
||
### 6.3 Diminishing Review System
|
||
|
||
As Buba proves competence in specific task types, HITL frequency decreases:
|
||
|
||
```
|
||
Level 0 (New task type): Review 100% of outputs
|
||
Level 1 (10+ approvals): Review 80%, spot-check 20%
|
||
Level 2 (25+ approvals, >85% rate): Review 50%, auto-approve simple ones
|
||
Level 3 (50+ approvals, >90% rate): Review 20%, auto-approve standard patterns
|
||
Level 4 (100+ approvals, >95% rate): Review 5%, audit-only mode
|
||
```
|
||
|
||
**Regression Detection:**
|
||
If at any level, the next 5 reviews show approval rate dropping below the level's threshold, **immediately regress one level** and log the regression in `improvement-log.md`.
|
||
|
||
```typescript
|
||
function checkForRegression(taskType: string): RegressionStatus {
|
||
const recentReviews = loadRecentReviews(taskType, 5);
|
||
const approvalRate = recentReviews.filter(r => r.approved).length / recentReviews.length;
|
||
const currentLevel = getAutonomyLevel(taskType);
|
||
const threshold = LEVEL_THRESHOLDS[currentLevel];
|
||
|
||
if (approvalRate < threshold) {
|
||
return {
|
||
regressed: true,
|
||
from: currentLevel,
|
||
to: Math.max(0, currentLevel - 1),
|
||
reason: `Approval rate ${approvalRate} < threshold ${threshold}`,
|
||
};
|
||
}
|
||
return { regressed: false };
|
||
}
|
||
```
|
||
|
||
### 6.4 A/B Learning
|
||
|
||
When Buba is uncertain between two approaches:
|
||
|
||
1. Generate both versions
|
||
2. Present as a comparison modal to Jake
|
||
3. Record which Jake prefers and why
|
||
4. Update preference model
|
||
|
||
**Trigger conditions for A/B:**
|
||
- Buba's confidence < 65% on the "right" approach
|
||
- Two valid design patterns exist (e.g., REST vs GraphQL adapter)
|
||
- Jake has shown mixed preferences for this domain in the past
|
||
|
||
---
|
||
|
||
## 7. Dashboard Metrics
|
||
|
||
### 7.1 Key Performance Indicators
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────┐
|
||
│ BUBA LEARNING DASHBOARD │
|
||
├──────────────────────┬─────────────────────────────────────┤
|
||
│ Overall Health │ ████████░░ 82% (↑ from 74% last mo) │
|
||
├──────────────────────┼─────────────────────────────────────┤
|
||
│ Approval Rate │ 78% → 85% → 89% (3-month trend ↑) │
|
||
│ Avg Quality Score │ 6.2 → 7.1 → 7.6 (3-month trend ↑) │
|
||
│ Time-to-Decision │ 120s → 85s → 62s (3-month trend ↓) │
|
||
│ Prediction Accuracy │ ±2.1 → ±1.4 → ±0.8 (calibrating ↑) │
|
||
│ Auto-Approve Rate │ 0% → 12% → 28% (HITL reduction ↑) │
|
||
│ Regression Events │ 3 → 1 → 0 (stability improving ↑) │
|
||
├──────────────────────┼─────────────────────────────────────┤
|
||
│ Top Improvement Area │ Error handling (score +2.1 since rule)│
|
||
│ Top Problem Area │ Security patterns (still at 6.2 avg) │
|
||
│ Feedback Backlog │ 3 unprocessed events │
|
||
│ Active Rules │ 23 learned rules (18 high-confidence) │
|
||
└──────────────────────┴─────────────────────────────────────┘
|
||
```
|
||
|
||
### 7.2 Metric Definitions
|
||
|
||
| Metric | Formula | Target Trend |
|
||
|--------|---------|-------------|
|
||
| **Approval Rate** | `approved / total_decisions` per rolling 30 days | ↑ (asymptote at ~95%) |
|
||
| **Avg Quality Score** | `mean(all dimension scores)` per rolling 30 days | ↑ (target: 8.0+) |
|
||
| **Time-to-Decision** | `mean(timeToDecisionMs)` per rolling 30 days | ↓ (faster = better context) |
|
||
| **Prediction Accuracy** | `mean(abs(predicted_score - actual_score))` | ↓ (target: ±0.5) |
|
||
| **Auto-Approve Rate** | `auto_approved / total_tasks` | ↑ (Jake reviews less) |
|
||
| **Feedback Volume** | `total_feedback_events` per week | ↓ (less HITL needed) |
|
||
| **Rule Effectiveness** | `quality_delta after rule introduced` | ↑ (rules actually help) |
|
||
| **Regression Rate** | `regression_events / total_level_checks` | ↓ (target: 0) |
|
||
|
||
### 7.3 Alerting Thresholds
|
||
|
||
| Condition | Alert Level | Action |
|
||
|-----------|------------|--------|
|
||
| Approval rate drops >10% week-over-week | 🔴 Critical | Pause auto-approvals, full HITL |
|
||
| Prediction accuracy >±2.0 | 🟡 Warning | Recalibrate immediately |
|
||
| New rejection reason appearing 3+ times | 🟡 Warning | Extract rule, add to checklist |
|
||
| Time-to-decision increasing | 🟠 Notice | Review presentation format |
|
||
| No feedback events for 48+ hours | 🟠 Notice | Check pipeline health |
|
||
|
||
---
|
||
|
||
## 8. Feedback-to-Prompt Engineering
|
||
|
||
### 8.1 Rule Extraction Pipeline
|
||
|
||
Feedback → NLP Theme Extraction → Frequency Analysis → Confidence Check → Rule Formulation → Prompt Injection
|
||
|
||
```typescript
|
||
interface LearnedRule {
|
||
id: string;
|
||
source: 'feedback-pattern' | 'rejection-analysis' | 'comparison' | 'annotation';
|
||
rule: string; // Human-readable rule
|
||
prompt_instruction: string; // How to inject into system prompt
|
||
confidence: number; // 0-1 based on data strength
|
||
occurrences: number; // How many feedback events support this
|
||
first_seen: string;
|
||
last_seen: string;
|
||
effectiveness?: number; // Post-deployment quality delta
|
||
domain: QualityDimension;
|
||
applies_to: string[]; // Server types or 'all'
|
||
}
|
||
```
|
||
|
||
### 8.2 Dynamic `.goosehints` Generation
|
||
|
||
The system auto-generates a `.goosehints` addendum from learned rules:
|
||
|
||
```markdown
|
||
# .goosehints (auto-generated section — DO NOT EDIT BELOW THIS LINE)
|
||
# Generated: 2025-07-11 | Based on 147 feedback events | 23 active rules
|
||
|
||
## Code Standards (from Jake's feedback)
|
||
- ALWAYS use Zod schemas for runtime validation of API responses
|
||
- ALWAYS include retry logic with exponential backoff (base 1s, max 30s, 3 attempts)
|
||
- ALWAYS use typed error classes, never throw raw strings
|
||
- NEVER use TypeScript `any` — use `unknown` with type guards instead
|
||
- PREFER functional composition over class hierarchies
|
||
- PREFER explicit error handling over try/catch-all blocks
|
||
- Name functions as verb-noun: `fetchUser`, `validateInput`, `buildResponse`
|
||
- Every public function needs a JSDoc comment with @param, @returns, @throws
|
||
|
||
## Design Standards (from Jake's feedback)
|
||
- Use 16-24px padding in card components
|
||
- Prefer Inter or system font stack
|
||
- Use muted color palettes: slate, zinc, gray families
|
||
- Card-based layouts for dashboard content
|
||
- Minimal animations: 150ms transitions only for state changes
|
||
|
||
## Quality Gates (auto-calibrated)
|
||
- Error handling score must be ≥6 before submission
|
||
- Test coverage score must be ≥5 before submission
|
||
- Security score must be ≥7 for auth-related MCP servers
|
||
- Overall predicted score must be ≥6.5 or self-review is triggered
|
||
|
||
## Review Presentation (optimized for Jake)
|
||
- Lead with a code diff, not full files
|
||
- Bullet-point summaries, max 5 bullets
|
||
- Highlight error handling approach prominently
|
||
- Include test coverage percentage in summary header
|
||
```
|
||
|
||
### 8.3 Per-Domain Instruction Sets
|
||
|
||
Beyond global rules, maintain domain-specific prompt sections:
|
||
|
||
```
|
||
prompts/
|
||
├── global-rules.md # Always included
|
||
├── domains/
|
||
│ ├── mcp-server.md # Rules specific to MCP server generation
|
||
│ ├── ui-design.md # Rules for UI/design tasks
|
||
│ ├── test-suite.md # Rules for test generation
|
||
│ └── documentation.md # Rules for docs
|
||
└── server-types/
|
||
├── shopify.md # Shopify-specific learned patterns
|
||
├── stripe.md # Stripe-specific patterns
|
||
└── ghl.md # GHL-specific patterns
|
||
```
|
||
|
||
Each file is rebuilt when new rules cross the confidence threshold (>0.8, supported by 5+ data points).
|
||
|
||
### 8.4 Quality Rubric Generation
|
||
|
||
From accumulated scores, generate rubrics that Buba uses for self-assessment:
|
||
|
||
```markdown
|
||
## Code Quality Rubric (generated from 52 scored reviews)
|
||
|
||
### Score 9-10 (Exceptional) — Jake gave this 8% of the time
|
||
- Typed errors with full context chain
|
||
- 95%+ test coverage including edge cases
|
||
- Zero TypeScript errors in strict mode
|
||
- Comprehensive JSDoc on all exports
|
||
- Performance considerations documented
|
||
|
||
### Score 7-8 (Good) — Jake gave this 45% of the time
|
||
- Typed errors on main paths
|
||
- 80%+ test coverage
|
||
- Minimal TypeScript warnings
|
||
- JSDoc on public API
|
||
- Basic error handling on all external calls
|
||
|
||
### Score 5-6 (Acceptable) — Jake gave this 30% of the time
|
||
- Try/catch on external calls
|
||
- 60%+ test coverage
|
||
- Some TypeScript warnings
|
||
- Partial documentation
|
||
|
||
### Score 1-4 (Rejection Zone) — Jake gave this 17% of the time
|
||
- Generic error handling or none
|
||
- <60% test coverage
|
||
- TypeScript `any` usage
|
||
- Missing documentation
|
||
- No input validation
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Privacy & Ethics
|
||
|
||
### 9.1 Data Retention Policy
|
||
|
||
| Data Type | Retention | Rationale |
|
||
|-----------|-----------|-----------|
|
||
| Raw feedback events | 90 days hot, indefinite cold | Source of truth for auditing |
|
||
| Free-text comments | 90 days verbatim, then extracted themes only | Privacy — reduce stored opinions over time |
|
||
| Decision metadata | Indefinite (aggregated) | Low-sensitivity, high-value for patterns |
|
||
| Annotation data | 90 days linked to code, then de-linked | Code context changes; old annotations mislead |
|
||
| Timing/behavioral meta | 30 days, then averaged into aggregates | Sensitive behavioral data, compress aggressively |
|
||
|
||
### 9.2 Handling Contradictory Feedback
|
||
|
||
Jake may give contradictory feedback (approves pattern X on Monday, rejects it Friday).
|
||
|
||
**Resolution Strategy:**
|
||
1. **Recency weighting:** Recent feedback counts 2× older feedback
|
||
2. **Context matching:** Check if the contradiction is actually context-dependent (e.g., pattern X is fine for internal tools but not for public APIs)
|
||
3. **Confidence-weighted:** Feedback where Jake marked "very sure" outweighs "somewhat unsure"
|
||
4. **Volume threshold:** Don't form a rule until the same signal appears 3+ times
|
||
5. **Clarification trigger:** If two equally-strong signals conflict, flag for clarification in the next modal: "You've previously both approved and rejected [pattern]. Which do you prefer for [context]?"
|
||
|
||
### 9.3 Overfitting Prevention
|
||
|
||
**Problem:** Buba might optimize for Jake's momentary mood rather than durable preferences.
|
||
|
||
**Safeguards:**
|
||
|
||
1. **Minimum sample size:** No rule is formed from fewer than 5 data points
|
||
2. **Time spread:** Rules require data points spread across at least 3 different days
|
||
3. **Fatigue filtering:** Feedback given after 5+ consecutive reviews in one session gets a 0.7× weight
|
||
4. **Trend smoothing:** Use 30-day rolling averages, not daily snapshots, for threshold calibration
|
||
5. **Preference decay:** Old preferences decay at 5% per month unless reinforced by new data
|
||
6. **Explicit override:** Jake can always mark feedback as "strong preference" (2× weight) or "weak/unsure" (0.5× weight)
|
||
7. **Regular audit prompt:** Monthly, present Jake with current learned rules for validation: "Here are the rules I've learned. Still accurate?"
|
||
|
||
### 9.4 Bias Awareness
|
||
|
||
- Track if Buba becomes overly conservative (rejection avoidance → boring/safe outputs)
|
||
- Monitor creativity scores alongside approval rates — if creativity drops as approval rises, inject more experimental variation
|
||
- Preserve a "exploration budget" — 10% of outputs can deliberately try new patterns outside the safe zone, clearly marked as "experimental"
|
||
|
||
---
|
||
|
||
## 10. Implementation Roadmap
|
||
|
||
### Phase 1: Foundation (Week 1-2)
|
||
- [ ] Implement `FeedbackEvent` schema and validation
|
||
- [ ] Build JSONL storage layer with daily rotation
|
||
- [ ] Create basic index builders (by-server-type, by-dimension, by-decision)
|
||
- [ ] Wire modal responses to storage pipeline
|
||
- [ ] Initialize memory files (`feedback-patterns.md`, `quality-standards.md`, `jake-preferences.md`, `improvement-log.md`)
|
||
|
||
### Phase 2: Analysis (Week 3-4)
|
||
- [ ] Build theme extraction from free-text feedback (keyword clustering)
|
||
- [ ] Implement approval rate computation per server type and stage
|
||
- [ ] Build confidence calibration table
|
||
- [ ] Create the pre-review self-check framework (rule-based)
|
||
- [ ] Implement prediction scoring (predict → store → compare loop)
|
||
|
||
### Phase 3: Learning Loops (Week 5-6)
|
||
- [ ] Auto-generate `.goosehints` addendum from learned rules
|
||
- [ ] Build the diminishing review system (autonomy levels 0-4)
|
||
- [ ] Implement regression detection and auto-escalation
|
||
- [ ] Create the A/B comparison flow for uncertain decisions
|
||
- [ ] Build per-domain instruction generators
|
||
|
||
### Phase 4: Optimization (Week 7-8)
|
||
- [ ] Notification timing optimization
|
||
- [ ] Session fatigue detection
|
||
- [ ] Format effectiveness tracking
|
||
- [ ] Memory compression and archival
|
||
- [ ] Dashboard metrics computation and display
|
||
|
||
### Phase 5: Maturity (Ongoing)
|
||
- [ ] Monthly rule audit system
|
||
- [ ] Exploration budget for creative variation
|
||
- [ ] Cross-domain pattern transfer (learnings from code quality improving design reviews)
|
||
- [ ] Community patterns (if multi-human review is ever added)
|
||
|
||
---
|
||
|
||
## Appendix A: Example End-to-End Flow
|
||
|
||
```
|
||
1. Buba generates a Shopify MCP server
|
||
2. Pre-review self-check runs:
|
||
- ✅ Retry logic present
|
||
- ✅ Zod validation on API responses
|
||
- ❌ Missing rate limiting (learned rule #7)
|
||
- Buba auto-fixes: adds rate limiting
|
||
- Re-check: all rules pass
|
||
- Predicted score: 7.8 (calibrated confidence: 74%)
|
||
3. Submitted to Jake's review queue
|
||
- Presentation format: code diff + 4-bullet summary (learned optimal format)
|
||
- Timing: queued for 9-11 AM window (Jake's peak engagement)
|
||
4. Jake reviews in modal:
|
||
- Decision: Approved
|
||
- Scores: code-quality: 8, error-handling: 8, test-coverage: 7, completeness: 9
|
||
- Free text: "Nice use of builder pattern for config. Wish the webhook handler had more specific error types."
|
||
- Time to decide: 45 seconds
|
||
5. Feedback processed:
|
||
- Validation: ✅
|
||
- Enrichment: themes ["builder-pattern-positive", "error-specificity-gap"], sentiment: 0.7
|
||
- Storage: appended to 2025-07-11.jsonl, indexes updated
|
||
- Analysis: "error specificity" theme now at 6 occurrences → approaching rule threshold
|
||
- Calibration: predicted 7.8, actual 8.0 → delta +0.2, slight pessimism
|
||
6. Learning applied:
|
||
- "builder pattern" noted as positive for Shopify servers
|
||
- "error specificity in webhooks" at 6 mentions → candidate rule queued (needs 2 more for confidence)
|
||
- Calibration nudged +0.1 for Shopify server type
|
||
- Shopify MCP autonomy: 15 approvals, 88% rate → Level 1 maintained
|
||
```
|
||
|
||
---
|
||
|
||
## Appendix B: Storage Size Estimates
|
||
|
||
| Component | Per Event | Daily (10 events) | Monthly | Yearly |
|
||
|-----------|-----------|-------------------|---------|--------|
|
||
| Raw JSONL | ~2 KB | ~20 KB | ~600 KB | ~7.2 MB |
|
||
| Indexes | N/A | ~50 KB (rebuilt) | ~50 KB | ~50 KB |
|
||
| Aggregates | N/A | ~10 KB | ~30 KB | ~120 KB |
|
||
| Memory files | N/A | ~5 KB (updated) | ~8 KB | ~15 KB |
|
||
| **Total** | | ~85 KB/day | ~688 KB/mo | ~7.4 MB/yr |
|
||
|
||
Extremely lightweight. Even at 50 reviews/day, yearly storage stays under 40 MB. No database needed — flat files with JSONL are sufficient for this scale.
|
||
|
||
---
|
||
|
||
*This document is a living architecture. As the system collects real feedback, specific thresholds, weights, and rules will be calibrated from actual data rather than these initial estimates.*
|