.agents/memory/2026-02-25-backwards-deduplication-pass-planning.md

# 2026-02-25 Session Notes

## Backwards Deduplication Pass Planning

Nicholai reviewed and approved a detailed implementation plan for adding retroactive deduplication to the memory pipeline. The feature addresses legacy databases that accumulated memories without the extraction pipeline, which currently only catch exact duplicates at write time via content_hash uniqueness constraints.

The plan spans 6 files and introduces a two-phase repair action: exact hash clustering (cheap SQL GROUP BY) followed by optional semantic clustering via KNN on embedded memories. Keeper selection uses a weighted scoring algorithm (importance × 3, normalized access/update counts, recency tiebreaker) with special protection for pinned and manually-overridden memories. Tag merging combines comma-separated strings, and all changes are soft-deletes with full audit trails.

Configuration adds four new fields to PipelineRepairConfig (cooldown, hourly budget, semantic threshold 0.92, batch size 100). Diagnostics gains a DuplicateHealth metric tracking excess dupes and duplicate ratio, wired into the composite health score at 0.04 weight. Maintenance worker adds dedup recommendations when duplicate ratio exceeds 5%. Two HTTP endpoints expose stats and execution (GET/POST `/api/repair/dedup-stats` and `/api/repair/deduplicate`).

All repair operations are rate-limited, policy-gated, idempotent, and transactional per cluster. A dry-run mode allows preview before execution.