.agents/memory/2026-02-25-ingestion-pipeline-cherry-pick-planning.md

1.9 KiB

2026-02-25 Session Notes

Ingestion Pipeline Cherry-Pick Planning

Nicholai outlined a detailed plan to cherry-pick the document ingestion pipeline from PR #25 (web3-identity branch) onto a clean branch off main. The ingestion pipeline at packages/core/src/ingest/ parses markdown, PDFs, code repositories, Slack/Discord exports, and git history into Signet memories via LLM extraction (Ollama).

Scope

Copying 14 self-contained files from the ingest directory, plus one migration file to be renumbered from 014 to 013. The pipeline includes parsers for multiple formats (markdown, PDF, code, Discord, Slack, entire.io sessions), a chunker, extractors for LLM processing, and provenance tracking for deduplication.

Key Fixes Required

Seven code quality fixes were identified from the code review:

  1. Prompt injection protection: Wrap untrusted content in XML delimiters across three extractor files
  2. Type safety: Replace inline db as {...} casts with a formal DatabaseLike interface
  3. PDF parser typing: Remove as any by defining interfaces for pdf-parse v2 API
  4. Non-null assertions: Replace ! with explicit guards in slack-parser.ts
  5. Error logging: Add warn-level logging for silent memory insert failures
  6. Validation: Add field presence checks before casting Discord/Slack exports
  7. Cleanup: Remove unused loop variable in markdown-parser.ts

Next Steps

Implementation plan: create branch nicholai/ingest-pipeline off main, copy files, apply all 7 fixes, register migration, then build/typecheck/lint/test to verify.

Technical Notes

  • Migration renumbering: main ends at 012-scheduled-tasks, so ingestion becomes 013
  • No package.json changes needed (pdf-parse is optional dynamic import)
  • No daemon routes or CLI changes included in this cherry-pick
  • Branch names: source is web3-identity, target is nicholai/ingest-pipeline