1.9 KiB
2026-02-25 Session Notes
Ingestion Pipeline Cherry-Pick Planning
Nicholai outlined a detailed plan to cherry-pick the document ingestion pipeline from PR #25 (web3-identity branch) onto a clean branch off main. The ingestion pipeline at packages/core/src/ingest/ parses markdown, PDFs, code repositories, Slack/Discord exports, and git history into Signet memories via LLM extraction (Ollama).
Scope
Copying 14 self-contained files from the ingest directory, plus one migration file to be renumbered from 014 to 013. The pipeline includes parsers for multiple formats (markdown, PDF, code, Discord, Slack, entire.io sessions), a chunker, extractors for LLM processing, and provenance tracking for deduplication.
Key Fixes Required
Seven code quality fixes were identified from the code review:
- Prompt injection protection: Wrap untrusted content in XML delimiters across three extractor files
- Type safety: Replace inline
db as {...}casts with a formalDatabaseLikeinterface - PDF parser typing: Remove
as anyby defining interfaces for pdf-parse v2 API - Non-null assertions: Replace
!with explicit guards in slack-parser.ts - Error logging: Add warn-level logging for silent memory insert failures
- Validation: Add field presence checks before casting Discord/Slack exports
- Cleanup: Remove unused loop variable in markdown-parser.ts
Next Steps
Implementation plan: create branch nicholai/ingest-pipeline off main, copy files, apply all 7 fixes, register migration, then build/typecheck/lint/test to verify.
Technical Notes
- Migration renumbering: main ends at 012-scheduled-tasks, so ingestion becomes 013
- No package.json changes needed (pdf-parse is optional dynamic import)
- No daemon routes or CLI changes included in this cherry-pick
- Branch names: source is web3-identity, target is nicholai/ingest-pipeline