reddit-scraper/CLAUDE.md
2026-01-21 05:35:27 -07:00

75 lines
2.0 KiB
Markdown

reddit trend analyzer
===
a CLI tool that scrapes reddit discussions, embeds them with ollama, stores in qdrant, and provides a TUI dashboard for discovering common problems/trends.
running
---
```bash
bun start # run the app
bun dev # run with watch mode
```
prerequisites
---
- ollama running locally with nomic-embed-text model (`ollama pull nomic-embed-text`)
- qdrant accessible at QDRANT_URL (or localhost:6333)
env vars
---
```
QDRANT_URL=https://vectors.biohazardvfx.com
QDRANT_API_KEY=<your-key>
OLLAMA_HOST=http://localhost:11434 # optional, defaults to this
```
architecture
---
```
src/
index.ts # entry point, connection checks, TUI setup
scraper/
reddit.ts # fetch subreddit posts with pagination
comments.ts # fetch comments for each post
types.ts # reddit json response types
embeddings/
ollama.ts # batch embed text with nomic-embed-text (768 dims)
storage/
qdrant.ts # create collection, upsert, search
types.ts # point payload schema
tui/
app.ts # main dashboard, wires everything together
components/
url-input.ts # subreddit url input
progress.ts # scraping/embedding progress bars
stats.ts # collection stats panel
trending.ts # trending topics view
search.ts # semantic search interface
export.ts # export to json/csv
utils/
rate-limit.ts # delay helper for reddit api
text.ts # text preprocessing for embedding
```
keybindings
---
- `q` or `ctrl+c` - quit
- `enter` - start scrape (when url is entered)
- `tab` - switch between url and search inputs
- `e` - export results to json
- `c` - export results to csv
- `r` - refresh stats from qdrant
coding notes
---
- uses @opentui/core standalone (no react/solid)
- reddit rate limiting: 3s delay between requests
- embeddings batched in groups of 10
- qdrant collection: reddit_trends with indexes on subreddit, type, created, score