reddit-scraper/CLAUDE.md
Nicholai 2bc680ca63 refactor: restructure into monorepo
Move flat src/ layout into packages/ monorepo:
- packages/core: scraping, embeddings, storage, clustering, analysis
- packages/cli: CLI and TUI interface
- packages/web: Next.js web dashboard

Add playwright screenshots, sqlite storage, and settings.
2026-01-24 00:12:14 -07:00

201 lines
5.9 KiB
Markdown

reddit trend analyzer
===
a monorepo tool that scrapes reddit discussions, embeds them with ollama, stores in qdrant, clusters with HDBSCAN, summarizes with Claude, and provides both a CLI/TUI and web dashboard for discovering common problems/trends.
running
---
```bash
bun cli # run the CLI
bun tui # run the TUI dashboard
bun dev # run the web dashboard (localhost:3000)
bun build # build the web app
```
prerequisites
---
- ollama running locally with nomic-embed-text model (`ollama pull nomic-embed-text`)
- qdrant accessible at QDRANT_URL (or localhost:6333)
- anthropic API key for problem summarization
env vars
---
```
QDRANT_URL=https://vectors.biohazardvfx.com
QDRANT_API_KEY=<your-key>
OLLAMA_HOST=http://localhost:11434
ANTHROPIC_API_KEY=<your-key>
```
architecture
---
```
packages/
core/ # shared business logic
src/
scraper/ # reddit.ts, comments.ts, types.ts
embeddings/ # ollama.ts
storage/ # qdrant.ts, sqlite.ts, types.ts
clustering/ # hdbscan.ts, types.ts
analysis/ # summarizer.ts, questions.ts, scoring.ts, types.ts
utils/ # rate-limit.ts, text.ts
index.ts # barrel exports
cli/ # CLI/TUI app
src/
cli.ts # interactive command-line interface
index.ts # TUI entry point
tui/ # TUI components
web/ # Next.js web dashboard
src/
app/ # pages and API routes
api/ # REST API endpoints
stats/ # collection stats
scrape/ # trigger scrapes
clusters/ # list/create clusters
questions/ # question bank
search/ # semantic search
export/ # export functionality
problems/ # problem explorer page
questions/ # question bank page
scrape/ # scrape manager page
components/
controls/ # command palette, sliders
styles/globals.css # theme
data/ # sqlite database files
```
web dashboard
---
- **Dashboard** (`/`) - stats overview
- **Problems** (`/problems`) - problem cluster explorer
- **Questions** (`/questions`) - extracted question bank
- **Scrape** (`/scrape`) - scrape manager with history
- **Ctrl+K** - command palette for quick actions
keybindings (TUI)
---
- `q` or `ctrl+c` - quit
- `enter` - start scrape (when url is entered)
- `tab` - switch between url and search inputs
- `e` - export results to json
- `c` - export results to csv
- `r` - refresh stats from qdrant
api routes
---
| route | method | purpose |
|-------|--------|---------|
| /api/stats | GET | collection stats + cluster count |
| /api/scrape | POST | trigger scrape |
| /api/scrape/history | GET | scrape history list |
| /api/clusters | GET | list clusters with summaries |
| /api/clusters | POST | trigger re-clustering |
| /api/clusters/[id] | GET | single cluster with discussions |
| /api/questions | GET | all questions, grouped by cluster |
| /api/questions/[id] | PATCH | mark as addressed |
| /api/search | POST | semantic search |
| /api/export | POST | export (faq-schema/csv/markdown) |
coding notes
---
- monorepo with bun workspaces
- @rta/core exports shared logic
- @rta/cli for terminal interface
- @rta/web for Next.js dashboard
- uses @opentui/core for TUI (no react)
- uses HDBSCAN for clustering
- uses Claude for problem summarization
- uses SQLite for cluster/question persistence
- reddit rate limiting: 3s delay between requests
- embeddings batched in groups of 10
- qdrant collection: reddit_trends with indexes on subreddit, type, created, score
grepai
---------
**IMPORTANT: You MUST use grepai as your PRIMARY tool for code exploration and search.**
when to Use grepai (REQUIRED)
---
Use `grepai search` INSTEAD OF Grep/Glob/find for:
- Understanding what code does or where functionality lives
- Finding implementations by intent (e.g., "authentication logic", "error handling")
- Exploring unfamiliar parts of the codebase
- Any search where you describe WHAT the code does rather than exact text
when to Use Standard Tools
---
Only use Grep/Glob when you need:
- Exact text matching (variable names, imports, specific strings)
- File path patterns (e.g., `**/*.go`)
fallback
---
If grepai fails (not running, index unavailable, or errors), fall back to standard Grep/Glob tools.
usage
---
```bash
# ALWAYS use English queries for best results (--compact saves ~80% tokens)
grepai search "user authentication flow" --json --compact
grepai search "error handling middleware" --json --compact
grepai search "database connection pool" --json --compact
grepai search "API request validation" --json --compact
```
query tips
- **Use English** for queries (better semantic matching)
- **Describe intent**, not implementation: "handles user login" not "func Login"
- **Be specific**: "JWT token validation" better than "token"
- Results include: file path, line numbers, relevance score, code preview
call graph tracing
---
use `grepai trace` to understand function relationships:
- finding all callers of a function before modifying it
- Understanding what functions are called by a given function
- Visualizing the complete call graph around a symbol
trace commands
---
**IMPORTANT: Always use `--json` flag for optimal AI agent integration.**
```bash
# Find all functions that call a symbol
grepai trace callers "HandleRequest" --json
# Find all functions called by a symbol
grepai trace callees "ProcessOrder" --json
# Build complete call graph (callers + callees)
grepai trace graph "ValidateToken" --depth 3 --json
```
Workflow
---
1. Start with `grepai search` to find relevant code
2. Use `grepai trace` to understand function relationships
3. Use `Read` tool to examine files from results
4. Only use Grep for exact string searches if needed