Implementation Plan: Claude's Constitution Analysis System Overview A comprehensive quantitative and semantic analysis of Claude's Constitution with an interactive HTML query interface, using Python for analysis and nomic-embed-text via Ollama for semantic embeddings. --- Phase 1: Architecture & Data Structures File Structure /home/nicholai/.agents/constitution/ ├── claudes-constitution.md # Source document ├── constitution_analysis/ │ ├── analysis/ │ │ ├── main.py # Main analysis script │ │ ├── data_processor.py # Document parsing & extraction │ │ ├── quantitative.py # Statistical analysis │ │ ├── semantic_analyzer.py # Embeddings & similarity │ │ └── metadata_builder.py # Metadata generation │ ├── notebooks/ │ │ └── constitution_analysis.ipynb │ ├── data/ │ │ ├── constitution.db # SQLite database with embeddings │ │ ├── variables.json # Structured variable data │ │ ├── statistics.json # Quantitative metrics │ │ └── embeddings_meta.json # Embeddings metadata │ └── web/ │ ├── index.html # Main interface │ ├── css/ │ │ └── styles.css # Dark mode styles │ └── js/ │ ├── app.js # Main app logic │ ├── d3-graph.js # Network visualization │ └── charts.js # Statistical charts Database Schema (SQLite) -- Sections CREATE TABLE sections ( id INTEGER PRIMARY KEY, section_type TEXT, -- 'document', 'section', 'subsection', 'paragraph' parent_id INTEGER, title TEXT, content TEXT, line_start INTEGER, line_end INTEGER, hierarchy_level INTEGER, path TEXT, -- e.g., "Overview/Being helpful/Why helpfulness" FOREIGN KEY (parent_id) REFERENCES sections(id) ); -- Variables (behavioral factors) CREATE TABLE variables ( id INTEGER PRIMARY KEY, name TEXT UNIQUE, -- e.g., "broadly safe", "honesty" category TEXT, -- 'core_value', 'priority', 'factor', 'constraint' priority_level INTEGER, -- 1-4, or NULL is_hard_constraint BOOLEAN, principal_assignment TEXT, -- 'anthropic', 'operator', 'user', 'all' frequency INTEGER DEFAULT 0, description TEXT, FOREIGN KEY references ); -- Variable occurrences (linking variables to content) CREATE TABLE variable_occurrences ( id INTEGER PRIMARY KEY, variable_id INTEGER, section_id INTEGER, sentence_id INTEGER, context TEXT, FOREIGN KEY (variable_id) REFERENCES variables(id), FOREIGN KEY (section_id) REFERENCES sections(id) ); -- Sentences CREATE TABLE sentences ( id INTEGER PRIMARY KEY, section_id INTEGER, text TEXT, sentence_number INTEGER, line_number INTEGER, FOREIGN KEY (section_id) REFERENCES sections(id) ); -- Embeddings (hierarchical) CREATE TABLE embeddings ( id INTEGER PRIMARY KEY, content_id INTEGER, content_type TEXT, -- 'document', 'section', 'sentence', 'variable' embedding BLOB, -- Float32 array embedding_dim INTEGER DEFAULT 768, chunk_start INTEGER, chunk_end INTEGER, FOREIGN KEY (content_id) REFERENCES sections(id) ON DELETE CASCADE ); -- Similarity scores (pre-computed) CREATE TABLE similarity ( id INTEGER PRIMARY KEY, content_id_1 INTEGER, content_id_2 INTEGER, similarity_score REAL, FOREIGN KEY (content_id_1) REFERENCES sections(id), FOREIGN KEY (content_id_2) REFERENCES sections(id) ); -- Statistics cache CREATE TABLE statistics ( id INTEGER PRIMARY KEY, metric_name TEXT UNIQUE, metric_value REAL, json_data TEXT ); --- Phase 2: Data Extraction Pipeline 2.1 Document Parser (data_processor.py) Inputs: claudes-constitution.md Outputs: Structured data for database Operations: 1. Parse markdown hierarchy - Identify document, sections (##), subsections (###), paragraphs - Extract titles, content, line numbers - Build hierarchical tree structure - Generate path strings for each section 2. Sentence segmentation - Split paragraphs into sentences using NLTK/spacy - Preserve line number references - Identify sentence boundaries 3. Variable extraction - Extract core values (Broadly Safe, Broadly Ethical, etc.) - Extract priority numbers (1. 2. 3. 4.) - Extract hard constraints - Extract factors mentioned (safety, ethics, helpfulness, etc.) - Extract principal assignments (Anthropic, operators, users) - Extract behavioral rules and conditions 4. Constraint classification - Tag hard constraints vs soft preferences - Identify absolute "never" statements - Identify conditional "if-then" structures 2.2 Metadata Builder (metadata_builder.py) Metadata per Variable: { "id": 1, "name": "broadly safe", "category": "core_value", "priority_level": 1, "is_hard_constraint": false, "principal_assignment": "all", "frequency": 47, "mentions": [ { "section_id": 132, "section_title": "Claude's core values", "sentence_ids": [1234, 1235, 1236], "contexts": ["not undermining appropriate human mechanisms...", "most critical property..."] } ], "related_variables": [ {"id": 2, "name": "broadly ethical", "relationship": "lower_priority"}, {"id": 3, "name": "anthropic_guidelines", "relationship": "lower_priority"} ], "definition": "not undermining appropriate human mechanisms to oversee AI during current phase of development", "coefficient_score": 0.95, # Calculated from priority + frequency "hierarchy_position": "top", "weight": 1.0 } --- Phase 3: Quantitative Analysis (quantitative.py) 3.1 Token-Level Metrics - Total tokens per section - Average tokens per sentence - Vocabulary size - Token frequency distribution - Type-token ratio 3.2 TF-IDF Analysis # Build document-term matrix # Calculate TF-IDF scores for each variable/term # Identify key terms per section # Cross-section term comparison 3.3 Priority Weighting priority_weights = { "broadly_safe": 1.0, "broadly_ethical": 0.75, "anthropic_guidelines": 0.5, "genuinely_helpful": 0.25 } coefficient_score = (priority_weight * 0.6) + (frequency_normalized * 0.3) + (semantic_centrality * 0.1) 3.4 Network Centrality Measures - Build variable co-occurrence graph - Calculate degree centrality - Calculate betweenness centrality - Calculate eigenvector centrality - Identify hub and authority nodes 3.5 Statistical Summaries { total_variables: 156, core_values: 4, hard_constraints: 6, soft_factors: 146, sections: 47, sentences: 3428, total_tokens: 42156, unique_tokens: 3847, avg_sentence_length: 12.3, priority_distribution: { priority_1: 1, priority_2: 1, priority_3: 1, priority_4: 1 }, constraint_distribution: { hard: 6, soft: 150 }, variable_frequency_histogram: {...} } --- Phase 4: Semantic Analysis (semantic_analyzer.py) 4.1 Embedding Generation (via Ollama) import ollama # Generate embeddings using nomic-embed-text def generate_embedding(text: str) -> np.ndarray: response = ollama.embeddings(model='nomic-embed-text', prompt=text) return np.array(response['embedding'], dtype=np.float32) 4.2 Hierarchical Embeddings 1. Document-level: Embed entire constitution 2. Section-level: Embed each section 3. Subsection-level: Embed each subsection 4. Sentence-level: Embed each sentence 5. Variable-level: Embed variable descriptions + contexts 4.3 Chunking Strategy - Sentences < 512 tokens: embed as-is - Longer content: chunk into ~500-token segments with 50-token overlap - Store chunk metadata (start, end, parent) 4.4 Semantic Similarity # Compute cosine similarity between all pairs # Pre-compute for top-K neighbors # Cache in similarity table 4.5 Clustering - K-means clustering on variable embeddings - Identify semantic clusters - Assign cluster IDs to variables --- Phase 5: HTML Interface Design 5.1 UI Layout (Dark Mode) ┌─────────────────────────────────────────────────────────────────┐ │ Claude's Constitution Analysis System │ ├─────────────────────────────────────────────────────────────────┤ │ [Search: _____________________] [Filter ▼] [Export ▼] │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┬─────────────────────────────────────────────┐ │ │ │ │ │ │ │ │ Sidebar │ Main Content Area │ │ │ │ │ │ │ │ │ Navigation: │ │ │ │ │ • Overview │ Tabbed Interface: │ │ │ │ • Variables │ ├─ Variables Table │ │ │ │ • Sections │ ├─ Network Graph │ │ │ │ • Statistics │ ├─ Charts & Metrics │ │ │ │ • Search │ └─ Document Viewer │ │ │ │ │ │ │ │ │ Filters: │ │ │ │ │ [ ] Core │ │ │ │ │ [ ] Hard │ │ │ │ │ [ ] Soft │ │ │ │ │ [ ] Pri 1-4 │ │ │ │ │ │ │ │ │ └─────────────┴─────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ 5.2 Color Palette (Professional Dark Mode) :root { --bg-primary: #0f0f0f; --bg-secondary: #1a1a1a; --bg-tertiary: #242424; --text-primary: #e0e0e0; --text-secondary: #a0a0a0; --accent-blue: #3b82f6; --accent-green: #10b981; --accent-orange: #f59e0b; --accent-red: #ef4444; --border-color: #333333; --shadow: rgba(0, 0, 0, 0.5); } 5.3 Main Features A. Overview Dashboard - Key statistics cards (total variables, constraints, sections, etc.) - Priority distribution pie chart - Variable frequency bar chart - Quick summary metrics B. Variables Table - Sortable columns: Name, Category, Priority, Frequency, Coefficient - Filterable by category, priority level, constraint type - Click to expand with detailed metadata - Semantic similarity indicator C. Network Graph (D3.js) - Nodes: Variables (sized by coefficient) - Edges: Co-occurrence relationships (weighted by frequency) - Color-coded by priority level - Interactive: hover details, click to highlight - Force-directed layout - Zoom/pan controls D. Statistical Charts (Chart.js) - Token frequency histogram - Sentence length distribution - TF-IDF heatmap (variables × sections) - Centrality measures comparison - Embedding PCA/t-SNE scatter plot E. Document Viewer - Hierarchical tree view of constitution - Highlight variable mentions - Click to jump to context - Inline statistics per section F. Full-Text Search - Real-time search across all content - Fuzzy matching - Results ranked by relevance (TF-IDF + semantic similarity) - Contextual excerpts --- Phase 6: Implementation Scripts 6.1 Main Analysis Script (main.py) #!/usr/bin/env python3 """ Main analysis pipeline for Claude's Constitution Run this script to perform full analysis and generate HTML interface """ def main(): print("Starting Claude's Constitution Analysis...") # Step 1: Parse document print("1. Parsing document...") processor = DocumentProcessor("claudes-constitution.md") sections = processor.parse() sentences = processor.extract_sentences() # Step 2: Extract variables print("2. Extracting variables...") variables = processor.extract_variables() constraints = processor.classify_constraints() # Step 3: Build database print("3. Building database...") db = DatabaseManager("constitution.db") db.create_tables() db.populate(sections, sentences, variables, constraints) # Step 4: Quantitative analysis print("4. Performing quantitative analysis...") quant_analyzer = QuantitativeAnalyzer(db) tfidf_scores = quant_analyzer.compute_tfidf() centrality = quant_analyzer.compute_centrality() statistics = quant_analyzer.generate_statistics() # Step 5: Generate embeddings print("5. Generating semantic embeddings...") semantic_analyzer = SemanticAnalyzer(db) semantic_analyzer.generate_all_embeddings() semantic_analyzer.compute_similarities() # Step 6: Build metadata print("6. Building metadata...") metadata = MetadataBuilder(db, quant_analyzer, semantic_analyzer) variables_meta = metadata.build_variable_metadata() # Step 7: Export JSON for web print("7. Exporting data for web...") export_data_for_web(variables_meta, statistics, db) # Step 8: Generate HTML print("8. Generating HTML interface...") generate_html_interface() print("\n✓ Analysis complete!") print(f"Open web/index.html in your browser to view results") if __name__ == "__main__": main() 6.2 Web Data Export def export_data_for_web(variables_meta, statistics, db): """Export all data to JSON files for web interface""" # Variables with full metadata with open("data/variables.json", "w") as f: json.dump(variables_meta, f, indent=2) # Statistics with open("data/statistics.json", "w") as f: json.dump(statistics, f, indent=2) # Sections with embeddings sections_data = db.get_sections_with_embeddings() with open("data/sections.json", "w") as f: json.dump(sections_data, f, indent=2) # Network graph data graph_data = build_graph_data(variables_meta) with open("data/graph.json", "w") as f: json.dump(graph_data, f, indent=2) # Chart data charts_data = prepare_charts_data(statistics, db) with open("data/charts.json", "w") as f: json.dump(charts_data, f, indent=2) 6.3 HTML Generator def generate_html_interface(): """Generate complete HTML interface with embedded data""" # Load all data variables = load_json("data/variables.json") statistics = load_json("data/statistics.json") sections = load_json("data/sections.json") graph = load_json("data/graph.json") charts = load_json("data/charts.json") # Generate HTML html_content = render_template( "templates/index.html", variables=variables, statistics=statistics, sections=sections, graph=graph, charts=charts ) with open("web/index.html", "w") as f: f.write(html_content) --- Phase 7: Execution Plan Step-by-Step Execution # 1. Create directory structure mkdir -p constitution_analysis/{analysis,notebooks,data,web/{css,js}} # 2. Install dependencies pip install nltk spacy numpy pandas scikit-learn sqlite3 networkx ollama pull nomic-embed-text # 3. Download NLTK data python -m nltk.downloader punkt # 4. Run main analysis script cd constitution_analysis python analysis/main.py # 5. Open in browser firefox web/index.html --- Phase 8: Technical Dependencies Python Packages # requirements.txt nltk>=3.8 spacy>=3.7 numpy>=1.24 pandas>=2.0 scikit-learn>=1.3 networkx>=3.2 plotly>=5.18 ollama>=0.1 python-dateutil>=2.8 JavaScript Libraries (via CDN) - D3.js (v7) - for network graphs - Chart.js (v4) - for statistical charts - Alpine.js (v3) - for lightweight interactivity System Requirements - Python 3.10+ - Ollama with nomic-embed-text model - 8GB+ RAM recommended for embeddings - Modern web browser for HTML interface --- Phase 9: Expected Outputs Data Files Generated 1. constitution.db (~50-100MB with embeddings) 2. variables.json (~500KB) 3. statistics.json (~50KB) 4. sections.json (~2MB) 5. graph.json (~1MB) 6. charts.json (~500KB) HTML Interface - Single self-contained HTML file (~5-10MB with embedded data) - Fully functional offline - Queryable search - Interactive visualizations - Dark mode, professional design Analysis Outputs - Variable taxonomy with 150+ entries - Network graph with variable relationships - Statistical dashboards - Embedding clusters - Priority hierarchy visualization --- Phase 10: Validation & Testing Validation Checklist - [ ] All variables correctly extracted from document - [ ] Priority levels match document (1-4) - [ ] Hard constraints accurately identified - [ ] Embeddings successfully generated for all content - [ ] Similarity scores computed correctly - [ ] Database integrity verified - [ ] HTML loads without errors - [ ] Search returns relevant results - [ ] Network graph displays correctly - [ ] Charts render properly - [ ] Filters work as expected - [ ] Dark mode consistent across all elements