Implementation Plan: Claude's Constitution Analysis System

Overview

A comprehensive quantitative and semantic analysis of Claude's Constitution with an interactive HTML query interface, using Python for analysis and nomic-embed-text via Ollama for semantic embeddings.

---

Phase 1: Architecture & Data Structures
File Structure
/home/nicholai/.agents/constitution/
├── claudes-constitution.md          # Source document
├── constitution_analysis/
│   ├── analysis/
│   │   ├── main.py                 # Main analysis script
│   │   ├── data_processor.py       # Document parsing & extraction
│   │   ├── quantitative.py         # Statistical analysis
│   │   ├── semantic_analyzer.py    # Embeddings & similarity
│   │   └── metadata_builder.py    # Metadata generation
│   ├── notebooks/
│   │   └── constitution_analysis.ipynb
│   ├── data/
│   │   ├── constitution.db          # SQLite database with embeddings
│   │   ├── variables.json          # Structured variable data
│   │   ├── statistics.json         # Quantitative metrics
│   │   └── embeddings_meta.json    # Embeddings metadata
│   └── web/
│       ├── index.html              # Main interface
│       ├── css/
│       │   └── styles.css          # Dark mode styles
│       └── js/
│           ├── app.js              # Main app logic
│           ├── d3-graph.js        # Network visualization
│           └── charts.js           # Statistical charts
Database Schema (SQLite)
-- Sections
CREATE TABLE sections (
    id INTEGER PRIMARY KEY,
    section_type TEXT,              -- 'document', 'section', 'subsection', 'paragraph'
    parent_id INTEGER,
    title TEXT,
    content TEXT,
    line_start INTEGER,
    line_end INTEGER,
    hierarchy_level INTEGER,
    path TEXT,                      -- e.g., "Overview/Being helpful/Why helpfulness"
    FOREIGN KEY (parent_id) REFERENCES sections(id)
);
-- Variables (behavioral factors)
CREATE TABLE variables (
    id INTEGER PRIMARY KEY,
    name TEXT UNIQUE,                -- e.g., "broadly safe", "honesty"
    category TEXT,                   -- 'core_value', 'priority', 'factor', 'constraint'
    priority_level INTEGER,          -- 1-4, or NULL
    is_hard_constraint BOOLEAN,
    principal_assignment TEXT,       -- 'anthropic', 'operator', 'user', 'all'
    frequency INTEGER DEFAULT 0,
    description TEXT,
    FOREIGN KEY references
);
-- Variable occurrences (linking variables to content)
CREATE TABLE variable_occurrences (
    id INTEGER PRIMARY KEY,
    variable_id INTEGER,
    section_id INTEGER,
    sentence_id INTEGER,
    context TEXT,
    FOREIGN KEY (variable_id) REFERENCES variables(id),
    FOREIGN KEY (section_id) REFERENCES sections(id)
);
-- Sentences
CREATE TABLE sentences (
    id INTEGER PRIMARY KEY,
    section_id INTEGER,
    text TEXT,
    sentence_number INTEGER,
    line_number INTEGER,
    FOREIGN KEY (section_id) REFERENCES sections(id)
);
-- Embeddings (hierarchical)
CREATE TABLE embeddings (
    id INTEGER PRIMARY KEY,
    content_id INTEGER,
    content_type TEXT,               -- 'document', 'section', 'sentence', 'variable'
    embedding BLOB,                   -- Float32 array
    embedding_dim INTEGER DEFAULT 768,
    chunk_start INTEGER,
    chunk_end INTEGER,
    FOREIGN KEY (content_id) REFERENCES sections(id) ON DELETE CASCADE
);
-- Similarity scores (pre-computed)
CREATE TABLE similarity (
    id INTEGER PRIMARY KEY,
    content_id_1 INTEGER,
    content_id_2 INTEGER,
    similarity_score REAL,
    FOREIGN KEY (content_id_1) REFERENCES sections(id),
    FOREIGN KEY (content_id_2) REFERENCES sections(id)
);
-- Statistics cache
CREATE TABLE statistics (
    id INTEGER PRIMARY KEY,
    metric_name TEXT UNIQUE,
    metric_value REAL,
    json_data TEXT
);
---
Phase 2: Data Extraction Pipeline
2.1 Document Parser (data_processor.py)
Inputs: claudes-constitution.md
Outputs: Structured data for database
Operations:
1. Parse markdown hierarchy
   - Identify document, sections (##), subsections (###), paragraphs
   - Extract titles, content, line numbers
   - Build hierarchical tree structure
   - Generate path strings for each section
2. Sentence segmentation
   - Split paragraphs into sentences using NLTK/spacy
   - Preserve line number references
   - Identify sentence boundaries
3. Variable extraction
   - Extract core values (Broadly Safe, Broadly Ethical, etc.)
   - Extract priority numbers (1. 2. 3. 4.)
   - Extract hard constraints
   - Extract factors mentioned (safety, ethics, helpfulness, etc.)
   - Extract principal assignments (Anthropic, operators, users)
   - Extract behavioral rules and conditions
4. Constraint classification
   - Tag hard constraints vs soft preferences
   - Identify absolute "never" statements
   - Identify conditional "if-then" structures
2.2 Metadata Builder (metadata_builder.py)
Metadata per Variable:
{
    "id": 1,
    "name": "broadly safe",
    "category": "core_value",
    "priority_level": 1,
    "is_hard_constraint": false,
    "principal_assignment": "all",
    "frequency": 47,
    "mentions": [
        {
            "section_id": 132,
            "section_title": "Claude's core values",
            "sentence_ids": [1234, 1235, 1236],
            "contexts": ["not undermining appropriate human mechanisms...", "most critical property..."]
        }
    ],
    "related_variables": [
        {"id": 2, "name": "broadly ethical", "relationship": "lower_priority"},
        {"id": 3, "name": "anthropic_guidelines", "relationship": "lower_priority"}
    ],
    "definition": "not undermining appropriate human mechanisms to oversee AI during current phase of development",
    "coefficient_score": 0.95,  # Calculated from priority + frequency
    "hierarchy_position": "top",
    "weight": 1.0
}
---
Phase 3: Quantitative Analysis (quantitative.py)
3.1 Token-Level Metrics
- Total tokens per section
- Average tokens per sentence
- Vocabulary size
- Token frequency distribution
- Type-token ratio
3.2 TF-IDF Analysis
# Build document-term matrix
# Calculate TF-IDF scores for each variable/term
# Identify key terms per section
# Cross-section term comparison
3.3 Priority Weighting
priority_weights = {
    "broadly_safe": 1.0,
    "broadly_ethical": 0.75,
    "anthropic_guidelines": 0.5,
    "genuinely_helpful": 0.25
}
coefficient_score = (priority_weight * 0.6) + (frequency_normalized * 0.3) + (semantic_centrality * 0.1)
3.4 Network Centrality Measures
- Build variable co-occurrence graph
- Calculate degree centrality
- Calculate betweenness centrality
- Calculate eigenvector centrality
- Identify hub and authority nodes
3.5 Statistical Summaries
{
    total_variables: 156,
    core_values: 4,
    hard_constraints: 6,
    soft_factors: 146,
    sections: 47,
    sentences: 3428,
    total_tokens: 42156,
    unique_tokens: 3847,
    avg_sentence_length: 12.3,
    priority_distribution: {
        priority_1: 1,
        priority_2: 1,
        priority_3: 1,
        priority_4: 1
    },
    constraint_distribution: {
        hard: 6,
        soft: 150
    },
    variable_frequency_histogram: {...}
}
---
Phase 4: Semantic Analysis (semantic_analyzer.py)
4.1 Embedding Generation (via Ollama)
import ollama
# Generate embeddings using nomic-embed-text
def generate_embedding(text: str) -> np.ndarray:
    response = ollama.embeddings(model='nomic-embed-text', prompt=text)
    return np.array(response['embedding'], dtype=np.float32)
4.2 Hierarchical Embeddings
1. Document-level: Embed entire constitution
2. Section-level: Embed each section
3. Subsection-level: Embed each subsection
4. Sentence-level: Embed each sentence
5. Variable-level: Embed variable descriptions + contexts
4.3 Chunking Strategy
- Sentences < 512 tokens: embed as-is
- Longer content: chunk into ~500-token segments with 50-token overlap
- Store chunk metadata (start, end, parent)
4.4 Semantic Similarity
# Compute cosine similarity between all pairs
# Pre-compute for top-K neighbors
# Cache in similarity table
4.5 Clustering
- K-means clustering on variable embeddings
- Identify semantic clusters
- Assign cluster IDs to variables
---
Phase 5: HTML Interface Design
5.1 UI Layout (Dark Mode)
┌─────────────────────────────────────────────────────────────────┐
│  Claude's Constitution Analysis System                            │
├─────────────────────────────────────────────────────────────────┤
│  [Search: _____________________] [Filter ▼] [Export ▼]         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┬─────────────────────────────────────────────┐ │
│  │             │                                             │ │
│  │   Sidebar   │          Main Content Area                   │ │
│  │             │                                             │ │
│  │ Navigation: │                                             │ │
│  │ • Overview  │   Tabbed Interface:                         │ │
│  │ • Variables │   ├─ Variables Table                         │ │
│  │ • Sections  │   ├─ Network Graph                          │ │
│  │ • Statistics │   ├─ Charts & Metrics                       │ │
│  │ • Search    │   └─ Document Viewer                         │ │
│  │             │                                             │ │
│  │ Filters:    │                                             │ │
│  │ [ ] Core    │                                             │ │
│  │ [ ] Hard    │                                             │ │
│  │ [ ] Soft    │                                             │ │
│  │ [ ] Pri 1-4 │                                             │ │
│  │             │                                             │ │
│  └─────────────┴─────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
5.2 Color Palette (Professional Dark Mode)
:root {
    --bg-primary: #0f0f0f;
    --bg-secondary: #1a1a1a;
    --bg-tertiary: #242424;
    --text-primary: #e0e0e0;
    --text-secondary: #a0a0a0;
    --accent-blue: #3b82f6;
    --accent-green: #10b981;
    --accent-orange: #f59e0b;
    --accent-red: #ef4444;
    --border-color: #333333;
    --shadow: rgba(0, 0, 0, 0.5);
}
5.3 Main Features
A. Overview Dashboard
- Key statistics cards (total variables, constraints, sections, etc.)
- Priority distribution pie chart
- Variable frequency bar chart
- Quick summary metrics
B. Variables Table
- Sortable columns: Name, Category, Priority, Frequency, Coefficient
- Filterable by category, priority level, constraint type
- Click to expand with detailed metadata
- Semantic similarity indicator
C. Network Graph (D3.js)
- Nodes: Variables (sized by coefficient)
- Edges: Co-occurrence relationships (weighted by frequency)
- Color-coded by priority level
- Interactive: hover details, click to highlight
- Force-directed layout
- Zoom/pan controls
D. Statistical Charts (Chart.js)
- Token frequency histogram
- Sentence length distribution
- TF-IDF heatmap (variables × sections)
- Centrality measures comparison
- Embedding PCA/t-SNE scatter plot
E. Document Viewer
- Hierarchical tree view of constitution
- Highlight variable mentions
- Click to jump to context
- Inline statistics per section
F. Full-Text Search
- Real-time search across all content
- Fuzzy matching
- Results ranked by relevance (TF-IDF + semantic similarity)
- Contextual excerpts
---
Phase 6: Implementation Scripts
6.1 Main Analysis Script (main.py)
#!/usr/bin/env python3
"""
Main analysis pipeline for Claude's Constitution
Run this script to perform full analysis and generate HTML interface
"""
def main():
    print("Starting Claude's Constitution Analysis...")
    
    # Step 1: Parse document
    print("1. Parsing document...")
    processor = DocumentProcessor("claudes-constitution.md")
    sections = processor.parse()
    sentences = processor.extract_sentences()
    
    # Step 2: Extract variables
    print("2. Extracting variables...")
    variables = processor.extract_variables()
    constraints = processor.classify_constraints()
    
    # Step 3: Build database
    print("3. Building database...")
    db = DatabaseManager("constitution.db")
    db.create_tables()
    db.populate(sections, sentences, variables, constraints)
    
    # Step 4: Quantitative analysis
    print("4. Performing quantitative analysis...")
    quant_analyzer = QuantitativeAnalyzer(db)
    tfidf_scores = quant_analyzer.compute_tfidf()
    centrality = quant_analyzer.compute_centrality()
    statistics = quant_analyzer.generate_statistics()
    
    # Step 5: Generate embeddings
    print("5. Generating semantic embeddings...")
    semantic_analyzer = SemanticAnalyzer(db)
    semantic_analyzer.generate_all_embeddings()
    semantic_analyzer.compute_similarities()
    
    # Step 6: Build metadata
    print("6. Building metadata...")
    metadata = MetadataBuilder(db, quant_analyzer, semantic_analyzer)
    variables_meta = metadata.build_variable_metadata()
    
    # Step 7: Export JSON for web
    print("7. Exporting data for web...")
    export_data_for_web(variables_meta, statistics, db)
    
    # Step 8: Generate HTML
    print("8. Generating HTML interface...")
    generate_html_interface()
    
    print("\n✓ Analysis complete!")
    print(f"Open web/index.html in your browser to view results")
if __name__ == "__main__":
    main()
6.2 Web Data Export
def export_data_for_web(variables_meta, statistics, db):
    """Export all data to JSON files for web interface"""
    
    # Variables with full metadata
    with open("data/variables.json", "w") as f:
        json.dump(variables_meta, f, indent=2)
    
    # Statistics
    with open("data/statistics.json", "w") as f:
        json.dump(statistics, f, indent=2)
    
    # Sections with embeddings
    sections_data = db.get_sections_with_embeddings()
    with open("data/sections.json", "w") as f:
        json.dump(sections_data, f, indent=2)
    
    # Network graph data
    graph_data = build_graph_data(variables_meta)
    with open("data/graph.json", "w") as f:
        json.dump(graph_data, f, indent=2)
    
    # Chart data
    charts_data = prepare_charts_data(statistics, db)
    with open("data/charts.json", "w") as f:
        json.dump(charts_data, f, indent=2)
6.3 HTML Generator
def generate_html_interface():
    """Generate complete HTML interface with embedded data"""
    
    # Load all data
    variables = load_json("data/variables.json")
    statistics = load_json("data/statistics.json")
    sections = load_json("data/sections.json")
    graph = load_json("data/graph.json")
    charts = load_json("data/charts.json")
    
    # Generate HTML
    html_content = render_template(
        "templates/index.html",
        variables=variables,
        statistics=statistics,
        sections=sections,
        graph=graph,
        charts=charts
    )
    
    with open("web/index.html", "w") as f:
        f.write(html_content)
---
Phase 7: Execution Plan
Step-by-Step Execution
# 1. Create directory structure
mkdir -p constitution_analysis/{analysis,notebooks,data,web/{css,js}}
# 2. Install dependencies
pip install nltk spacy numpy pandas scikit-learn sqlite3 networkx
ollama pull nomic-embed-text
# 3. Download NLTK data
python -m nltk.downloader punkt
# 4. Run main analysis script
cd constitution_analysis
python analysis/main.py
# 5. Open in browser
firefox web/index.html
---
Phase 8: Technical Dependencies
Python Packages
# requirements.txt
nltk>=3.8
spacy>=3.7
numpy>=1.24
pandas>=2.0
scikit-learn>=1.3
networkx>=3.2
plotly>=5.18
ollama>=0.1
python-dateutil>=2.8
JavaScript Libraries (via CDN)
- D3.js (v7) - for network graphs
- Chart.js (v4) - for statistical charts
- Alpine.js (v3) - for lightweight interactivity
System Requirements
- Python 3.10+
- Ollama with nomic-embed-text model
- 8GB+ RAM recommended for embeddings
- Modern web browser for HTML interface
---
Phase 9: Expected Outputs
Data Files Generated
1. constitution.db (~50-100MB with embeddings)
2. variables.json (~500KB)
3. statistics.json (~50KB)
4. sections.json (~2MB)
5. graph.json (~1MB)
6. charts.json (~500KB)
HTML Interface
- Single self-contained HTML file (~5-10MB with embedded data)
- Fully functional offline
- Queryable search
- Interactive visualizations
- Dark mode, professional design
Analysis Outputs
- Variable taxonomy with 150+ entries
- Network graph with variable relationships
- Statistical dashboards
- Embedding clusters
- Priority hierarchy visualization
---
Phase 10: Validation & Testing
Validation Checklist
- [ ] All variables correctly extracted from document
- [ ] Priority levels match document (1-4)
- [ ] Hard constraints accurately identified
- [ ] Embeddings successfully generated for all content
- [ ] Similarity scores computed correctly
- [ ] Database integrity verified
- [ ] HTML loads without errors
- [ ] Search returns relevant results
- [ ] Network graph displays correctly
- [ ] Charts render properly
- [ ] Filters work as expected
- [ ] Dark mode consistent across all elements