clawdbot-workspace/voiceink-streaming-code/STREAMING_ARCHITECTURE.md

# VoiceInk Real-Time Streaming Voice-to-Text Architecture

## How It Works (HYBRID approach)

1. **While speaking:** Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
2. **When done speaking:** The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it

This gives the user instant visual feedback while ensuring the final output is accurate.

---

## Files & Their Roles

### 1. `WhisperState.swift` — Orchestrator
The main state machine that coordinates everything. Key streaming methods:
- `startStreamingTranscription()` — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins
- `handleStreamingUpdate(newText)` — Receives partial transcripts and pastes them (with differential updates to reduce flicker)
- `finishStreamingTranscription()` — Stops streaming, deletes preview text, returns final text
- `cancelStreamingTranscription()` — Aborts streaming and cleans up
- `handleStreamingCompletion()` — HYBRID: finishes streaming preview, then runs batch transcription for accuracy

Key properties:
- `streamingUpdateTask` — Task consuming the AsyncStream of partial transcripts
- `lastStreamedText` — Tracks what's currently pasted so we can do differential updates
- `isStreamingActive` — Guard flag

### 2. `ParakeetTranscriptionService.swift` — Streaming Engine
Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:
- `startStreaming(model:)` — Downloads EOU 320ms models if needed, creates `StreamingEouAsrManager`, returns `AsyncStream<String>`
- `streamAudio(samples:frameCount:sampleRate:channels:)` — Called from audio thread, creates mono PCM buffer, feeds to EOU manager
- `finishStreaming()` — Gets final text from EOU manager
- `cancelStreaming()` — Resets EOU manager

Uses `StreamingEouAsrManager` from FluidAudio SDK with:
- 320ms chunk size (balance of accuracy vs latency)
- 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
- Partial callback yields to AsyncStream continuation

### 3. `CursorPaster.swift` — Text Injection
Handles pasting text into the active application:
- `setStreamingMode(enabled)` — Skips clipboard save/restore during streaming (prevents race conditions)
- `pasteAtCursor(text)` — Sets clipboard + simulates Cmd+V
- `deleteCharacters(count)` — Simulates backspace key presses (used to delete old streaming text before pasting updated text)
- `pressEnter()` — Simulates Enter key (for auto-send in Power Mode)

### 4. `ClipboardManager.swift` — Clipboard Helper
Simple clipboard read/write with transient paste support.

### 5. `CoreAudioRecorder.swift` — Audio Capture
Low-level CoreAudio recording with streaming callback:
- `streamingAudioCallback` property — Called on every audio render cycle with raw samples
- In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService

### 6. `Recorder.swift` — Recording Coordinator
Wraps CoreAudioRecorder:
- `setStreamingAudioCallback(callback)` — Stores callback and applies to CoreAudioRecorder

### 7. `TranscriptionServiceRegistry.swift` — Service Lookup
Lazy-initializes `ParakeetTranscriptionService` and routes transcription requests.

---

## Data Flow

```
Microphone Audio
    │
    ▼
CoreAudioRecorder (render callback on audio thread)
    │
    ▼ streamingAudioCallback
    │
ParakeetTranscriptionService.streamAudio()
    │  creates mono AVAudioPCMBuffer
    │
    ▼ Task.detached
    │
StreamingEouAsrManager.process(audioBuffer)  [FluidAudio SDK]
    │  320ms chunks → partial transcripts
    │
    ▼ partialCallback
    │
AsyncStream<String>.Continuation.yield(text)
    │
    ▼ for await text in transcriptStream
    │
WhisperState.handleStreamingUpdate(newText)
    │
    ├─ If newText starts with lastStreamedText:
    │     → CursorPaster.pasteAtCursor(deltaOnly)  [APPEND mode - no flicker]
    │
    └─ Otherwise:
          → CursorPaster.deleteCharacters(oldText.count)
          → wait for deletions
          → CursorPaster.pasteAtCursor(newText)     [FULL REPLACE mode]
    │
    ▼ When recording stops
    │
WhisperState.handleStreamingCompletion()
    │
    ├─ finishStreamingTranscription() → deletes preview text
    │
    └─ Runs batch Whisper transcription for accuracy
         → CursorPaster.pasteAtCursor(accurateText)
```

---

## Key Design Decisions

1. **Streaming setup BEFORE recording starts** — Avoids losing early audio
2. **Differential paste (append vs full replace)** — Reduces visual flicker during continuous speech
3. **Streaming mode in CursorPaster** — Skips clipboard save/restore to prevent race conditions
4. **HYBRID batch + stream** — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
5. **320ms chunk size** — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
6. **Inter-keystroke delays in deleteCharacters** — 1.5ms pause every 5 backspaces to prevent keystroke loss