clawdbot-workspace/voiceink-streaming-code/STREAMING_ARCHITECTURE.md
2026-02-06 23:01:30 -05:00

117 lines
5.0 KiB
Markdown

# VoiceInk Real-Time Streaming Voice-to-Text Architecture
## How It Works (HYBRID approach)
1. **While speaking:** Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
2. **When done speaking:** The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it
This gives the user instant visual feedback while ensuring the final output is accurate.
---
## Files & Their Roles
### 1. `WhisperState.swift` — Orchestrator
The main state machine that coordinates everything. Key streaming methods:
- `startStreamingTranscription()` — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins
- `handleStreamingUpdate(newText)` — Receives partial transcripts and pastes them (with differential updates to reduce flicker)
- `finishStreamingTranscription()` — Stops streaming, deletes preview text, returns final text
- `cancelStreamingTranscription()` — Aborts streaming and cleans up
- `handleStreamingCompletion()` — HYBRID: finishes streaming preview, then runs batch transcription for accuracy
Key properties:
- `streamingUpdateTask` — Task consuming the AsyncStream of partial transcripts
- `lastStreamedText` — Tracks what's currently pasted so we can do differential updates
- `isStreamingActive` — Guard flag
### 2. `ParakeetTranscriptionService.swift` — Streaming Engine
Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:
- `startStreaming(model:)` — Downloads EOU 320ms models if needed, creates `StreamingEouAsrManager`, returns `AsyncStream<String>`
- `streamAudio(samples:frameCount:sampleRate:channels:)` — Called from audio thread, creates mono PCM buffer, feeds to EOU manager
- `finishStreaming()` — Gets final text from EOU manager
- `cancelStreaming()` — Resets EOU manager
Uses `StreamingEouAsrManager` from FluidAudio SDK with:
- 320ms chunk size (balance of accuracy vs latency)
- 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
- Partial callback yields to AsyncStream continuation
### 3. `CursorPaster.swift` — Text Injection
Handles pasting text into the active application:
- `setStreamingMode(enabled)` — Skips clipboard save/restore during streaming (prevents race conditions)
- `pasteAtCursor(text)` — Sets clipboard + simulates Cmd+V
- `deleteCharacters(count)` — Simulates backspace key presses (used to delete old streaming text before pasting updated text)
- `pressEnter()` — Simulates Enter key (for auto-send in Power Mode)
### 4. `ClipboardManager.swift` — Clipboard Helper
Simple clipboard read/write with transient paste support.
### 5. `CoreAudioRecorder.swift` — Audio Capture
Low-level CoreAudio recording with streaming callback:
- `streamingAudioCallback` property — Called on every audio render cycle with raw samples
- In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService
### 6. `Recorder.swift` — Recording Coordinator
Wraps CoreAudioRecorder:
- `setStreamingAudioCallback(callback)` — Stores callback and applies to CoreAudioRecorder
### 7. `TranscriptionServiceRegistry.swift` — Service Lookup
Lazy-initializes `ParakeetTranscriptionService` and routes transcription requests.
---
## Data Flow
```
Microphone Audio
CoreAudioRecorder (render callback on audio thread)
▼ streamingAudioCallback
ParakeetTranscriptionService.streamAudio()
│ creates mono AVAudioPCMBuffer
▼ Task.detached
StreamingEouAsrManager.process(audioBuffer) [FluidAudio SDK]
│ 320ms chunks → partial transcripts
▼ partialCallback
AsyncStream<String>.Continuation.yield(text)
▼ for await text in transcriptStream
WhisperState.handleStreamingUpdate(newText)
├─ If newText starts with lastStreamedText:
│ → CursorPaster.pasteAtCursor(deltaOnly) [APPEND mode - no flicker]
└─ Otherwise:
→ CursorPaster.deleteCharacters(oldText.count)
→ wait for deletions
→ CursorPaster.pasteAtCursor(newText) [FULL REPLACE mode]
▼ When recording stops
WhisperState.handleStreamingCompletion()
├─ finishStreamingTranscription() → deletes preview text
└─ Runs batch Whisper transcription for accuracy
→ CursorPaster.pasteAtCursor(accurateText)
```
---
## Key Design Decisions
1. **Streaming setup BEFORE recording starts** — Avoids losing early audio
2. **Differential paste (append vs full replace)** — Reduces visual flicker during continuous speech
3. **Streaming mode in CursorPaster** — Skips clipboard save/restore to prevent race conditions
4. **HYBRID batch + stream** — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
5. **320ms chunk size** — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
6. **Inter-keystroke delays in deleteCharacters** — 1.5ms pause every 5 backspaces to prevent keystroke loss