# VoiceInk Real-Time Streaming Voice-to-Text Architecture ## How It Works (HYBRID approach) 1. **While speaking:** Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency) 2. **When done speaking:** The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it This gives the user instant visual feedback while ensuring the final output is accurate. --- ## Files & Their Roles ### 1. `WhisperState.swift` — Orchestrator The main state machine that coordinates everything. Key streaming methods: - `startStreamingTranscription()` — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins - `handleStreamingUpdate(newText)` — Receives partial transcripts and pastes them (with differential updates to reduce flicker) - `finishStreamingTranscription()` — Stops streaming, deletes preview text, returns final text - `cancelStreamingTranscription()` — Aborts streaming and cleans up - `handleStreamingCompletion()` — HYBRID: finishes streaming preview, then runs batch transcription for accuracy Key properties: - `streamingUpdateTask` — Task consuming the AsyncStream of partial transcripts - `lastStreamedText` — Tracks what's currently pasted so we can do differential updates - `isStreamingActive` — Guard flag ### 2. `ParakeetTranscriptionService.swift` — Streaming Engine Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription: - `startStreaming(model:)` — Downloads EOU 320ms models if needed, creates `StreamingEouAsrManager`, returns `AsyncStream` - `streamAudio(samples:frameCount:sampleRate:channels:)` — Called from audio thread, creates mono PCM buffer, feeds to EOU manager - `finishStreaming()` — Gets final text from EOU manager - `cancelStreaming()` — Resets EOU manager Uses `StreamingEouAsrManager` from FluidAudio SDK with: - 320ms chunk size (balance of accuracy vs latency) - 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence) - Partial callback yields to AsyncStream continuation ### 3. `CursorPaster.swift` — Text Injection Handles pasting text into the active application: - `setStreamingMode(enabled)` — Skips clipboard save/restore during streaming (prevents race conditions) - `pasteAtCursor(text)` — Sets clipboard + simulates Cmd+V - `deleteCharacters(count)` — Simulates backspace key presses (used to delete old streaming text before pasting updated text) - `pressEnter()` — Simulates Enter key (for auto-send in Power Mode) ### 4. `ClipboardManager.swift` — Clipboard Helper Simple clipboard read/write with transient paste support. ### 5. `CoreAudioRecorder.swift` — Audio Capture Low-level CoreAudio recording with streaming callback: - `streamingAudioCallback` property — Called on every audio render cycle with raw samples - In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService ### 6. `Recorder.swift` — Recording Coordinator Wraps CoreAudioRecorder: - `setStreamingAudioCallback(callback)` — Stores callback and applies to CoreAudioRecorder ### 7. `TranscriptionServiceRegistry.swift` — Service Lookup Lazy-initializes `ParakeetTranscriptionService` and routes transcription requests. --- ## Data Flow ``` Microphone Audio │ ▼ CoreAudioRecorder (render callback on audio thread) │ ▼ streamingAudioCallback │ ParakeetTranscriptionService.streamAudio() │ creates mono AVAudioPCMBuffer │ ▼ Task.detached │ StreamingEouAsrManager.process(audioBuffer) [FluidAudio SDK] │ 320ms chunks → partial transcripts │ ▼ partialCallback │ AsyncStream.Continuation.yield(text) │ ▼ for await text in transcriptStream │ WhisperState.handleStreamingUpdate(newText) │ ├─ If newText starts with lastStreamedText: │ → CursorPaster.pasteAtCursor(deltaOnly) [APPEND mode - no flicker] │ └─ Otherwise: → CursorPaster.deleteCharacters(oldText.count) → wait for deletions → CursorPaster.pasteAtCursor(newText) [FULL REPLACE mode] │ ▼ When recording stops │ WhisperState.handleStreamingCompletion() │ ├─ finishStreamingTranscription() → deletes preview text │ └─ Runs batch Whisper transcription for accuracy → CursorPaster.pasteAtCursor(accurateText) ``` --- ## Key Design Decisions 1. **Streaming setup BEFORE recording starts** — Avoids losing early audio 2. **Differential paste (append vs full replace)** — Reduces visual flicker during continuous speech 3. **Streaming mode in CursorPaster** — Skips clipboard save/restore to prevent race conditions 4. **HYBRID batch + stream** — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy 5. **320ms chunk size** — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff 6. **Inter-keystroke delays in deleteCharacters** — 1.5ms pause every 5 backspaces to prevent keystroke loss