5.0 KiB
5.0 KiB
VoiceInk Real-Time Streaming Voice-to-Text Architecture
How It Works (HYBRID approach)
- While speaking: Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
- When done speaking: The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it
This gives the user instant visual feedback while ensuring the final output is accurate.
Files & Their Roles
1. WhisperState.swift — Orchestrator
The main state machine that coordinates everything. Key streaming methods:
startStreamingTranscription()— Sets up audio callback + starts Parakeet EOU streaming BEFORE recording beginshandleStreamingUpdate(newText)— Receives partial transcripts and pastes them (with differential updates to reduce flicker)finishStreamingTranscription()— Stops streaming, deletes preview text, returns final textcancelStreamingTranscription()— Aborts streaming and cleans uphandleStreamingCompletion()— HYBRID: finishes streaming preview, then runs batch transcription for accuracy
Key properties:
streamingUpdateTask— Task consuming the AsyncStream of partial transcriptslastStreamedText— Tracks what's currently pasted so we can do differential updatesisStreamingActive— Guard flag
2. ParakeetTranscriptionService.swift — Streaming Engine
Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:
startStreaming(model:)— Downloads EOU 320ms models if needed, createsStreamingEouAsrManager, returnsAsyncStream<String>streamAudio(samples:frameCount:sampleRate:channels:)— Called from audio thread, creates mono PCM buffer, feeds to EOU managerfinishStreaming()— Gets final text from EOU managercancelStreaming()— Resets EOU manager
Uses StreamingEouAsrManager from FluidAudio SDK with:
- 320ms chunk size (balance of accuracy vs latency)
- 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
- Partial callback yields to AsyncStream continuation
3. CursorPaster.swift — Text Injection
Handles pasting text into the active application:
setStreamingMode(enabled)— Skips clipboard save/restore during streaming (prevents race conditions)pasteAtCursor(text)— Sets clipboard + simulates Cmd+VdeleteCharacters(count)— Simulates backspace key presses (used to delete old streaming text before pasting updated text)pressEnter()— Simulates Enter key (for auto-send in Power Mode)
4. ClipboardManager.swift — Clipboard Helper
Simple clipboard read/write with transient paste support.
5. CoreAudioRecorder.swift — Audio Capture
Low-level CoreAudio recording with streaming callback:
streamingAudioCallbackproperty — Called on every audio render cycle with raw samples- In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService
6. Recorder.swift — Recording Coordinator
Wraps CoreAudioRecorder:
setStreamingAudioCallback(callback)— Stores callback and applies to CoreAudioRecorder
7. TranscriptionServiceRegistry.swift — Service Lookup
Lazy-initializes ParakeetTranscriptionService and routes transcription requests.
Data Flow
Microphone Audio
│
▼
CoreAudioRecorder (render callback on audio thread)
│
▼ streamingAudioCallback
│
ParakeetTranscriptionService.streamAudio()
│ creates mono AVAudioPCMBuffer
│
▼ Task.detached
│
StreamingEouAsrManager.process(audioBuffer) [FluidAudio SDK]
│ 320ms chunks → partial transcripts
│
▼ partialCallback
│
AsyncStream<String>.Continuation.yield(text)
│
▼ for await text in transcriptStream
│
WhisperState.handleStreamingUpdate(newText)
│
├─ If newText starts with lastStreamedText:
│ → CursorPaster.pasteAtCursor(deltaOnly) [APPEND mode - no flicker]
│
└─ Otherwise:
→ CursorPaster.deleteCharacters(oldText.count)
→ wait for deletions
→ CursorPaster.pasteAtCursor(newText) [FULL REPLACE mode]
│
▼ When recording stops
│
WhisperState.handleStreamingCompletion()
│
├─ finishStreamingTranscription() → deletes preview text
│
└─ Runs batch Whisper transcription for accuracy
→ CursorPaster.pasteAtCursor(accurateText)
Key Design Decisions
- Streaming setup BEFORE recording starts — Avoids losing early audio
- Differential paste (append vs full replace) — Reduces visual flicker during continuous speech
- Streaming mode in CursorPaster — Skips clipboard save/restore to prevent race conditions
- HYBRID batch + stream — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
- 320ms chunk size — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
- Inter-keystroke delays in deleteCharacters — 1.5ms pause every 5 backspaces to prevent keystroke loss