117 lines
5.0 KiB
Markdown
117 lines
5.0 KiB
Markdown
# VoiceInk Real-Time Streaming Voice-to-Text Architecture
|
|
|
|
## How It Works (HYBRID approach)
|
|
|
|
1. **While speaking:** Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
|
|
2. **When done speaking:** The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it
|
|
|
|
This gives the user instant visual feedback while ensuring the final output is accurate.
|
|
|
|
---
|
|
|
|
## Files & Their Roles
|
|
|
|
### 1. `WhisperState.swift` — Orchestrator
|
|
The main state machine that coordinates everything. Key streaming methods:
|
|
- `startStreamingTranscription()` — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins
|
|
- `handleStreamingUpdate(newText)` — Receives partial transcripts and pastes them (with differential updates to reduce flicker)
|
|
- `finishStreamingTranscription()` — Stops streaming, deletes preview text, returns final text
|
|
- `cancelStreamingTranscription()` — Aborts streaming and cleans up
|
|
- `handleStreamingCompletion()` — HYBRID: finishes streaming preview, then runs batch transcription for accuracy
|
|
|
|
Key properties:
|
|
- `streamingUpdateTask` — Task consuming the AsyncStream of partial transcripts
|
|
- `lastStreamedText` — Tracks what's currently pasted so we can do differential updates
|
|
- `isStreamingActive` — Guard flag
|
|
|
|
### 2. `ParakeetTranscriptionService.swift` — Streaming Engine
|
|
Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:
|
|
- `startStreaming(model:)` — Downloads EOU 320ms models if needed, creates `StreamingEouAsrManager`, returns `AsyncStream<String>`
|
|
- `streamAudio(samples:frameCount:sampleRate:channels:)` — Called from audio thread, creates mono PCM buffer, feeds to EOU manager
|
|
- `finishStreaming()` — Gets final text from EOU manager
|
|
- `cancelStreaming()` — Resets EOU manager
|
|
|
|
Uses `StreamingEouAsrManager` from FluidAudio SDK with:
|
|
- 320ms chunk size (balance of accuracy vs latency)
|
|
- 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
|
|
- Partial callback yields to AsyncStream continuation
|
|
|
|
### 3. `CursorPaster.swift` — Text Injection
|
|
Handles pasting text into the active application:
|
|
- `setStreamingMode(enabled)` — Skips clipboard save/restore during streaming (prevents race conditions)
|
|
- `pasteAtCursor(text)` — Sets clipboard + simulates Cmd+V
|
|
- `deleteCharacters(count)` — Simulates backspace key presses (used to delete old streaming text before pasting updated text)
|
|
- `pressEnter()` — Simulates Enter key (for auto-send in Power Mode)
|
|
|
|
### 4. `ClipboardManager.swift` — Clipboard Helper
|
|
Simple clipboard read/write with transient paste support.
|
|
|
|
### 5. `CoreAudioRecorder.swift` — Audio Capture
|
|
Low-level CoreAudio recording with streaming callback:
|
|
- `streamingAudioCallback` property — Called on every audio render cycle with raw samples
|
|
- In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService
|
|
|
|
### 6. `Recorder.swift` — Recording Coordinator
|
|
Wraps CoreAudioRecorder:
|
|
- `setStreamingAudioCallback(callback)` — Stores callback and applies to CoreAudioRecorder
|
|
|
|
### 7. `TranscriptionServiceRegistry.swift` — Service Lookup
|
|
Lazy-initializes `ParakeetTranscriptionService` and routes transcription requests.
|
|
|
|
---
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Microphone Audio
|
|
│
|
|
▼
|
|
CoreAudioRecorder (render callback on audio thread)
|
|
│
|
|
▼ streamingAudioCallback
|
|
│
|
|
ParakeetTranscriptionService.streamAudio()
|
|
│ creates mono AVAudioPCMBuffer
|
|
│
|
|
▼ Task.detached
|
|
│
|
|
StreamingEouAsrManager.process(audioBuffer) [FluidAudio SDK]
|
|
│ 320ms chunks → partial transcripts
|
|
│
|
|
▼ partialCallback
|
|
│
|
|
AsyncStream<String>.Continuation.yield(text)
|
|
│
|
|
▼ for await text in transcriptStream
|
|
│
|
|
WhisperState.handleStreamingUpdate(newText)
|
|
│
|
|
├─ If newText starts with lastStreamedText:
|
|
│ → CursorPaster.pasteAtCursor(deltaOnly) [APPEND mode - no flicker]
|
|
│
|
|
└─ Otherwise:
|
|
→ CursorPaster.deleteCharacters(oldText.count)
|
|
→ wait for deletions
|
|
→ CursorPaster.pasteAtCursor(newText) [FULL REPLACE mode]
|
|
│
|
|
▼ When recording stops
|
|
│
|
|
WhisperState.handleStreamingCompletion()
|
|
│
|
|
├─ finishStreamingTranscription() → deletes preview text
|
|
│
|
|
└─ Runs batch Whisper transcription for accuracy
|
|
→ CursorPaster.pasteAtCursor(accurateText)
|
|
```
|
|
|
|
---
|
|
|
|
## Key Design Decisions
|
|
|
|
1. **Streaming setup BEFORE recording starts** — Avoids losing early audio
|
|
2. **Differential paste (append vs full replace)** — Reduces visual flicker during continuous speech
|
|
3. **Streaming mode in CursorPaster** — Skips clipboard save/restore to prevent race conditions
|
|
4. **HYBRID batch + stream** — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
|
|
5. **320ms chunk size** — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
|
|
6. **Inter-keystroke delays in deleteCharacters** — 1.5ms pause every 5 backspaces to prevent keystroke loss
|