jake/clawdbot-workspace

Fork 0

Jake Shore 16db42bf7e Daily backup: 2026-02-06

2026-02-06 23:01:30 -05:00

5.0 KiB

Raw Blame History

VoiceInk Real-Time Streaming Voice-to-Text Architecture

How It Works (HYBRID approach)

While speaking: Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
When done speaking: The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it

This gives the user instant visual feedback while ensuring the final output is accurate.

Files & Their Roles

1. `WhisperState.swift` — Orchestrator

The main state machine that coordinates everything. Key streaming methods:

startStreamingTranscription() — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins
handleStreamingUpdate(newText) — Receives partial transcripts and pastes them (with differential updates to reduce flicker)
finishStreamingTranscription() — Stops streaming, deletes preview text, returns final text
cancelStreamingTranscription() — Aborts streaming and cleans up
handleStreamingCompletion() — HYBRID: finishes streaming preview, then runs batch transcription for accuracy

Key properties:

streamingUpdateTask — Task consuming the AsyncStream of partial transcripts
lastStreamedText — Tracks what's currently pasted so we can do differential updates
isStreamingActive — Guard flag

2. `ParakeetTranscriptionService.swift` — Streaming Engine

Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:

startStreaming(model:) — Downloads EOU 320ms models if needed, creates StreamingEouAsrManager, returns AsyncStream<String>
streamAudio(samples:frameCount:sampleRate:channels:) — Called from audio thread, creates mono PCM buffer, feeds to EOU manager
finishStreaming() — Gets final text from EOU manager
cancelStreaming() — Resets EOU manager

Uses StreamingEouAsrManager from FluidAudio SDK with:

320ms chunk size (balance of accuracy vs latency)
1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
Partial callback yields to AsyncStream continuation

3. `CursorPaster.swift` — Text Injection

Handles pasting text into the active application:

setStreamingMode(enabled) — Skips clipboard save/restore during streaming (prevents race conditions)
pasteAtCursor(text) — Sets clipboard + simulates Cmd+V
deleteCharacters(count) — Simulates backspace key presses (used to delete old streaming text before pasting updated text)
pressEnter() — Simulates Enter key (for auto-send in Power Mode)

4. `ClipboardManager.swift` — Clipboard Helper

Simple clipboard read/write with transient paste support.

5. `CoreAudioRecorder.swift` — Audio Capture

Low-level CoreAudio recording with streaming callback:

streamingAudioCallback property — Called on every audio render cycle with raw samples
In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService

6. `Recorder.swift` — Recording Coordinator

Wraps CoreAudioRecorder:

setStreamingAudioCallback(callback) — Stores callback and applies to CoreAudioRecorder

7. `TranscriptionServiceRegistry.swift` — Service Lookup

Lazy-initializes ParakeetTranscriptionService and routes transcription requests.

Data Flow

Microphone Audio
    │
    ▼
CoreAudioRecorder (render callback on audio thread)
    │
    ▼ streamingAudioCallback
    │
ParakeetTranscriptionService.streamAudio()
    │  creates mono AVAudioPCMBuffer
    │
    ▼ Task.detached
    │
StreamingEouAsrManager.process(audioBuffer)  [FluidAudio SDK]
    │  320ms chunks → partial transcripts
    │
    ▼ partialCallback
    │
AsyncStream<String>.Continuation.yield(text)
    │
    ▼ for await text in transcriptStream
    │
WhisperState.handleStreamingUpdate(newText)
    │
    ├─ If newText starts with lastStreamedText:
    │     → CursorPaster.pasteAtCursor(deltaOnly)  [APPEND mode - no flicker]
    │
    └─ Otherwise:
          → CursorPaster.deleteCharacters(oldText.count)
          → wait for deletions
          → CursorPaster.pasteAtCursor(newText)     [FULL REPLACE mode]
    │
    ▼ When recording stops
    │
WhisperState.handleStreamingCompletion()
    │
    ├─ finishStreamingTranscription() → deletes preview text
    │
    └─ Runs batch Whisper transcription for accuracy
         → CursorPaster.pasteAtCursor(accurateText)

Key Design Decisions

Streaming setup BEFORE recording starts — Avoids losing early audio
Differential paste (append vs full replace) — Reduces visual flicker during continuous speech
Streaming mode in CursorPaster — Skips clipboard save/restore to prevent race conditions
HYBRID batch + stream — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
320ms chunk size — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
Inter-keystroke delays in deleteCharacters — 1.5ms pause every 5 backspaces to prevent keystroke loss

5.0 KiB Raw Blame History

VoiceInk Real-Time Streaming Voice-to-Text Architecture

How It Works (HYBRID approach)

Files & Their Roles

1. WhisperState.swift — Orchestrator

2. ParakeetTranscriptionService.swift — Streaming Engine

3. CursorPaster.swift — Text Injection

4. ClipboardManager.swift — Clipboard Helper

5. CoreAudioRecorder.swift — Audio Capture

6. Recorder.swift — Recording Coordinator

7. TranscriptionServiceRegistry.swift — Service Lookup