clawdbot-workspace/voiceink-streaming-code/STREAMING_ARCHITECTURE.md
2026-02-06 23:01:30 -05:00

5.0 KiB

VoiceInk Real-Time Streaming Voice-to-Text Architecture

How It Works (HYBRID approach)

  1. While speaking: Parakeet EOU model streams partial transcriptions into the active text field in real-time (~320ms latency)
  2. When done speaking: The streamed preview text is deleted, and a high-accuracy batch transcription (Whisper large-v3-turbo) replaces it

This gives the user instant visual feedback while ensuring the final output is accurate.


Files & Their Roles

1. WhisperState.swift — Orchestrator

The main state machine that coordinates everything. Key streaming methods:

  • startStreamingTranscription() — Sets up audio callback + starts Parakeet EOU streaming BEFORE recording begins
  • handleStreamingUpdate(newText) — Receives partial transcripts and pastes them (with differential updates to reduce flicker)
  • finishStreamingTranscription() — Stops streaming, deletes preview text, returns final text
  • cancelStreamingTranscription() — Aborts streaming and cleans up
  • handleStreamingCompletion() — HYBRID: finishes streaming preview, then runs batch transcription for accuracy

Key properties:

  • streamingUpdateTask — Task consuming the AsyncStream of partial transcripts
  • lastStreamedText — Tracks what's currently pasted so we can do differential updates
  • isStreamingActive — Guard flag

2. ParakeetTranscriptionService.swift — Streaming Engine

Manages the Parakeet EOU (End-of-Utterance) model for real-time transcription:

  • startStreaming(model:) — Downloads EOU 320ms models if needed, creates StreamingEouAsrManager, returns AsyncStream<String>
  • streamAudio(samples:frameCount:sampleRate:channels:) — Called from audio thread, creates mono PCM buffer, feeds to EOU manager
  • finishStreaming() — Gets final text from EOU manager
  • cancelStreaming() — Resets EOU manager

Uses StreamingEouAsrManager from FluidAudio SDK with:

  • 320ms chunk size (balance of accuracy vs latency)
  • 1280ms EOU debounce (end-of-utterance detection after ~1.3s silence)
  • Partial callback yields to AsyncStream continuation

3. CursorPaster.swift — Text Injection

Handles pasting text into the active application:

  • setStreamingMode(enabled) — Skips clipboard save/restore during streaming (prevents race conditions)
  • pasteAtCursor(text) — Sets clipboard + simulates Cmd+V
  • deleteCharacters(count) — Simulates backspace key presses (used to delete old streaming text before pasting updated text)
  • pressEnter() — Simulates Enter key (for auto-send in Power Mode)

4. ClipboardManager.swift — Clipboard Helper

Simple clipboard read/write with transient paste support.

5. CoreAudioRecorder.swift — Audio Capture

Low-level CoreAudio recording with streaming callback:

  • streamingAudioCallback property — Called on every audio render cycle with raw samples
  • In the render callback: feeds samples to the callback which routes to ParakeetTranscriptionService

6. Recorder.swift — Recording Coordinator

Wraps CoreAudioRecorder:

  • setStreamingAudioCallback(callback) — Stores callback and applies to CoreAudioRecorder

7. TranscriptionServiceRegistry.swift — Service Lookup

Lazy-initializes ParakeetTranscriptionService and routes transcription requests.


Data Flow

Microphone Audio
    │
    ▼
CoreAudioRecorder (render callback on audio thread)
    │
    ▼ streamingAudioCallback
    │
ParakeetTranscriptionService.streamAudio()
    │  creates mono AVAudioPCMBuffer
    │
    ▼ Task.detached
    │
StreamingEouAsrManager.process(audioBuffer)  [FluidAudio SDK]
    │  320ms chunks → partial transcripts
    │
    ▼ partialCallback
    │
AsyncStream<String>.Continuation.yield(text)
    │
    ▼ for await text in transcriptStream
    │
WhisperState.handleStreamingUpdate(newText)
    │
    ├─ If newText starts with lastStreamedText:
    │     → CursorPaster.pasteAtCursor(deltaOnly)  [APPEND mode - no flicker]
    │
    └─ Otherwise:
          → CursorPaster.deleteCharacters(oldText.count)
          → wait for deletions
          → CursorPaster.pasteAtCursor(newText)     [FULL REPLACE mode]
    │
    ▼ When recording stops
    │
WhisperState.handleStreamingCompletion()
    │
    ├─ finishStreamingTranscription() → deletes preview text
    │
    └─ Runs batch Whisper transcription for accuracy
         → CursorPaster.pasteAtCursor(accurateText)

Key Design Decisions

  1. Streaming setup BEFORE recording starts — Avoids losing early audio
  2. Differential paste (append vs full replace) — Reduces visual flicker during continuous speech
  3. Streaming mode in CursorPaster — Skips clipboard save/restore to prevent race conditions
  4. HYBRID batch + stream — Parakeet streams (6.05% WER) for preview, Whisper batch (2.7% WER) for final accuracy
  5. 320ms chunk size — Larger chunks = fewer corrections/flicker, acceptable latency tradeoff
  6. Inter-keystroke delays in deleteCharacters — 1.5ms pause every 5 backspaces to prevent keystroke loss