.agents/memory/VOICE-WORKFLOW.md

1.9 KiB

voice message workflow

when you receive a voice message:

  1. transcribe it:
# convert to 16khz wav
ffmpeg -i <input.ogg> -ar 16000 -ac 1 -f wav /tmp/voice_input.wav -y 2>&1 | tail -3

# transcribe using hyprwhspr's whisper
~/.local/share/hyprwhspr/venv/bin/python << 'EOF'
from pywhispercpp.model import Model
m = Model('base.en', n_threads=4)
result = m.transcribe('/tmp/voice_input.wav')
# concatenate all segments (fixes truncation for longer audio)
full_text = ' '.join(seg.text for seg in result) if result else ''
print(full_text)
EOF
  1. respond normally with text

  2. generate voice reply:

curl -s -X POST http://localhost:8765/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"YOUR REPLY TEXT HERE","format":"ogg"}' \
  --output /tmp/voice_reply.ogg
  1. send voice reply:

discord (preferred method — inline MEDIA tag): Include this line in your text reply and clawdbot auto-attaches it:

MEDIA:/tmp/voice_reply.ogg

telegram (via message tool):

clawdbot message send --channel telegram --target 6661478571 --media /tmp/voice_reply.ogg

fallback (if message tool has auth issues): Use the MEDIA: tag method — it works on all channels since it goes through clawdbot's internal reply routing, not the gateway HTTP API.

tts service details:

  • running on port 8765
  • using qwen3-tts-12hz-1.7b-base (upgraded from 0.6b for better accent preservation)
  • voice cloning with nicholai's snape voice impression
  • reference audio: /mnt/work/clawdbot-voice/reference_snape_v2.wav
  • systemd service: clawdbot-tts.service
  • auto-starts on boot, restarts on failure
  • idle timeout: automatically unloads model after 120s of inactivity (frees ~3.5GB VRAM)
  • lazy loading: model loads on first request, not at startup

transcription details:

  • using pywhispercpp (whisper.cpp python bindings)
  • model: base.en (same as hyprwhspr)
  • venv: ~/.local/share/hyprwhspr/venv/