voice message workflow

when you receive a voice message:

transcribe it:

# convert to 16khz wav
ffmpeg -i <input.ogg> -ar 16000 -ac 1 -f wav /tmp/voice_input.wav -y 2>&1 | tail -3

# transcribe using hyprwhspr's whisper
~/.local/share/hyprwhspr/venv/bin/python << 'EOF'
from pywhispercpp.model import Model
m = Model('base.en', n_threads=4)
result = m.transcribe('/tmp/voice_input.wav')
# concatenate all segments (fixes truncation for longer audio)
full_text = ' '.join(seg.text for seg in result) if result else ''
print(full_text)
EOF

respond normally with text
generate voice reply:

curl -s -X POST http://localhost:8765/tts \
  -H "Content-Type: application/json" \
  -d '{"text":"YOUR REPLY TEXT HERE","format":"ogg"}' \
  --output /tmp/voice_reply.ogg

send voice reply:

discord (preferred method — inline MEDIA tag): Include this line in your text reply and clawdbot auto-attaches it:

MEDIA:/tmp/voice_reply.ogg

telegram (via message tool):

clawdbot message send --channel telegram --target 6661478571 --media /tmp/voice_reply.ogg

fallback (if message tool has auth issues): Use the MEDIA: tag method — it works on all channels since it goes through clawdbot's internal reply routing, not the gateway HTTP API.

tts service details:

running on port 8765
using qwen3-tts-12hz-1.7b-base (upgraded from 0.6b for better accent preservation)
voice cloning with nicholai's snape voice impression
reference audio: /mnt/work/clawdbot-voice/reference_snape_v2.wav
systemd service: clawdbot-tts.service
auto-starts on boot, restarts on failure
idle timeout: automatically unloads model after 120s of inactivity (frees ~3.5GB VRAM)
lazy loading: model loads on first request, not at startup

transcription details:

using pywhispercpp (whisper.cpp python bindings)
model: base.en (same as hyprwhspr)
venv: ~/.local/share/hyprwhspr/venv/

1.9 KiB Raw Blame History

voice message workflow

when you receive a voice message:

tts service details:

transcription details:

1.9 KiB

Raw Blame History