1.9 KiB
1.9 KiB
voice message workflow
when you receive a voice message:
- transcribe it:
# convert to 16khz wav
ffmpeg -i <input.ogg> -ar 16000 -ac 1 -f wav /tmp/voice_input.wav -y 2>&1 | tail -3
# transcribe using hyprwhspr's whisper
~/.local/share/hyprwhspr/venv/bin/python << 'EOF'
from pywhispercpp.model import Model
m = Model('base.en', n_threads=4)
result = m.transcribe('/tmp/voice_input.wav')
# concatenate all segments (fixes truncation for longer audio)
full_text = ' '.join(seg.text for seg in result) if result else ''
print(full_text)
EOF
-
respond normally with text
-
generate voice reply:
curl -s -X POST http://localhost:8765/tts \
-H "Content-Type: application/json" \
-d '{"text":"YOUR REPLY TEXT HERE","format":"ogg"}' \
--output /tmp/voice_reply.ogg
- send voice reply:
discord (preferred method — inline MEDIA tag): Include this line in your text reply and clawdbot auto-attaches it:
MEDIA:/tmp/voice_reply.ogg
telegram (via message tool):
clawdbot message send --channel telegram --target 6661478571 --media /tmp/voice_reply.ogg
fallback (if message tool has auth issues): Use the MEDIA: tag method — it works on all channels since it goes through clawdbot's internal reply routing, not the gateway HTTP API.
tts service details:
- running on port 8765
- using qwen3-tts-12hz-1.7b-base (upgraded from 0.6b for better accent preservation)
- voice cloning with nicholai's snape voice impression
- reference audio: /mnt/work/clawdbot-voice/reference_snape_v2.wav
- systemd service: clawdbot-tts.service
- auto-starts on boot, restarts on failure
- idle timeout: automatically unloads model after 120s of inactivity (frees ~3.5GB VRAM)
- lazy loading: model loads on first request, not at startup
transcription details:
- using pywhispercpp (whisper.cpp python bindings)
- model: base.en (same as hyprwhspr)
- venv: ~/.local/share/hyprwhspr/venv/