# voice message workflow ## when you receive a voice message: 1. **transcribe it:** ```bash # convert to 16khz wav ffmpeg -i -ar 16000 -ac 1 -f wav /tmp/voice_input.wav -y 2>&1 | tail -3 # transcribe using hyprwhspr's whisper ~/.local/share/hyprwhspr/venv/bin/python << 'EOF' from pywhispercpp.model import Model m = Model('base.en', n_threads=4) result = m.transcribe('/tmp/voice_input.wav') # concatenate all segments (fixes truncation for longer audio) full_text = ' '.join(seg.text for seg in result) if result else '' print(full_text) EOF ``` 2. **respond normally with text** 3. **generate voice reply:** ```bash curl -s -X POST http://localhost:8765/tts \ -H "Content-Type: application/json" \ -d '{"text":"YOUR REPLY TEXT HERE","format":"ogg"}' \ --output /tmp/voice_reply.ogg ``` 4. **send voice reply:** **discord (preferred method — inline MEDIA tag):** Include this line in your text reply and clawdbot auto-attaches it: ``` MEDIA:/tmp/voice_reply.ogg ``` **telegram (via message tool):** ```bash clawdbot message send --channel telegram --target 6661478571 --media /tmp/voice_reply.ogg ``` **fallback (if message tool has auth issues):** Use the MEDIA: tag method — it works on all channels since it goes through clawdbot's internal reply routing, not the gateway HTTP API. ## tts service details: - running on port 8765 - using qwen3-tts-12hz-1.7b-base (upgraded from 0.6b for better accent preservation) - voice cloning with nicholai's snape voice impression - reference audio: /mnt/work/clawdbot-voice/reference_snape_v2.wav - systemd service: clawdbot-tts.service - auto-starts on boot, restarts on failure - **idle timeout**: automatically unloads model after 120s of inactivity (frees ~3.5GB VRAM) - lazy loading: model loads on first request, not at startup ## transcription details: - using pywhispercpp (whisper.cpp python bindings) - model: base.en (same as hyprwhspr) - venv: ~/.local/share/hyprwhspr/venv/