.agents/memory/VOICE-WORKFLOW.md

62 lines
1.9 KiB
Markdown

# voice message workflow
## when you receive a voice message:
1. **transcribe it:**
```bash
# convert to 16khz wav
ffmpeg -i <input.ogg> -ar 16000 -ac 1 -f wav /tmp/voice_input.wav -y 2>&1 | tail -3
# transcribe using hyprwhspr's whisper
~/.local/share/hyprwhspr/venv/bin/python << 'EOF'
from pywhispercpp.model import Model
m = Model('base.en', n_threads=4)
result = m.transcribe('/tmp/voice_input.wav')
# concatenate all segments (fixes truncation for longer audio)
full_text = ' '.join(seg.text for seg in result) if result else ''
print(full_text)
EOF
```
2. **respond normally with text**
3. **generate voice reply:**
```bash
curl -s -X POST http://localhost:8765/tts \
-H "Content-Type: application/json" \
-d '{"text":"YOUR REPLY TEXT HERE","format":"ogg"}' \
--output /tmp/voice_reply.ogg
```
4. **send voice reply:**
**discord (preferred method — inline MEDIA tag):**
Include this line in your text reply and clawdbot auto-attaches it:
```
MEDIA:/tmp/voice_reply.ogg
```
**telegram (via message tool):**
```bash
clawdbot message send --channel telegram --target 6661478571 --media /tmp/voice_reply.ogg
```
**fallback (if message tool has auth issues):**
Use the MEDIA: tag method — it works on all channels since it goes
through clawdbot's internal reply routing, not the gateway HTTP API.
## tts service details:
- running on port 8765
- using qwen3-tts-12hz-1.7b-base (upgraded from 0.6b for better accent preservation)
- voice cloning with nicholai's snape voice impression
- reference audio: /mnt/work/clawdbot-voice/reference_snape_v2.wav
- systemd service: clawdbot-tts.service
- auto-starts on boot, restarts on failure
- **idle timeout**: automatically unloads model after 120s of inactivity (frees ~3.5GB VRAM)
- lazy loading: model loads on first request, not at startup
## transcription details:
- using pywhispercpp (whisper.cpp python bindings)
- model: base.en (same as hyprwhspr)
- venv: ~/.local/share/hyprwhspr/venv/