I'm working on a project where I need to modify the caller's voice in real-time before it reaches the recipient. I have been following some twilio blogs, Blog-1, Blog-2 I've built a websocket server that can process audio coming from twilio stream through OpenAI's real-time API and return the modified voice.
Current Setup
We're using Twilio SDK for making calls from our UI, which triggers a webhook that returns TwiML:
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Dial callerId="+123456789">
<Number>+1987654322</Number>
</Dial>
</Response>
Attempted Solutions
I've tried adding Connect and Stream resources to this TwiML without success. I added it after Dial but it seems to be synchronous. Connect and Stream are executed after the recipient ended the call. Websocket server was returning the audio as expected but only the caller's voice was getting modified and coming back to caller again.
I attempted to inject Stream into the Call Resource, but it's unidirectional which doesn't work for my use case
Potential Solutions I'm Considering
- Using AudioProcessor in Twilio SDK: Intercept the audio stream at the UI side, send it to my websocket server for processing, then relay the modified voice back to the call bridge.
- Conference Approach: Set up a conference where I can access both call legs (caller and recipient) and modify only the caller's voice before relaying it to the recipient.
Questions:
- Has anyone else tried these approaches for real-time voice modification? I'm worried about latency issues and want to know if they actually work in practice.
- Is there a simpler way to plug my voice modification server into Twilio's call flow that I'm overlooking?
- I'd love to hear from someone who's built something similar! What pitfalls did you encounter? Any tips that saved you hours of debugging?
Thanks in advance and any insights would be greatly appriciated.