Where is the difference between this and Indian support staff pretending to be in your vicinity by telling you about the local weather? Your version is arguably even worse because it can plausibly fool people more competently.
So you're telling the caller that it is an AI, and yet you can have a pleasant background audio experience.
OpenAI realtime voices are really bad though, so you can also configure your session to accept AUDIO and output TEXT, and then use any TTS provider (like ElevenLabs or InWord.ai, my favorite for cost) so generate the audio.
I just want to provide: - business logic - tools - configuration metadata (e.g. which voice to use)
I don't like Vapi due to 1) extensive GUI driven experience, 2) cost
Or PipeCat Cloud / LiveKit cloud (I think they charge 1 cent per minute?)
e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?
Same with TTS: some like Deepgram and ElevenLabs let you stream the LLM text (or chunks per sentence) over their websocket API, making your Voice AI bot really really low latency.
(If you do need SIP, this Asterisk project looks really great.)
Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].
[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn
Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/
That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.
Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).
Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)
(It's free up to 20 articles because there are real costs: I use Gemini to summarize the pages you open)
AI voices run locally on your iPhone/iPad (web extension version coming soon).
When you find something useful, you can share the overviews online (free hosting), e.g. https://voiceview.app/a/2J49UnwK
Hope this helps cut the noise and help folks save time.
Laurent