GPT-realtime and Realtime API updates

Seeing this title, I really hoped they'd have a parameter to force the language for voice recognition. Being Finnish with a heavy rally-Finnish accent, the real-time modes quite often end up transcribing my English speech as Finnish text, and strangely, I get the correct Finnish for the words for what I spoke in English. It might not happen during the first sentence, but after a few queries and replies it usually does.

According to the OpenAI forums this is a common problem. I see they've addressed this in the post by prompting the model to stick to one language, but previously this didn’t work consistently, and in their Playground the newest `User transcript model` is still the same as before (`gpt-4o-transcribe`), so I don’t have high hopes. Must be hard to implement.

edit: Tried it again (with a prompt requesting English like always). By my 6th message it suddenly started transcribing to Finnish, and after that it became more common. Better than it used to be, but in many ways still useless. Though I'm sure it works better for people with lighter accents.

I worked on the SIP stuff. If anyone has questions/problems reach out anytime and would love to help :)

tschellenbach · 6 months ago

lower cost, higher quality, loving it so far. question. we run this voice proxy, but here the calls are started on our end typically: https://getstream.io/video/voice-agents/

with sip this logic changes. https://platform.openai.com/docs/guides/realtime-sip

sounds like we can listen to the webhook and start from there?

Sean-Der · 6 months ago

I also want to support outbound SIP Invites! I don't think it will be that hard

But for now yea waiting for Webhook is way to do it.

politelemon · 6 months ago

Are there any mechanisms in place to prevent scams or fake calls?

Sean-Der · 6 months ago

Realtime API has checks that are running on inputs/outputs.

Outside of OpenAI lots of mechanisms exist stuff like STIR/SHAKEN[0]

[0] https://en.wikipedia.org/wiki/STIR/SHAKEN

bayesianbot · 6 months ago

keleftheriou · 6 months ago

These improvements are very welcome for the “VoiceGPT” app I’m building for Apple Watch.

I have a TestFlight beta for those who want to try it out, hope to have the new model included in the next beta build:

https://x.com/keleftheriou/status/1956932258293755955?s=46

moltar · 6 months ago

I’m interested.

Here you go: https://testflight.apple.com/join/bp9B5Pp2

Cu3PO42 · 6 months ago

I think it would be neat to hook up a (realtime) speech-to-speech model to something like Home Assistant for smart home controls + generic questions. HA has this feature, but is currently using a STT + text model + TTS pipeline, which works fine, but has higher delays and feels less... natural.

zebomon · 6 months ago

I've been using the voice chat in ChatGPT more and more frequently. I'm curious now to see how the costs associated with this would work through the API on some user-facing features. It's a cool update at a glance.

daft_pink · 6 months ago

Uh oh, with sip support we’re going to start getting ai scammers all the time!

OutOfHere · 6 months ago

Yes, and they will go straight to voicemail. I don't know of anyone picks up calls from random numbers anymore, at least in this country.

hectormalot · 6 months ago

Just happen have some stats on that (non-US context): 60% picks up a local number, about 40% picks up a foreign number (specifically the stat I have is a US number calling someone in a non-US geography).

More than I expected.