Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime

Something that really kills the 'effect' of most of the Voice > AI demos that I see is the cold start / latency.

The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.

Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.

akadeb · 4 months ago

Yeah the way I am handling this is turn detection which feels unnatural. I like how Livekit handles turn detection with a small model[0][1] [0]https://www.youtube.com/watch?v=EYDrSSEP0h0 [1]https://docs.livekit.io/agents/build/turns/turn-detector/

``` turn_detection: { type: "server_vad", threshold: 0.4, prefix_padding_ms: 400, silence_duration_ms: 1000, }, ```

Sean-Der · 4 months ago

I think it will always feel unnatural as long as 'AI Speech' is turn based. Right now developers used Voice Activity Detection to detect when the user has stopped talking.

What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.

conductr · 4 months ago

I can see how interruptions would prove even more unnatural and annoying pretty quick. There's a lot of nuance in knowing how to interrupt properly and often, people that interrupt only do so quickly, then yield, allow person to finish then resume - very situational and tons of nuance. Otherwise, with current level of sophistication, you'd just have the AI talking over you the entire time, not allowing you to complete your thoughts/questions/commands/etc and people would quickly be more frustrated and just turn it off.

dgellow · 4 months ago

Have you recently tried OpenAI voice mode from ChatGPT Plus? It's basically what you describe

drakenot · 4 months ago

Yes, I mentioned this in the comment.

I think it is closer, although still even it has a cold start problem. Once you are connected and in-session, it is a better experience.

There is still some "turn based" conversational aspect to it that can be awkward but it is much better. It also helps that you can "tap and hold" to override, which is a bit of a hack but works well in practice for that mobile use-case.

This looks like so much fun! I have recently gotten into working with electronics, so it seems like a nice little project to undertake.

I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.

I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.

What other options are available for this kind of real-time behaviour?

Sean-Der · 4 months ago

My plan is that Espressif’s WebRTC code[0] will hook up to pipe at [1] that gets you the freedom to do whatever you want.

The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.

[0] https://github.com/espressif/esp-webrtc-solution

[1] https://github.com/pipecat-ai/pipecat

akadeb · 4 months ago

Pipecat is awesome! is it similar to what livekit provides?

I think Realtime API adoption would be higher if it is offered on Arduino rather than ESP-IDF as the latter is not very beginner friendly. That was one of the main reasons I built this repo using edge functions instead of a direct WebRTC connection.

supermatt · 4 months ago

Fantastic! This will save a ton of work

_neil · 4 months ago

Not on-device but for local network I’ve been looking at Speaches[0]. Haven’t tried it yet, but I have been running kokoru-web[1] and the quality and speed is really good.

[0] https://speaches.ai/ [1] https://huggingface.co/spaces/Xenova/kokoro-web

3D30497420 · 4 months ago

Maybe inspiration from how Home Assistant can do local speech-to-text and vice versa? https://www.home-assistant.io/voice_control/voice_remote_loc...

Pretty sure you'd need to host this on something more robust than an ESP32 though.

supermatt · 4 months ago

Yeah, I was looking at home assistant as well, but it doesnt feel real-time, likely due to it having the transcription stage separate from the inference.

This is wonderful, really great job on this! For me physical devices is when it really starts to feel magical. My pre-schooler never engaged with Speech-to-Speech examples I showed her on a screen. However, when I showed her a reindeer toy[1] on my desk that tells joke that is when it became real. It is the same joy/wonder I felt playing Myst for the first time.

----

If anyone is trying to build physical devices with Realtime API I would love to help. I work at OpenAI on Realtime API and worked on [0] (was upstreamed) and I really believe in this space. I want to see this all built with Open/Interoperable standards so we don't have vendor lock-in and developers can build the best thing possible :)

[0] https://github.com/openai/openai-realtime-embedded

[1] https://youtu.be/14leJ1fg4Pw?t=804

StefMyb · 4 months ago

I would love to chat further with you about this. I am working on building a educational conversational toy. The toy will tell stories and sing but the conversational aspect is the only thing at this stage that requires AI. The whole idea came from my daughter who was in Kinder at the time

sean @ pion.ly please email me any time.

Offer is open for anyone. If you need help with WebRTC/Realtime API/Embedded I am here to help. I have an open meeting link on my website.

hoppp · 4 months ago

Its great.lovely. but on the long run these toys rely on subscription payment?

Both the supabase Api and OpenAI billing is per api call.

So the lovely talking toys can die if the company stops being profitable.

I would love to see a version with decent hardware that runs a local model, that could have a long lifespan and work offline.

xp84 · 4 months ago

> lovely talking toys can die if the company stops being profitable.

This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.

> version with decent hardware that runs a local model

I feel like something small and efficient enough to meet that (today) would be dumb as a post. Like Siri-level dumb.

Personally, I'd prefer a toy which was tethered to a home device. Without a cloud (and thus commercial) dependency, the toy wouldn't be 'smart' outside of Wi-fi range, but I'd design it so that it got 'sleepy' when away from Wi-fi, able to be "woken up" and, in that state, to respond to a few phrases with canned, Siri-like answers. Perhaps new content could be made up for it daily and downloaded to local storage while at home, so that it could still "tell me a story" offline etc.

scottmcf · 4 months ago

> This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.

We've already seen this exact scenario play out with "Moxie" a few months ago:

https://www.axios.com/2024/12/10/moxie-kids-robot-shuts-down

zild3d · 4 months ago

well for now its either small device that uses APIs or Paddington Bear needs a backpack for his GPU

empath75 · 4 months ago

When someone figures this out, it's going to be a multi billion dollar company, but the safety concerns for actually putting something like this into the hands of children are unbelievable.

mithr · 4 months ago

This. The idea is super cool in theory! But given how these sort of things work today, having a toy that can have an independent conversation with a kid and that, despite the best intentions of the prompt writer, isn't guaranteed to stay within its "sandbox", is terrifying enough to probably not be worth the risk.

IMO this is only exacerbated by how little children (who are the presumably the target audience for stuffed animals that talk) often don't follow "normal" patterns of conversation or topics, so it feels like it'd be hard to accurately simulate/test ways in which unexpected & undesirable responses could come out.

I'm trying to use my imagination, but what exactly is the fear? Perhaps the AI will explain where baby's come from in graphic detail before the parent is ready to have that conversation or something similar? Or, for us in US, maybe it tells your kid they should wear a bullet proof vest to pre-K instead of bringing a stuffy for naptime?

Essentially, telling kids the truth before they're ready and without typical parental censorship? Or is there some other fear, like the AI will get compromised by a pedo and he'll talk your kid into who knows what? Or similar for "fill in state actor" using mind control on your kid (which, honestly, I feel like is normalized even for adults; eg. Fox News, etc., again US-centric)

georgemcbay · 4 months ago

Reminds me of Conan O'Brien's old WikiBear skits

https://youtu.be/0SfSx9ts46A

Babies often have ipads now. I think they should make an offline toy with decent hardware inside. That would be somethin.

justanotheratom · 4 months ago

This is quite cool. Two questions:

- why do you need nextjs frontend for what looks like a headless use case? - how much would be the OpenAI bill if there is 15 minutes of usage per day?

irq-1 · 4 months ago

> This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output.

https://openai.com/index/introducing-the-realtime-api/

About the nextjs site, I was thinking maybe its difficult to have supabase hold long connections, or route the response? I'm curious too.

The long connections are ultimately handled by Deno Edge so the site isn't used there. The NextJS frontend (which also could be an iOS/Android app) helps provide an interface to select character, create AI characters, set ESP32 volume, and view conversation history.

thank you! The nextjs frontend is to set things like device volume, selecting which character you are interacting with, viewing conversation history etc. I just tried it and for a 15 minute chat, it's roughly 20c. Roughly 570 input tokens

JKCalhoun · 4 months ago

And I am wondering, why use an ESP32 if you don't need the WiFi? (And, please, no WiFi in a toy!)

Currently we connect to a Wifi network to reach the Deno edge server. Some popular toys doing it: Yoto, Toniebox

behnamoh · 4 months ago

am I the only one who finds the unnecessarily positive vibes of OpenAI realtime voices unrealistic, too much, and borderline creepy?

mickael-kerjean · 4 months ago

Yep and having it in a child toy is way beyond the border of creepy

Currently our device is a toy accessory. And for children we are strictly focusing on `Story mode`. Where adventure stories / fairy tales feel more engaging. I think there's value in getting the AI to create epic stories consistently

3np · 4 months ago

Moreso from the consent- and privacy angle.

mst · 4 months ago

OpenAI stuff in general seems (to me, at least) to be overly positive and confident in terms of how it replies.

While I make no foolish claims that it's perfect, I've found Claude feels much less arrogant, and was genuinely appreciative when one of its replies started with an (accurate, of course I checked primary sources to verify that) analysis of the first half of my question, and then for the more obscure second half said "I'm not sure if I can answer that without hallucinating, but here's some stuff you could try researching."

Certainly Claude's tone and "attitude" (FSVO) works much better for me than any other LLM I've tried, though mileage will, of course, vary.

(I have zero connection to the company and am still on a free account, I'm just quietly impressed relative to the competition)

scyzoryk_xyz · 4 months ago

You’re not the only one, same here.

I believe there will be interest in extracting insights from speech-related fields, performing arts etc. Kind of how there was this transfer of design principles in the 90’s-00’s from traditional typographers, letterform revivals, print techniques.

It’ll be interesting to see an evolution of expectations and culture emerge around AI voices depending on role. Maybe we’ll see these positive voice vibes as silly and naive the same way we see MySpace aesthetics today?

Dead Comment

Deleted Comment