Readit News logoReadit News
Posted by u/hassaanr a year ago
Show HN: A real time AI video agent with under 1 second of latency
Hey it’s Hassaan & Quinn – co-founders of Tavus, an AI research company and developer platform for video APIs. We’ve been building AI video models for ‘digital twins’ or ‘avatars’ since 2020.

We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.

To try it, talk to Hassaan’s digital twin: https://www.hassaanraza.com, or to our "demo twin" Carter: https://www.tavus.io

We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.

To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.

Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.

The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.

For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.

We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware. We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.

We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.

The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.

The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.

We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.

All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.

Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website https://www.tavus.io.

causal · a year ago
1) Your website, and the dialup sounds, might be my favorite thing about all of this. I also like the cowboy hat.

2) Maybe it's just degrading under load, but I didn't think either chat experience was very good. Both avatars interrupted themselves a lot, and the chat felt more like a jumbled mess of half-thoughts than anything.

3) The image recognition is pretty good though, when I could get one of the avatars to slow down long enough to identify something I was holding.

Anyway great progress, and thanks for sharing so much detail about the specific hurdles you've faced. I'm sure it'll get much better.

hassaanr · a year ago
Glad you liked the website it was such fun project. Getting the hug of death from HN so that might be why you're getting a worse experience, please try again :)
Nadya · a year ago
It was disabled yesterday due to the high traffic - but I was able to connect today and after saying hello the chat immediately kicked me off after I asked a question. So unfortunately I've not been able to test it out for more than a few seconds of the "Hello, how can I help you today?"

One thing I've noticed for a lot of these AI video agents, and I've noticed it in Meta's teaser for their virtual agents as well as some other companies, is they seem to love to move their head constantly. It makes them all a bit uncanny and feel like a video game NPC that reacts with a head movement on every utterance. It's less apparent on short 5-10s video clips but the longer the clips the more the constant head movements give it away.

I'm assuming this is, of course, a well known and tough problem to solve and is being worked on. Since swinging too far in the other direction of stiff/little head movements would make it even more uncanny. I'd love to hear what has been done to try and tackle the problem or if at this point it is an accepted "tell" so that one knows when they're speaking with a virtual agent?

causal · a year ago
Tried again today, latency seemed a little better- still a lot of interrupting himself to change thoughts.

I'm still most impressed by the image recognition - could clearly read even tiny or partially obscured print on products I held up and name them accordingly. Curious how you're achieving that level of fidelity without sacrificing throughput.

qingcharles · a year ago
Just tried this. Most amazing thing I've ever seen. Utterly incredible that this is where we're at.
karolist · a year ago
Felt like talking to a person, I couldn't bring myself to treat it like a piece of code, that's how real it felt. I wanted to be polite and diplomatic, caught myself thinking about "how I look to this person". This brought me thinking of the conscious effort we put in when we talk with people and how sloppy and relaxed we can be when interacting with algorithms.

For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.

whiplash451 · a year ago
I see it the other way around.

I think our human-human interaction style will “leak” into the way we interact with humanoid AI agents. Movie-Her style.

amelius · a year ago
Only if the AI gets annoyed when you don't treat it with respect.
tstrimple · a year ago
Mine certainly has. I type to ChatGPT much more like a human than a search engine. It feels more natural for me as it's context aware than search engines ever were. I can ask follow up questions and ask for more details about a specific portion or ask for the analysis I just walked it through to get the results I want to apply to another data set.

"Now dump those results into a markdown table for me please."

bpanahij · a year ago
Thanks for that insight. Brian here, one of the engineers for CVI. I've spoken with CVI so much, and as it has become more natural, I've found myself becoming more comfortable with a conversational style of interaction with the vastness of information contained within the LLMs and context under the hood. Whereas, with Google or other search based interactions I'm more point and shoot. I find CVI is more of an experience and for me yields more insight.
alwa · a year ago
I’m having trouble understanding what CVI means here. Is it the firm Computer Vision Inc. (https://www.cvi.ai/)?

The firm in the post seems to be called Tavus, and their products either “digital twins” or “Carter.”

Not meaning to be pedantic, I’m just wondering whether the “V” in the thing you’ve spoken to indicates more “voice” or “video” conversations.

wantsanagent · a year ago
Functionality for a demo launch: 9.5/10

Creepiness: 10/10

CapeTheory · a year ago
I was just about to try it, but the idea of allowing Firefox access to my audio/video to talk to a machine-generated person gave me such a bad feeling, I couldn't go through with it even fuelled by my morbid curiosity.
oniony · a year ago
I did it with my finger over the camera and it even commented on me having my finger over the camera!
butlike · a year ago
I did it. The demo is kinda cool. If they want to steal an unshowered, back-lit, messy hair picture of me, go for it. I can't imagine it'd be that useful right now.
handfuloflight · a year ago
Super awkward. But promising. It should have taken more control of the conversation.
elaus · a year ago
It left me speechless after commenting on a (small) text on my hoodie – this made it feel super personal all of a sudden (which is amazing for an AI of course)
pookeh · a year ago
I joined while in the bathroom where the camera was facing upwards looking up to the hanging towel on the wall…and it said “looks like you got a cozy bathroom here”

You have to be kidding me.

hassaanr · a year ago
Appreciate you not flashing Carter or my digital twin haha
turnsout · a year ago
Incredibly impressive on a technical level. The Carter avatar seems to swallow nervously a lot (LOL), and there's some weirdness with the mouth/teeth, but it's quite responsive. I've seen more lag on Zoom talking to people with bad wifi.

Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.

nick3443 · a year ago
Actually what really matters for a call center is having the problem I called in for resolved promptly.
tomp · a year ago
I don't understand why call centers exist in the first place.

If you just exposed all the functionality as buttons on the website, or even as AI, I'd be able to fix the problems myself!

And I say that while working for a company making call centre AIs... double ironic!

gh2k · a year ago
Agreed. I've been frustrated by the proliferation in AI with technical support. Sometimes it's can't answer a question but thinks it can, so we go round and round in circles.

A couple have had a low threshold for "this didn't solve my answer" and directed me to a human, but others are impossible to escape.

On the other hand, I've had more success with a problem actually getting resolved by a chatbot without speaking to someone more recently... But not a lot more. Ususally I think that because I skew technical and treat Support as a last resort, I've tried everything it wants to suggest.

turnsout · a year ago
Right, so do you want to wait 45 minutes for a human, or get it resolved via AI in 2 minutes?
myprotegeai · a year ago
>Honestly this is the future of call centers.

This feels like retro futurism, where we take old ideas and apply a futuristic twist. It feels much more likely that call centers will cease to be relevant, before this tech is ever integrated into them.

turnsout · a year ago
Tell that to my mom
caseyy · a year ago
Amazing work technically, less than 1 second is very impressive. It quite scary though that I might FaceTime someone one day soon, and they’d won’t be real.

What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.

btbuildem · a year ago
Another nail in the coffin for WFH, too. "They" will be scared we're not actually working even when on calls.
kredd · a year ago
The question is, what'll come first - AI agents that will replace white collar jobs, so you don't even need the employees or companies not trusting WFH employees, thus bringing everyone back to in person?
kwindla · a year ago
If you're interested in low-latency, multi-modal AI, Tavus is sponsoring a hackathon Oct 19th-20th in SF. (I'm helping to organize it.) There will also be a remote track for people who aren't in SF, so feel free to sign up wherever you are in the world.

https://x.com/kwindla/status/1839767364981920246

kristopolous · a year ago
Hey, I used to work for you a long time ago in a galaxy far away. Nice to hear from you.
kwindla · a year ago
Hi!
hassaanr · a year ago
Big +1 here! Also shoutout to the Daily team who helped build this!
myprotegeai · a year ago
Can you say more about how developers will use this? Is the api going to be exposed to participants?
hassaanr · a year ago
The API is exposed now, you can signup at tavus.io, and at the hackathon we’ll be giving credits to build!
heroprotagonist · a year ago
Sooo, are you scouting talent and good ideas with this, or is it the kind of hackathon where people give up rights to any IP they produce?

Not to be rude, but these days it's best to ask.

kabirgoel · a year ago
As someone who's attended events run by Daily/Kwindla, I can guarantee that you’ll have fun and leave with your IP rights intact. :) (In fact, I don't even know that they're looking for talent and good ideas... the motivation for organizing these is usually to get people excited about what you're building and create a community you can share things with.)
kwindla · a year ago
What? No. That’s crazy. (I believe you. I’ve just … never heard of giving up IP rights because you participated in a hackathon.)

This is about community and building fun things. I can’t speak for all the sponsors, but what I want is to show people the Open Source tooling we work on at Daily, and see/hear what other people interested in real-time AI are thinking about and working on.

gavmor · a year ago
> the kind of hackathon where people give up rights to any IP they produce

Wow, I have been attending public hackathons for over a decade, and I have never heard of something like this. That would be an outrage!

radarsat1 · a year ago
As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.
bpanahij · a year ago
We're partnering with GPU infrastructure providers like Replicate. In addition, we have done some engineering to bring down our stack's cold and warm boot times. With sufficient caches on disk, and potentially a running process/memory snapshot we can bring these cold/warm boot times down to under 5 seconds. Of course, we're making progress every week on this, and it's getting better all the time.
whiplash451 · a year ago
Not the author, but their description implies that they are running more than one stream per GPU.

So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.

Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.

pavlov · a year ago
You can do parallel rendering jobs on a GPU. (Think of how each GPU-accelerated window on a desktop OS has its own context for rendering resources.)

So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.

Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.

kabirgoel · a year ago
(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.

That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.

diggan · a year ago
> that you can use to maximize throughput

While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.

ilaksh · a year ago
It is expensive. They charge in 6 second increments. I have not found anywhere that says how much per 6 second stream.

Okay found it, $0.24 per minute, on the bottom of the pricing page.

That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.

bpanahij · a year ago
Scroll down the page and the per minute pricing is there: https://www.tavus.io/pricing

We bill in 6 second increments, so you only pay for what you use in 6 second bins.