We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.
To try it, talk to Hassaan’s digital twin: https://www.hassaanraza.com, or to our "demo twin" Carter: https://www.tavus.io
We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.
To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.
Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.
The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.
For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.
We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware. We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.
We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.
The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.
The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.
We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.
All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.
Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website https://www.tavus.io.
2) Maybe it's just degrading under load, but I didn't think either chat experience was very good. Both avatars interrupted themselves a lot, and the chat felt more like a jumbled mess of half-thoughts than anything.
3) The image recognition is pretty good though, when I could get one of the avatars to slow down long enough to identify something I was holding.
Anyway great progress, and thanks for sharing so much detail about the specific hurdles you've faced. I'm sure it'll get much better.
One thing I've noticed for a lot of these AI video agents, and I've noticed it in Meta's teaser for their virtual agents as well as some other companies, is they seem to love to move their head constantly. It makes them all a bit uncanny and feel like a video game NPC that reacts with a head movement on every utterance. It's less apparent on short 5-10s video clips but the longer the clips the more the constant head movements give it away.
I'm assuming this is, of course, a well known and tough problem to solve and is being worked on. Since swinging too far in the other direction of stiff/little head movements would make it even more uncanny. I'd love to hear what has been done to try and tackle the problem or if at this point it is an accepted "tell" so that one knows when they're speaking with a virtual agent?
I'm still most impressed by the image recognition - could clearly read even tiny or partially obscured print on products I held up and name them accordingly. Curious how you're achieving that level of fidelity without sacrificing throughput.
For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.
I think our human-human interaction style will “leak” into the way we interact with humanoid AI agents. Movie-Her style.
"Now dump those results into a markdown table for me please."
The firm in the post seems to be called Tavus, and their products either “digital twins” or “Carter.”
Not meaning to be pedantic, I’m just wondering whether the “V” in the thing you’ve spoken to indicates more “voice” or “video” conversations.
Creepiness: 10/10
You have to be kidding me.
Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.
If you just exposed all the functionality as buttons on the website, or even as AI, I'd be able to fix the problems myself!
And I say that while working for a company making call centre AIs... double ironic!
A couple have had a low threshold for "this didn't solve my answer" and directed me to a human, but others are impossible to escape.
On the other hand, I've had more success with a problem actually getting resolved by a chatbot without speaking to someone more recently... But not a lot more. Ususally I think that because I skew technical and treat Support as a last resort, I've tried everything it wants to suggest.
This feels like retro futurism, where we take old ideas and apply a futuristic twist. It feels much more likely that call centers will cease to be relevant, before this tech is ever integrated into them.
What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.
https://x.com/kwindla/status/1839767364981920246
Not to be rude, but these days it's best to ask.
This is about community and building fun things. I can’t speak for all the sponsors, but what I want is to show people the Open Source tooling we work on at Daily, and see/hear what other people interested in real-time AI are thinking about and working on.
Wow, I have been attending public hackathons for over a decade, and I have never heard of something like this. That would be an outrage!
So you can basically spin off a few GPUs as a baseline, allocate streams to them then boot up a new GPU when existing GPUs get overwhelmed.
Does not look very different than standard cloud compute management. I’m not saying it’s easy, but definitely not rocket science either.
So if the rendering is lightweight enough, you can multiplex potentially lots of simultaneous jobs onto a smaller pool of beefy GPU server instances.
Still, all these GPU-backed cloud services are expensive to run. Right now it’s paid by VC money — just like Uber used to be substantially cheaper than taxis when they were starting out. Similarly everybody in consumer AI hopes to be the winner who can eventually jack up prices after burning billions getting the customers.
That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.
While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.
Okay found it, $0.24 per minute, on the bottom of the pricing page.
That means they can spend $14/hour on GPU and still break even. So I believe that leaves a bit of room for profit.
We bill in 6 second increments, so you only pay for what you use in 6 second bins.