While it is impressive and I like to follow the advancements in this field, it is incredibly frustrating to listen to. I can't put my finger on why exactly. It's definitely closer to human-sounding, but the uncanny valley is so deep here that I find myself thinking "I just want the point, not the fake personality that is coming with it". I can't make it through a 30s demo.
We're used to hearing some kind of identity behind voices -- we unconsciously sense clusters of vocabulary, intonation patterns, ticks, frequent interruption vs quiet patience, silence tolerance, response patterns to various triggers, etc that communicate a coherent person of some kind.
We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes"
These don't have that.
Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity.
I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.)
As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though.
To be fair, the majority of podcasts are from a group of generic white guys, and they almost sound identical to these AI generated ones. The AI actually seems to to do a better job too.
Whether this stops at the uncanny valley or progresses to specific "AI celebrity" voices, I'm left thinking the engineers involved in this never stopped to think carefully about whether this ought to be done in the first place.
Agreed. To me it sounds like bad voice-over actors reading from a script. So the natural parts of a conversation where you might say the wrong thing and step back to correct yourself are all gone. Impressive for sure.
Totally agree. Maybe it’s just the clips they chose, but it feels overfit on the weird conversational elements that make it impressive? Like the “oh yeahs” from the other person when someone is speaking. It is cool to see that natural flow in a conversation generated by a model, but there’s waaaay too much of it in these examples to sound natural.
And I say all that completely slackjawed that this is possible.
I love the technology, but I really don't want AI to sound like this.
Imagine being stuck on a call with this.
> "Hey, so like, is there anything I can help you with today?"
> "Talk to a person."
> "Oh wow, right. (chuckle) You got it. Well, before I connect you, can you maybe tell me a little bit more about what problem you're having? For example, maybe it's something to do with..."
I'd love to see stats on disfluency rate in conversation, podcasts, and this sample to get an idea of where it lies. It seems like they could have cranked it up, but there's also the chance that it's just the frequency illusion because we were primed to pay attention to it.
For me it isn’t uncanny from a lack of humanity. Rather, it triggers all my “fake and shallow” personality biases. It certainly sounds human enough, just not the type of humans I like.
That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.
Yeah... It isn't that it doesn't sound like human speech... it just sounds like how humans speak when they are uncomfortable or reading prepared and they aren't good at it.
Probably because you're expecting it and looking at a demo page. Put these voices behind a real video or advertisement and I would imagine most people wouldn't be able to tell that it's AI generated at all.
It'd be annoying to me whether it was AI or human. The faux-excitement and pseudo-bonhomie is grating. They should focus on how people actually talk, not on copying the vocal intonation of coked-up public radio presenters just back from a positive affirmation seminar.
I agree. It’s profoundly sad that so much energy is being invested in solving the non-problem of making long documents accessible. To think that people will ignore carefully written work for the “chat show” output of an LLM is horrifying and a harbinger of a societal slide into happy stupidity and willing ignorance.
> Example of a multi-speaker dialogue generated by NotebookLM Audio Overview, based on a few potato-related documents.
Listening to this on 1.75x speed is excellent. I think the generated speaking speed is slow for audio quality, bc it'd be much harder to slow-down the generated audio while retaining quality than vice versa.
It's due to the histrionic mental epidemic that we are going through.
A lot of people are just like that IRL.
They cannot just say "the food was fine", it's usually some crap like "What on earth! These are the best cheese sticks I've had IN MY EN TI R E LIFE!".
I tuned it out instantly because I have that feeling with most Americans / podcasts / etc already. That said, it's a convincing enough analog for that kind of content I think.
It doesn't feel any different to me than listening to a random radio station where I don't know who is speaking. I didn't feel any uncanny valley but I'm not an English native speaker so I might miss some nuances. However there are relatively few English native speakers around the world so this might not be a problem for us.
The problem is that people talking over each other is not a format I long to listen to.
While it is impressive and I like to follow the advancements in this field...
Please don't think that I'm trying to suggest... anything . It's just that I'm getting used to read this pattern in the output of LLMs. "While this and that is great...". Maybe we're mimicking them now? I catch myself using these disclaimers even in spoken language.
I like to preface negativity with a positive note. Maybe I am influenced in my word choice but my intent was to point out that this is a very, very impressive feat and I don't want to undermine it.
Whilst I don't doubt you feel like that the general response to the notebook LLM podcast feature (which uses this) has been very well received generally.
In general people find the back and forth between the "hosts" engaging and also gives people time to digest the contents.
When I got to the bit where they referred to the smaller training set of paid voice actors, that hit it for me. It certainly sounds like they are throwing the 'um' and 'ah's in to a script - not naturally.
There’s a certain fakeness to the rhythm of the space between words. Particularly the “uh” and “um” filler sounds. To me it sounds like they always either come in abnormally early or late after speaking those sounds
It's because it's probably trained with "professional audio", ads, movies, audiobooks, and not "normal people talking". Like the effect when diffusion was mostly trained with stock photos.
I got a similar feeling. I think it was overdoing the ums and uhhs for something trying to sound like an even slightly professional podcast kind of sound.
I think I put my finger on exactly why it sounds a bit uncanny-valley: it sounds like humans who are reading from a prepared 'bit' or 'script'.
We've all been on those webinars where it's clear -- despite the infusions (on cue) of "enthusiasm" from the speaker attempting to make it sound more natural and off-the-cuff -- that they are reading from a script.
It's a difficult-to-mask phenomenon for humans.
That all said, I actually have more grace for an AI sounding like this than I do for a human presenter reading from a script. Like, if I'm here "live" and paying attention to what you're saying, at least do me the service of truly being "here" with me and authentically communicating vs. simply reading something.
If you're going to simply read something, then just send it to me to read too - don't pretend it's a spontaneously synchronous communication.
But what's the end goal and audience here? I don't believe people will resonate with robots making "um" and "ohs" because people usually resonate with an artist, a producer, a writer, a singer etc. A human layer with which people can empathize is essential. This can work as long as people are deceived and don't know there is no human behind it. If however i find out that a video is AI -generated i instantly lose interest in it. There are e.g. a lot of AI-generated architecture videos on youtube at the moment, i have never wanted to listen to one, because i know the emotions will be fake.
> I don't believe people will resonate with robots making "um" and "ohs" because people usually resonate with an artist, a producer, a writer, a singer etc.
I think they absolutely will, because "resonating" is not a material phenomenon, it's something people decide that they're doing. Your connection with an actor on television is not an actual connection. Most of acting is learning the times and length to be silent while making a particular face (dictated by the director) in order for the audience to project feelings and thoughts onto you. You're thinking about your camera blocking, or your groceries, and your audience sees you thinking about some plot point in a fictional world.
I've got a theory that we severely damaged a generation of girls by inundating them with images of girls their own age singing songs and acting parts all written and directed by middle-aged men - ones who chose as a profession to write songs in the voices of, write fiction in the voices of, and to direct, photograph and choreograph in person, tween girls. Their models of themselves have come from looking at these depictions of girls, who were never allowed to speak for themselves, and resonating.
1. Voice acting for low-budget/no-budget animations and games.
2. Billions of youtube "top 50 building demolitions" where the forgettable presentation is narrated by forgettable AI. Now we'll get "podcast style" conversation narration over those videos. Instead of bailing after 30 sec with regret, you might make a whole minute.
3. Reaction videos? Sometimes I weaken. I want to see a random person's reaction to their "first time listening" to the famous song they somehow have never heard until this moment. If we humans lower ourselves to reaction videos, we'll watch/listen to AI chatting to itself about things we love. Once the content gets "spicy", beyond the potato salad google demos, the floodgates will open. God help us.
It's very related to LLMs. Though instead of text tokens you are working with audio tokens (e.g. from SoundStream). Then you go to audio corpus, instead of text corpus.
Is there a free (ad supported?) online tool without login that reads text that you paste into it?
I often would like to listen to a blog post instead of reading it, but haven't found an easy, quick solution yet.
I tried piping text through OpenAI's tts-1-hd, model and it is the first one I ever found that is human like enough for me to like listening to it. So I could write a tool for my own usecase that pipes the text to tts-1-hd and plays the audio.
But maybe there is already something with a public web interface out there?
Both windows and macos (the operating systems) have this built-in under accessibility. It’s worth a try and I use it sometimes when I want to read something while cooking.
There is on iOS. No ads. "Reader" by Eleven Labs. I haven't used it that much but have listened to some white papers and blogs (some of which were like 45 minutes) and it "just worked". Even let's you click text you want to jump to.
And it's Eleven Labs quality- which unless I've fallen behind the times is the highest quality TTS by a margin.
We've been using this at work to get inside of our customer's perspective. It's helpful to throw eg a bunch of point-of-sale data sync challenges into Notebook LM and eg pass a 10 minute audio to the team so they can understand where our work fits in.
I’ve cut and pasted weeks of Slack conversations into NotebookLM and it was quite entertaining to then listen to a Podcast talking humorously about all the arguments in the #management channel.
We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes"
These don't have that.
Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity.
I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.)
As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though.
And I say all that completely slackjawed that this is possible.
Imagine being stuck on a call with this.
> "Hey, so like, is there anything I can help you with today?"
> "Talk to a person."
> "Oh wow, right. (chuckle) You got it. Well, before I connect you, can you maybe tell me a little bit more about what problem you're having? For example, maybe it's something to do with..."
I bet that if you select a British accent you will get fewer of them.
Listening to this on 1.75x speed is excellent. I think the generated speaking speed is slow for audio quality, bc it'd be much harder to slow-down the generated audio while retaining quality than vice versa.
A lot of people are just like that IRL.
They cannot just say "the food was fine", it's usually some crap like "What on earth! These are the best cheese sticks I've had IN MY EN TI R E LIFE!".
The problem is that people talking over each other is not a format I long to listen to.
Please don't think that I'm trying to suggest... anything . It's just that I'm getting used to read this pattern in the output of LLMs. "While this and that is great...". Maybe we're mimicking them now? I catch myself using these disclaimers even in spoken language.
In general people find the back and forth between the "hosts" engaging and also gives people time to digest the contents.
This is good, but certainly not yet great.
Deleted Comment
In similar vein, I’m glad they told me it was a funny story, because otherwise I wouldn’t have known.
Deleted Comment
We've all been on those webinars where it's clear -- despite the infusions (on cue) of "enthusiasm" from the speaker attempting to make it sound more natural and off-the-cuff -- that they are reading from a script.
It's a difficult-to-mask phenomenon for humans.
That all said, I actually have more grace for an AI sounding like this than I do for a human presenter reading from a script. Like, if I'm here "live" and paying attention to what you're saying, at least do me the service of truly being "here" with me and authentically communicating vs. simply reading something.
If you're going to simply read something, then just send it to me to read too - don't pretend it's a spontaneously synchronous communication.
Deleted Comment
I think they absolutely will, because "resonating" is not a material phenomenon, it's something people decide that they're doing. Your connection with an actor on television is not an actual connection. Most of acting is learning the times and length to be silent while making a particular face (dictated by the director) in order for the audience to project feelings and thoughts onto you. You're thinking about your camera blocking, or your groceries, and your audience sees you thinking about some plot point in a fictional world.
I've got a theory that we severely damaged a generation of girls by inundating them with images of girls their own age singing songs and acting parts all written and directed by middle-aged men - ones who chose as a profession to write songs in the voices of, write fiction in the voices of, and to direct, photograph and choreograph in person, tween girls. Their models of themselves have come from looking at these depictions of girls, who were never allowed to speak for themselves, and resonating.
1. Voice acting for low-budget/no-budget animations and games.
2. Billions of youtube "top 50 building demolitions" where the forgettable presentation is narrated by forgettable AI. Now we'll get "podcast style" conversation narration over those videos. Instead of bailing after 30 sec with regret, you might make a whole minute.
3. Reaction videos? Sometimes I weaken. I want to see a random person's reaction to their "first time listening" to the famous song they somehow have never heard until this moment. If we humans lower ourselves to reaction videos, we'll watch/listen to AI chatting to itself about things we love. Once the content gets "spicy", beyond the potato salad google demos, the floodgates will open. God help us.
Is this related to LLM, or is this a completely different branch of AI, and is it just a coincidence? I am curious.
I often would like to listen to a blog post instead of reading it, but haven't found an easy, quick solution yet.
I tried piping text through OpenAI's tts-1-hd, model and it is the first one I ever found that is human like enough for me to like listening to it. So I could write a tool for my own usecase that pipes the text to tts-1-hd and plays the audio. But maybe there is already something with a public web interface out there?
I did a bit of research and it seems to be, by far, the highest-quality TTS engine that is free and you can do things like pause and continue.
There are other options that have higher-quality voices, but they aren't free.
And it's Eleven Labs quality- which unless I've fallen behind the times is the highest quality TTS by a margin.