Readit News logoReadit News
Posted by u/leobg 2 years ago
Show HN: AI dub tool I made to watch foreign language videos with my 7-year-oldspeakz.ai...
Hey HN!

I love watching YouTube with my 7-year-old daughter. Unfortunately, the best stuff is often in English (we're German). So I made an AI tool that translates videos directly, using the original voices. All other sounds, as well as background music, are preserved, too.

Turns out that it works for many other language pairs, too. So far, it can create dubs in English, Mandarin Chinese, Spanish, Arabic, French, Russian, German, Italian, Korean, Polish and Dutch.

The main challenge in building this was to get the balance right between translating the original meaning and getting the timing right. Especially for language pairs like English -> German, where the target ist often longer than the source ("bat" -> "Fle-der-maus", "speed" -> "Ge-schwin-dig-keit").

Let me know what you think! :)

sorenjan · 2 years ago
I know Germany dub most video, but wouldn't a seven year old be able to read subtitles? It's a great way for her to learn English, it's how most Swedes learn it before starting school. I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

https://haonowshaokao.com/2013/05/18/does-dubbing-tv-harm-la...

Edit: I forgot to mention that the samples on the website is impressive and well made. How do you do the speaker diarization and voice cloning?

leobg · 2 years ago
Yeah, this isn't really helpful for her to learn English. This is more when we watch The Anatomy Lab, or BBC's "The Incredible Human Journey". She'll already be asking me a lot of questions about the content. So if I had to translate on top of that, it would be tedious.

Subtitles - those are actually being generated as well. I've generated SRT files during development. Color coded by speaker, and on a per-word basis, for me to get the timing right.

Basically, if you have a YouTube channel, you can take any video from your channel, run it through Speakz.ai, and you'll get 15+ additional audio tracks in different languages, plus 15+ subtitle files (SRT).

Voice cloning and speaker diarization was a bit of a challenge. On the one hand, I want to duplicate the voice that is being spoken right now. On the other hand, sometimes "right now" is just a short "Yeah" (like in the Elon interview) which doesn't give you a lot of "meat" to work with in terms of embedding the voice's characteristics.

Right now, I'm using a mix of signals:

- Is the utterance padded by pauses before/after? - Is the utterance a complete sentence? - Does the voice of the utterance sound significantly different from the voice of the previous utterance?

It's a deep, deep rabbit hole. I was tempted to go much deeper. But I thought I better check if anybody besides myself actually cares before I do that... :)

supafastcoder · 2 years ago
I think it’s a cultural difference. I’m also from a non-dubbing country (Netherlands) and I can’t stand dubbed content either. On the other hand people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it.
lukan · 2 years ago
"people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it."

I love watching movies in the original language, but this is something I hate as well, but something that can be avoided.

Some movies get it right, though. The timing, just the words that are spoken and even different colors for different persons speaking (very rare, cannot even remember where I have seen it). That should be standard, but with most movies you can be lucky if the subs even match the plot and do not reveal too much.

vidarh · 2 years ago
I'm Norwegian, and Norway used to be near-universally non-dubbing other than for TV for the very youngest children, and even then almost exclusively cartoons or stop motion etc. where it wasn't so jarring. But the target age of material being dubbed has crept up as it has become relatively-speaking cheaper to do compared to revenues generated in what is a tiny market.

The thing that annoys me the most about it is that it often alters the feel of the material. E.g. I watched Valiant (2005) with my son in Norwegian first, because he got it on DVD from his grandparents. He doesn't understand much Norwegian, but when he first got the DVD he was so little that it didn't matter. A few years later we watched the English language version.

It comes across as much darker in the English version. The voice acting is much more somber than the relatively cheerful way the Norwegian dub was done, and it while it's still a comedy, in comparison it feels like the Norwegian version obscures a lot of the tension, and it makes it feel almost like a different movie.

I guess that could go both ways, but it does often feel like the people dubbing something are likely to have less time and opportunity to get direction on how to play the part, and you can often hear the consequences.

matsemann · 2 years ago
I think you get used to it. Like a punchline I've read, but I don't "register" it until the proper thing happens on the screen.
alexdbird · 2 years ago
I prefer subs over dubbing for foreign languages, but I cannot stand closed captions (for people who can’t hear at all) because having your eye drawn to the bottom of the screen for a description of something I don’t need to know about is horrible!
darkwater · 2 years ago
A 7yo can barely keep up with subtitles in their mother tongue, depending on the speed. And that's probably true for a p90 reader. A p50 there is no way it can follow subs understanding what they say. Now, being a video, they might be able to interpolate from what they see, so it might be a nice challenge. But doing this with subtitles in a foreign language is only for a few, privileged minds.

Source: father of a 8yo with VERY good reading skills (already reading books in 2 languages targeted at tweens)

ChemSpider · 2 years ago
Dubbing in Germany is horrible and pervasive. Even in the news and interviews. Subtitles are cheaper and better.

As others have said, it is better to expose kids (that can read) to the original language plus subtitles.

So in other words, your solution while technically great is pedagogical not wise. A typical geek approach to a problem ;)

rob74 · 2 years ago
The worst thing about dubbing is that it's more important for the translations to have roughly the same length and correspondence to the original mouth movements than to be accurate. So the original meaning is often altered, and you don't even know it because of course you have no easy access to the original most of the time. But unfortunately Germans are so used to dubbing that subtitles don't really stand a chance. There are a few cinemas here and there that show original-language movies with subtitles, and on TV there was one experiment that I'm aware of a few years ago (on Pro Sieben Maxx) to show TV series with subtitles, but it was cancelled after some time. AFAIK it's also more expensive to secure the rights to show English-language content compared to dubbed content.
wodenokoto · 2 years ago
> but wouldn't a seven year old be able to read subtitles?

No, they wouldn't.

I don't believe that most swedes learn English by reading subtitles before starting school.

> I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

That I agree with.

input_sh · 2 years ago
> I don't believe that most swedes learn English by reading subtitles before starting school.

It's not about learning the language per se, it's about familiarizing yourself with the sound of the language, which then makes formal learning feel much more intuitive. English becomes an easy subject because you always feel a little ahead of the material. When faced with a "fill in the blank" type of questions, you're able to answer them by what feels right, even when you can't quite explain why it feels right.

It's why #1 rule of language learning at any stage in life is always gonna be immersing yourself with the language you want to learn, and by far the most effective way to immerse yourself (excluding moving to another country) is to consume content in your target language.

NicoJuicy · 2 years ago
Most people in Belgium learn English through that before school.

Why wouldn't swedes?

konschubert · 2 years ago
My 6 year old has been watching 20 minutes of cartoons every night for the past two years. This is the only exposure to the English language that she has ever had.

She has learned to understand what is said in the cartoons. Of course she misses some things, but it's surprising how much she gets.

Like, when I ask her "what did Bluey just say?", she can explain it.

Children's brains are awesome.

But actually, grown-ups can also pick up quite a lot if they actually immerse themselves.

ivanhoe · 2 years ago
Young kids don't even need subtitles, their brains are wired to figure out spoken languages, after all that's how we all learn our mother tongue initially. Last summer my then 3.5 years old, to my huge surprise, started talking in (simple, but correct) English with some tourist kids she met in the park. We never spoke English in home with her before, so I presume she picked it up from youtube and her older brother, but I had no idea she can form full sentences - including conditionals and past tense. And at first she was a bit slow to express her self, but after a few hours of play with those kids she sounded totally relaxed and fluent.
vidarh · 2 years ago
Subtitles in a foreign language? Probably not. Subtitles translated into their original language? I think it's probably an exaggeration that people have learnt it before starting school because it implies a lot about what learning it means, but picking up a number of words, sure.

Deleted Comment

anhner · 2 years ago
>> but wouldn't a seven year old be able to read subtitles?

> No, they wouldn't.

hard disagree

poulsbohemian · 2 years ago
>I know Germany dub most video, but wouldn't a seven year old be able to read subtitles?

I gotta say... while sometimes it is a necessarily evil, I would so rather not have to read subtitles. I often want to listen to a show so that I can also continue working on catching up on email, etc, IE: I can't read two things at once, but I can listen to one thing and continue working on something else.

waldrews · 2 years ago
Impressively done! It sounds like you're doing

1) doing voice recognition with voice time clues, which Whisper and the like provide, breaking it up into sentence (or similar) units; you don't need to time match individual words, but you need to time match at coarser grain.

2) using a translation engine that allows for multiple alternative translations

3) cloning the original voice, regardless of language

4) choosing the translation that has the best time match (possibly by syllable counting, or by actually rendering and timing the translations). If there isn't a close translation, maybe you're asking ChatGPT to forcibly rephrase?

5) Maybe some modest pitch-corrected rate control to pick out path that gets you closest to the timing?

Did I get any of that right?

leobg · 2 years ago
Very good!

Yes, that's basically how it works.

I don't do any pitch-correction. But I do check the TTS output for lenght, and I re-generate if it doesn't match my time contraints.

I also have an arranger that tries to figure out when to play an utterance early (i.e. earlier than in the original) in order to make up for the translated version being longer.

I try to make the translations match the speaker's character, as well as the context. So ideally, Alex (Sample 2) will still say "Salut" even in German (instead of translating that greeting, too).

And I need to monitor for speaker changes. This is because I can't clone the voice unless I have a decent amount of sample data. If Elon just says "Yes", cloning the voice based on just that one syllable will make it sound like a robot. But I also can't just blindly grab any voice around it, since that might be somebody else's voice.

davidzweig · 2 years ago
I think it's a speech to speech model, I know about seamlessm4t: https://www.google.com/amp/s/about.fb.com/news/2023/08/seaml...

Interesting, but what inference engine supports it to run at decent speeds?

waldrews · 2 years ago
Ooh and you're probably doing a split into voice and non-voice tracks of the original, and keeping non-voice at original volume, but lowering the voice track.
euazOn · 2 years ago
I also noticed that the third sample with Chinese sounds slightly sped up in the first English segment, so there may be also an element of postprocessing the dub (speeding it up/slowing it down).
leobg · 2 years ago
Yes. Though I don't like this solution. It breaks the flow. And it also doesn't really fully solve the problem. Overruns still accumulate if they happen too frequently. One second here, one second there... the further you get into the video, the worse it gets.

I think it would be better to either slow down the underlying video or solve the overrun issue on the translation level. A good professional dubber will find translations that will even out in terms of timing. That's something an AI should be able to do better instead of worse.

odiroot · 2 years ago
The last sample from BBC is really hilarious when translated to Polish. Something definitely went wrong and the voice speaks like a drunkard.
scrollaway · 2 years ago
I know this is HN so I don't want to distract from the technical achievement and how genuinely useful this can be.

I also don't want to tell you how to raise your kid. You do you, it's not my family. But I want to share how important it is to watch foreign spoken language movies and TV, especially as a kid, to be able to speak multiple languages later in life. You'll notice that in every country where TV and movies are regularly dubbed in the local language, the English levels go to shit. Dubbing is partially responsible for this because kids are not exposed to a different language on a regular basis.

I remember wanting to watch a dubbed movie with my mom as a kid, and she told me "We will watch the original instead, dubbed movies don't have a soul". It stuck with me. She was absolutely right. Today I am working on my sixth spoken language. Causation not guaranteed, merely implied.

jeroenhd · 2 years ago
>We will watch the original instead, dubbed movies don't have a soul

I disagree. It's all about the quality of the voice actors and the effort put into localisation.

Having grown up on Dutch dubs of many cartoons, I honestly find the Dutch voice actor of Spongebob better than the original. I'm missing the extra energy that the Dutch VA seems to have put into the voice when I hear the original, even if the original is very good. Though text on screen isn't translated, puns and references are, sometimes overhauled completely.

The talent pool for Dutch voice actors isn't as big as I would've liked (you often hear the same five VAs in every show on a given channel), but some of them really put in the work. Many of them only do kids TV and commercials (really freaked me out to hear Ash Ketchum try to sell me soap one day) and not every VA is as good/paid enough/gets decent scripts, but there are some real gems to be found in dubs.

Last year I found out how Ukrainian dubs work and I was astounded by how weird the experience was. I'm used to dubs having only the voice track swapped out, but the Ukrainian shows seemed to just have the acties talk over the original show, like this AI tool does, and I honestly can't imagine ever getting into a show that's dubbed like that. I assume people get used to this, but I found it rather annoying.

Blanket statements like "dubs have no soul" serve nobody. There are good dubs and there are bad dubs, and the ratio will probably differ depending on the language you're talking about. Dismissing all dubs ignores the real heart and soul some dubbing teams have put into their works. That doesn't mean I disagree with the idea of exposing kids to more languages, but I wouldn't expect kids to learn much from just TV shows and movies in the first place.

jamager · 2 years ago
Voices in dubbed movies don't have any depth, for instance.

That doesn't have anything to do with the quality of the voice actors. Everything sounds flat because that is just how they record it.

Dubbing is a useful convenience, an accessibility feature (even if it wasn't born that way). But they have way less soul.

imp0cat · 2 years ago
I think the main point is that small kids get the basic building blocks for learning languages from anything they hear (even if they don't understand it yet), so listening to as many languages as possible when they are young will make learning languages easier for them later in their lives.
Freak_NL · 2 years ago
Mostly I agree with this, but for animated works dubs can be an integral part of the product when done right, and some are even tweaked for different languages (although I strongly reject adjusting the actual cultural content for different locales). The dubs have to be made in concert with the original though. There is also a lot of plain crap out there.

But absolutely; for anything featuring live action, dubs just damage the original.

I watch a German man building his massive Lego city on Youtube (narrated and recorded quite professionally) with my five year old son for a few minutes before bed. He is now at the point where he is trying to give this weird language (to him) a place in his head. Some words are familiar (being Dutch), some are foreign, and you can see the feedback loop happening when words do land; he wants to know what that man is saying. I don't except him to pick any German at this point, but the basics of immersion in another language are there.

scrollaway · 2 years ago
Yes I agree with you. Actually, good-quality dubbed animated movies (= disney) is what I often use to help learn a new language.
leobg · 2 years ago
FYI, I agree with you in all points.

As I said in another comment, I wouldn't want to live in a world where everything was dubbed into my language.

Any translation takes something away from the original. And dubbing even more so.

I also believe that being exposed to a foreign language long before you ever make a concious attempt to learn it is important. I wouldn't think I'd succeed in teaching my toddler to say "Daddy" if he hadn't been listening to the rest of us speaking for many months before.

I can see how this headline can make me seem like a bafoon of a dad. But I think I'm really not. :) When I watch The Anatomy Lab with my daughter, that's a time when I want our conversation to focus on how digestion works. Not on what the guy on the screen was saying just now. But of course there will also be times where I'll want our conversation to be about exactly that: What a foreign speaker just said. How those words come together. How the may have the same root as the words we use in German. Also, while AI has its place, I prefer to have these conversations with her myself.

Baeocystin · 2 years ago
I can't say I agree with dubbed movies having no soul. Greater accessibility to a wider audience is not something to deride, or hold in contempt.

That being said, I do agree that listening to other languages is a great thing. My father was a linguist, and when we would watch subtitled media, we'd play a game where we'd try and hear the cognates, pick out the most common words, figure out the basics of the grammar as we went along. It was a lot of fun!

leobg · 2 years ago
One of my favorite movies was "Scent Of A Woman". But when I watched it in the German-dubbed version, I was appalled. It made the whole movie suddenly seem like a comedy. To me, the translation had killed its "soul", for lack of a better term.

I still want my kids to learn English. And ideally also one or two other languages, like Chinese.

As Nietzsche said:

"So you have mounted on horseback? And now ride briskly up to your goal? Well, my friend - but your lame foot is also riding with you!"

true_religion · 2 years ago
Counter point. My parents didn’t let me watch dubbed shows, and didn’t speak our native language because rhetorical wanted me to speak unaffected English.

I can’t speak any languages, but in school my English was insanely good. To the point of perfect scores in the college scholastic exams, and when I was in uni for engineering, I took on an English major for fun with essentially no impact on my work load.

You can generalize but you can also specialize.

bufferoverflow · 2 years ago
I speak Russian, and I gotta say, the Lex sample is incredible. It sounds like real dubbing. Maybe not pro-level dubbing, but it's very very good. They voices are also very close to Lex's and Elon's.

Congrats! Very well done.

leobg · 2 years ago
Thank you! Yeah, I also had a big grin on my face, hearing Lex and Elon suddenly talk in another language. :)

Consistency isn't perfect yet, as we're building the voice from scratch basically for each utterance. One the one hand, you want that, because the utterance might be more upbeat, or lower pitch, than the speaker's "average" voice. On the other hand, it sometimes introduces variance that makes the listener's brain go, "Uh.... is that another person speaking now?". If I had to dub 200 videos of a single YouTube channel, I would be able to fine-tune the voices of the main characters, and reserve the ad-hoc cloning for guest characters.

pavelboyko · 2 years ago
Please consider adding Simplified English as an output language option, preferably with a level, e.g., A2, B1, etc. This way, I can adjust the language complexity to my kids' level and then gradually remove the crutches as they improve in English.
leobg · 2 years ago
Yes! I love this!

So you'd be translating English to Simplified English? Or are you talking from another source language?

I've already been playing with this concept w.r.t. books:

I take a non-fiction book. I'll have an LLM translate it with a specific audience in mind (say, a 7 year old girl with a certain background), explaining concepts and words that are likely unknown to that audience. And then converting the whole thing into an audiobook. Optional parental controls built in ("exclude violence", etc.). Nowhere near showtime, though.

Another thing I'd love to work on is filtering existing content. There are millions of videos on YouTube. Right now, finding quality stuff that's fun to watch with my kid depends a lot on dumb luck. But what if I could filter by topic (semantic whitelist/blacklist, i.e. not keyword dependent), personality traits (OCEAN, MBTI), values (e.g. "curiousity") and language (reading level, vocabulary, words per minute, etc.)? I'd love that.

vidarh · 2 years ago
I'd love what they suggested as well, for other languages. I'm working on improving my French (and occasionally German), and I'm at a stage where I can follow along some French shows reasonably well if they're not speaking too fast (one of the first French phrases my French teacher in school taught us was "plus lentement, s'il vous plaît" - "slower/slowly please", for a reason), and if they're not speaking any particularly difficult accents, and not too much slang, but it's limiting and I'm often forced to keep English subtitles on as a consequence and it's sometimes too much of a crutch. It doesn't help that my hearing isn't what it was.

Being able to "step down" the difficulty so that I can either turn off subtitles entirely or rely on French subtitles, or even much "difficult speech" and "simple subtitles" or vice versa seems like it'd be very useful in getting over that hump faster.

solardev · 2 years ago
This is really impressive! Can't wait to see this more fleshed out. I'd gladly pay for something like this (by the video, ideally).

Some page feedback though: It seems to me that the video just keeps playing, with no way to restart it or scrub through the timeline. Each time I click a language, it changes the spoken audio but just keeps playing where it left off. That makes it hard to compare the same passage across different languages.

Separately, I think there are also some errors in translation. For Sample 3 (about the vines), the original in Mandarin Chinese says something like "if this tree gets grabbed, the weed will climb up and wrap around it, and the tree won't be able to photosynthesize and will die". But the English mistranslation says "If it gets scared by people, it gets pulled off and messed with. It can't function. The evil effects? It just dies."

There are also timing issues where the translations don't match up with the original subtitles or dialogue, and certain parts of the original audio just seem to be altogether ignored and not translated.

Maybe displaying the translated subtitles, allow with a way for users to report errors, would help...?

leobg · 2 years ago
Thank you very much!

Yes. You cannot control the video playback on the demo page. I made it so because I wanted a way to showcase how you can switch between languages. You can go from Elon speaking English to German, Russian and Chinese, each with just one click. Activating the player controls would have made the UI more complex and distracting. And it would have also made it harder for me to sync the timing between languages.

Of course, the real output would be a proper player, with all of the controls. Or, for creators, raw files (video and/or audio, plus SRT subtitles).

I also noticed problems in the translation of the Chinese video. I put it up there anyway, because I figured most people coming to my site would be English speakers, and being able to understand a Chinese video might be another interesting aspect, in addition to the idea of being able to turn your own English content into languages you don't speak.

If this had been a pitch deck, I would have cherry picked the samples. But I wanted to share where the project is right now and see if anyone was interested. Premature optimization is the root evil of all programming. I think Knuth said that. And it's a trap I regularly fall into. So I tried to be disciplined this time.

But if any Chinese YouTuber would ask me to dub their work today, I'd make darn sure that the translations were close to perfect. Meaning I'd allow the system to make changes to the way things are phrased if that's necessary for the purpose of timing or cultural context. But I wouldn't allow it to skip a thought from the original video, or say something something different.

I've translated books by hand in the past. So this is something I care about. If the demo isn't perfect in this regard, it's because I didn't know if anyone was going to even look at my project. When I first posted this yesterday, my submissions didn't go beyond one comment for several hours. I already thought I had built another solution looking for a problem. :)

If you're seeing dropped phrases, that's most likely because my arranging function failed. Basically, the translation ran longer than the original. The algorithm tried to speed it up and fit it in. But it failed and dropped it. Better handling of these overruns are on my to-do list. Neither drops nor speedups should be tolerated.

In terms of self-correction, I plan to feed the translated audio back into the transcription engine. Then, an LLM can compare the translation with the original transcript. If anything is missing, the pipeline will be force to run again with slightly different parameters. There shouldn't be a human neccessary in the loop. Translation is what Transformers are best at.

solardev · 2 years ago
Gotcha, thanks for the great walk-through and in-depth explanations! Excited to see how this thing progresses.

I'd totally pay to have something like this as a Chrome plugin for YouTube, for example.

jeroenhd · 2 years ago
I've only ever experienced Dutch dubs in kids' TV but I feel like these examples show that your Dutch model may need some work. I can't judge other languages well, but I found the Taiwanese documentary dub especially hard to follow. I wouldn't have expected Dutch to be in there for how little the language is spoken and how often Dutch speakers will understand English, though!

/offtopic It seems to do a pretty interesting thing where the first male voice has a bit of a Flemish/southern accent while the second male voice has an accent much closer to "Netherlands TV" Dutch. Reminded me a bit of the Lion King dub where the dub studio used Flemish voice actors to do the jungle animals (and Dutch voice actors for the savannah animals) to underline the "different world" Simba arrived in.

leobg · 2 years ago
Yes, that issue is also present in the German translation.

I'm planning to monitor the output quality. Basically, feeding the translated audio back into the transcriber. Then compare it to the original transcript. Like a higher level loss function. I'll need this already because I don't speak all of these languages myself. But I can also use it to make the pipeline self-regulate and generate a new, better version if the last one scored too poorly.

jeroenhd · 2 years ago
Interesting, I can see how that approach would catch the weird voice lines.

Just the different ways the languages get picked up and processed by the AI system could be interesting. If you find anything cool, I'd love to read a blog post about it!

daremon · 2 years ago
This is really amazing! Well done.

I already joined the beta but I want to point out another use case here as well:

In many countries (ie Greece where I'm from) movies and TV shows never get dubbed. We rely on subtitles. This means that if you can't see well (disability or age-related eye problems) and if your English is not excellent, then you are doomed to only watch locally produced movies & shows.

This can be a real life-changer.

leobg · 2 years ago
Thank you!

With movies, I think I could get into legally challenging territory. I guess all AI apps are, in a way. But with movies, there's an entire industry behind enforcing copyright. So I must tread carefully on that front.

I made the jump from the courtroom into VS Code years ago. I really don't want to go backwards.

daremon · 2 years ago
I honestly don't see how movies are different with any content ie YouTube videos. I am pretty sure MrBeast etc have the same lawyers as any big studio.

Could this run locally? I would certainly pay for that and you're off the hook on how anyone uses it.