I love watching YouTube with my 7-year-old daughter. Unfortunately, the best stuff is often in English (we're German). So I made an AI tool that translates videos directly, using the original voices. All other sounds, as well as background music, are preserved, too.
Turns out that it works for many other language pairs, too. So far, it can create dubs in English, Mandarin Chinese, Spanish, Arabic, French, Russian, German, Italian, Korean, Polish and Dutch.
The main challenge in building this was to get the balance right between translating the original meaning and getting the timing right. Especially for language pairs like English -> German, where the target ist often longer than the source ("bat" -> "Fle-der-maus", "speed" -> "Ge-schwin-dig-keit").
Let me know what you think! :)
https://haonowshaokao.com/2013/05/18/does-dubbing-tv-harm-la...
Edit: I forgot to mention that the samples on the website is impressive and well made. How do you do the speaker diarization and voice cloning?
Subtitles - those are actually being generated as well. I've generated SRT files during development. Color coded by speaker, and on a per-word basis, for me to get the timing right.
Basically, if you have a YouTube channel, you can take any video from your channel, run it through Speakz.ai, and you'll get 15+ additional audio tracks in different languages, plus 15+ subtitle files (SRT).
Voice cloning and speaker diarization was a bit of a challenge. On the one hand, I want to duplicate the voice that is being spoken right now. On the other hand, sometimes "right now" is just a short "Yeah" (like in the Elon interview) which doesn't give you a lot of "meat" to work with in terms of embedding the voice's characteristics.
Right now, I'm using a mix of signals:
- Is the utterance padded by pauses before/after? - Is the utterance a complete sentence? - Does the voice of the utterance sound significantly different from the voice of the previous utterance?
It's a deep, deep rabbit hole. I was tempted to go much deeper. But I thought I better check if anybody besides myself actually cares before I do that... :)
I love watching movies in the original language, but this is something I hate as well, but something that can be avoided.
Some movies get it right, though. The timing, just the words that are spoken and even different colors for different persons speaking (very rare, cannot even remember where I have seen it). That should be standard, but with most movies you can be lucky if the subs even match the plot and do not reveal too much.
The thing that annoys me the most about it is that it often alters the feel of the material. E.g. I watched Valiant (2005) with my son in Norwegian first, because he got it on DVD from his grandparents. He doesn't understand much Norwegian, but when he first got the DVD he was so little that it didn't matter. A few years later we watched the English language version.
It comes across as much darker in the English version. The voice acting is much more somber than the relatively cheerful way the Norwegian dub was done, and it while it's still a comedy, in comparison it feels like the Norwegian version obscures a lot of the tension, and it makes it feel almost like a different movie.
I guess that could go both ways, but it does often feel like the people dubbing something are likely to have less time and opportunity to get direction on how to play the part, and you can often hear the consequences.
Source: father of a 8yo with VERY good reading skills (already reading books in 2 languages targeted at tweens)
As others have said, it is better to expose kids (that can read) to the original language plus subtitles.
So in other words, your solution while technically great is pedagogical not wise. A typical geek approach to a problem ;)
No, they wouldn't.
I don't believe that most swedes learn English by reading subtitles before starting school.
> I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.
That I agree with.
It's not about learning the language per se, it's about familiarizing yourself with the sound of the language, which then makes formal learning feel much more intuitive. English becomes an easy subject because you always feel a little ahead of the material. When faced with a "fill in the blank" type of questions, you're able to answer them by what feels right, even when you can't quite explain why it feels right.
It's why #1 rule of language learning at any stage in life is always gonna be immersing yourself with the language you want to learn, and by far the most effective way to immerse yourself (excluding moving to another country) is to consume content in your target language.
Why wouldn't swedes?
She has learned to understand what is said in the cartoons. Of course she misses some things, but it's surprising how much she gets.
Like, when I ask her "what did Bluey just say?", she can explain it.
Children's brains are awesome.
But actually, grown-ups can also pick up quite a lot if they actually immerse themselves.
Deleted Comment
> No, they wouldn't.
hard disagree
I gotta say... while sometimes it is a necessarily evil, I would so rather not have to read subtitles. I often want to listen to a show so that I can also continue working on catching up on email, etc, IE: I can't read two things at once, but I can listen to one thing and continue working on something else.
1) doing voice recognition with voice time clues, which Whisper and the like provide, breaking it up into sentence (or similar) units; you don't need to time match individual words, but you need to time match at coarser grain.
2) using a translation engine that allows for multiple alternative translations
3) cloning the original voice, regardless of language
4) choosing the translation that has the best time match (possibly by syllable counting, or by actually rendering and timing the translations). If there isn't a close translation, maybe you're asking ChatGPT to forcibly rephrase?
5) Maybe some modest pitch-corrected rate control to pick out path that gets you closest to the timing?
Did I get any of that right?
Yes, that's basically how it works.
I don't do any pitch-correction. But I do check the TTS output for lenght, and I re-generate if it doesn't match my time contraints.
I also have an arranger that tries to figure out when to play an utterance early (i.e. earlier than in the original) in order to make up for the translated version being longer.
I try to make the translations match the speaker's character, as well as the context. So ideally, Alex (Sample 2) will still say "Salut" even in German (instead of translating that greeting, too).
And I need to monitor for speaker changes. This is because I can't clone the voice unless I have a decent amount of sample data. If Elon just says "Yes", cloning the voice based on just that one syllable will make it sound like a robot. But I also can't just blindly grab any voice around it, since that might be somebody else's voice.
Interesting, but what inference engine supports it to run at decent speeds?
I think it would be better to either slow down the underlying video or solve the overrun issue on the translation level. A good professional dubber will find translations that will even out in terms of timing. That's something an AI should be able to do better instead of worse.
I also don't want to tell you how to raise your kid. You do you, it's not my family. But I want to share how important it is to watch foreign spoken language movies and TV, especially as a kid, to be able to speak multiple languages later in life. You'll notice that in every country where TV and movies are regularly dubbed in the local language, the English levels go to shit. Dubbing is partially responsible for this because kids are not exposed to a different language on a regular basis.
I remember wanting to watch a dubbed movie with my mom as a kid, and she told me "We will watch the original instead, dubbed movies don't have a soul". It stuck with me. She was absolutely right. Today I am working on my sixth spoken language. Causation not guaranteed, merely implied.
I disagree. It's all about the quality of the voice actors and the effort put into localisation.
Having grown up on Dutch dubs of many cartoons, I honestly find the Dutch voice actor of Spongebob better than the original. I'm missing the extra energy that the Dutch VA seems to have put into the voice when I hear the original, even if the original is very good. Though text on screen isn't translated, puns and references are, sometimes overhauled completely.
The talent pool for Dutch voice actors isn't as big as I would've liked (you often hear the same five VAs in every show on a given channel), but some of them really put in the work. Many of them only do kids TV and commercials (really freaked me out to hear Ash Ketchum try to sell me soap one day) and not every VA is as good/paid enough/gets decent scripts, but there are some real gems to be found in dubs.
Last year I found out how Ukrainian dubs work and I was astounded by how weird the experience was. I'm used to dubs having only the voice track swapped out, but the Ukrainian shows seemed to just have the acties talk over the original show, like this AI tool does, and I honestly can't imagine ever getting into a show that's dubbed like that. I assume people get used to this, but I found it rather annoying.
Blanket statements like "dubs have no soul" serve nobody. There are good dubs and there are bad dubs, and the ratio will probably differ depending on the language you're talking about. Dismissing all dubs ignores the real heart and soul some dubbing teams have put into their works. That doesn't mean I disagree with the idea of exposing kids to more languages, but I wouldn't expect kids to learn much from just TV shows and movies in the first place.
That doesn't have anything to do with the quality of the voice actors. Everything sounds flat because that is just how they record it.
Dubbing is a useful convenience, an accessibility feature (even if it wasn't born that way). But they have way less soul.
But absolutely; for anything featuring live action, dubs just damage the original.
I watch a German man building his massive Lego city on Youtube (narrated and recorded quite professionally) with my five year old son for a few minutes before bed. He is now at the point where he is trying to give this weird language (to him) a place in his head. Some words are familiar (being Dutch), some are foreign, and you can see the feedback loop happening when words do land; he wants to know what that man is saying. I don't except him to pick any German at this point, but the basics of immersion in another language are there.
As I said in another comment, I wouldn't want to live in a world where everything was dubbed into my language.
Any translation takes something away from the original. And dubbing even more so.
I also believe that being exposed to a foreign language long before you ever make a concious attempt to learn it is important. I wouldn't think I'd succeed in teaching my toddler to say "Daddy" if he hadn't been listening to the rest of us speaking for many months before.
I can see how this headline can make me seem like a bafoon of a dad. But I think I'm really not. :) When I watch The Anatomy Lab with my daughter, that's a time when I want our conversation to focus on how digestion works. Not on what the guy on the screen was saying just now. But of course there will also be times where I'll want our conversation to be about exactly that: What a foreign speaker just said. How those words come together. How the may have the same root as the words we use in German. Also, while AI has its place, I prefer to have these conversations with her myself.
That being said, I do agree that listening to other languages is a great thing. My father was a linguist, and when we would watch subtitled media, we'd play a game where we'd try and hear the cognates, pick out the most common words, figure out the basics of the grammar as we went along. It was a lot of fun!
I still want my kids to learn English. And ideally also one or two other languages, like Chinese.
As Nietzsche said:
"So you have mounted on horseback? And now ride briskly up to your goal? Well, my friend - but your lame foot is also riding with you!"
I can’t speak any languages, but in school my English was insanely good. To the point of perfect scores in the college scholastic exams, and when I was in uni for engineering, I took on an English major for fun with essentially no impact on my work load.
You can generalize but you can also specialize.
Congrats! Very well done.
Consistency isn't perfect yet, as we're building the voice from scratch basically for each utterance. One the one hand, you want that, because the utterance might be more upbeat, or lower pitch, than the speaker's "average" voice. On the other hand, it sometimes introduces variance that makes the listener's brain go, "Uh.... is that another person speaking now?". If I had to dub 200 videos of a single YouTube channel, I would be able to fine-tune the voices of the main characters, and reserve the ad-hoc cloning for guest characters.
So you'd be translating English to Simplified English? Or are you talking from another source language?
I've already been playing with this concept w.r.t. books:
I take a non-fiction book. I'll have an LLM translate it with a specific audience in mind (say, a 7 year old girl with a certain background), explaining concepts and words that are likely unknown to that audience. And then converting the whole thing into an audiobook. Optional parental controls built in ("exclude violence", etc.). Nowhere near showtime, though.
Another thing I'd love to work on is filtering existing content. There are millions of videos on YouTube. Right now, finding quality stuff that's fun to watch with my kid depends a lot on dumb luck. But what if I could filter by topic (semantic whitelist/blacklist, i.e. not keyword dependent), personality traits (OCEAN, MBTI), values (e.g. "curiousity") and language (reading level, vocabulary, words per minute, etc.)? I'd love that.
Being able to "step down" the difficulty so that I can either turn off subtitles entirely or rely on French subtitles, or even much "difficult speech" and "simple subtitles" or vice versa seems like it'd be very useful in getting over that hump faster.
Some page feedback though: It seems to me that the video just keeps playing, with no way to restart it or scrub through the timeline. Each time I click a language, it changes the spoken audio but just keeps playing where it left off. That makes it hard to compare the same passage across different languages.
Separately, I think there are also some errors in translation. For Sample 3 (about the vines), the original in Mandarin Chinese says something like "if this tree gets grabbed, the weed will climb up and wrap around it, and the tree won't be able to photosynthesize and will die". But the English mistranslation says "If it gets scared by people, it gets pulled off and messed with. It can't function. The evil effects? It just dies."
There are also timing issues where the translations don't match up with the original subtitles or dialogue, and certain parts of the original audio just seem to be altogether ignored and not translated.
Maybe displaying the translated subtitles, allow with a way for users to report errors, would help...?
Yes. You cannot control the video playback on the demo page. I made it so because I wanted a way to showcase how you can switch between languages. You can go from Elon speaking English to German, Russian and Chinese, each with just one click. Activating the player controls would have made the UI more complex and distracting. And it would have also made it harder for me to sync the timing between languages.
Of course, the real output would be a proper player, with all of the controls. Or, for creators, raw files (video and/or audio, plus SRT subtitles).
I also noticed problems in the translation of the Chinese video. I put it up there anyway, because I figured most people coming to my site would be English speakers, and being able to understand a Chinese video might be another interesting aspect, in addition to the idea of being able to turn your own English content into languages you don't speak.
If this had been a pitch deck, I would have cherry picked the samples. But I wanted to share where the project is right now and see if anyone was interested. Premature optimization is the root evil of all programming. I think Knuth said that. And it's a trap I regularly fall into. So I tried to be disciplined this time.
But if any Chinese YouTuber would ask me to dub their work today, I'd make darn sure that the translations were close to perfect. Meaning I'd allow the system to make changes to the way things are phrased if that's necessary for the purpose of timing or cultural context. But I wouldn't allow it to skip a thought from the original video, or say something something different.
I've translated books by hand in the past. So this is something I care about. If the demo isn't perfect in this regard, it's because I didn't know if anyone was going to even look at my project. When I first posted this yesterday, my submissions didn't go beyond one comment for several hours. I already thought I had built another solution looking for a problem. :)
If you're seeing dropped phrases, that's most likely because my arranging function failed. Basically, the translation ran longer than the original. The algorithm tried to speed it up and fit it in. But it failed and dropped it. Better handling of these overruns are on my to-do list. Neither drops nor speedups should be tolerated.
In terms of self-correction, I plan to feed the translated audio back into the transcription engine. Then, an LLM can compare the translation with the original transcript. If anything is missing, the pipeline will be force to run again with slightly different parameters. There shouldn't be a human neccessary in the loop. Translation is what Transformers are best at.
I'd totally pay to have something like this as a Chrome plugin for YouTube, for example.
/offtopic It seems to do a pretty interesting thing where the first male voice has a bit of a Flemish/southern accent while the second male voice has an accent much closer to "Netherlands TV" Dutch. Reminded me a bit of the Lion King dub where the dub studio used Flemish voice actors to do the jungle animals (and Dutch voice actors for the savannah animals) to underline the "different world" Simba arrived in.
I'm planning to monitor the output quality. Basically, feeding the translated audio back into the transcriber. Then compare it to the original transcript. Like a higher level loss function. I'll need this already because I don't speak all of these languages myself. But I can also use it to make the pipeline self-regulate and generate a new, better version if the last one scored too poorly.
Just the different ways the languages get picked up and processed by the AI system could be interesting. If you find anything cool, I'd love to read a blog post about it!
I already joined the beta but I want to point out another use case here as well:
In many countries (ie Greece where I'm from) movies and TV shows never get dubbed. We rely on subtitles. This means that if you can't see well (disability or age-related eye problems) and if your English is not excellent, then you are doomed to only watch locally produced movies & shows.
This can be a real life-changer.
With movies, I think I could get into legally challenging territory. I guess all AI apps are, in a way. But with movies, there's an entire industry behind enforcing copyright. So I must tread carefully on that front.
I made the jump from the courtroom into VS Code years ago. I really don't want to go backwards.
Could this run locally? I would certainly pay for that and you're off the hook on how anyone uses it.