Readit News logoReadit News
cyp0633 · 5 months ago
The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.
ttflee · 5 months ago
In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.
codedokode · 5 months ago
Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.
kgeist · 5 months ago
Interesting, in Russian, it often ends with "Subtitles by %some_username%"
cyp0633 · 5 months ago
That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.
isoprophlex · 5 months ago
Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...
rollcat · 5 months ago
When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

wahnfrieden · 5 months ago
whisper MUST be combined with silence detection / VAD
xigoi · 5 months ago
Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?
madcaptenor · 5 months ago
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
indrora · 5 months ago
When YouTube began building automatic transcriptions for captions, it regularly flagged any noise or music -- typically industrial noise -- with "[foreign]"

If it couldn't understand it, it was "foreign" for the longest time.

the_af · 5 months ago
Hey, Netflix occasionally still puts in its English subtitles "[foreign music]", it always cracks me up.
stndef · 5 months ago
Yeah, I can confirm seeing that a fair bit specifically during non-verbal parts of videos when someone is using a tool.
st_goliath · 5 months ago
That's interesting, the few times I tried playing with whisper, I had the impression that YouTube style videos or random cellphone videos was something it did particularly bad with (compared to movies). My guess at the time was that most of the training material might be sub titles and raw screen plays.

The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.

That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.

In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).

mmcwilliams · 5 months ago
Similar in the English model. Pretty clear they trained on YouTube videos where creators will put that in otherwise silent sections to ensure it shows up for people with CC on.
probably_wrong · 5 months ago
The number one hallucination in my transcriptions was "Subtitles by the Amara.org community".
philipwhiuk · 5 months ago
> I suspect they trained the model on some random YouTube video without carefully picking really useful data.

They trained the model on every YouTube video they could, and hoped the aggregate was useful data.

PhasmaFelis · 5 months ago
This reminds me, some years ago as Google was expanding its translation service, someone tried translating text into and out of an obscure African language (don't recall which) and it always came out as weird Biblical-sounding semi-gibberish.

My revelation was that machine translation needs a corpus of bilingual documents to learn from, and if the language is sufficiently obscure, there may not be any bilingual documents except for the Bible, which missionaries have translated into just about every language on Earth.

danirod · 5 months ago
This is totally happening with other models too, at least with Spanish. Many transcriptions will end with something that roughly translates to "Thanks for watching!" even if it's never present in the original audio.

Deleted Comment

horseradish7k · 5 months ago
oh yeah this happens a lot on reddit on videos in foreign languages
tonyhart7 · 5 months ago
lmao
dlcarrier · 5 months ago
Classic overfitting

It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5

stingraycharles · 5 months ago
How is this overfitting, rather than a data quality / classification issue?
bGl2YW5j · 5 months ago
If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

Instead, it reverted to what it has seen before (in the training data), hence the overfit.

hsn915 · 5 months ago
ُThe Arabic text is the translator's self credit

"Translated by Nancy Qanfar"

mort96 · 5 months ago
Isn't overfitting just when the model picks up on an unintended pattern in the training data? Isn't that precisely what this is?
maxbond · 5 months ago
It is a data quality issue which caused the model to overfit.
RamblingCTO · 5 months ago
As I didn't see one correct definition of overfitting:

overfitting means that the model is too closely aligned to the test data, picked up noise and does not generalize well to *new, unseen* data. think students that learn to reproduce questions and their answers for a test instead of learning concepts and to transfer knowledge to new questions that include the same concepts.

while this sounds like overfitting, I'd just say it's garbage in, garbage out; wrong classification. the training data is shit and didn't have (enough) correct examples to learn from.

sivers · 5 months ago
to save you a lookup:

The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"

"رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"

mormegil · 5 months ago
In Czech, Whisper usually transcribes music as "Titulky vytvořil JohnyX" ("subtitles made by JohnyX") for the same reason.
actionfromafar · 5 months ago
Haha, trained on torrented movies! :-D

The MPA must be so proud.

aprilthird2021 · 5 months ago
And it seems to be because the training data is largely unofficial subtitles from movies. Which often have a string like "Translated by X" at the end of the movie which is often silent while credits roll.
rob74 · 5 months ago
Looks like they used more official sources for German - there, silence is apparently hallucinated as "Untertitelung des ZDF für funk, 2017" according to one of the comments on the issue. Which makes sense, as the public broadcasters' "Mediathek" is probably the largest freely available resource of subtitled videos in Germany. I wonder if the ZDF gave its approval for it being used for LLM training though?
4gotunameagain · 5 months ago
I'm sure they totally did not pirate the audio of said movies.

Deleted Comment

iqfareez · 5 months ago
make sense..
beshrkayali · 5 months ago
You've got a little typo, it's not "رجمة", it's "ترجمة" that means translation, the ت at the beginning is missing.
nottorp · 5 months ago
Title should be changed to "OpenAI publishes evidence they trained on pirated movies".
pjc50 · 5 months ago
Of course. Piracy is legal when you have a bigger pile of money than the studios.
codedokode · 5 months ago
Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.
onlyrealcuzzo · 5 months ago
Isn't Piracy legal in many parts of the world?

Legally, why wouldn't they be able to do the piracy parts in one of those jurisdictions and then ship the outputs back to the mothership?

foogazi · 5 months ago
Too big to nail
berkes · 5 months ago
How is this evidence of that fact? Honest question.

I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

0points · 5 months ago
> I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

> But isn't it already known and admitted (and allowed?)

No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

2: https://www.reuters.com/legal/litigation/openai-hit-with-new...

nemomarx · 5 months ago
The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all
jcranmer · 5 months ago
> How is this evidence of that fact?

The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.

> But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.

Hnrobert42 · 5 months ago
HN is pretty strict about not editorializing titles. Even if you statement was unequivocably correct, the post would get flagged.

Dead Comment

dandiep · 5 months ago
Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1]

1. https://news.ycombinator.com/item?id=43427376

ilyakaminsky · 5 months ago
I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

poly2it · 5 months ago
How does this make a profit? Whisper should be $0.006 to $0.010 per minute, but you rate less than $0.001? Do you 10x the audio?
eric-burel · 5 months ago
That's the problem with raws large models, it should always be coupled with satellite small models and logic. It's (probably) easier to detect hallucinations using a traditional ML/DL model that can catch mismatches (it's easy to build a synthetic dataset for this) than transcribing. And the simplest piece of code can detect a silence and that it should match no text.
horseradish7k · 5 months ago
well, auto correcting grammar happens in normal subtitles too... "Why don't subtitles match dubbing?" by Tom Scott: https://youtu.be/pU9sHwNKc2c
Hobadee · 5 months ago
Little did you all know, this is just being mechanical turked by Nancy Qunqar.

Way to go Nancy! Keep up the good work, ya crazy bastard!

whamlastxmas · 5 months ago
Is this spam? That name only shows as an instagram account and this thread. If you pay for insta followers is this how they get them now? Haha
DAlperin · 5 months ago
That’s the name in the Arabic text hallucinated by the model :)
haiku2077 · 5 months ago
I've noticed this also happens in english Whisper models with the phrases:

"[ sub by sk cn2 ]"

or

"Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"

or

"This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."

OSDeveloper · 5 months ago
Because they train on pirated media and or youtube videos, good method, until you get slop, or get caught
flexagoon · 5 months ago
In Russian it often hallucinates "Субтитры сделал DimaTorzok" ("Subtitles by DimaTorzok") at the end of things. Interestingly, I wasn't able to find any YouTube videos with that name in the subtitles, so it's not like it's in a lot of training data.
codedokode · 5 months ago
I tried googling this and found questions from Telegram users why voice messages recognition sometimes produces this phrase and who is this person. Also I found this thread [1] claiming that the subtitles by DimaTorzok are coming from some Russian youtube videos on gaming like [2].

[1] https://github.com/openai/whisper/discussions/2372

[2] https://www.youtube.com/watch?v=FAqyUuahMlc&t=401s

flexagoon · 5 months ago
Yeah, I know about this from Telegram, because they use Whisper for voice message recognition. There are a bunch of other artifacts it often produces.
berkes · 5 months ago
Could it be someone distributing subs online, e.g. showing up in the opensubtitles.org dataset?
voidUpdate · 5 months ago
Or possibly someone subtitling pirated movies? That seems to be a common thing according to other comments