With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
In the idea of making more of an OpenAI minute, don't send it any silence.
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.
From my own experience with whisper.cpp, normalizing the audio and removing silence not only shortens the process time significantly, but also increases a lot the quality of the transcription, as silence can mean hallucinations. You can do that graphically with Audacity too, if you do not want to deal with the command line. You also do not need any special hardware to run whisper.cpp, with the small model literally any computer should be able to do it if you can wait a bit (less than the audio length).
One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
Others pointed out the value of silence, but I just wanted to say it saddens me when humanity is misclassified as inefficiency. The other day Sam Altman made a jest about how much energy is wasted by people saying "thanks" to chatgpt. The corollary is how much human energy is wasted on humans saying thanks to each other. When making a judgement about inefficiency one is making a judgement on what is valuable, a very biased judgement that isn't necessarily aligned with what makes us thrive. =) (<-- a wasteful smiley)
1/3 of the meeting is silence? That’s a good thing. It’s allowing people time to think over what they’re hearing, there are pauses to allow people to contribute or participate. What do you think a better percentage of silent time would be?
If a human meeting had lot of silence (assuming it's between words and not before / after), I would consider it a very efficient meeting where there was just enough information exchanged with adequate absorption, processing and response time.
Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.
Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.
For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.
A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.
To me you talk at what I would consider "1.2x" of podcast speed (which to me is a decent average measure of spoken word speed - I usually do 1.5x on all podcasts). You're definitely still in the normal distribution for tech YouTubers, in my experience - in fact it feels like a lot of tech YouTube talks like they've had a bit too much adderall, but you don't come off that way. Naturally people may choose to slow down tutorials, because the person giving the tutorial can never truly understand what someone learning would or wouldn't understand. So overall I think your speed is totally fine! Also, very timely video, I was interested in the exact topic, so I'm happy I found this.
Your actual speed of talking sounds a little faster than average but not notably so.
But it feels (very subjectively) faster to me than usual because you don't really seem to take any pauses. It's like the whole video is a single run-on sentence that I keep buffering, but I never get a chance to process it and flush the buffer.
Your video sounded a tad fast at 2x and pretty fine at 1.5.
Now I think speed adjustment come less from the natural speaking pace of the person than the subject matter.
I'm thinking of a channel like Accented Cinema (https://youtu.be/hfruMPONaYg), with a slowish talking pace, but as there's all the visual part going on at all times, it actually doesn't feel slow to my ear.
I felt the same for videos explaining concept I have no familiarity with, so I see as how fast the brain can process the info, less than the talking speed per se.
but even when I play back my own videos it feels painfully slow unless it's 2x.
Watching your video at 1x still feels too slow, and it's just right for me at 2x speed (that's approximately how fast I normally talk if others don't tell me to slow down), although my usual YouTube watching speed is closer to 2.5-3x. That is to say, you're still faster than a lot of others.
I think it just takes practice --- I started at around 1.25x for videos, and slowly moved up from there. As you have noticed, once you've consumed enough sped-up content, your own speaking speed will also naturally increase.
Your speaking speed is noticeably faster than usual, but I think it's good for this kind of video. When the content is really dense and every word is chosen for maximum information value, a slower speed would be good, but for relatively natural speech with a normal amount of redundancy I think it's fine to go at this speed.
> Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.
We get used to higher speeds when we consume a lot of content that way. Have you heard the systems used by experienced blind people? I cannot even understand the words in them, but months of training would probably fix that.
> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.
Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.
Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.
It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me.
I wish platforms have included 4x but done properly (with minimal artefacts).
In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove
Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!
I wish there was a 2.25x YouTube option for "normal" humans. I already use every shortcut, and listen at 2x 90% of the time. But Andrej I can't take faster than 1.25x
YouTube ran an experiment with up to 4x playback on mobile (???) but it went away in February. I get a lot of the experiments they do being experiments but why just allowing the slider to go farther is such a back and forth hoopla is beyond me. It's one of the oft touted features of 3rd party apps and extensions with nearly 0 UI impact to those who don't want to use it (just don't slide the slider past 2x if you don't want past 2x).
The interesting thing here is that OpenAI likely has a layer that trims down videos exactly how you suggest, so they can still charge by the full length while costing less for them to actually process the content.
Gemini charges by tokens rather than minutes. I used VAD to trim silence hoping token count will go down. I noticed the token count wasn't much different (Eg: 30 seconds of background noise had the same count as 2s of background noise). Either Gemini API trims silence under the hood, or the nature of tokenization is dependent on speech content rather than the length. Not sure which.
In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.
I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.
Often, I'll come across speakers who just speak slowly and listening at 1.5x or 2x barely feels sped-up.
Additionally, the brain tends to adjust to a faster talking speed very quickly. If I'm watching an average-paced person talk and speed them up by 2x, the first couple minutes of listening might be difficult and will require more intent-listening. However, the brain starts processing it as the new normal and it does not feel sped-up anymore. To the extent that if I go back to 1x, it feels like the speaker is way too slow.
That's completely different. Imagine you are reading a book and the words only get revealed to you at 1 word a second. You would get annoyed if your natural reading speed was higher than that.
Same with a video. A lot of people speak considerably slower than you could process the information they are conveying, so you speed it up. You still get the same content and are not skipping parts as you would when skimming a book.
>>Just feels like 'skimming' a real book instead of actually reading it.
That's the goal for me lately. I primarily use Youtube for technical assistance (where are the screws to adjust this carburetor?, how do I remove this brake hub?, etc). There used to be short 1 to 2m videos on this kind of stuff but nowadays I have to suffer through a 10-15 minute video with multiple ad breaks.
So now I always watch youtube at 2x speed while rapidly jumping the slider forward to find relevant portions.
Some people talk slower than your natural listening speed. It's less like skimming and more like if some books used 36pt font and you normalized the size back down to a comfortable information-dense size.
That's an amusing perspective. I really struggle with watching any video at double speed, but I've never had trouble listening to any of his talks at 1x. To me, he seems to speak at a perfectly reasonable pace.
A point on skimming vs taking the time to read something properly.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
> I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you
"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.
I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.
Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.
University didn't agree with me mostly because I can't pay attention to the average lecturer. Getting bored in between words or while waiting for them to write means I absorbed very little and had to teach myself nearly everything.
Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.
> I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.
But now we get to browse the knowledge rather than having it thrown at us. That's more important than the quality or formatting of the content.
For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
Depends on what you're listening to. If it's a recap of something and you're just looking for the answer to "what happened?", that can be fine for 2x. If you're getting into the "why?" maybe slower is better. Or if there are a lot of players involved.
I'm trying to imagine listening to War and Peace faster. On the one hand, there are a lot of threads and people to keep track of (I had a notepad of who is who). On the other hand, having the stories compressed in time might help remember what was going on with a character when finally returning to them.
Listening to something like Dune quickly, someone might come out only thinking of the main political thrusts, and the action, without building that same world in their mind they would if read slower.
Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.
This is similar to strategies in “how to read a book” (Adler).
By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.
Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.
I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
That's why I find the idea of training breaking news on Reddit or Twitter funny, wild exaggerations and targeted spin is the sort of stuff that does best on those sites and generates the most comments, 50% of the output would be lies.
As if human-generated transcriptions of audio ever came with guarantees of accuracy?
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!
I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
If you have a recent macbook you can run the same whisper model locally for free. People are really sleeping on how cheap the compute you own hardware for already is.
I don't. I have a MacBook Pro from 2019 with an Intel chip and 16 GB of memory. Pretty sure when I tried the large whisper model it took like 30 minutes to an hour to do something that took hardly any time via Groq. It's been a while though so maybe my times are off.
Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.
Probably would be a good idea to add a delay to it and wait for the automatic ones though :)
You could use Hugging Face's Inference API (which supports all of these API providers) directly making it easier to switch between them, e.g. look at the panel on the right on: https://huggingface.co/openai/whisper-large-v3
Let me know if you are interested in a more reliable transcription API. I'm building Lemonfox.ai and we've optimized our transcription API to be highly available and very fast for large files. Happy to give you a discount (email: bruno at lemonfox.ai)
I am a blue collar electrician. Not a coder (but definitely geeky).
Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).
We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.
I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).
I came here to ask the same question. This is a well-solved problem, red queen racing it seems utterly pointless, a symptom of reflexive adversarialism.
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text
Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.
For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.
A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.
But it feels (very subjectively) faster to me than usual because you don't really seem to take any pauses. It's like the whole video is a single run-on sentence that I keep buffering, but I never get a chance to process it and flush the buffer.
Now I think speed adjustment come less from the natural speaking pace of the person than the subject matter.
I'm thinking of a channel like Accented Cinema (https://youtu.be/hfruMPONaYg), with a slowish talking pace, but as there's all the visual part going on at all times, it actually doesn't feel slow to my ear.
I felt the same for videos explaining concept I have no familiarity with, so I see as how fast the brain can process the info, less than the talking speed per se.
https://en.m.wikipedia.org/wiki/James_Goodnight
I have watched one or two videos of his, and he spoke slowly, compared to the average person. I liked that. It sounded good.
Watching your video at 1x still feels too slow, and it's just right for me at 2x speed (that's approximately how fast I normally talk if others don't tell me to slow down), although my usual YouTube watching speed is closer to 2.5-3x. That is to say, you're still faster than a lot of others.
I think it just takes practice --- I started at around 1.25x for videos, and slowly moved up from there. As you have noticed, once you've consumed enough sped-up content, your own speaking speed will also naturally increase.
We get used to higher speeds when we consume a lot of content that way. Have you heard the systems used by experienced blind people? I cannot even understand the words in them, but months of training would probably fix that.
https://en.wikipedia.org/wiki/Spreading_(debate)
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
Transcribe it locally using whisper and output tokens/sec?
Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.
Hilbert transform and FFT to get phoneme rate would work.
Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!
I now think this might be a good solution:
https://www.theverge.com/news/603581/youtube-premium-experim...
I listen to a lot of videos on 3 or even 4x.
In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.
Deleted Comment
Is it common for people to watch Youtube sped up?
I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.
Additionally, the brain tends to adjust to a faster talking speed very quickly. If I'm watching an average-paced person talk and speed them up by 2x, the first couple minutes of listening might be difficult and will require more intent-listening. However, the brain starts processing it as the new normal and it does not feel sped-up anymore. To the extent that if I go back to 1x, it feels like the speaker is way too slow.
Same with a video. A lot of people speak considerably slower than you could process the information they are conveying, so you speed it up. You still get the same content and are not skipping parts as you would when skimming a book.
That's the goal for me lately. I primarily use Youtube for technical assistance (where are the screws to adjust this carburetor?, how do I remove this brake hub?, etc). There used to be short 1 to 2m videos on this kind of stuff but nowadays I have to suffer through a 10-15 minute video with multiple ad breaks.
So now I always watch youtube at 2x speed while rapidly jumping the slider forward to find relevant portions.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.
I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.
Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.
Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.
But now we get to browse the knowledge rather than having it thrown at us. That's more important than the quality or formatting of the content.
There is too much information. people are trying to optimize breadth over depth, but obviously there are costs to this.
Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".
This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.
Consume it however you want and come up with actual criticisms next time?
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
++ to "Slower is usually better for thinking"
Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.
Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?
I'm trying to imagine listening to War and Peace faster. On the one hand, there are a lot of threads and people to keep track of (I had a notepad of who is who). On the other hand, having the stories compressed in time might help remember what was going on with a character when finally returning to them.
Listening to something like Dune quickly, someone might come out only thinking of the main political thrusts, and the action, without building that same world in their mind they would if read slower.
By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
[1] https://speechischeap.com
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
Newspaper is essentially just an inaccurate summary of what really happened. So I don't find this realization that uncomfortable.
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
[1] https://github.com/adam-s/doomberg-terminal
[0] https://groq.com/pricing/
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:
https://github.com/jdepoix/youtube-transcript-api
The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.
Probably would be a good idea to add a delay to it and wait for the automatic ones though :)
At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.
Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.
https://developers.cloudflare.com/workers-ai/models/whisper-...
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.
Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).
We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.
I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).
Is there a definition for this expression? I don't catch you.
> ... using corporate technology for the solved problem is a symptom of self-directed skepticism by the user against the corporate institutions ...
Eh?