Readit News logoReadit News
Posted by u/notjoemartinez 2 years ago
Show HN: YouTube Full Text Search – Search all of a channel from the commandlinegithub.com/NotJoeMartinez...
yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line. It allows you to query a channel for specific key word or phrase and will generate time stamped youtube urls to the video containing the keyword.
derefr · 2 years ago
I love that a third party is stepping up here, but it's incomprehensible to me that Google doesn't do this themselves. They're a search company, and they own YouTube. The YouTube data — including the subtitle files — is already sitting there on their servers; they don't have to scrape it, they just have to index it. What are they even doing?

Fun thing to try: do a Google search with "site:youtube.com" in it. You get basically nothing, no matter what keywords you use. It seems that Google actually entirely ignores/excludes YouTube from their regular HTML indexing, and instead only relies on the YouTube backend to actively push content into (a special, separate part of) the search index. Which gets you "results from YouTube" and "video search" — but doesn't get you the ability to search youtube videos pages qua web pages. (Consider: you can find a post in a Reddit comment thread on Google. Can you find a post in a YouTube video comments section on Google?)

Heck, when I first heard about YouTube's autogenerated captions, my first thought was "oh, so this is Google building deep indexing of video through audio transcription, because they can't trust externally-provided subtitles, right?" But it's been 10 years, and I couldn't have been more wrong.

hysan · 2 years ago
I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable. I’m basing this hypothesis on YouTube’s consistent trend of removing features related to controlling your own viewing experience. For example, removing subscription collections.
HardlyCurious · 2 years ago
Actually, letting people search video text would enable less watching and that is probably the reason they aren't interested in it.
derefr · 2 years ago
How would making videos more indexible move people off of YouTube? Once you're there, you'd stay there, with all the current recommender algorithms still in effect; all the external indexability would do is give Google (and Bing, and everyone else) more reasons to lead you into that labyrinth.
roncesvalles · 2 years ago
Whenever I notice an "obvious" potential feature that could improve user experience in a Google product (there are MANY), I automatically assume it's because of one of two things:

a. Doing it won't get anybody a promo.

b. They've considered it and determined that it will lose revenue to a degree that is not justified by the usability improvement.

This is one of the pitfalls of having an ad-based product instead of a fee-based product. User experience is just no longer the top priority.

reaperducer · 2 years ago
I would posit that google has determined that it’s more profitable to keep viewers on YouTube via controlling the viewing experience vs whatever additional ad revenue they’d gain by making videos more easily indexable

This is exactly it. It's the same business logic that is employed in the regular Google Search.

Google doesn't make money from giving you correct search results. It makes money by keeping you searching for the results you want.

hackernewds · 2 years ago
I despise YouTube purely for inflicting mandatory Shorts on the user.
j33zusjuice · 2 years ago
I’m willing to bet that the features you’ve seen removed are taken out because their utilization is low, and there’s some associated maintenance cost. I’d also like to know what other features were removed. I had never heard of subscription collections, for example, and a search returned [this video](https://youtu.be/qGSHPhR8k8g) (I wish markdown worked on here) that says collections were a test feature (I’m guessing it didn’t do well enough to make it to prod).
StrangeATractor · 2 years ago
Have you tried searching through your call history on a Google phone? It's awful. You'd think they'd have a solid search built in but it's nearly useless. Especially considering you're usually searching for that number you only dialed once and isn't in your contacts and is for some reason excluded from your recents, so you go into your call history (strangely hidden behind a menu) and...there's no search function. WTF? You'd think you could filter by area code, date or a general time which the call took place, Google is a search company after all, right? Or am I subtly being nudged to use some of their more profitable products to try and find it again?

FOSS dialer recommendations are welcome btw.

zo1 · 2 years ago
As much as I like to generally, I wouldn't blame Google for this, rather I blame the entire field of "UX".
bob_theslob646 · 2 years ago
I have this same exact problem. I thought I was crazy. The call history on Android/Pixel is absolutely terrible.
whitemary · 2 years ago
It works as good as Windows search, no better
gniv · 2 years ago
Beyond conspiracy theories it's interesting to speculate why Google is not providing native search-in-subtitles and search-in-comments. The easiest explanation is that they don't trust the quality. They probably tried it and reached the conclusion that it doesn't improve search in any meaningful direction.

I know from experience that search in user reviews is very hard. Unless you really understand the review (which was tried via sentiment analysis) you cannot rank results well. But now with the new LLM models I think it would work better.

ramraj07 · 2 years ago
I’m pretty sure google uses the captions in search. As long as you search from within YouTube. I regularly search for keywords and find hourlong videos where it’s mentioned somewhere in the middle and nowhere in the description.
davidy123 · 2 years ago
I think Google is now thoroughly infected with Big-Company-itis. A couple departments would like to and know how to comprehensively use AI across many services, which the consumers would love (though some would be confused by it). But legal, marketing, and some guy in a department called "Annex B?" are preventing it. So then the people in those departments get bored and go somewhere else and their perspectives and skills are lost.
yosito · 2 years ago
At one point Google claimed their mission was to "organize the world's information and make it accessible". That's proven to be just as much of a joke as "Do no evil".
MildlySerious · 2 years ago
For all I care, they lost that claim with [Deleted video] - and by that I don't mean that they remove videos, for which there are countless valid reasons, but that there is no way to see what it was you liked, and that the lists you curate just deteriorate. There are many other, maybe more valid reasons they fail this mission, but that's the one that has plaguing me the longest.
zip1 · 2 years ago
It's quite intriguing that Google doesn't offer full-text search capabilities for YouTube, considering its position as a leading search company. However, I think there are several reasons for this, some of which may not be immediately apparent.

Firstly, if Google did offer this feature, it would likely be targeted by Search Engine Optimization (SEO) exploits. In essence, any time a new search parameter is introduced, there's a risk of it being manipulated to prioritize certain content—especially by those interested in gaming the system for increased visibility or monetary gain. If YouTube's search feature were to be plagued by such spamming, it could severely degrade the user experience and lead to Google having to strip it away. While not a guarantee, it's a probable outcome given the history of SEO misuse.

Secondly, YouTube's primary focus is on its recommendation algorithm rather than search functionality. With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue. The search feature, while useful, is not as integral to this objective. Further, offering full-text search could provide yet another avenue to manipulate the algorithm, which YouTube surely wants to avoid.

Finally, implementing and maintaining such a feature would require substantial resources. It would necessitate hiring teams of high-salaried employees to moderate and ensure fair use of the feature, adding considerable operational costs. Considering these factors, it seems that Google has made a strategic decision to avoid this feature for now.

That said, the fact that third-party solutions are emerging, such as the one shared here, shows that there's a demand for full-text search capabilities. It also underscores the potential that these solutions have when unencumbered by the constraints faced by a tech giant like Google. This provides a fascinating insight into the dynamic relationship between third-party developers and tech corporations and the way they can complement each other.

dingledork69 · 2 years ago
> With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue

Maybe for some users. I just use youtube to find a specific video I need (because people have stopped writing useful how-to's now that they can just make a 10 minute video covering about 1 minute worth of text), and a full text search would be so, so useful.

2020aj · 2 years ago
Regarding your second point... I think it's still important because recommendation algorithms work better when users can find content they enjoy outside of the recommended content. If they can't then the recommendations will become stale.
crazygringo · 2 years ago
Google already does this themselves. If you search for rare words (e.g. try "indubitably") it will absolutely pull up videos that have the word in the auto-generated transcript and nowhere else (not in descriptions, not in comments).

Also, using "site:youtube.com" on Google works perfectly for me. If I look up "site:youtube.com david letterman" it gives me the David Letterman channel, followed by a seemingly infinite number of Letterman clips. Precisely what I'd expect.

The only thing I can reproduce that you're complaining about is that Google (and YouTube) search don't seem to index YouTube user comments, in contrast to Reddit. But Google doesn't seem to index comments-attached-to-content anywhere on the internet -- not even comments on articles at mainstream publications like the New York Times. Which is probably more of a feature than a bug -- comments on both YouTube videos and news articles tend to be a lot of emotional reactions and repeated opinions which aren't worth searching at all. In contrast, many (not all) Reddit threads are often very informative and the "main content", so it makes sense Google indexes them.

So I don't really see anything to complain about here, from my perspective.

harshreality · 2 years ago
What I'd expect is that

"having a distribution that's both radially symmetric" site:youtube.com

would return 3b1b's "Why π is in the normal distribution" video, which has that in subtitles at 22:28.

Even without the site: term, all I get is an allreadable.com page that's scraped the subtitles for that video. Allreadable appears at first glance to be a site owned by someone in China and hosted on liquidweb.

dizhn · 2 years ago
Google doesn't even do a lot simpler things like searching by language or location. And the search is garbage. I am trying to learn Italian so that's what I am interested in but even when I enter a search term spelled correctly with its accents and everything I get anything from Brazilian Portuguese to French. They do a very helfpul translation of the term and return results that are unfortunately useless to me. (I would have loved to speak every language but I don't)
Jakob · 2 years ago
The default behaviour for multiple languages is bad, but the settings page for both region, and search results language work well. In case you haven’t seen the settings, yet.
gniv · 2 years ago
On web search, you can append &lr=lang_it to the URL. Maybe make it into a Chrome extension.
pmoriarty · 2 years ago
YouTube doesn't have to be good. It just needs to do the minimum it needs to keep users from switching to a different platform... which, because of the network effect of so many people and videos being on it, is not much at all.

If they had serious competition, they'd have to do more to keep users, but no such competition exists.

lettergram · 2 years ago
There’s twitch and I think Rumble is going to give it a run for it’s money.

I grant Rumble is only 1% the size of YouTube by viewership, but I think that’ll shift fairly rapidly and we can see 10% on rumble in 3 years.

My analysis https://austingwalters.com/an-analysis-on-rumble-nasdaq-rum/

stingraycharles · 2 years ago
There’s a company founded by a friend of mine called MediaDistillery, who are doing awesome stuff in this area. Real-time searchable massive video archives, with contextual understanding (e.g. “a WW2 fragment with a Jewish mother holding her baby”). Super useful for so many purposes.

And then there’s YouTube where you can’t even search subtitles. It makes me shake my head. Google seems to be at the forefront of AI, but doesn’t seem to be able to turn all that expertise into relevant products. Maybe the recent disruptions will shake them awake?

methou · 2 years ago
I mustn't be searching it right, I can only get this company: https://mediadistillery.com/ , the feature you mentioned is not accessible, and has that Ad-tech smell.
sharess · 2 years ago
Social media companies are in a constant tug-of-war against the end users when it comes to controlling what the users see. The ideal is that the user has absolutely no control over on what they see and the social media company can fully dictate content. That is what makes the money.

Allowing users to freely query content in their own websites is completely antithetical to what they are trying to do. YouTube is also very aggressive in preventing scraping and limiting the usage of the official API. Which is quite ironic considering the history of the company.

flenserboy · 2 years ago
Google: appeared open to start things off, then went whole hog on the MS embrace-and-extend philosophy, aiming to crush the life out of the entire web.
Pazzaz · 2 years ago
I think you're exaggerating a little when you say "site:youtube.com" doesn't work. If I search 'site:youtube.com apple watch' I get 143 000 000 results, and if I search something more specific like 'site:youtube.com "Featuring Dr James Grime"' I'll find exactly what I'm looking for. But you're correct that it doesn't seem to search video comments, only titles and descriptions.
derefr · 2 years ago
The problem I ran into is that, like I said, YouTube doesn't seem to get indexed by hypertext inward-edges from other sites like a regular website does — and so you can't search by how you recall a video being described in pages that link to it; instead, you have to remember how the video describes itself. Which it may not always do well.
dredmorbius · 2 years ago
Some of us recall that when Google+ launched it lacked any search whatsoever.

That this was the case with a company whose name is synonymous with online search was ... simply mind-boggling.

The platform eventually did get search (actually a few different implementations), which varied between mostly useless to actually reasonably functional, though I'll note that HN's Algolia-based search is vastly more useful on an ongoing basis.

G+'s content, to the extent it survives at all, is largely on the Internet Archive's Wayback Machine which ... lacks search.

ranting-moth · 2 years ago
Conflict of interests. They want you faffing around provisioning all those valuable clicks for them to sell.
rasz · 2 years ago
Google does it, Youtube does not.

Google will find a video when you search for a phrase that was said in it (as long as its bad speech recognition got it right), it will find a video with a text that shows up on an object with enough clarity to OCR (for example electronic component name on a pcb in the foreground). There is one plot twist - Google will not always do it when you search for VIDEO specifically :) but will gladly give you videos when searching text/images :)

Youtube search on the other hand will try

- suggest something you liked that has nothing to do with the search term entered

- popular videos at this moment

- videos mildly related to proper results. One of them had a horse in it? clearly you want more horses!

- videos with title mildly related to search term

- to ignore upload date filter when they panic (Christchurch mosque massacre).

For example YT search for "Si5351A" limited to this month will give 11 somewhat proper results mostly with "Si5351" (no A postfix) in title/description AND some dude DXing in Indonesian "Menerima Modif Radio Yaesu FTC 1540A Ke DDS System" because "Si5351A" is a "DDS" so its the same thing right? Its like when Im looking for "NSR Ro80" you should show me plenty of other cars because Ro80 is a car :). Searching for Si5351A without quote marks will show one additional video with Si5351B in the title.

Gets better, searching Google Video for Si5351A last month also gives ~11 results, but only 4 of those are direct YT links :]

mrazomor · 2 years ago
Probably because YouTube =! Google Search, while YouTube is still a subset of Google. So, going an extra mile for YouTube and not for others might put Google Search in anti competition issues.

Then again, I also find it absurd. YouTube is one of the most valuable parts of the Internet. And its lack of searchability is criminal. At least the YT search itself should make up for it. It's shame it doesn't.

derefr · 2 years ago
Google doesn't necessarily have to do anything special for YouTube, though. Google could "just" index YouTube videos as if they were any other web pages, in a standard way. It would then be YouTube's job, to make the data inside those video pages legible to Google's indexer. Where Google could enable this, by pushing for web standards to increase machine-legibility of video in HTML — e.g. standardized ARIA-accessible captions sources for the <video> element, etc.

If they got it set up such that in theory any web spider could come along and index a YouTube video — then there would be no anti-trust reason that Google couldn't just directly ingest the subtitle files off their own servers; it'd just be a bandwidth-saving optimization over the scraping process that they could otherwise do.

choppaface · 2 years ago
They do have search in video but the launch was kinda miffed

https://www.socialmediatoday.com/news/google-tests-text-sear...

jameshart · 2 years ago
How do you think monopoly regulators would like it if YouTube videos were indexed with higher accuracy and detail in google searches than Vimeo video?

So sure, google can say ‘here’s a standard way to provide subtitles for a video which we’ll index’, but then that becomes a complete SEO side channel - google needs to validate that the subtitles actually match the content. And that means their bots downloading the video itself. And google really doesn’t want to go out there and argue that video needs to be downloadable by bots, because that’s the whole YouTube-dl case right there.

ttctciyf · 2 years ago
> Google search with "site:youtube.com" in it. You get basically nothing

That is not my experience! I regularly resort to this when the crappy inbuilt youtube search, which prefers to throw out algorithmic recommendations over returning actual search results, fails to come up with the goods.

Do you really get no results for, say: https://www.google.com/search?&q=intitle%3A%22thomas+brinkma... ?

Deleted Comment

reaperducer · 2 years ago
Heck, when I first heard about YouTube's autogenerated captions

Off-topic, but since I don't use YouTube and you do, in your experience, how are the auto-generated captions? Are they accurate?

I've been unimpressed by speech-to-text engines in the past, so I'm interested to hear if this is a problem that Google's managed to solve.

andai · 2 years ago
As far as I can tell, Google (but not YouTube?) does search YouTube transcripts.

I have successfully Googled text in a video's transcript and found that video.

The transcripts themselves are pretty bad though (Google's using old tech).

They're usually good enough for auto-summarization though.

guerrilla · 2 years ago
> but it's incomprehensible to me that Google doesn't do this themselves.

How is it incomprehensible that they don't give a shit about what you want to see and only care about what's profitable to them for you to see?

fatneckbeard · 2 years ago
i used to imagine similar opportunities with google books. but they have done basically nothing with it. and that's been like 20 years.

if anyone could have disrupted the corrupt and unfair academic publishing world, it was Google. they just found it an uninteresting task. they preferred to work on G+, Stadia, Google Code, etc, https://killedbygoogle.com/

jimmySixDOF · 2 years ago
Rule #50 The better the Catalog, the Worse the Interface

Spotify and YouTube are the leading examples but there are definitely others.

whitemary · 2 years ago
YouTube profits from people scrubbing videos. Why on Earth would they want to offer full text search instead?
drumhead · 2 years ago
Its Google, the obvious seems to elude them even when its sitting in front of them.
freedomben · 2 years ago
I believe Bard has the capability of searching youtube transcripts
gorgoiler · 2 years ago
Very nice! FYI: sqlite ships with a full text search engine featuring a Boolean query language, highlight(), snippet() and scoring:

https://www.sqlite.org/fts5.html

I’ve not used it with enough content to know how much faster it is than LIKE ‘%my query%’ but it should be a lot quicker.

(Also, in most cases you don’t need to create an id column — every table has one already in the form of rowid.)

lennxa · 2 years ago
Not sure if this includes fuzzy search, but having it will make this much more usable.
Boltgolt · 2 years ago
In what cases is it unwise to rely on rowid over a id field?
gorgoiler · 2 years ago
I don’t think there are any. They are one and the same — if you create an integer primary key named id it is aliased to rowid:

https://www.sqlite.org/rowidtable.html

nomilk · 2 years ago
> yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line.

Critically, this is per channel. I wonder if we can optionally configure this to share the downloaded transcripts to a central repository so eventually a good proportion of youtube's transcripts could be downloaded as one big text file.

hnarn · 2 years ago
> share the downloaded transcripts to a central repository

Sure, are you willing to host it and handle the absolutely inevitable legal issues?

moritonal · 2 years ago
I wish IPFS was better. It'd be an obvious solution to this. Content hash the YouTube ID and then distribute hosting.

Deleted Comment

prometheon1 · 2 years ago
There is something similar to a central repository at https://filmot.com/
simonw · 2 years ago
It looks like you're running searches using LIKE: https://github.com/NotJoeMartinez/yt-fts/blob/050981c0519a96...

SQLite has a really power full-text search mechanism built in - FTS5. It can handle things like stemming and stop words and relevance ranking.

My sqlite-utils Python library includes helper methods for setting that up: https://sqlite-utils.datasette.io/en/stable/python-api.html#...

notjoemartinez · 2 years ago
Thank you! I was able to integrate this into the project[1]. I'm also looking into using your openai-to-sqlite[2] library for semantic search.

[1]https://github.com/NotJoeMartinez/yt-fts/pull/25 [2]https://github.com/simonw/openai-to-sqlite

lfconsult · 2 years ago
You're right. Thanks for sharing the link to your full-text search helper, really neat.
lopatin · 2 years ago
This will come in handy. I’ve always wanted to count how many times Lex Fridman has referred to something as a beautiful dance.
noman-land · 2 years ago
I want to do a word count on the word "love".
notjoemartinez · 2 years ago
> yt-fts search "love" --channel "Lex Fridman" | grep "love" | wc -l

> 7060

expertentipp · 2 years ago
I think I'll start to use exclusively CLI tools for discovering and downloading of YT content. The entire experience which starts from typing "youtube.com" in the address bar and pressing enter is obnoxiously unbearable.
lawrencehook · 2 years ago
self-promo, but you might find my extension helpful. https://lawrencehook.com/rys/
hackernewds · 2 years ago
Downloading? Do you also not believe creators should be compensated for their content?
galleywest200 · 2 years ago
I already block advertisements on the web, so I see none when on a Desktop web version of YouTube. But I do not use Sponsor Block so those creators still get to show me their ExpressVPN ads or whatever the flavor is today. Also I use Patreon.
smcleod · 2 years ago
I think they should be compensated, but I don't think Google should be.
harlanji · 2 years ago
I hate being the poo pooer who says that subtitles are available via the API and wish the tool went that route.

I'm all for stuff being archivable with tools like youtube-dl, but I much prefer to see tools like this use the API despite its quotas because it goes beyond archiving a copy for reference. Tools that (ab)use scraping only justify anti-scraping efforts that journalists and the like use and escalate that arms race. I think one could still scrape a channel or two per day within API usage limits [1]--50 units per list, 200 units per download; quota 10,000 units per day.

[1]: https://developers.google.com/youtube/v3/docs/captions/downl...

tonto · 2 years ago
agree that using the API is likely the nicer route. you can also apply for a quota increase, I recently applied for youtube API quota increase to 100,000 units and it was approved for my app (https://cmdcolin.github.io/ytshuffle/) I was concerned they wouldn't like that the app downloads so much data but it was approved without much question, they just wanted terms of service prominently displayed to end users
ggerganov · 2 years ago
For videos without subtitles one could chain Whisper to auto-generate transcripts, though that would require downloading the audio and processing it
cced · 2 years ago
This is exactly what I’ve built. Nothing fancy just ytdlp + whisper + ripgrep + fzf and I’ve got a pretty interesting way to ctrl+f my YT history.
gukoff · 2 years ago
Mind to share? I'd like to try this out