yt-fts is a simple python script that uses yt-dlp to scrape all of a youtube channels subtitles and load them into an sqlite database that is searchable from the command line. It allows you to query a channel for specific key word or phrase and will generate time stamped youtube urls to the video containing the keyword.
Fun thing to try: do a Google search with "site:youtube.com" in it. You get basically nothing, no matter what keywords you use. It seems that Google actually entirely ignores/excludes YouTube from their regular HTML indexing, and instead only relies on the YouTube backend to actively push content into (a special, separate part of) the search index. Which gets you "results from YouTube" and "video search" — but doesn't get you the ability to search youtube videos pages qua web pages. (Consider: you can find a post in a Reddit comment thread on Google. Can you find a post in a YouTube video comments section on Google?)
Heck, when I first heard about YouTube's autogenerated captions, my first thought was "oh, so this is Google building deep indexing of video through audio transcription, because they can't trust externally-provided subtitles, right?" But it's been 10 years, and I couldn't have been more wrong.
a. Doing it won't get anybody a promo.
b. They've considered it and determined that it will lose revenue to a degree that is not justified by the usability improvement.
This is one of the pitfalls of having an ad-based product instead of a fee-based product. User experience is just no longer the top priority.
This is exactly it. It's the same business logic that is employed in the regular Google Search.
Google doesn't make money from giving you correct search results. It makes money by keeping you searching for the results you want.
FOSS dialer recommendations are welcome btw.
I know from experience that search in user reviews is very hard. Unless you really understand the review (which was tried via sentiment analysis) you cannot rank results well. But now with the new LLM models I think it would work better.
Firstly, if Google did offer this feature, it would likely be targeted by Search Engine Optimization (SEO) exploits. In essence, any time a new search parameter is introduced, there's a risk of it being manipulated to prioritize certain content—especially by those interested in gaming the system for increased visibility or monetary gain. If YouTube's search feature were to be plagued by such spamming, it could severely degrade the user experience and lead to Google having to strip it away. While not a guarantee, it's a probable outcome given the history of SEO misuse.
Secondly, YouTube's primary focus is on its recommendation algorithm rather than search functionality. With billions of videos hosted, the key goal is to keep users engaged by serving up content they're likely to enjoy, thereby increasing view times and ad revenue. The search feature, while useful, is not as integral to this objective. Further, offering full-text search could provide yet another avenue to manipulate the algorithm, which YouTube surely wants to avoid.
Finally, implementing and maintaining such a feature would require substantial resources. It would necessitate hiring teams of high-salaried employees to moderate and ensure fair use of the feature, adding considerable operational costs. Considering these factors, it seems that Google has made a strategic decision to avoid this feature for now.
That said, the fact that third-party solutions are emerging, such as the one shared here, shows that there's a demand for full-text search capabilities. It also underscores the potential that these solutions have when unencumbered by the constraints faced by a tech giant like Google. This provides a fascinating insight into the dynamic relationship between third-party developers and tech corporations and the way they can complement each other.
Maybe for some users. I just use youtube to find a specific video I need (because people have stopped writing useful how-to's now that they can just make a 10 minute video covering about 1 minute worth of text), and a full text search would be so, so useful.
Also, using "site:youtube.com" on Google works perfectly for me. If I look up "site:youtube.com david letterman" it gives me the David Letterman channel, followed by a seemingly infinite number of Letterman clips. Precisely what I'd expect.
The only thing I can reproduce that you're complaining about is that Google (and YouTube) search don't seem to index YouTube user comments, in contrast to Reddit. But Google doesn't seem to index comments-attached-to-content anywhere on the internet -- not even comments on articles at mainstream publications like the New York Times. Which is probably more of a feature than a bug -- comments on both YouTube videos and news articles tend to be a lot of emotional reactions and repeated opinions which aren't worth searching at all. In contrast, many (not all) Reddit threads are often very informative and the "main content", so it makes sense Google indexes them.
So I don't really see anything to complain about here, from my perspective.
"having a distribution that's both radially symmetric" site:youtube.com
would return 3b1b's "Why π is in the normal distribution" video, which has that in subtitles at 22:28.
Even without the site: term, all I get is an allreadable.com page that's scraped the subtitles for that video. Allreadable appears at first glance to be a site owned by someone in China and hosted on liquidweb.
If they had serious competition, they'd have to do more to keep users, but no such competition exists.
I grant Rumble is only 1% the size of YouTube by viewership, but I think that’ll shift fairly rapidly and we can see 10% on rumble in 3 years.
My analysis https://austingwalters.com/an-analysis-on-rumble-nasdaq-rum/
And then there’s YouTube where you can’t even search subtitles. It makes me shake my head. Google seems to be at the forefront of AI, but doesn’t seem to be able to turn all that expertise into relevant products. Maybe the recent disruptions will shake them awake?
Allowing users to freely query content in their own websites is completely antithetical to what they are trying to do. YouTube is also very aggressive in preventing scraping and limiting the usage of the official API. Which is quite ironic considering the history of the company.
That this was the case with a company whose name is synonymous with online search was ... simply mind-boggling.
The platform eventually did get search (actually a few different implementations), which varied between mostly useless to actually reasonably functional, though I'll note that HN's Algolia-based search is vastly more useful on an ongoing basis.
G+'s content, to the extent it survives at all, is largely on the Internet Archive's Wayback Machine which ... lacks search.
Google will find a video when you search for a phrase that was said in it (as long as its bad speech recognition got it right), it will find a video with a text that shows up on an object with enough clarity to OCR (for example electronic component name on a pcb in the foreground). There is one plot twist - Google will not always do it when you search for VIDEO specifically :) but will gladly give you videos when searching text/images :)
Youtube search on the other hand will try
- suggest something you liked that has nothing to do with the search term entered
- popular videos at this moment
- videos mildly related to proper results. One of them had a horse in it? clearly you want more horses!
- videos with title mildly related to search term
- to ignore upload date filter when they panic (Christchurch mosque massacre).
For example YT search for "Si5351A" limited to this month will give 11 somewhat proper results mostly with "Si5351" (no A postfix) in title/description AND some dude DXing in Indonesian "Menerima Modif Radio Yaesu FTC 1540A Ke DDS System" because "Si5351A" is a "DDS" so its the same thing right? Its like when Im looking for "NSR Ro80" you should show me plenty of other cars because Ro80 is a car :). Searching for Si5351A without quote marks will show one additional video with Si5351B in the title.
Gets better, searching Google Video for Si5351A last month also gives ~11 results, but only 4 of those are direct YT links :]
Then again, I also find it absurd. YouTube is one of the most valuable parts of the Internet. And its lack of searchability is criminal. At least the YT search itself should make up for it. It's shame it doesn't.
If they got it set up such that in theory any web spider could come along and index a YouTube video — then there would be no anti-trust reason that Google couldn't just directly ingest the subtitle files off their own servers; it'd just be a bandwidth-saving optimization over the scraping process that they could otherwise do.
https://www.socialmediatoday.com/news/google-tests-text-sear...
So sure, google can say ‘here’s a standard way to provide subtitles for a video which we’ll index’, but then that becomes a complete SEO side channel - google needs to validate that the subtitles actually match the content. And that means their bots downloading the video itself. And google really doesn’t want to go out there and argue that video needs to be downloadable by bots, because that’s the whole YouTube-dl case right there.
That is not my experience! I regularly resort to this when the crappy inbuilt youtube search, which prefers to throw out algorithmic recommendations over returning actual search results, fails to come up with the goods.
Do you really get no results for, say: https://www.google.com/search?&q=intitle%3A%22thomas+brinkma... ?
Deleted Comment
Off-topic, but since I don't use YouTube and you do, in your experience, how are the auto-generated captions? Are they accurate?
I've been unimpressed by speech-to-text engines in the past, so I'm interested to hear if this is a problem that Google's managed to solve.
I have successfully Googled text in a video's transcript and found that video.
The transcripts themselves are pretty bad though (Google's using old tech).
They're usually good enough for auto-summarization though.
How is it incomprehensible that they don't give a shit about what you want to see and only care about what's profitable to them for you to see?
if anyone could have disrupted the corrupt and unfair academic publishing world, it was Google. they just found it an uninteresting task. they preferred to work on G+, Stadia, Google Code, etc, https://killedbygoogle.com/
Spotify and YouTube are the leading examples but there are definitely others.
https://www.sqlite.org/fts5.html
I’ve not used it with enough content to know how much faster it is than LIKE ‘%my query%’ but it should be a lot quicker.
(Also, in most cases you don’t need to create an id column — every table has one already in the form of rowid.)
https://www.sqlite.org/rowidtable.html
Critically, this is per channel. I wonder if we can optionally configure this to share the downloaded transcripts to a central repository so eventually a good proportion of youtube's transcripts could be downloaded as one big text file.
Sure, are you willing to host it and handle the absolutely inevitable legal issues?
Deleted Comment
SQLite has a really power full-text search mechanism built in - FTS5. It can handle things like stemming and stop words and relevance ranking.
My sqlite-utils Python library includes helper methods for setting that up: https://sqlite-utils.datasette.io/en/stable/python-api.html#...
[1]https://github.com/NotJoeMartinez/yt-fts/pull/25 [2]https://github.com/simonw/openai-to-sqlite
> 7060
I'm all for stuff being archivable with tools like youtube-dl, but I much prefer to see tools like this use the API despite its quotas because it goes beyond archiving a copy for reference. Tools that (ab)use scraping only justify anti-scraping efforts that journalists and the like use and escalate that arms race. I think one could still scrape a channel or two per day within API usage limits [1]--50 units per list, 200 units per download; quota 10,000 units per day.
[1]: https://developers.google.com/youtube/v3/docs/captions/downl...