Readit News logoReadit News
pvankessel · 2 years ago
This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...
m463 · 2 years ago
> Google started locking down the API almost immediately after we published our study

Isn't this ironic, given how google bots scour the web relentlessly and hammer sites almost to death?

LeonM · 2 years ago
> google bots scour the web relentlessly and hammer sites almost to death

I have been hosting sites and online services for a long time now and never had this problem, or heard of this issue ever before.

If your site can't even handle a crawler, you need to seriously question your hosting provider, or your architecture.

LocalH · 2 years ago
"Rules for thee, but not for me"
dotandgtfo · 2 years ago
This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.
MBCook · 2 years ago
This would find things like unlisted videos which don’t have links to them from recommendations.
trogdor · 2 years ago
That’s a really good point. I wonder if they have an estimate of the percentage of YouTube videos that are unlisted.
0x1ceb00da · 2 years ago
This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)
pants2 · 2 years ago
That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.
justinpombrio · 2 years ago
That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.

zellyn · 2 years ago
Do you get the same 100 dumb fish?
dclowd9901 · 2 years ago
I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.
midasuni · 2 years ago
It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851

krackers · 2 years ago
Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problemhttp://www.stat.yale.edu/~yw562/reprints/species-si.pdf

neurostimulant · 2 years ago
> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.

Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.

fergbrain · 2 years ago
Isn’t this just a variation of the Monte Carlo method?
layer8 · 2 years ago
That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.

Deleted Comment

gaucheries · 2 years ago
I think YouTube locked down their APIs after the Cambridge Analytica scandal.
herval · 2 years ago
in the end, that scandal was the open web's official death sentence :(
nextaccountic · 2 years ago
In which ways were the Cambridge Analytica thing and the openness of Youtube APIs (or other web APIs) related? I just don't see the connection
pvankessel · 2 years ago
They actually held out for a couple of years after Facebook and didn't start forcing audits and cutting quotas until 2019/2020

Dead Comment

alex_young · 2 years ago
This is an interesting way to attack mitigations to the German Tank Problem [0].

I expect the optimal solution is to increase the address space to prevent random samples from collecting enough data to arrive at a statistically significant conclusion. There are probably other good solutions which attempt to vary the distribution in different ways, but a truly random sample should limit that angle.

[0] https://en.m.wikipedia.org/wiki/German_tank_problem

consp · 2 years ago
I didn't read it in the article but this hinges on it being a discrete uniform distribution. Who knows what kind of shenanigans Google did to the identifiers.
cbolton · 2 years ago
Actually the method works regardless of the distribution. It's an interesting and important feature of random sampling. Consider 1000 IDs assigned in the worst (most skewed) way: as a sequence from 0 to 999. If there are 20 videos they will have IDs 0 to 19. If you draw 500 independent random numbers between 0 and 999, each number will have 2% probability of being in the range 0 to 19. So on average you will find that 2% of your 500 attempts find an existing video. From that you conclude that 2% of the whole space of 1000 IDs are assigned. You find correctly that there are 0.02*1000 = 20 videos.
Maxion · 2 years ago
Would be quite the challenge to use a skewed distribution of the address space that's skewed enough to mitigate this type of scraping while at the same time minimizing the risk of collisions.
867-5309 · 2 years ago
this is exactly what springs to mind when it emerged google "conveniently" autocomplete under certain circumstances, thus making those identifiers more likely to be targeted. this completely skews the sample from the outset
pbhjpbhj · 2 years ago
How does a random sample solve for a clustered, say, distribution? Don't the estimations rely on assumptions of continuity?

Suppose I have addresses /v=0x00 to 0xff, but I only use f0 to ff; if you assume the videos are distributed randomly then your estimates will always be skewed, no?

So I take the addressable space and apply an arbitrary filter before assigning addresses.

Equally random samples will be off by the same amount, but you don't know the sparsity that I've applied with my filter?

barrkel · 2 years ago
As long as the sampling isn't skewed, and is properly random and covers the whole space evenly, it will estimate cardinality correctly regardless of the underlying distribution.

There is no way for clustering to alter the probability of a hit or a miss. There is nowhere to "hide". The probability of a hit remains the proportion of the space which is filled.

zX41ZdbW · 2 years ago
I recommend checking the dataset of "YouTube dislikes": https://clickhouse.com/docs/en/getting-started/example-datas...

(it is named this way because it was an archival effort to collect the information before the dislike feature was removed)

It can be used to find things like the most controversial videos; top videos with a description in a certain language, etc.

chii · 2 years ago
It's important to know stats like this dislike counts, because youtube is such a large and public platform that it borderline a public utility.

from the article:

> It’s possible that YouTube will object to the existence of this resource or the methods we used to create it. Counterpoint: I believe that high level data like this should be published regularly for all large user-generated media platforms. These platforms are some of the most important parts of our digital public sphere, and we need far more information about what’s on them, who creates this content and who it reaches.

The gov't ought to make it regulation to force platforms to expose stats like these, so that it can be collected by the statistics bureaus.

otteromkram · 2 years ago
I disagree.

Nobody is stopping users from selecting one of the many YouTube competitors out there (eg - Twitch, Facebook, Vimeo) to host their content. We could also argue that savvy marketers/influencers use multiple hosting platforms.

YouTube's data is critical for YouTube and Google, which is basically an elaborate marketing company.

Governments should only enforce oversight on matters such as user rights and privacy, anticompetitive practices, content subject matter, etc.

tap-snap-or-nap · 2 years ago
> youtube is such a large and public platform that it borderline a public utility.

So has been the big banks, large corporations, land but they all feed off each other and the government. What we want as a community is usually quite different to what they decide to do.

danlark · 2 years ago
Disclaimer: the author of the comment is the CEO of ClickHouse
drewtato · 2 years ago
I was expecting to find out how much data YouTube has, but that number wasn't present. I've used the stats to roughly calculate that the average video is 500 seconds long. Then using a bitrate of 400 KB/s and 13 billion videos, that gives us 2.7 exabytes.

I got 400KB/s from some FHD 24-30 fps videos I downloaded, but this is very approximate. YouTube will encode sections containing less perceptible information with less bitrate, and of course, videos come in all kinds of different resolutions and frame rates, with the distribution changing over the history of the site. If we assume every video is 4K with a bitrate of 1.5MB/s, that's 10 exabytes.

This estimate is low for the amount of storage YouTube needs, since it would store popular videos in multiple datacenters, in both VP9 and AV1. It's possible YouTube compresses unpopular videos or transcodes them on-demand from some other format, which would make this estimate high, but I doubt it.

ksec · 2 years ago
That storage number is highly likely to be off by an order of magnitude.

400KB/s, or 3.2Mbps as we would commonly use in video encoding, is quite low for original quality upload in FHD or commonly known as 1080p. The 4K video number is just about right for average original upload.

You then have to take into account YouTube at least compress those into 2 video codec, H.264 and VP9. Each codec to have all the resolution from 320P to 1080P or higher depending on the original upload quality. With many popular additional and 4K video also encoded in AV1 as well. Some even comes in HEVC for 360 surround video. Yes you read that right. H.265 HEVC on YouTube.

And all of that doesn't even include replication or redundancy.

I would not be surprised if the total easily exceed 100EB. Which is 100 (2020 ) Dropbox in size.

jl6 · 2 years ago
For comparison, 100-200EB is roughly the order of magnitude of all HDDs shipped per quarter:

https://blocksandfiles.com/2023/07/11/disk-drive-numbers/

xapata · 2 years ago
> EB

I pine for the day when "hella-" extends the SI prefixes. Sadly, they added "ronna-" and "quetta-" in 2022. Seems like I'll have to wait quite some time.

financypants · 2 years ago
So is YouTube storing somewhere in the realm of 50-100EB somewhere? How many data centers is that?
38 · 2 years ago
> The 4K video number is just about right for average original upload.

No, it definitely is not.

onlyrealcuzzo · 2 years ago
I was under the impression that all Google storage including GCP (and replication) is under 1ZB.

IIUC, ~1ZB is basically the entire hard drive market for the last 5 years, and drives don't last that long...

I suspect YouTube isn't >10% of all Google.

the-rc · 2 years ago
On one hand: just two formats? There are more, e.g. H264. And there can be multiple resolutions. On the same hand: there might be or might have been contractual obligations to always deliver certain resolutions in certain formats.

On the other hand: there might be a lot of videos with ridiculously low view counts.

On the third hand: remember that YouTube had to come up with their own transcoding chips. As they say, it's complicated.

Source: a decade ago, I knew the answer to your question and helped the people in charge of the storage bring costs down. (I found out just the other day that one of them, R.L., died this February... RIP)

drewtato · 2 years ago
For resolutions over 1080, it's only VP9 (and I guess AV1 for some videos), at least from the user perspective. 1080 and lower have H264, though. And I don't think the resolutions below 1080 are enough to matter for the estimate. They should affect it by less than 2x.

The lots of videos with low view counts are accounted for by the article. It sounds like the only ones not included are private videos, which are probably not that numerous.

laluser · 2 years ago
You’re forgetting about replication and erasure coding overhead. 10 exabytes seems very low tbh. I’d put it closer to 50-100EB at this point.
virtuallynathan · 2 years ago
I did the math on this back in 2013, based on the annual reported number of hours uploaded per minute, and came up with 375PB of content, adding 185TB/day, with a 70% annual growth rate. This does not take into account storing multiple encodes or the originals.
charcircuit · 2 years ago
Keep in mind youtube permanently keeps a copy of the original upload which may have an even larger file size
qingcharles · 2 years ago
Do you know that for certain? I always suspected they would, so they could transcode to better formats in the future, but never found anything to confirm it.
staplers · 2 years ago
I'm interested in the carbon footprint.

Dead Comment

judge2020 · 2 years ago
The outcome of this article is this linked accompanying website: https://tubestats.org/
bane · 2 years ago
Google used to ask scaling questions about youtube for some positions. They often ended up in some big-O line of questions about syncing log-data across a growing an distributed infrastructure. The result was some ridiculous big-O(f(n)) where the function was almost impossible to even describe verbally. Fun fun.

source interviewed by Google a few times

verginer · 2 years ago
The author notes that they used "cheats". Depending on what these do the iid assumption of the samples being independent could be violated. If it is akin to snowball sampling it could have an "excessive" success rate thereby inflating the numbers.

> Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often

Woberto · 2 years ago
Just keep reading TFA

> it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.)

bsdetector · 2 years ago
There's probably a checksum in the URL so that typos can be detected without actually trying to access the video.

If you don't know how the checksum is created you can still try all values of it for one sample of the actual ID space.

oh_sigh · 2 years ago
I assume the cheat is something like using the playlist API that returns individual results for whether a video exists or not.

So you issue an API to create a playlist with video IDs x, x+1, x+2, ..., and then when you retrieve the list, only x+2 is in it since it is the assigned ID.

stravant · 2 years ago
The data probably wouldn't look so clean if it were skewed. If Google were doing something interesting it probably wouldn't be skewed only by a little bit.
verginer · 2 years ago
Admittedly, I did not read the paper linked. But my point is not about google doing something funny. Even if we assume that ids are truly random and uniformly distributed this does not mean that the sampling method doesn't have to be iid. This problem is similar to density estimation where Rejection sampling is super inefficient but converges to the correct solution, but MCMC type approaches might need to run multiple times to be sure to have found the solution.
mocamoca · 2 years ago
I agree.

Proving that using cheats and auto complete does not break sample independence and keeps sampling as random as possible would be needed here for stats beginners such as me!

Drunk dialing but having a human operator that each time tries to help you connect with someone, even if you mistyped the number... Doesn't look random to me.

However I did not read the 85 pages paper... Maybe it's addressed there.

Mogzol · 2 years ago
Page 9 & 10 of the paper [1] go into some detail:

> By constructing a search query that joins together 32 randomly generated identifiers using the OR operator, the efficiency of each search increases by a factor of 32. To further increase search efficiency, randomly generated identifiers can take advantage of case insensitivity in YouTube’s search engine. A search for either "DQW4W9WGXCQ” or “dqw4w9wgxcq” will return an extant video with the ID “dQw4w9WgXcQ”. In effect, YouTube will search for every upper- and lowercase permutation of the search query, returning all matches. Each alphabetical character in positions 1 to 10 increases search efficiency by a factor of 2. Video identifiers with only alphabetical characters in positions 1 to 10 (valid characters for position 11 do not benefit from case-insensitivity) will maximize search efficiency, increasing search efficiency by a factor of 1024. By constructing search queries with 32 randomly generated alphabetical identifiers, each search can effectively search 32,768 valid video identifiers.

They also mention some caveats to this method, namely, that it only includes publicly listed videos:

> As our method uses YouTube search, our random set only includes public videos. While an alternative brute force method, involving entering video IDs directly without the case sensitivity shortcut that requires the search engine, would include unlisted videos, too, it still would not include private videos. If our method did include unlisted videos, we would have omitted them for ethical reasons anyway to respect users’ privacy through obscurity (Selinger & Hartzog, 2018). In addition to this limitation, there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample to compare to our sample - would not be computationally realistic.

[1]: https://journalqd.org/article/view/4066

stochtastic · 2 years ago
This is a fun dataset. The paper leaves a slight misimpression about channel statistics: IIUC, they do not correct for sampling propensity to reweight when looking at subscriber counts (it should be weighted ~1/# of videos per channel since the probability of a given channel appearing is proportional to the number of public videos that channel has, as long as the sample is a small fraction of the population).
bevan · 2 years ago
I noticed that too. Seems very unlikely that 1,000,000 subscribers represents the 98th percentile and not the 99.999th.