verginer (u/verginer)

verginer commented on Raspberry Pi Drag Race: Pi 1 to Pi 5 – Performance Comparison the-diy-life.com/raspberr... · Posted by u/verginer

verginer · 17 days ago

I finally found a job for my Raspberry Pi 1 Model B from 2012. It’s been sitting in a drawer for years, but about a 2 years ago added it to my Tailscale network as an exit node.

It’s a single-core 700MHz ARMv6 chip with 512MB of RAM. It's a fossil—a Pi 5 is 600x faster (according to the video). But for the 'low-bandwidth' task of routing some banking traffic or running a few changedetection watches via a Hetzner VPS (where the actual docker image runs), it’s rock solid. There’s something deeply satisfying about giving 'e-waste' a second life as a weekend project.

verginer commented on How big is YouTube? ethanzuckerman.com/2023/1... · Posted by u/MBCook

Mogzol · 2 years ago

Page 9 & 10 of the paper [1] go into some detail:

> By constructing a search query that joins together 32 randomly generated identifiers using the OR operator, the efficiency of each search increases by a factor of 32. To further increase search efficiency, randomly generated identifiers can take advantage of case insensitivity in YouTube’s search engine. A search for either "DQW4W9WGXCQ” or “dqw4w9wgxcq” will return an extant video with the ID “dQw4w9WgXcQ”. In effect, YouTube will search for every upper- and lowercase permutation of the search query, returning all matches. Each alphabetical character in positions 1 to 10 increases search efficiency by a factor of 2. Video identifiers with only alphabetical characters in positions 1 to 10 (valid characters for position 11 do not benefit from case-insensitivity) will maximize search efficiency, increasing search efficiency by a factor of 1024. By constructing search queries with 32 randomly generated alphabetical identifiers, each search can effectively search 32,768 valid video identifiers.

They also mention some caveats to this method, namely, that it only includes publicly listed videos:

> As our method uses YouTube search, our random set only includes public videos. While an alternative brute force method, involving entering video IDs directly without the case sensitivity shortcut that requires the search engine, would include unlisted videos, too, it still would not include private videos. If our method did include unlisted videos, we would have omitted them for ethical reasons anyway to respect users’ privacy through obscurity (Selinger & Hartzog, 2018). In addition to this limitation, there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample to compare to our sample - would not be computationally realistic.

[1]: https://journalqd.org/article/view/4066

verginer · 2 years ago

Good observation, but they also acknowledge: > there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample

In short I do believe that the sample is valuable, but it is not a true random sample in the spirit that the post is written, there is a heuristic to have "more hits"

verginer commented on How big is YouTube? ethanzuckerman.com/2023/1... · Posted by u/MBCook

stravant · 2 years ago

The data probably wouldn't look so clean if it were skewed. If Google were doing something interesting it probably wouldn't be skewed only by a little bit.

verginer · 2 years ago

Admittedly, I did not read the paper linked. But my point is not about google doing something funny. Even if we assume that ids are truly random and uniformly distributed this does not mean that the sampling method doesn't have to be iid. This problem is similar to density estimation where Rejection sampling is super inefficient but converges to the correct solution, but MCMC type approaches might need to run multiple times to be sure to have found the solution.

verginer commented on How big is YouTube? ethanzuckerman.com/2023/1... · Posted by u/MBCook

verginer · 2 years ago

The author notes that they used "cheats". Depending on what these do the iid assumption of the samples being independent could be violated. If it is akin to snowball sampling it could have an "excessive" success rate thereby inflating the numbers.

> Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often