> there’s unfortunately no easy way to sort out the least viewed pages, short of a very slow linear search for the needle in the haystack
So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:
en.m Alcibiades_(character) 1 0
en.m Alcibiades_DeBlanc 2 0
en.m Alcibiades_the_Schoolboy 1 0
en.m Alcide_De_Gasperi 2 0
en.m Alcide_Herveaux 1 0
en.m Alcide_Laurin 1 0
en.m Alcide_de_Gasperi 1 0
en.m Alcides_Escobar 3 0
en.m Alcimus_(mythology) 1 0
with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.
The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.
Should be titled differently, all of the least viewed articles are now going to be viewed, lots. I'm really having trouble myself not visiting them. Mind you it's time some obscure moth species articles got some improvement :)
This is the interesting number paradox: the least interesting number in the world ceases to be boring once it has the interesting property of being the least interesting attached to it.
Also a paradox of any value system based on scarcity and obscurity of what is ultimately a zero-marginal-cost production function.
Cult-followings of music, art, or fashion, mass-produced luxery fashion labels (where market growth is the kiss of death, though a new brand can of course, be easily launched, and is), a secret vacation spot (where travel costs are largely the same as to any other point on Earth), etc., etc.
I noticed moths came up a couple times, a brief guess why: Wikipedia says it's "one of most speciose orders" (besides flies and beetles). But maybe it has the most pages because they are so easy to catch with a light in your backyard, that it's far easier to name them all than something parasitoid wasps [1]?
Although this same paper says that "more species of beetles (>350,000) have been described than any other order of animal, insect or otherwise"
Mother here. The examples listed in the article aren't of much use, as there's not much there you can use to identify species. If you're interested in moths, you'll visit a specialist web site, and if you aren't, you won't visit the Wikipedia pages either. The web site I use is https://www.ukmoths.org.uk, which has photographs of each (UK) species, as well as its range, food plant, and flying time.
Regarding the number of beetle species, JBS Haldane (probably) said that "God is incredibly fond of beetles" (or words to that effect).
As an aside, is mother actually the term? My first readthrough of your comment had me envisioning an incredibly knowledgeable parent with a cool hobby before I realized what you meant!
Interestingly, two of these — one of the disambiguation pages, and one of the least-viewed non-disambiguation pages — were created by the same user, Carlossuarez46, and are both about a location in Iran.
Furthermore, the user Ruigeroeland has contributed to three of the insect pages.
So I guess these users have the distinction of having contributed to multiple of the least-seen articles on Wikipedia. (They probably also contributed to many widely-seen articles!)
This is addressed at the end of the article. Many are probably made with automated tools
> For example, the 12-word stub Pottallinda (5 views last year) was created on 18 January 2011 by User:Ser Amantio di Nicolao, who happens to be the most active editor in all of Wikipedia (as measured by number of edits). Within 60 seconds of creating this page, the same editor also created Polmalagama, Polommana, Polpitiya, Polwatta, and dozens of other substantially identical articles.
> The mathematician and philosopher Alex Bellos suggested in 2014 that a candidate for the lowest uninteresting number would be 247 because it was, at the time, "the lowest number not to have its own page on Wikipedia".
And a quick way to get there is the Page information link in the sidebar. On the info page for each page[0] is a count and plot of views in the past 30 days, and at the bottom is a link to external tools like the one above.
Great sleuthing. Is there a convenient alternative algorithm that might be used for random article that would also continue to work fine as more articles are added or removed?
You can make a binary tree where each node counts the total (recursive) number of children. Updating these counts is log(N).
To insert a node, generate a random bit string: descend the tree and when the counts are equal then take the branch corresponding to that position in the bit string. When the counts are unequal, take the branch with the smallest count.
To remove a node just remove it from the tree and update the counts up along its path to the root. This assumes that articles are added as often as they are removed.
To sample, just generate another random bit string and traverse the tree according to it.
But I agree with the sibling comment that the current non-uniform algorithm is fine for a cute “Random article” button.
If the articles have some reasonably dense small IDs associated with them, then an easy algorithm is to simply pick a random ID between 0 and the max, check if it's a still-existing article, and repeat if not. There are plenty of distributed queue designs capable of distributing IDs to servers so they can be given to new articles, and re-using the id from a deleted article is fine to keep things denser.
The database itself may internally store a more dense identifier than an autoincrementing primary key. For e.g. ctid in Postgres or, under certain conditions rowid in sqlite. These should be fully dense after a vacuum, and rejection sampling can be used to paper over tombstones generated between vacuums.
Seems like it would be pretty easy to just maintain an alternative index of articles, with an integer ID counting from 1 to N. Then just pick a random number from 1 to N.
It could even be maintained entirely outside the Wikimedia servers, relying on database dumps.
So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:
en.m Alcibiades_(character) 1 0
en.m Alcibiades_DeBlanc 2 0
en.m Alcibiades_the_Schoolboy 1 0
en.m Alcide_De_Gasperi 2 0
en.m Alcide_Herveaux 1 0
en.m Alcide_Laurin 1 0
en.m Alcide_de_Gasperi 1 0
en.m Alcides_Escobar 3 0
en.m Alcimus_(mythology) 1 0
with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.
The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.
[0]for example, https://dumps.wikimedia.org/other/pageviews/2019/2019-01/
https://en.wikipedia.org/wiki/Interesting_number_paradox
Cult-followings of music, art, or fashion, mass-produced luxery fashion labels (where market growth is the kiss of death, though a new brand can of course, be easily launched, and is), a secret vacation spot (where travel costs are largely the same as to any other point on Earth), etc., etc.
Dead Comment
Although this same paper says that "more species of beetles (>350,000) have been described than any other order of animal, insect or otherwise"
[1] https://www.biorxiv.org/content/10.1101/274431v1.full.pdf
Just speculation...
Regarding the number of beetle species, JBS Haldane (probably) said that "God is incredibly fond of beetles" (or words to that effect).
Furthermore, the user Ruigeroeland has contributed to three of the insect pages.
So I guess these users have the distinction of having contributed to multiple of the least-seen articles on Wikipedia. (They probably also contributed to many widely-seen articles!)
> For example, the 12-word stub Pottallinda (5 views last year) was created on 18 January 2011 by User:Ser Amantio di Nicolao, who happens to be the most active editor in all of Wikipedia (as measured by number of edits). Within 60 seconds of creating this page, the same editor also created Polmalagama, Polommana, Polpitiya, Polwatta, and dozens of other substantially identical articles.
"the smallest uninteresting number is itself interesting because it is the smallest uninteresting number, thus producing a contradiction."
https://en.wikipedia.org/wiki/Interesting_number_paradox
Incidentally, the last line of that article is:
> The mathematician and philosopher Alex Bellos suggested in 2014 that a candidate for the lowest uninteresting number would be 247 because it was, at the time, "the lowest number not to have its own page on Wikipedia".
(I didn't see this linked in the post. Apologies if I missed it)
[0] example: https://en.wikipedia.org/w/index.php?title=Weimer_Township&a...
To insert a node, generate a random bit string: descend the tree and when the counts are equal then take the branch corresponding to that position in the bit string. When the counts are unequal, take the branch with the smallest count.
To remove a node just remove it from the tree and update the counts up along its path to the root. This assumes that articles are added as often as they are removed.
To sample, just generate another random bit string and traverse the tree according to it.
But I agree with the sibling comment that the current non-uniform algorithm is fine for a cute “Random article” button.
It could even be maintained entirely outside the Wikimedia servers, relying on database dumps.
However there was a brief time period where randomPage used elasticsearch to get the random article instead.
Maybe a hashing function could work.