Readit News logoReadit News
onychomys · 3 years ago
> there’s unfortunately no easy way to sort out the least viewed pages, short of a very slow linear search for the needle in the haystack

So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:

en.m Alcibiades_(character) 1 0

en.m Alcibiades_DeBlanc 2 0

en.m Alcibiades_the_Schoolboy 1 0

en.m Alcide_De_Gasperi 2 0

en.m Alcide_Herveaux 1 0

en.m Alcide_Laurin 1 0

en.m Alcide_de_Gasperi 1 0

en.m Alcides_Escobar 3 0

en.m Alcimus_(mythology) 1 0

with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.

The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.

[0]for example, https://dumps.wikimedia.org/other/pageviews/2019/2019-01/

melony · 3 years ago
MapReduce and dump everything into something like DuckDB.
softgrow · 3 years ago
Should be titled differently, all of the least viewed articles are now going to be viewed, lots. I'm really having trouble myself not visiting them. Mind you it's time some obscure moth species articles got some improvement :)
thematrixturtle · 3 years ago
This is the interesting number paradox: the least interesting number in the world ceases to be boring once it has the interesting property of being the least interesting attached to it.

https://en.wikipedia.org/wiki/Interesting_number_paradox

McBeige · 3 years ago
Doesn't that just mean the least interesting number definitely exists but is simply unknowable?
dredmorbius · 3 years ago
Also a paradox of any value system based on scarcity and obscurity of what is ultimately a zero-marginal-cost production function.

Cult-followings of music, art, or fashion, mass-produced luxery fashion labels (where market growth is the kiss of death, though a new brand can of course, be easily launched, and is), a secret vacation spot (where travel costs are largely the same as to any other point on Earth), etc., etc.

simonh · 3 years ago
It's not as if 'not being visited' is some special protected status we need to preserve for posterity.

Dead Comment

dmix · 3 years ago
I noticed moths came up a couple times, a brief guess why: Wikipedia says it's "one of most speciose orders" (besides flies and beetles). But maybe it has the most pages because they are so easy to catch with a light in your backyard, that it's far easier to name them all than something parasitoid wasps [1]?

Although this same paper says that "more species of beetles (>350,000) have been described than any other order of animal, insect or otherwise"

[1] https://www.biorxiv.org/content/10.1101/274431v1.full.pdf

Just speculation...

DonaldFisk · 3 years ago
Mother here. The examples listed in the article aren't of much use, as there's not much there you can use to identify species. If you're interested in moths, you'll visit a specialist web site, and if you aren't, you won't visit the Wikipedia pages either. The web site I use is https://www.ukmoths.org.uk, which has photographs of each (UK) species, as well as its range, food plant, and flying time.

Regarding the number of beetle species, JBS Haldane (probably) said that "God is incredibly fond of beetles" (or words to that effect).

beowulfey · 3 years ago
As an aside, is mother actually the term? My first readthrough of your comment had me envisioning an incredibly knowledgeable parent with a cool hobby before I realized what you meant!
jl6 · 3 years ago
So Wikipedia, just like the real world, is filled with staggering quantities of different types of bugs.
simonh · 3 years ago
May brain parsed that as if you were talking about software bugs for a second.
DavidSJ · 3 years ago
Interestingly, two of these — one of the disambiguation pages, and one of the least-viewed non-disambiguation pages — were created by the same user, Carlossuarez46, and are both about a location in Iran.

Furthermore, the user Ruigeroeland has contributed to three of the insect pages.

So I guess these users have the distinction of having contributed to multiple of the least-seen articles on Wikipedia. (They probably also contributed to many widely-seen articles!)

shmageggy · 3 years ago
This is addressed at the end of the article. Many are probably made with automated tools

> For example, the 12-word stub Pottallinda (5 views last year) was created on 18 January 2011 by User:Ser Amantio di Nicolao, who happens to be the most active editor in all of Wikipedia (as measured by number of edits). Within 60 seconds of creating this page, the same editor also created Polmalagama, Polommana, Polpitiya, Polwatta, and dozens of other substantially identical articles.

iamacyborg · 3 years ago
Yeah, he’s likely using AutoWikiBrowser for this, it’s a great tool for automating MediaWiki tasks.
bryanrasmussen · 3 years ago
Since they have links to all of these least viewed articles they will soon cease to be the least viewed articles.
inopinatus · 3 years ago
yesenadam · 3 years ago
Also seems related to the interesting number paradox:

"the smallest uninteresting number is itself interesting because it is the smallest uninteresting number, thus producing a contradiction."

https://en.wikipedia.org/wiki/Interesting_number_paradox

Incidentally, the last line of that article is:

> The mathematician and philosopher Alex Bellos suggested in 2014 that a candidate for the lowest uninteresting number would be 247 because it was, at the time, "the lowest number not to have its own page on Wikipedia".

IAmEveryone · 3 years ago
Note that page view data is also available at https://pageviews.wmcloud.org/?project=en.wikipedia.org&plat...

(I didn't see this linked in the post. Apologies if I missed it)

mgdlbp · 3 years ago
And a quick way to get there is the Page information link in the sidebar. On the info page for each page[0] is a count and plot of views in the past 30 days, and at the bottom is a link to external tools like the one above.

[0] example: https://en.wikipedia.org/w/index.php?title=Weimer_Township&a...

onos · 3 years ago
Great sleuthing. Is there a convenient alternative algorithm that might be used for random article that would also continue to work fine as more articles are added or removed?
zarzavat · 3 years ago
You can make a binary tree where each node counts the total (recursive) number of children. Updating these counts is log(N).

To insert a node, generate a random bit string: descend the tree and when the counts are equal then take the branch corresponding to that position in the bit string. When the counts are unequal, take the branch with the smallest count.

To remove a node just remove it from the tree and update the counts up along its path to the root. This assumes that articles are added as often as they are removed.

To sample, just generate another random bit string and traverse the tree according to it.

But I agree with the sibling comment that the current non-uniform algorithm is fine for a cute “Random article” button.

HALtheWise · 3 years ago
If the articles have some reasonably dense small IDs associated with them, then an easy algorithm is to simply pick a random ID between 0 and the max, check if it's a still-existing article, and repeat if not. There are plenty of distributed queue designs capable of distributing IDs to servers so they can be given to new articles, and re-using the id from a deleted article is fine to keep things denser.
anewhnaccount2 · 3 years ago
The database itself may internally store a more dense identifier than an autoincrementing primary key. For e.g. ctid in Postgres or, under certain conditions rowid in sqlite. These should be fully dense after a vacuum, and rejection sampling can be used to paper over tombstones generated between vacuums.
stevage · 3 years ago
Seems like it would be pretty easy to just maintain an alternative index of articles, with an integer ID counting from 1 to N. Then just pick a random number from 1 to N.

It could even be maintained entirely outside the Wikimedia servers, relying on database dumps.

bawolff · 3 years ago
I disagree that anything is wrong with the current algorithm.

However there was a brief time period where randomPage used elasticsearch to get the random article instead.

amai · 3 years ago
To get a more equal spacing between the articles maybe one can make use of quasirandom numbers instead of random numbers: https://en.wikipedia.org/wiki/Low-discrepancy_sequence
blackboxlogic · 3 years ago
For the smallest change with the fairest result, I would reseed the random IDs on a schedule.
bombcar · 3 years ago
Added is easy. It’s removed that is the tricky part, though perhaps you could reuse IDs somehow.

Maybe a hashing function could work.