I happen to have a corpus which includes pretty much every word ever written in a book, including many misspelled, mistranscribed, or otherwise non-dictionary words.
After eliminating nonsense, non-English, or other mistakes, I think the real winner, coming it at 12 characters, is:
teetertotter
That's a relatively common word. Even though it's usually seen hyphenated, the unhyphenated form is recognized by all the online dictionaries I found.
----
And some other candidates, just for fun, in the 13 or 12 character range:
"proproprietor" seems more like a misspelling. Should have a hyphen, or be two words.
"priorityqueue" is of course familiar to hackers here, but is more of a jargon term, and is only concatenated due to appearing in source code. Invariably it's two words when actually written out.
"preprototype" is used exactly as is, in lots of scientific papers, up to the current day. That's a pretty good one too, and could be a tie for "teetertotter", but it's verging on jargon.
How did you scrape that data? How do you store and retrieve it? Is it just a standard db or a vector db?
Sorry for the questions, but it seems like an interesting, yet probably common data set and as someone who is venturing down this path, I’d like to learn more about building my own dataset similar to this from scratch.
Works for Ubuntu, too. My Colemak self can only get fluffy (6) from the front row, that's the longest word. Middle row really shines though, I can get hardheartedness (15) or assassinations (14).
8 flagfall "Flagfall, or flag fall, is a common Australian expression for a fixed start fee, especially in the taxi, haulage, railway, and toll road industries."
8 galagala "A name in the Philippine Islands of Dammara Philippinensis, a coniferous tree yielding dammar-resin."
Lower/Third Row:
- None
There are no vowels on the bottom row. So no words. I've been typing at ~ 50wpm for 30 years, and I don't think I'd ever actually consciously recognized this fact about the bottom row.
Dyalog APL, using the enable1 wordlist, I don't know its origins but you can get it from Peter Norvig's website https://norvig.com/ngrams/enable1.txt or various GitHubs and Gists:
Reading from the right, "test each word by removing 'qwertyuiop' and see if it leaves an empty string, use the test results to filter the input word list, descending-sort the length of each word and use that to arrange(index) the remaining words, flatten the array and take the top 7".
(Longest from the middle row is 'haggadahs' then 'alfalfas', third row is 'mm')
So it seems that in addition to having parts of its kernel based on FreeBSD, there is also a lot of similarities in the wordlist at /usr/share/dict/words of macOS to that of FreeBSD :) perhaps even the same?
C:\>grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head
Bad command or file name
Bad command or file name
SORT: Too many parameters
Bad command or file name
Here's something you may not know, the *-insane dictionaries, which are giant, are functions of OCR output and are known to contain lots of errors.
I found a few earlier this year and I was going to file a bug so I did some research to find out this is a known and expected behavior.
If the computer say reads stubborn as stubbum, the smaller dictionaries are the ones that have cross checked and filtered those out. The insane ones do not. It's a good name. "Lack of sanity checks"
Here's an example word I found, "suabilities". You'll find it only on wordlist sites that used this wordlist and I guess, now here.
just saw this. I've got no idea how kanji ocr works but I do know enough japanese to know what most of those characters are attempting to refer to, my penmanship has certainly been that bad. I still don't understand how it would make its way into the standard unless that part wasn't written by someone who is competent in japanese.
I wonder how often that happens - surely there's tons of people dealing with japanese text who can't read it and just use diligence to make sure the "letters are the same"
I've used the insane dictionaries a number of times for puzzle stuff and I never knew that they were derived from OCR output. Thanks for mentioning that!
You might find the... 'translation'[1] of Genesis 1 using only keys on the Colemak home row interesting:
In the start The One has risen the stars and the earth.
The earth had no order, and nothin' resided there; and shade resided on the nonendin' 'neath. And The One rided on the seas.
Then The One said: "I desire it to shine"; and it shone.
And The One had seen the shine, that it's neat; and The One sorted the shine on one side, and the shade on the other.
The One then denoted the shine and the shade. So the nite and the shine that are date no. one had ended.
I tried to do some other fun things like going row by row with each row only contributing one letter and seeing what’s the longest word I could come up with.
If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.
Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
> If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.
A dictionary search turned up "paxwaxes" as the longest word I could find that starts in the top row and goes down, wrapping around to the top every three letters.
> Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
Chickens is indeed the longest.
If you start at the bottom row and go up-and-down: cataclysms, or catamarans.
If you start at the top row and go down-and-up: escapable.
If you start in the middle and go down-and-up: scarabaean
If you start in the middle and go up-and-down, I didn't find anything longer than 7 letters, and there were 39 seven-letter words, including "discard", "grandpa", and "stacked".
Related, is there a high quality plaintext dictionary file for running similar searches? I’ve spent several hours but couldn’t find one that’s both comprehensive and accurate.
What are your rules for what counts as a "word"? If you go with the basic scrabble rules (i.e. nothing that would be capitalized or punctuated) then YAWL[1] is pretty good, with the downside being the most recent version I know of is from 2008.
FYI, rupturewort is the sole 11-letter word answer to TFA in YAWL; found using:
Some common Linux distributions have packages that provide word list files to /usr/share/dict/ in several languages. It's likely for English files to be preinstalled. I've had a plenty of fun practising regex and pipes with these word lists!
I happen to have a corpus which includes pretty much every word ever written in a book, including many misspelled, mistranscribed, or otherwise non-dictionary words.
After eliminating nonsense, non-English, or other mistakes, I think the real winner, coming it at 12 characters, is:
That's a relatively common word. Even though it's usually seen hyphenated, the unhyphenated form is recognized by all the online dictionaries I found.----
And some other candidates, just for fun, in the 13 or 12 character range:
"proproprietor" seems more like a misspelling. Should have a hyphen, or be two words."priorityqueue" is of course familiar to hackers here, but is more of a jargon term, and is only concatenated due to appearing in source code. Invariably it's two words when actually written out.
"reporterette" is antique, but appeared in a NYTimes headline as late as 2018 - the author reflected on her career, including sexist epithets. https://www.nytimes.com/2018/12/02/opinion/george-hw-bush-ma...
"preprototype" is used exactly as is, in lots of scientific papers, up to the current day. That's a pretty good one too, and could be a tie for "teetertotter", but it's verging on jargon.
Sorry for the questions, but it seems like an interesting, yet probably common data set and as someone who is venturing down this path, I’d like to learn more about building my own dataset similar to this from scratch.
lol, it's a 42MB text file from Google Books Ngrams.
The format looks like this:
I queried it with perl and sort. I can't remember exactly which file I downloaded, but according to my notes I got it from here back in 2012 or so.https://storage.googleapis.com/books/ngrams/books/datasetsv2...
There seems to be a newer corpus published in 2020:
https://storage.googleapis.com/books/ngrams/books/datasetsv3...
Deleted Comment
I note that hardheartedness and hotheadedness threaten the darnedest nonstandard assassinations. Such sordidness!
Middle/Second row result is
8 flagfall "Flagfall, or flag fall, is a common Australian expression for a fixed start fee, especially in the taxi, haulage, railway, and toll road industries."
8 galagala "A name in the Philippine Islands of Dammara Philippinensis, a coniferous tree yielding dammar-resin."
Lower/Third Row: - None
There are no vowels on the bottom row. So no words. I've been typing at ~ 50wpm for 30 years, and I don't think I'd ever actually consciously recognized this fact about the bottom row.
(standard US keyboard layout)
And yeah, nothing in the bottom row other than acronyms and similar pseudo-words.
(Longest from the middle row is 'haggadahs' then 'alfalfas', third row is 'mm')
Deleted Comment
I found a few earlier this year and I was going to file a bug so I did some research to find out this is a known and expected behavior.
If the computer say reads stubborn as stubbum, the smaller dictionaries are the ones that have cross checked and filtered those out. The insane ones do not. It's a good name. "Lack of sanity checks"
Here's an example word I found, "suabilities". You'll find it only on wordlist sites that used this wordlist and I guess, now here.
I wonder how often that happens - surely there's tons of people dealing with japanese text who can't read it and just use diligence to make sure the "letters are the same"
https://godexperiment.org/beginnings-an-alliterative-rewrite...
It was inspired by these versions:
https://llamasandmystegosaurus.blogspot.com/2017/05/alpha.ht...
https://calvinballing.github.io/saga/
First row
$ awk '/^[,.pyfgcrl]$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head
3 pry / ply / fry / cry
Second row
$ awk '/^[aoeuidhtns]
$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head15 tendentiousness
14 assassinations
13 instantaneous
13 insidiousness
Third row
$ awk '/^[;qjkxbmwvz]*$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head
4 xxxv
3 xxx
3 xxv
2 xx
[...] 15 sententiousness 15 sinuatodentated 15 soundheadedness 15 tendentiousness 15 uninitiatedness 16 antisensuousness 16 ostentatiousness 17 dissentaneousness 17 instantaneousness 18 unostentatiousness
$ awk '/^[;qjkxbmwvz]*$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head 4 xxxv 3 xxx 3 xxv 2 xx 2 xv
If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.
Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
A dictionary search turned up "paxwaxes" as the longest word I could find that starts in the top row and goes down, wrapping around to the top every three letters.
> Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(
Chickens is indeed the longest.
If you start at the bottom row and go up-and-down: cataclysms, or catamarans.
If you start at the top row and go down-and-up: escapable.
If you start in the middle and go down-and-up: scarabaean
If you start in the middle and go up-and-down, I didn't find anything longer than 7 letters, and there were 39 seven-letter words, including "discard", "grandpa", and "stacked".
FYI, rupturewort is the sole 11-letter word answer to TFA in YAWL; found using:
1: https://github.com/elasticdog/yawlIt's 170k English words, no placenames or people's names or anything like that, but does have some that I question how valid they are.
/usr/share/dict/words is always destination zero for words.