Readit News logoReadit News
neilk · 2 years ago
I think we can do better than 11 characters!

I happen to have a corpus which includes pretty much every word ever written in a book, including many misspelled, mistranscribed, or otherwise non-dictionary words.

After eliminating nonsense, non-English, or other mistakes, I think the real winner, coming it at 12 characters, is:

    teetertotter
That's a relatively common word. Even though it's usually seen hyphenated, the unhyphenated form is recognized by all the online dictionaries I found.

----

And some other candidates, just for fun, in the 13 or 12 character range:

    proproprietor
    priorityqueue
    reporterette
    preprototype

"proproprietor" seems more like a misspelling. Should have a hyphen, or be two words.

"priorityqueue" is of course familiar to hackers here, but is more of a jargon term, and is only concatenated due to appearing in source code. Invariably it's two words when actually written out.

"reporterette" is antique, but appeared in a NYTimes headline as late as 2018 - the author reflected on her career, including sexist epithets. https://www.nytimes.com/2018/12/02/opinion/george-hw-bush-ma...

"preprototype" is used exactly as is, in lots of scientific papers, up to the current day. That's a pretty good one too, and could be a tie for "teetertotter", but it's verging on jargon.

soultrees · 2 years ago
How did you scrape that data? How do you store and retrieve it? Is it just a standard db or a vector db?

Sorry for the questions, but it seems like an interesting, yet probably common data set and as someone who is venturing down this path, I’d like to learn more about building my own dataset similar to this from scratch.

neilk · 2 years ago
> standard db or vector db

lol, it's a 42MB text file from Google Books Ngrams.

The format looks like this:

    $ head words-all.txt

    a       14219615690
    a!      196012
    a"      84
    a'      47713
    a'0     3036
    a'1     4070
    a'10    99
    a'11    56
I queried it with perl and sort.

    $ time perl -wlane 'if ($F[0] =~ /^[qwertyuiop]+$/) { print length($F[0]), "\t", $F[0] }' words-all.txt | sort -rn > qwertywords

    real 0m1.915s
    user 0m1.896s
    sys 0m0.025s

I can't remember exactly which file I downloaded, but according to my notes I got it from here back in 2012 or so.

https://storage.googleapis.com/books/ngrams/books/datasetsv2...

There seems to be a newer corpus published in 2020:

https://storage.googleapis.com/books/ngrams/books/datasetsv3...

Deleted Comment

bnjmn · 2 years ago
On any macOS computer (or replace /usr/share/dict/words with your own word list):

  grep '^[qwertyuiop]*$' /usr/share/dict/words | \
  awk '{ print length(), $0 }' | \
  sort -n

juujian · 2 years ago
Works for Ubuntu, too. My Colemak self can only get fluffy (6) from the front row, that's the longest word. Middle row really shines though, I can get hardheartedness (15) or assassinations (14).
tiltowait · 2 years ago
Interesting that your dictionary doesn't have "tenderheartedness", which is two letters longer.
seabass-labrax · 2 years ago
Gulp, fluffy puppy pug! Yup. Fly, ugly pup, fly.

I note that hardheartedness and hotheadedness threaten the darnedest nonstandard assassinations. Such sordidness!

travisgriggs · 2 years ago
Nice.

Middle/Second row result is

8 flagfall "Flagfall, or flag fall, is a common Australian expression for a fixed start fee, especially in the taxi, haulage, railway, and toll road industries."

8 galagala "A name in the Philippine Islands of Dammara Philippinensis, a coniferous tree yielding dammar-resin."

Lower/Third Row: - None

There are no vowels on the bottom row. So no words. I've been typing at ~ 50wpm for 30 years, and I don't think I'd ever actually consciously recognized this fact about the bottom row.

(standard US keyboard layout)

JoshTriplett · 2 years ago
For QWERTY, I found two nine-letter words using only the middle row: halakhahs and haggadahs.

And yeah, nothing in the bottom row other than acronyms and similar pseudo-words.

Symbiote · 2 years ago
Dvorak:

  ',.PY FGCRL   pry or Lyly
  AOEUI DHTNS   tendentiousness
  ;QJKX BMWVZ   xxxv, www, bbq or mm
After 'apt install wbritish-insane'

  pyrryl (a chemical group)
  unostentatiousnesses (and anaesthetisations is good too)
  mmmm

tedunangst · 2 years ago
Knuth vs McIlroy all over again.
susam · 2 years ago
On macOS version 12.1 Monterey:

  $ grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head
  11 rupturewort
  11 proterotype
  11 proprietory
  10 typewriter
  10 tetterwort
  10 repetitory
  10 repertoire
  10 proprietor
  10 pretorture
  10 prerequire
On Debian GNU/Linux 11 (bullseye):

  $ grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head
  10 typewriter
  10 repertoire
  10 proprietor
  10 perpetuity
  9 typewrote
  9 typewrite
  9 territory
  9 repertory
  9 puppeteer
  9 prototype

jodrellblank · 2 years ago
Dyalog APL, using the enable1 wordlist, I don't know its origins but you can get it from Peter Norvig's website https://norvig.com/ngrams/enable1.txt or various GitHubs and Gists:

          ↑7↑{⍵[⍒≢¨⍵]}words/⍨{''≡⍵~'qwertyuiop'}¨words
    ┌→─────────┐
    ↓peppertree│
    │perpetuity│
    │prerequire│
    │proprietor│
    │repertoire│
    │typewriter│
    │etiquette │
    └──────────┘
Reading from the right, "test each word by removing 'qwertyuiop' and see if it leaves an empty string, use the test results to filter the input word list, descending-sort the length of each word and use that to arrange(index) the remaining words, flatten the array and take the top 7".

(Longest from the middle row is 'haggadahs' then 'alfalfas', third row is 'mm')

JoshTriplett · 2 years ago
For Debian, try installing one of the larger wordlists, such as wamerican-huge or wbritish-huge; those have "rupturewort".
codetrotter · 2 years ago
FreeBSD 13.2

    % grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head

    11 rupturewort
    11 proterotype
    11 proprietory
    10 typewriter
    10 tetterwort
    10 repetitory
    10 repertoire
    10 proprietor
    10 pretorture
    10 prerequire
So it seems that in addition to having parts of its kernel based on FreeBSD, there is also a lot of similarities in the wordlist at /usr/share/dict/words of macOS to that of FreeBSD :) perhaps even the same?

p1mrx · 2 years ago
MS-DOS 6.22

    C:\>grep '^[qwertyuiop]*$' /usr/share/dict/words | awk '{print length, $0}' | sort -rn | head
    Bad command or file name
    Bad command or file name
    SORT: Too many parameters
    Bad command or file name

kazinator · 2 years ago
Awk greps!

   awk '/^[qwertyuiop]+$/ {print length, $0}'

Deleted Comment

kristopolous · 2 years ago
Here's something you may not know, the *-insane dictionaries, which are giant, are functions of OCR output and are known to contain lots of errors.

I found a few earlier this year and I was going to file a bug so I did some research to find out this is a known and expected behavior.

If the computer say reads stubborn as stubbum, the smaller dictionaries are the ones that have cross checked and filtered those out. The insane ones do not. It's a good name. "Lack of sanity checks"

Here's an example word I found, "suabilities". You'll find it only on wordlist sites that used this wordlist and I guess, now here.

colinchartier · 2 years ago
Reminds me of the ghost Unicode character saga: https://www.dampfkraft.com/ghost-characters.html
kristopolous · 2 years ago
just saw this. I've got no idea how kanji ocr works but I do know enough japanese to know what most of those characters are attempting to refer to, my penmanship has certainly been that bad. I still don't understand how it would make its way into the standard unless that part wasn't written by someone who is competent in japanese.

I wonder how often that happens - surely there's tons of people dealing with japanese text who can't read it and just use diligence to make sure the "letters are the same"

schoen · 2 years ago
I've used the insane dictionaries a number of times for puzzle stuff and I never knew that they were derived from OCR output. Thanks for mentioning that!
seabass-labrax · 2 years ago
You might find the... 'translation'[1] of Genesis 1 using only keys on the Colemak home row interesting:

  In the start The One has risen the stars and the earth.

  The earth had no order, and nothin' resided there; and shade resided on the nonendin' 'neath. And The One rided on the seas.

  Then The One said: "I desire it to shine"; and it shone.

  And The One had seen the shine, that it's neat; and The One sorted the shine on one side, and the shade on the other.

  The One then denoted the shine and the shade. So the nite and the shine that are date no. one had ended.
[1]: https://colemak.com/Fun

schoen · 2 years ago
SethTro · 2 years ago
For Dvorak with a little assist from unix

First row

$ awk '/^[,.pyfgcrl]$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head

3 pry / ply / fry / cry

Second row

$ awk '/^[aoeuidhtns]

$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head

15 tendentiousness

14 assassinations

13 instantaneous

13 insidiousness

Third row

$ awk '/^[;qjkxbmwvz]*$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head

4 xxxv

3 xxx

3 xxv

2 xx

rwl4 · 2 years ago
Hmm. My Mac shows these:

[...] 15 sententiousness 15 sinuatodentated 15 soundheadedness 15 tendentiousness 15 uninitiatedness 16 antisensuousness 16 ostentatiousness 17 dissentaneousness 17 instantaneousness 18 unostentatiousness

Nekhrimah · 2 years ago
Not sure about that third row, the "A" is in the second row.
SethTro · 2 years ago
Awkward, now the third now doesn't return anything

$ awk '/^[;qjkxbmwvz]*$/ { print length(), $0 }' /usr/share/dict/words | sort -nr | head 4 xxxv 3 xxx 3 xxv 2 xx 2 xv

lovehashbrowns · 2 years ago
I tried to do some other fun things like going row by row with each row only contributing one letter and seeing what’s the longest word I could come up with.

If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.

Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(

JoshTriplett · 2 years ago
> If I start at the top row and go down, I can make TAXES but couldn’t think of a longer word. The third row having no vowels makes it so hard.

A dictionary search turned up "paxwaxes" as the longest word I could find that starts in the top row and goes down, wrapping around to the top every three letters.

> Starting at the bottom row and going up, I came up with CHICKEN which is delicious and neat that it ends where it started. Chickens is longer but ends on the middle row which is not as neat I feel like :(

Chickens is indeed the longest.

If you start at the bottom row and go up-and-down: cataclysms, or catamarans.

If you start at the top row and go down-and-up: escapable.

If you start in the middle and go down-and-up: scarabaean

If you start in the middle and go up-and-down, I didn't find anything longer than 7 letters, and there were 39 seven-letter words, including "discard", "grandpa", and "stacked".

DylanDmitri · 2 years ago
Related, is there a high quality plaintext dictionary file for running similar searches? I’ve spent several hours but couldn’t find one that’s both comprehensive and accurate.
aidenn0 · 2 years ago
What are your rules for what counts as a "word"? If you go with the basic scrabble rules (i.e. nothing that would be capitalized or punctuated) then YAWL[1] is pretty good, with the downside being the most recent version I know of is from 2008.

FYI, rupturewort is the sole 11-letter word answer to TFA in YAWL; found using:

    grep '^[qwertyuiop]*$' word.list |while read -r line; do echo "${#line} ${line}"; done |sort -n | tail
1: https://github.com/elasticdog/yawl

jodrellblank · 2 years ago
I linked in another comment, I use "enable1.txt" which is here on Peter Norvig's site: https://norvig.com/ngrams/enable1.txt

It's 170k English words, no placenames or people's names or anything like that, but does have some that I question how valid they are.

praash · 2 years ago
Some common Linux distributions have packages that provide word list files to /usr/share/dict/ in several languages. It's likely for English files to be preinstalled. I've had a plenty of fun practising regex and pipes with these word lists!
koolba · 2 years ago
Dictionary or word list?

/usr/share/dict/words is always destination zero for words.

JoshTriplett · 2 years ago
I'd recommend the SCOWL wordlist, which also has usage data (so you can decide how rare of words you want to include).