Readit News logoReadit News
knodi123 · 3 years ago
I read an article where a software engineer was about to go on a long plane flight, so he downloaded a file of all english words and tried to write a spell checker during his flight, with no internet access, just as an exercise. And when he landed, he had finished, and his algorithm was beautiful, simple, elegant, easy to understand, and never something I'd have come up with in a million years.

It was actually a little depressing - it made me realize that some people just have a "something" that I'm not going to get through hard work or whatnot.

*edit: ah, found it. https://norvig.com/spell-correct.html

andersource · 3 years ago
To be fair, Peter Norvig [0] isn't just another software engineer, he's a hardcore CS researcher, co-authored the most popular (pre-DL) AI textbook, was head of NASA's Computational Sciences Division, etc.

[0] https://en.wikipedia.org/wiki/Peter_Norvig

e_i_pi_2 · 3 years ago
I think what they're saying is that some people are Peter Norvig and got lucky, and other people won't get there with any amount of study or practice. He's definitely not an average engineer, and as a result most people shouldn't expect to have the same level of ability

I had the same thought - read the URL and then said "Oh well that's Peter Norvig he's just weird in a cool way"

rbancroft · 3 years ago
And he was probably one of the worlds foremost experts on this at the time given his position at google, haha.
azurelake · 3 years ago
I mean, he definitely knew about the concept of edit distance: https://en.wikipedia.org/wiki/Edit_distance before writing this code, which is the most difficult part. Here's a practice problem: https://leetcode.com/problems/edit-distance/.

So in this case, the special "something" he has is quite literally the amount of hard work he put in to learn this stuff. And if you take this opportunity to learn about edit distance, you'll be one step closer to that special someone you want to be.

pasquinelli · 3 years ago
maybe someday i'll finally be the person i want to be, and then i can really start to live.
JoshCole · 3 years ago
You know how there is that joke about reasoning about multi-dimensional structures?

"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."

– Geoffrey Hinton, A geometrical view of perceptrons

Well, you actually can come up with Norvig's spellchecker very easily with a similar trick. Just tell yourself: to deal with a really hard problem, just argmax with an okayish heuristic. Everyone does it.

Solutions like this seem impossibly elegant because they tolerate a whole lot of error, but they are actually extremely obvious when you tolerate error.

notfish · 3 years ago
That quote is incredible, and now I know what coffee feels like leaving my nose
teshigahara · 3 years ago
If it makes you feel any better the original is not the same as what is displayed now[0] so it did not take just one flight to achieve elegance, and also included some bugs which were mentioned in the errata later[1].

[0] - https://web.archive.org/web/20070410053746/http://norvig.com...

[1] - https://web.archive.org/web/20150906100448/http://www.norvig... : ctrl+f "Update"

TacticalCoder · 3 years ago
It's really nice and I'm a fan of the man but... Even back then it was a really simple spellchecker. Back then stuff like the "metaphone" and "double metaphone" algorithms already existed (suggesting candidates based on similarly sounding words). Fun algorithms 'em metaphones I'd say. Relatively easy to implement/port (been there, done that).

I'm sure there's better stuff now.

Also it's possible to better ranks candidates to fix typos by taking into account the layout of the keyboard (e.g. on a QWERTY keyboard is you see "wery" it is, despite having an edit distance of 1, highly unlikely the person typing meant to type "very". "weary" is much more likely). So just sorting by edit distance and then suggesting the most common word often doesn't result in the best candidate being shown first.

And then context. Ah, context. I type this and my browser isn't aware I'm talking about algorithms and not only says "metaphone" is incorrect but suggests "megaphone".

So... The matter still isn't settled if you ask me.

WalterBright · 3 years ago
Soundex is an earlier method.
GuB-42 · 3 years ago
This kind of feat is the exact opposite of what the article is about.

This code is very simple, very short, very unoptimized and done in a few hours, free solo. The kind of code the article could have linked as an example of how simple it is now to write a spellchecker, in fact it explicitly talks about how you can do it in a few lines of Python, i.e. this.

What the article is about is complex, highly optimized code that took months to do, because computers at the time didn't even have enough space to store the complete word list uncompressed, neither in storage nor in memory, and even with compression, you had to be clever.

chii · 3 years ago
so a question that could be asked today is: what could be achieved, if only the same cleverness was applied to the utilization of the massive resources available today?

One potential answer is indeed the recent developments in AI (generative stuff like GPT & stable diffusion etc).

stevenspasbo · 3 years ago
You're in for a bad time if you're going to be comparing yourself to Peter Norvig.
wheels · 3 years ago
The prowess here is greatly overstated. His implementation would be pretty obvious for anyone that had read an Information Retrieval textbook any time in the couple decades between when these methods were devised and when he wrote that post.

That's kind of like watching someone use Red-Black Trees and assuming that they intuited them out of thin air. It's just remembering a basic algorithm that would be familiar to anyone in his particular field.

feoren · 3 years ago
These ideas are extremely powerful. I built a spell-checker largely based on this article, by parsing English Wikipedia. At scale it needs a trie and a distance metric with better-scaling performance metric, but it works really well. These go together: your error metric is a priority-based multi-headed traversal of your trie -- this emulates comparing your word to every word in your dictionary; you can compare against a few million words very quickly.

Because it's based on Wikipedia, it can correct proper names too. It's very extensible: by changing the error model, you can get fuzzy auto-complete that can look at "Shwarz" and suggest "Schwarzenegger" (even though you missed the 'c'). You can extend it to looking at multiple words instead of just letters, for correcting common word-use issues as well.

WalterBright · 3 years ago
I built a spell checker in the D compiler, where the dictionary consists of the symbols in scope. It is very satisfactory.
agencies · 3 years ago
Is the code or expanded explanation available?
corobo · 3 years ago
Do you allow yourself to be bored at all?

I recently realised I've been missing my "something" for a while.. ya I've not had a dopamine free moment since doom scrolling became a thing.

Not comparing somethings, I probably couldn't write a spell checker much less a beautiful one, but my creative juices are much less clogged up since allowing myself to get bored.

I believe it's being able to combine a learned skillset with raw creativity (boredom) that gives a person that something you refer to.

andai · 3 years ago
For me it's largely "the joy of finding things out". If I know there's a standard solution to a problem, I'll avoid the "spoilers" because I want to see how far I can get on my own. (And yeah, sometimes I just get bored of doomscrolling... Building things is a much more satisfying way to spend my time!)

I often avoid things because I think they'll be hard, but then discover that they turn out to be easy. "Why didn't anyone tell me how easy this was! I could have done this ages ago!" was a running theme for me for a good while.

Of course, sometimes things are actually harder than you thought. But they're still doable! Whereas I have this cognitive bias where "difficult" is (emotionally) equivalent to "impossible", which is just really unfortunate. I'm sort of starting to see through that, that everything's made of atoms, and once you break a task down far enough, the individual steps are trivial.

rjh29 · 3 years ago
Worth noting he did not perform the "major feat of software engineering" like the OP says (given 256k ram or lower). He just put all the words into a hash. The main key is knowing about edit distance, if you know that, you can probably write the solution.

So yes he's very talented, but if you study enough computer science, you can also do it. Or don't. It's totally fine to be interested in other parts of engineering instead.

alfalfasprout · 3 years ago
Don't get me wrong, Peter Norvig is awesome, but if you're already familiar with levenshtein or even hamming distances, it's pretty straightforward to hack together a spellchecker.
rzimmerman · 3 years ago
Reminds me of a dumb project I did that takes a person's name and returns a name that is close, but not quite right: https://github.com/rzimmerman/shitty-autoreply

The idea was to set up an auto-reply to my email that looked automated, but surely couldn't be because it messed up the sender's name. Not nearly as clever as Peter Norvig's program and even less useful.

I'd say the bias toward common American names is a bug, but given that the goal of the project is to be obnoxious it's probably a feature.

pixel_tracing · 3 years ago
To be fair even Norvig’s implementation wasn’t great for a mobile device. It can be improved with a more efficient trie structure…
Xeoncross · 3 years ago
Do you have any links?
cortesoft · 3 years ago
It is a bit strange to me that it would be surprising that there are people way better at programming than you. The chance that you would be at the far right of the bell curve for any skill is pretty small, by definition. Almost everyone is going to not be the best at your thing. I don’t know why it would be depressing to not be the best.
addaon · 3 years ago
> any skill is pretty small, by definition

I challenge this, especially the "by definition" part. It depends on how many skills you recognize.

The chance of being the best artist (to the extent that such a statement is reasonable) might be one in eight billion.

The chance of being the best automotive sharpie artist is, similarly, one in eight billion.

The chance of being the best artist /of some particular skill set/ is, then, much higher -- if you recognize only a few thousand distinct skills here, you're already at one in a million.

Then realize that those who are likely to be surprised that there are better programmers than them are /not/ likely to be surprised that there are better artists than them. So we're assessing across all skills.

The chance of being on the far right of the bell curve of a specific skill is pretty small. The chance of being on the far right of the bell curve for /any/ skill is quite high, with only eight billion people around.

Find what you do better than anyone else, embrace it, and enjoy it.

knicholes · 3 years ago
I used it to apply to a job at Twitch, as they tasked me on creating a spellchecker. Thanks, Peter! Also, they didn't give me an offer.
kachurovskiy · 3 years ago
I've come to accept the fact that there are millions of people better than me at something. It's depressing and hopeless trying to be the best, there isn't even an objective measure. My goal is instead to be useful.
nottathrowaway3 · 3 years ago
That "something" is basically passion and motivation, a child-like fascination with computers and the powers they can give you. People like this are so valuable and great to work with not because they have some innate power to understand algorithms, but because they sell a brand of optimism reminiscent of the "old" tech industry, the American postwar optimism of the 90s, where new things actually were revolutionary.

Let's be real, in 5-6 hours he designed a very nice, simple algorithm to produce likely words from a dictionary based on frequency analysis, and produced 30 lines of Python code implementing it. A lot of engineers today could have done this.

The "specialness" was that he did it on his own time and enjoyed it.

There was an xkcd relevant here about enjoying Python: https://xkcd.com/353/

aflag · 3 years ago
Many developers would come up with that once they know the answer. Not as many will find the answer.
projektfu · 3 years ago
The optimism of the 90s. That's a strange one I hadn't heard before.
devoutsalsa · 3 years ago
I’d bet you have a “something” that isn’t obvious to you, or that people don’t compliment you on. And even if you don’t, comparing one’s self to anyone is sure to be depressing.
2143 · 3 years ago
Peter Norvig is at god-tier, comparable perhaps to Torvalds, van Rossum, Wozniak, Stallman, ESR, Kernighan, Ritchie, Gates, Cormen, and many others I've left out.
oska · 3 years ago
Gates is definitely not 'god-tier' (I won't argue about any of the others, although I don't know who Cormen is)

Deleted Comment

armatav · 3 years ago
Yeah and that “something” is a deep interest.

You can do similar things in domains where you’re so interested you can’t think of doing anything else.

anu7df · 3 years ago
Left the punchline for the link huh? Peter Norvig was the "a software engineer" nuf said, Mike drop and all that.
technological · 3 years ago
If you realize that how hard you try you won’t able to do certain things, do ppl keep trying or accept the limitations ?

Deleted Comment

yieldcrv · 3 years ago
The guy making revenue is a better developer than the guy making elegant algorithms

Very few of us actually want to be the guy accumulating accolades for their engineering, we want to take flights to things we actually want to be doing, not opening our laptop once or even bringing it

Everything in between is a coping mechanism

Waterluvian · 3 years ago
You are a very big fish when measured against countless ponds.

Dead Comment

lb1lf · 3 years ago
No one posted Martha Snow's epic poem yet?

Eye halve a spelling chequer It came with my pea sea It plainly marques four my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter perfect awl the weigh My chequer tolled me sew.

Deleted Comment

pornel · 3 years ago
For the record, ChatGPT can actually correct spelling of this (or it has seen spoilers somewhere…)
nathan_f77 · 3 years ago
This is very impressive. I ran it through ChatGPT and got the following response. The only mistake it made is "really" instead of "rarely".

------

> Correct the spelling:

Eye halve a spelling chequer It came with my pea sea It plainly marques four my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter perfect awl the weigh My chequer tolled me sew.

> I have corrected the spelling errors in the poem:

"I have a spelling checker

It came with my PC

It plainly marks for my review

Mistakes I cannot see.

I strike a key and type a word

And wait for it to say

Whether I am wrong or right

It shows me straight away.

As soon as a mistake is made

It knows before too long

And I can put the error right

It's really ever wrong.

I have run this poem through it

I am sure you're pleased to know

It's letter perfect all the way

My checker told me so."

jandrese · 3 years ago
It still blows my mind that when growing up I had a Commodore 64 word processor with spell check. To run the spell checker you had to hit the button to start the spell checker and then turn the floppy upside down while it ran through the document.

This means you had to store in a measly 64kb not only the code for the spell checker, the entire contents of the document, KERNAL overhead, and enough data for the spell checker to run. Remember that the Commodore 64 had an incredibly slow disk drive (300bps) and the disks only supported 160kb of data. Any scheme that required a lot of disk access would be terribly slow. Yet somehow it worked and wasn't outrageously slow. I suspect the program was unloading big chunks of its own program to aggressively reuse the memory space but even then it seems like magic.

zimpenfish · 3 years ago
See also ZX Spectrum Scrabble[1] with an 11k word dictionary (and game code) in ~41kb[2], loaded from tape - no swapping / reading / etc.

[1] https://spectrumcomputing.co.uk/zxsr.php?id=4375

[2] 48k - 6.75k for the screen.

ghaff · 3 years ago
It's pretty amazing between now and then.

Way back when I wrote a DOS file manager that fit in 32K of assembly language. As I recall, at one point I upgraded it to use two different 64K segments when running by putting the code and in-memory information about the files in different segments.

networked · 3 years ago
When this was posted in 2012, an HN user described how they developed a spellchecker for the Commodore 64 as a teenager in the 1980s.

https://news.ycombinator.com/item?id=4641019

reaperducer · 3 years ago
There were several spell checkers for the TRS-80 model 100, which maxed out at 32K of RAM. More typically, it has 24K, because the weird RAM chips were so expensive.

Plus, few people had floor drives for the 100 —you had to operate from a cassette tape.

jgtrosh · 3 years ago
I was going to point out my Panasonic W1030 word processor as probably even more impressive since it's not even really a PC.

But then I found specs and its dictionary uses 384kB [1]. Then again it's from the latter 80s, maybe previous word processors attempted more constrained spellcheckers.

[1]: https://archive.org/details/kx-w1000-service-manual/KX-W1000...

Xeoncross · 3 years ago
If you think this is great and/or immediately thought of using a trie to reduce storage costs, then let me introduce you to https://blog.burntsushi.net/transducers/ which pushes storage reuse over the edge to almost competing with compression like gzip while still allowing you to consider edit distance while finding possible alternatives.

typeaheads/autocomplete and spellchecking both need this so worth a read.

polytely · 3 years ago
Great stuff, I never deal with stuff like this in my day job and was pleasantly surprised at how easy it was to follow what was going on, which means the author did a great job.
Xeoncross · 3 years ago
Yes, sad that we don't get to work on actual algorithms/leetcode-like problems in the job. Instead it's just hundreds of cocoapods, gems, npm packages, and build tools to make a contact form or profile page.
denvaar · 3 years ago
Also recommend "Compact Data Structures" by Gonzalo Navarro if you're into compression and data structures.
ccooffee · 3 years ago
Great article, thanks for sharing!
dang · 3 years ago
It used to be a popular submission topic too (still is, and used to as well):

A spellchecker used to be a major feat of software engineering (2008) - https://news.ycombinator.com/item?id=25296900 - Dec 2020 (143 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering (2008) - https://news.ycombinator.com/item?id=10789019 - Dec 2015 (29 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering - https://news.ycombinator.com/item?id=4640658 - Oct 2012 (70 comments)

A Spellchecker Used To Be A Major Feat of Software Engineering - https://news.ycombinator.com/item?id=3466388 - Jan 2012 (61 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering - https://news.ycombinator.com/item?id=212221 - June 2008 (22 comments)

Willish42 · 3 years ago
Pretty funny how the oldest one has the same top comment as this submission (https://norvig.com/spell-correct.html), 15 years later
VLM · 3 years ago
Three other things to think about:

1980s memory models especially dos-era PC was a nightmare but you could at great effort work around it. The C / unix memory model is actually quite luxurious compared to programming on dos 3.3.

In the old days non-English support was pretty complicated. Now you rely on the OS to support UTF-8 and call it good, more or less. Was a bigger problem 'back in the day'.

In the old days you'd run a spell checker as a batch process, possibly not even part of the editor itself. The point I'm making is WRT speed, now a days speed doesn't matter, not because processors are faster (although they are, software always expands faster than hardware) but because we have integrated app with threading, so instead of saving this into a file, then exiting my browser and running a separate spell check app then exit and start the browser again and load this file back into it, the browser simply has a thread doing spell checking "on the fly". So in the old days it was VERY important WRT labor costs to spell check a 10000 word essay as fast as possible, and on a 8 mhz XT that was a trick indeed. But today my multi core multi ghz desktop merely needs to spell check on the fly as I'm typing and I can't be faster than 100 wpm or so. My desktop had "like one second or so" to decide that ghz and mhz are not words and put red squigglies under them whereas in the old days you'd have to spell check like 1000 words/second to keep the users happy.

So simple ineffective code survives in 2023 because it can lean on the OS, its not just that hardware is faster or larger.

jrumbut · 3 years ago
A good one still is. When I got my most recent phone (Galaxy 22, but I jumped several versions in the upgrade so I don't know when the change happened), the spellcheck engine regressed massively from previous editions.

I'd be so curious to hear what the change was. It's absolutely awful to use!

benj111 · 3 years ago
Is it not based on your history? I know the suggestions seem to be. If so and it's stored locally it could just be that it's having to start again from scratch?
jrumbut · 3 years ago
It used to be very clearly based on history, so every time I restarted my phone I'd get the classic "duck" correction.

Now I couldn't tell you what it's doing, but it doesn't appear to be using history. For instance, the same technical terms get autocorrected again and again.

progmetaldev · 3 years ago
It definitely does make use of your history.
bediger4000 · 3 years ago
He's got a point: increase in minimum memory and increase in CPU performance means that superior programming isn't always necessary.

People regularly produce for fun what used to be a masters thesis level of work these days. I'll cite that YouTube video of someone making a reaction wheel system that can get an inverted Lego pendulum to stand up, and stay up under perturbations. The Lego part isn't the main work. The main work is software, a PID controller of the reaction wheel.

Something else has happened. There's enough people who know enough that the knowledge is just sloshing around, waiting for others to do cool things with it.

simcop2387 · 3 years ago
> He's got a point: increase in minimum memory and increase in CPU performance means that superior programming isn't always necessary.

Fun part about this is that it might actually make it slower! Between cache misses, pre-fetchers and branch mispredictions the more naive search may make things significantly more predictable for the out of order execution on modern processors that the result is actually faster when you don't try to be clever with some of this.

You still have to benchmark and not do things completely wacky but it's kind of interesting how things have changed because of the scale of the pieces.

BoorishBears · 3 years ago
A good example from Bjarne Stroustrup https://www.youtube.com/watch?v=YQs6IC-vgmo