… where the first falsehood is that a computer could be able to parse an address at all (let alone normalize it). Just take the address as given and leave the rest to the mail delivery person.
I think fundamentally, no parsing/normalizing library can be effective for addresses. A much better approach is to have a search library which finds the address you're looking for within a dataset of all the addresses in the world.
Addresses are fundamentally unstructured data. You can't validate them structurally. It's trivial to create nonexistent addresses which any parsing library will parse just fine. On the flipside, there's enough variety in real addresses that your parser has to be extremely tolerant in what it accepts--so tolerant that it basically tolerates everything. The entire purpose of a parser for addresses is to reject invalid addresses, so if your parser tolerates everything it's pointless.
The only validation that makes any sense is "does this address exist in the real world?". And the way to do that is not parsing, it's by comparing to a dataset of all the addresses in the world.
I haven't evaluated this project enough to understand confidently what they're doing, but I hope they're approaching this as a search engine for address datasets, and not as a parsing/normalizing library.
And keeping such datasets up to date is another matter entirely, because clearly a lot of companies rely datasets that were outdated before their company even existed.
A trivially simple example of just how messy this is when people try to constrain it is that it's nearly random whether or not a given carrier would insist on me giving an incorrect address for my previous place, seemingly because traditionally and prior to 1965 the address was in Surrey, England.
The "postcode area name" for my old house is Croydon, and Croydon has legally been in London since 1965, and was allocated it's own postcode area in 1966. "Surrey" hasn't been correct for addresses in Croydon since then.
But at least one delivery company insisted my old address was invalid unless I changed the town/postcode area to "Surrey", and refused to even attempt a delivery. Never mind they had my house number and postcode, which was sufficient to uniquely identify my house.
Agreed. Keeping an up-to-date dataset of addresses is enormously hard. It's impossible to do perfectly, and only a few companies are capable of doing it passably, while the rest of us have no choice but to buy from them.
But notably, to validate a parser/normalizer, you need this dataset anyway, so creating a parser/normalizer isn't even saving you that work. It's just giving you a worse result for more work.
There are many useful applications of libpostal, and it's an impressive library, but one I would caution against is for the purpose of address matching, at least as the 'primary' approach.
The problem is the hardest to parse addresses are also often the hardest to match, making the problem somewhat circular. I wrote about this more in a recent blog on address matching: https://www.robinlinacre.com/address_matching/
I somehow doubt this will pass the snifftest of one of my old addresses, which Australia Post successfully delivered to on a weekly basis:
Third on right of main,
Tiwi College,
Melville Island, 0822, AU.
You can try to normalize that... But "Main Road" is in another city. Because I wasn't living in a city. There were no road names. And the 3rd position was an empty plot, not the third house. We had a bunch of houses around a strip of land, a few minutes from the airstrip - the only egress.
(For today’s 10000, that’s Terry Pratchett. The autocrat of the city of Ankh-Morpork amuses himself, at times, by figuring out where unreadably-addressed mail should go - in this case, a baker (“duzbuns” == does buns) across the street (“hopsit” == opposite) from a pharmacy, which in his extremely detailed knowledge of the city means only one place.)
I recall an episode of Fraiser where Niles moved into “The Montana” and it was so famous that he could just have people write his name followed by “The Montana” on envelopes to send mail to him. I believe that was based on the Dakota apartments in NYC. I have no idea if people at the actual Dakota apartments can do that, but I suspect the post offices in NYC would know to send mail there if it simply said a name followed by “The Dakota”.
Something like that has not worked in Finland for several years. All addresses are scanned and matched by the mail with a DB of "valid addresses". There is a big student dorm in this city here, which has had problems with mail delivery for years. Not that students would receive a lot of mail. Most businesses charge extra for paper bills, most authorities prefer electronic messages and private postcards don't seem to be common in that age group either.
After years of undeliverable mail it was found that the building permit for the dorm was registered incorrectly by the city and as a result the rooms were never registered as residential addresses in the postal DB.
Wow, ambitious project. Anybody who has tried to verify addresses can tell you that the staggering number of different formats and conventions around the world make it and almost intractable problem. So many countries have wildly informal standards and people putting down just whatever they want because the mailman "just knows".
Maxmind is the quintessential example of what devs want to build in their heart of hearts. Low-touch sales but b2b. Almost a monopoly. Prints money for decades. Not a public company so they never increase costs to a usurious amount. Open source never quite meets the level needed
Why would one try to "verify" addresses that one knows nothing about?
> because the mailman "just knows"
The mailman does "just know", and the mailman is who the address is for. Web forms I have seen that have tried to "verify" my address have never done so in a way that made the address better for the mailman.
EDIT: I've long thought that web forms should not have separate "street", "street line 2", "number", "apartment", "whatever" fields. Instead they should offer a multi-line input field labeled "this will go straight on the address label, write whatever you like but it's your problem if it doesn't arrive". You'd probably still need separate fields for town/postcode for calculating postage. And of course it wouldn't work because the downstream delivery company would also insist on something it can "verify".
For the US the underlying need for parsing is to determine a definitive location so that taxation, which can vary down to the municipality level, can be computed.
In the same vein, there is also Google's excellent libphonenumber for parsing, formatting, and validating international phone numbers.
And because I had no idea before I worked on a project where we had to deal with customer data: many companies also use commercial services for address and phone number validation and normalization.
<https://news.ycombinator.com/item?id=18775099> Libpostal: A C library for parsing/normalizing street addresses around the world - 117 points by polm23 on Dec 29, 2018 (25 comments)
<https://news.ycombinator.com/item?id=11173920> Libpostal: international street address parsing in C trained on OpenStreetMap (mapzen.com) 74 points by riordan on Feb 25, 2016 (7 comments)
Discussed on HN here: https://news.ycombinator.com/item?id=8907301
Addresses are fundamentally unstructured data. You can't validate them structurally. It's trivial to create nonexistent addresses which any parsing library will parse just fine. On the flipside, there's enough variety in real addresses that your parser has to be extremely tolerant in what it accepts--so tolerant that it basically tolerates everything. The entire purpose of a parser for addresses is to reject invalid addresses, so if your parser tolerates everything it's pointless.
The only validation that makes any sense is "does this address exist in the real world?". And the way to do that is not parsing, it's by comparing to a dataset of all the addresses in the world.
I haven't evaluated this project enough to understand confidently what they're doing, but I hope they're approaching this as a search engine for address datasets, and not as a parsing/normalizing library.
A trivially simple example of just how messy this is when people try to constrain it is that it's nearly random whether or not a given carrier would insist on me giving an incorrect address for my previous place, seemingly because traditionally and prior to 1965 the address was in Surrey, England.
The "postcode area name" for my old house is Croydon, and Croydon has legally been in London since 1965, and was allocated it's own postcode area in 1966. "Surrey" hasn't been correct for addresses in Croydon since then.
But at least one delivery company insisted my old address was invalid unless I changed the town/postcode area to "Surrey", and refused to even attempt a delivery. Never mind they had my house number and postcode, which was sufficient to uniquely identify my house.
But notably, to validate a parser/normalizer, you need this dataset anyway, so creating a parser/normalizer isn't even saving you that work. It's just giving you a worse result for more work.
You are equating two things that are not equatable.
The problem is the hardest to parse addresses are also often the hardest to match, making the problem somewhat circular. I wrote about this more in a recent blog on address matching: https://www.robinlinacre.com/address_matching/
(For today’s 10000, that’s Terry Pratchett. The autocrat of the city of Ankh-Morpork amuses himself, at times, by figuring out where unreadably-addressed mail should go - in this case, a baker (“duzbuns” == does buns) across the street (“hopsit” == opposite) from a pharmacy, which in his extremely detailed knowledge of the city means only one place.)
After years of undeliverable mail it was found that the building permit for the dorm was registered incorrectly by the city and as a result the rooms were never registered as residential addresses in the postal DB.
Why would one try to "verify" addresses that one knows nothing about?
> because the mailman "just knows"
The mailman does "just know", and the mailman is who the address is for. Web forms I have seen that have tried to "verify" my address have never done so in a way that made the address better for the mailman.
EDIT: I've long thought that web forms should not have separate "street", "street line 2", "number", "apartment", "whatever" fields. Instead they should offer a multi-line input field labeled "this will go straight on the address label, write whatever you like but it's your problem if it doesn't arrive". You'd probably still need separate fields for town/postcode for calculating postage. And of course it wouldn't work because the downstream delivery company would also insist on something it can "verify".
So you aren't shipping your product to some place that doesn't exist. Also, some KYC requires that you verify the address of the person.
And because I had no idea before I worked on a project where we had to deal with customer data: many companies also use commercial services for address and phone number validation and normalization.
IIRC it takes gigs of storage space and has significant runtime requirements.
Also, while it's implemented in C there are language binding for most major languages [1].
It's one of those things where it's most likely best deployed as an independent service on a dedicated machine.
[1] https://github.com/openvenues/libpostal?tab=readme-ov-file#b...
<https://news.ycombinator.com/item?id=18775099> Libpostal: A C library for parsing/normalizing street addresses around the world - 117 points by polm23 on Dec 29, 2018 (25 comments)
<https://news.ycombinator.com/item?id=11173920> Libpostal: international street address parsing in C trained on OpenStreetMap (mapzen.com) 74 points by riordan on Feb 25, 2016 (7 comments)