In Defense of OpenStreetMap's Data Model

Q: What is the correct representation of the date of Caesar's death, 15 March 44 BC in ISO-8601? A: -0043-03-13 Why? -- Ancient dates are typically given according to the Julian calendar excluding a year 0, but ISO-8601 uses a proleptic Gregorian calendar including a year 0.

The proposed improvements would obsolete a bunch of problems such as broken polygons [1] which happen regularly. They would also make processing OSM more accessible without needing to randomly seek over GBs of node locations just to assemble geometries which takes a significant runtime percentage of osm2pgsql.

For me Steve Coast lost his credibility when he joined the closed and proprietary what3words.

[1] https://wiki.openstreetmap.org/wiki/OSM_Inspector/Views/Mult...

jasonwatkinspdx · 4 years ago

What 3 Words makes me so angry.

There's around 5.1e14 meters squared on the surface of earth. It takes 34 bits to address this uniquely. If we use one of EFF's dice words style short word lists (6^4 words), we need 5 words to describe any point on earth with 1 meter precision.

If we use a projection like say S2 (though plenty of other options exist), these 5 word locators will show strong hierarchical locality. In any specific area for example, there's likely only 3 distinct top level words. Likewise, the last word is useful but probably unnecessary precision for "find the building" day to day use. So the middle 3 words will be sufficient to be unambiguous in most cases, and if people used this system they'll naturally become familiar with the phrases typical to their locale.

All of this can be done with an algorithm a freshman cs student can understand, with a trivial amount of reference data. It can run on any mobile device made in the last 15 years without an internet connection.

I designed a scheme like this for fun years ago, just because it was a natural outflow of some stuff I was doing with dicewords for default credentials in a consulting context, and I just find spatial subdivision structures neat.

It's hard to interpret what3word's scheme as anything but craven rent seeking. They want to keep the mapping obscure, and fundamentally sacrifice usability in the interests of this. That what3words markets this specifically as a solution for low income nations, and dupes NGOs that are not tech savvy in the service of this is utterly #$%@$#ing revolting.

Imagine trying to rent seek on selling poor people their own street addresses, if you'll let me be slightly hyperbolic.

There is no reason a scheme like this can't simply be a standard from some appropriate body, and a few open source reference implementations.

firen777 · 4 years ago

This comment thread is the first time I hear about w3w. It hurts my brain trying to come up a reasoning how such concept is not some kind of parody one-off project intended to be posted on HN or reddit for the lolz. Instead, it is actually being used by the emergency service?

Trying to google with the query "what3words explained site:reddit.com" gave me this r/911dispatchers post as the first result: [What3Words and why it's trash.](https://www.reddit.com/r/911dispatchers/comments/olcxdv/what...)

(Amusingly, this 10 months old post was last edited 2 days ago.)

vjk800 · 4 years ago

> There is no reason a scheme like this can't simply be a standard from some appropriate body, and a few open source reference implementations.

Yet no-one did this and I think that's the point here.

World is full of rent seeking in the form of stuff that is dead simple to do but no-one does without a financial incentive.

In w3w the hard part is not the system itself, but getting people to use it, which must be done because the value of the system comes from the network effect.

TuringTest · 4 years ago

But you don't need to complicate the storage format to fix a problem like that. You can build validation tools that will check whether the stored data conforms to the correct specified geometry, and only emit valid polygons to later tools in the pipeline when they do.

"Be liberal in what you accept and strict in what you send" is still a good principle. The problem with rejecting invalid structures at the data storage format instead of a later validation step is that it hurts flexibility and extensibility. If later on you need a different type of polygon that would be rejected by the specification, you'll need to create a new version of the file format and update all tools reading it even if they won't handle the new type, instead of just having old tools silently ignoring the new format that they don't understand.

seoaeu · 4 years ago

> "Be liberal in what you accept and strict in what you send" is still a good principle.

No, it is a terrible principle which produces brittle software and impossible to implement standards. The problem is that no one actually follows the “be strict in what you send” part, and just goes with whatever cobbled together mess the other existing software seems to accept. Before long, a spec compliant implementation can’t actually understand any of the messages that are being sent

> just having old tools silently ignoring the new format that they don't understand.

This sounds like another headache. I don’t want my tools silently breaking.

matkoniecz · 4 years ago

> You can build validation tools that will check whether the stored data conforms to the correct specified geometry, and only emit valid polygons to later tools in the pipeline when they do.

It is not helping at all when the problem is that important areas disappeared.

It is also not helping at all other mappers or confused newbie.

maxerickson · 4 years ago

The "expression" layer of the data model has had 20 years to evolve and has largely been static for a decade.

Making everything slower and harder to retain flexibility you don't need isn't a great tradeoff.

RicoElectrico · 4 years ago

The best thing is not to allow invalid geometries to begin with. Any validation would need to be done in an off-line fashion for a number of reasons (such as needing to retrieve any referred OSM elements), and by that time you can't automatically revert offending changes as any revert carries a chance of an object version conflict.

arccy · 4 years ago

postel was wrong https://tools.ietf.org/id/draft-thomson-postel-was-wrong-03....

danShumway · 4 years ago

I don't have horribly strong opinions here, but the argument feels circular to me:

- The format should be kept simple to encourage more people to build tools on top of it, and users will be more likely to work with it.

- We should deal with the emergent complexity of bad validation by making tools more complicated and having them detect errors on their end.

If users are going to use a validation tool to work with data, then they can also use a helper tool to generate data. And if the goal is to make it easier to build on top of data, import it, etc... allowing developers to do less work validating everything makes it easier for them to build things.

I'm going over the various threads on this page, and half of the critics here are saying that user data should be user facing, and the other half are saying that separate tools/validators should be used when submitting data. I don't know how to reconcile those two ideas; particularly a few comments that I'm seeing that validation should be primarily clientside embedded in tools.

Again, no strong opinions, and I'll freely admit I'm not familiar enough with OSM's data model to really have an opinion on whether simplification is necessary. But one of the good things about user facing data should be that you can confidently manipulate it without requiring a validator. If you need a validator, then why not also just use a tool to generate/translate the data?

To me, "just use a tool" doesn't seem like a convincing argument for making a data structure more error prone, at least not if the idea is that people should be able to work directly with that data structure.

----

> you'll need to create a new version of the file format and update all tools reading it even if they won't handle the new type, instead of just having old tools silently ignoring the new format that they don't understand.

Again, not sure that I understand the full scope of the problem here, and I'm not trying to make a strong claim, but extensible/backwards-compatible file formats exist. And again, I don't really see how validation solves this problem, you're just as likely to end up with a validator in your pipeline that rejects extensions as invalid, or a renderer that doesn't know how to handle a data extension that used to be invalid or impossible.

Wouldn't be nicer to have a clear definition of what's possible that everyone is aware of and can reason about without inspecting the entire validation stack? Wouldn't it be nice to not finish a big mapping project and then only find out that it has errors when you submit it? Or to know that if your viewer supports vWhatever of the spec that it is guaranteed to actually work, and not fall over when it encounters a novel extension to the data format that it doesn't understand or that it didn't think was possible? Personally, I'd rather be able to know right off the bat what a program supports rather than have to intuit it by seeing how it behaves and looking around for missing data.

Part of what's nice about trying to do extensions explicitly rather than implicitly through assumptions about data shape, is that it's easier to explicitly identify what is and isn't an extension.

I’ve started playing with data from OpenStreetMap. It started with me trying to fetch all the places where I could get water when moving around Copenhagen, which turned out not to be as easy as first envisioned, because OSM seems to have a lot of different ways to categorise available water, which makes sense, OSM and the tagging system isn't there to support only my usecase, and describing my idea doesn't fit 1:1 with the model.

I identified the following tags to look out for:

amenity=drinking_water, https://wiki.openstreetmap.org/wiki/Tag:amenity%3Ddrinking_w...

man_made=water_tap, https://wiki.openstreetmap.org/wiki/Tag:man_made%3Dwater_tap

amenity=water_point, https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dwater_poin...

drinking_water=*, https://wiki.openstreetmap.org/wiki/Key:drinking_water

It's a tough problem to map out the world and describe it, especially when everyone can add or modify the data, but anything that could improve the experience of importing like osm2pgsql would be welcome.

Aachen · 4 years ago

I don't understand how this doesn't fit your use case. The tags are for different things, e.g.

> for places where you can get larger amounts of "drinking water" for filling a fresh water holding tank, such as found on caravans, RVs and boats

versus

> a man-made construction providing access to water, supplied by centralized water distribution system (unlike in case of man_made=water_well [...]). The tag man_made=water_tap is used for publicly usable water taps, such as those in the cities and graveyards. Water taps may provide potable and technical water, which can be specified with drinking_water=yes and drinking_water=no.

And another tag for when you're not mapping a separate water point, but indicating whether a given feature has drinking water (for example a well or mountain hut).

You're saying that it's tough when anyone can mess with the data rather than working in a structured way, but these tags have distinct definitions and seem perfectly sensible to me (there are much worse examples like highway=track, which spawned huge discussions in various places within the community). How do these tags not match your use case to select which tags you need and display those in the way you want (e.g. as list or map)?

francisofascii · 4 years ago

When features are sometimes tagged specifically and other times tagged more generically, it is impossible to get valid results. You either have to filter on the more specific tag (leaving out valid features) or include the generic tagged features (including features that should not be).

jmkb · 4 years ago

You missed (at least) one: https://wiki.openstreetmap.org/wiki/Tag:man_made%3Ddrinking_...

snickerer · 4 years ago

I spent 20 months of my life traveling around Europe and Asia and I found the sources to fill up my camper's water tank mostly using OSM data! It works very well in most areas.

I used the the app Maps.me for that (which by the way I would not recommend anymore). Maps.me's internal search function is not intuitive but I found out the right key words to get to drinking water sources.

To your list I would add the search for springs. Especially in mountainous areas you often find usable springs (sometimes pipes coming out of a wall) with drinking water.

matkoniecz · 4 years ago

> I used the the app Maps.me for that (which by the way I would not recommend anymore).

Organic Maps is its successor - https://organicmaps.app/

zigzag312 · 4 years ago

Author states:

>And that’s the point, rules and complexity have completely unknowable downsides. Downsides like the destruction of the whole project. With each rule and added complexity you make the system less human and less fun. You make it a Computer Scientists rube goldberg machine while sterilizing it of all the joy of life.

While too much rules and complexity can certainly be bad, some basic amount of standardization can actually reduce complexity and really doesn't cause a "destruction of the whole project".

As a counterpoint, too much flexibility can also increase complexity. For example, without defined rules, 5.6.2022 can mean 5. June 2022 or 6. May 2022. Nor user, nor parser can know for sure what it means, if standard isn't defined. This kind of flexibility certainly isn't fun.

Example from OSM wiki for "Key:source:date":

> There is no standing recommendation as to the date format to be used. However, the international standard ISO 8601 appears to be followed by 9 of the top 10 values for this tag. The ISO 8601 basic date format is YYYY-MM-DD. https://wiki.openstreetmap.org/wiki/Key:source:date

Just define some essential standards. It won't lead to destruction of the project!

And while you are making breaking changes, please fix the 'way' element. Maps are big. Storing points in ways as 64bit node-ids, while coordinates in nodes are also 64-bit (32bit lon and 32bit lat), just leads to wasted space and wasted processing time. There are billions of these nodes and nearly all of these nodes don't have tags, just coordinates. There is no upside for this level of indirection. And in case tags are needed for a point, this can already be solved with a separate node and a 'relation' element.

OSM data format could certainly be improved and it would benefit end users, as better tools/apps could be made more quickly and easily.

atoav · 4 years ago

The date example is a good one. No one has fun by choosing their own date format. This is putting the burden of choice onto the user. They might like to think about some map stuff and now they have to think about data format stuff.

Of course projects like these have to strike a balance between the strictest bureaucratic nightmare and such a structure so loose that people are overburdened by the available options at every corner.

I think a lot of that complexity can (and should!) live in the tools themselves. Who cares about a date format, when the tool that creates it offers a date picker or extracts the correct date from the meta data of an image? The date format in the backend should be fixed and then you should offer flexibility in the frontend for user input.

Archelaos · 4 years ago

> The date format in the backend should be fixed and then you should offer flexibility in the frontend for user input.

Agreed. However, it might be not so easy for historical dates, because doing it correctly requires great diligence on the part of the tool developer as well as from the user to choose the correct calendar system. For example:

joshxyz · 4 years ago

why not use the universal date format [1] that works for everyone?

yyyy mm dd mm yyyy

[1] https://twitter.com/dan_abramov/status/1447710863960551433

Grimburger · 4 years ago

Unix time is the real universal date format surely? It underpins basically everything in datetime database entries.

The problem is that most people can't read that along with the French somehow never being able to convince the world to adopt decimal time.

RcouF1uZ4gsC · 4 years ago

It actually doesn't work for Americans who write:

mm dd yyyy

867-5309 · 4 years ago

you'll have to round up those 86,400ths!

stevage · 4 years ago

In the first part of the article I was thinking, oh, maybe Steve Coast isn't such a jerk after all.

Then I got to the meat of it. Oh dear.

As one of the many many people who has had to deal with OSM data, I curse people with this attitude that the mess is somehow desirable or necessary. It's not. There is a long spectrum between totally free form and completely constrained, and OSM's data model is painfully down the wrong end, and causes enormous harm to all kinds of potential reuses of the data.

It also causes harm to the people creating data. Try adding bike paths and figuring out what tags are appropriate in your area. Try working out how to tag different kinds of parks, or which sorts of administrative boundaries should be added or how they should be maintained. It puts many people off, me included.

Bah.

delusional · 4 years ago

I tend to agree with you, having done a fair bit of cursing at the OSM format as well.

Yet they've made an open source map, and I haven't. The data tells me that I'm wrong.

For a crowd sourced dataset, a strict ontology anyway wouldn't work. Instead of messy tag definitions you'd have tag use that didn't align with the definitions.

I don't mean that as an argument against improving the tagging!

The biggest friction point is probably that people resist rationalization of tagging schemes that have demonstrated themselves to be problematic.

The tagging system in the iD editor tries to address the issue, supporting search terms and suggesting related tags and so on.

The article is more about the underlying storage of the geometries (I don't think there is the same level of interest in changing the basic approach to tagging/categorization).

Sujan · 4 years ago

I think those are the important bits:

> The Engineering Working Group (EWG) of the OSMF has “commissioned” (I think that’s OSMF language for paid) a longstanding proponent of rules and complexity to, uh, investigate how to add rules and complexity to OSM.

> [...]

> Let us pray that the EWG is just throwing Jochen a bone to go play in the corner and stop annoying the grownups.

It's a "response" to https://blog.openstreetmap.org/2022/06/02/announcement-data-...

everybodyknows · 4 years ago

This seems an important bit to me:

> Facebook solved this in a beautifully OSM-like way: daylight. Daylight is a sanitized, consistent and cleaned up map based on OSM

https://daylightmap.org/https://registry.opendata.aws/daylight-osm/#usageexamples https://gist.github.com/jenningsanderson/3e42a99dcb8f760038a...

> The harder you make it for them to edit, the less volunteers you’ll get.

And that is why dedicated area type (rather than representing areas with lines or special relations[0]) could help new mappers and new users of data.

There would be very significant transition costs, but maybe it would be overall beneficial.

It is possible to have objects that are both area and line at once. Or area according to one tool/map/edtor and line according to another.

And many multipolygon relations are in inconsistent state and require manual fixup.

Also, complexity of entire area baggage makes explaining things to newbies more complex. You can either try to hide complexity (used by iD in-browser-editor) leaving people hopelessly confused when things are getting complex or present full complexity (JOSM) causing people to be overwhelmed.

See https://wiki.openstreetmap.org/wiki/Area#Tags_implying_area_... for a start of a complexity fractal.

[0] https://wiki.openstreetmap.org/wiki/Area

pramsey · 4 years ago

1000 times yes! I am a spatial data expert but only a some-time OSM editor and I still have yet to figure out how to create a polygonal feature more complex than a single building footprint. The theoretical advantage of a unified topology model of just nodes/edges where polygons and lines share core geometry is nullified by cultural rules that say "don't do that" to editors (I had a bunch of parks that shared a boundary with a road reverted with nasty notes). The current setup is not just hard for processors, it's hard for non-experts to understand and therefore a higher barrier than a simple polygon model would be.

> how to create a polygonal feature more complex than a single building footprint

In ID (default editor) you can mark area and area inside or select two disjointed areas and press right click on the and select "Merge". Or press "c" while selecting areas for combining.

In JOSM there is equivalent "create multipolygon" (or "update multipolygon")

https://wiki.openstreetmap.org/wiki/Relation:multipolygon#Ho...

> parks that shared a boundary with a road

FYI, that is because highway=* road line represents centerline of carriageway - and unless park somehow ends in the middle of road and includes half of its surface it will be not correct.

It also makes future editing quite nasty.

JackFr · 4 years ago

I know it’s nothing to do with the main thrust of the article, but the author fundamentally misrepresents KYC. Know-your-customer is a facet of anti-money laundering and anti-corruption regulation. It has nothing to do with talking to users.

myself248 · 4 years ago

Perhaps an existing term was co-opted by financial legislation...

bornfreddy · 4 years ago

Maybe, but not likely. The quoted text fits the common term definition:

> The answer, as any product owner will tell you, is to get close to the customer. To talk to them. To understand them. To feel their pain. The {big short}:

> Deutsche Bank had a program it called KYC (Know Your Customer), which, while it didn't involve anything so radical as actually knowing their customers, did require them to meet their customers, in person, at least once.

That is neither helpful, not useful, nor making me more likely to treat this diatribe more seriously.

NelsonMinar · 4 years ago

It is the kind of disrespectful rhetoric that defines the OSM community though.

I do not consider it as defining and definitely nor desirable or improving ones standing.

For reference: I am extremely active in OSM community. On channels that I moderate this would result in user being warned/kicked (but not banned, unless in case of repetitive insults).

kawsper · 4 years ago