I used it to compare Databricks SQL Serverless (Medium) vs Databricks Jobs clusters with Apache Sedona 1.7 across 12 queries (from simple filters to joins, distance joins, multi-way joins, KNN) at SF100 and SF1000 (SF1000 is roughly 500GB uncompressed Parquet).
TLDR apart from one query, Sedona was up to ~6x better on cost per query, and also covered more queries under the same 10 hour timeout guardrails. Some queries didn’t finish or errored on either side, so there is a capability matrix in the post.
My bet is most of actually useful spatial ST_ functions are not implemented in this one, as they are not in the DuckDB offering.
In terms of number of functions PostGIS is still the leader, but for analytical functions (spatial relationships, distances, etc) having those in place in these systems is important. DuckDB started this but this has a spatial focused engine. You can use the two together, PostGIS for transactional processing and queries, and then SedonaDB for processing and data prep.
A combination of tools makes a lot of sense here especially as the data starts to grow.
A project that crunches real estate and satellite imagery data with scalable spatial joins
Sedona basically makes spatial at scale way more accessible. Instead of rolling your own hacks, you get Sparks distributed compute with geospatial APIs baked in.
Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.
Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.
You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
A 5+4 formatted ZIP code maps to just a handful of addresses. In cities with larger populations, the +4 could map to a single building, and in more sparely populated place, it might include houses on a handful of roads.
For smaller datasets, ZIP+4 might as well be a unique household identifier. I just checked a 10 million address database and 60% of entries had a unique ZIP+4, so one other bit of PII would be enough to be a 99.99% unique identifier per person.
With a geo-coded ZIP+4 database, you could locate people with a precision that's proportional to the population density of their region.
They don't represent geography at all, they represent the organizational structure of USPS.
They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.