mattforrest (u/mattforrest)

mattforrest commented on Making geo joins faster with H3 indexes floedb.ai/blog/how-we-mad... · Posted by u/matheusalmeida

mattforrest · a month ago

I wrote a post about using H3 or any DGGS for that matter. Yes it speeds things up but you loose accuracy. If search is the primary concern it can help but if any level of accuracy matters I would just use a better engine with GeoParquet to handle it. https://sedona.apache.org/latest/blog/2025/09/05/should-you-...

mattforrest commented on Price-performance benchmarks for spatial SQL (Databricks Serverless vs. Sedona) sedona.apache.org/latest/... · Posted by u/mattforrest

mattforrest · 2 months ago

SpatialBench is an open benchmark suite for spatial SQL. The goal is to compare engines on price-performance, since serverless makes raw “x times faster” claims hard to interpret.

I used it to compare Databricks SQL Serverless (Medium) vs Databricks Jobs clusters with Apache Sedona 1.7 across 12 queries (from simple filters to joins, distance joins, multi-way joins, KNN) at SF100 and SF1000 (SF1000 is roughly 500GB uncompressed Parquet).

TLDR apart from one query, Sedona was up to ~6x better on cost per query, and also covered more queries under the same 10 hour timeout guardrails. Some queries didn’t finish or errored on either side, so there is a capability matrix in the post.

mattforrest commented on SedonaDB: A new geospatial DataFrame library written in Rust sedona.apache.org/latest/... · Posted by u/MrPowers

larodi · 6 months ago

Somehow I dont see this applicable for 90% of all current spatial needs, where PostGIS does just right, and same IMHO goes for DuckDB. There perhaps exists 10% of business where data is so immense you want to hit it with Rust & whatnot, but all others do just fine im Postgre.

My bet is most of actually useful spatial ST_ functions are not implemented in this one, as they are not in the DuckDB offering.

mattforrest · 6 months ago

I wrote a book on PostGIS and used it for years and these single node analytical tools make sense when PostGIS performance starts to break down. For many tasks PostGIS works great, but again you are limited by the fact that your tables have to live in the DB and can only scale as much as the computing resources you have allocated.

In terms of number of functions PostGIS is still the leader, but for analytical functions (spatial relationships, distances, etc) having those in place in these systems is important. DuckDB started this but this has a spatial focused engine. You can use the two together, PostGIS for transactional processing and queries, and then SedonaDB for processing and data prep.

A combination of tools makes a lot of sense here especially as the data starts to grow.

mattforrest commented on Apache Sedona tutorial for Spark based spatial processing [video] youtube.com/watch?v=V__Lq... · Posted by u/mattforrest

mattforrest · 6 months ago

I put together a tutorial for Apache Sedona which brings geospatial to Spark.

A project that crunches real estate and satellite imagery data with scalable spatial joins

Sedona basically makes spatial at scale way more accessible. Instead of rolling your own hacks, you get Sparks distributed compute with geospatial APIs baked in.

mattforrest commented on Stop using zip codes for geospatial analysis (2019) carto.com/blog/zip-codes-... · Posted by u/voxadam

mcphage · a year ago

> The consequence is that performing any analysis with an assumption that ZIP codes are polygons is bound to be error-prone.

Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.

Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.

mattforrest · a year ago

They aren’t reliable correct actually. The boundaries that the Census publishes are called Zip Code Tabulation Areas which are approximations of zip codes and include overlaps.

mattforrest commented on Stop using zip codes for geospatial analysis (2019) carto.com/blog/zip-codes-... · Posted by u/voxadam

Spivak · a year ago

Right but this ends up being a good approximation for geography because the reality of logistics is that you end up doing a cute n-ary search of the geography. When you know the regional hub you can say for certain a huge chunk of the US the zip code doesn't represent. And then you keep n-secting. Sometimes the land-mass you get at the end is specific enough for your uses.

You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.

mattforrest · a year ago

Just use a spatial query. That’s what they are made for.

mattforrest commented on Stop using zip codes for geospatial analysis (2019) carto.com/blog/zip-codes-... · Posted by u/voxadam

mywittyname · a year ago

> For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.

A 5+4 formatted ZIP code maps to just a handful of addresses. In cities with larger populations, the +4 could map to a single building, and in more sparely populated place, it might include houses on a handful of roads.

For smaller datasets, ZIP+4 might as well be a unique household identifier. I just checked a 10 million address database and 60% of entries had a unique ZIP+4, so one other bit of PII would be enough to be a 99.99% unique identifier per person.

With a geo-coded ZIP+4 database, you could locate people with a precision that's proportional to the population density of their region.

mattforrest · a year ago

Yeah but we have that already in the census hierarchy. Plus you have to pay to access Zip+4 geospatial data and it changes sometime as frequently as quarterly

mattforrest commented on Stop using zip codes for geospatial analysis (2019) carto.com/blog/zip-codes-... · Posted by u/voxadam

cogman10 · a year ago

Zip codes (in the US) are machine readable numbers a mail sorter can use to send a parcel to the right delivery truck for final delivery. In the US, they represent the hierarchy of postal centers with the most significant digit representing the primary hub for a region and the smallest number the actual post office that will be in charge of delivering the letter (or truck if you do the extended post code).

They don't represent geography at all, they represent the organizational structure of USPS.

They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.

mattforrest · a year ago

Well put