DuckDB is probably the most important geospatial software of the last decade

“import geopandas” also exists and has for some time. Snark aside, WHAT is special about duckDB? I wish the author had actually shown some practical examples so I could understand their claims better.

maxxen · 8 months ago

I replied to another comment, but I think a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

Yes, DuckDB does a whole lot more, vectorized larger-than-memory execution, columnar compressed storage and a ecosystem of other extensions that make it more than the sum of its parts. But while Ive been working hard on making the spatial extension more performant and more broadly useful (I designdd a new geometry engine this year, and spatial join optimization just got merged on the dev-branch), the fact that you can e.g. convert too and from a myriad of different geospatial formats by utilizing GDAL, transforming through SQL, or pulling down the latest overture dump without having the whole workflow break just cause you updated QGIS has probably been the main killer feature for a lot of the early adopters.

(Discmaimer, I work on duckdb-spatial @ duckdblabs)

timschmidt · 8 months ago

I'm not the OP, but thank you for such a detailed answer. The integration and reduced barriers to entry you mention mirror my own experiences with tooling in another area, and your explanation made parallels clear.

carstonh · 8 months ago

Here's an example building on @maxxen's work - because DuckDB (+spatial extension) can compile to Wasm, I built an in-browser Shapefile to CSV converter tool: https://www.honeycombmaps.com/tools/shapefile-to-csv-convert...

larsiusprime · 8 months ago

This is an excellent reply and what I wish the article had been, thanks!

Amadiro · 8 months ago

Is there any strong reason to use GeoParquet instead of straight up parquet if all I'm interested in is storing and operating on lat/lons?

I'm curious if it compresses them better or something like that. I see lots of people online saying it compresses well (but mostly compared to .shp or similar) but normal parquet (.gz.parquet or .snappy.parquet) already does that really well. So it's not clear to me if I should spend time investigating it...

I mostly process normal parquet with spark and sometimes clickhouse right now.

larodi · 8 months ago

> a big part is that duckdbs spatial extension provides a SQL interface to a whole suite of standard foss gis packages by statically bundling everything (including inlining the default PROJ database of coordinate projection systems into the binary) and providing it for multiple platforms (including WASM). I.E there are no transitive dependencies except libc.

and for the last twenty, not ten years, this is what PostGIS was pioneering, and also teaching everyone get used to. DuckDB was not something that people even knew in GIS world. I'm not even sure whether QGIS connects to DuckDB, perhaps it does for a while, but it sure knows Spatialite for very long and last, but not least - ESRI sure as f*ck still have not heard of DuckDB. This is already half the geospatial world out there.

This whole article is superb biased and its very sad.

dahauns · 8 months ago

As someone unfamiliar with DuckDB but at least somewhat with geospatial tools (it's been a few years, though): Dang - see, now that is seriously cool. The whole ETL shebang always was the biggest hassle, even with serious commercial tools, and the idea of a stable, all-in-one, ready-to-go layer is incredibly appealing.

It's just something the writer of the article should probably have at least mentioned when going full hyperbole with the title (hey, after reading this, it might actually be justified! :) ).

Rereading the article that focuses on the one thing that isn't a standout (the installation itself), though, I can't help but chuckle and think "worst sales pitch ever". ;)

tmpz22 · 8 months ago

I've been researching DuckDB - while it has many technical merits I think the main argument will be ease of use. It has a lot of the operational advantages of sqlite paired with strong extensibility and good succinct documentation.

Folks who have been doing DevOps work are exasperated with crummy SaaS vendors or antiquated OSS options that have a high setup cost. DuckDB is just a mature project that offers an alternative, hence an easy fan favorite among hobbyists (I imagine at scale the opportunity costs change and it becomes less attractive).

wenc · 8 months ago

How is the adoption among DevOps folks?

I'm still getting feedback that many devs are not too comfortable with reading and writing SQL. They learned simple SELECT statements in school, but get confused by JOINs and GROUP BYs.

jjtheblunt · 8 months ago

duckdb has parquet support and can operate, in SQL syntax, on enormous 'tables' spread across huge collections of parquet files as if one virtual file. i believe the underlying implication is opportunities to leverage vector instructions on parquet. it's very "handy".

getnormality · 8 months ago

Everything is special about DuckDB. Pandas is way, way behind the state of the art in tabular data analysis.

Deleted Comment

dbreunig · 8 months ago

Author here: what's special is that you can go from 0 to spatial data incredibly quickly, in the data generalist tool you're already using. It makes the audience of people working with geospatial data much bigger.

(Geopandas is great, too.)

dopidopHN · 8 months ago

I’m very familiar with Postgres and spinning one with postgis seems easy enough. Do I get more with duckdb?

Most of the time I store locations and compute distance to them. Would that being faster to implement with duckdb

tsss · 8 months ago

For one it doesn't have the god awful pandas API.

hokusad · 8 months ago

Yeah, that's one of my favourite aspects to it

joshvm · 8 months ago

I haven't used duckDB but the real comparison is presumably postgis? Which is also absent from the discussion, but I think what the author alludes to.

I have no major qualm with pandas and geopandas. However I use it when it's the only practical solution, not because I enjoy using it as a library. It sounds like pandas (or similar) vs a database?

stevage · 8 months ago

Yeah, PostGIS is readily available, and postgres is much more widely used than DuckDB. Either I don't understand OP's argument for why this is so important or I just don't buy it.

If you're using JavaScript you install Turf. The concept that you can readily install spatial libraries is hardly earth shattering.

tom_m · 8 months ago

Convenience will always be a personal preference.

serjester · 8 months ago

Ask anyone that's just starting out with geo pandas about their experience, and I'd be shocked if anyone calls it intuitive and straightforward. To geopandas credit, I think they just inherited many of Pandas' faults (why does a user need to understand indexes to do basic operations, no multi core support, very poor type hinting, etc).

I work on geospatial apps and the software I think I am most excited about is https://felt.com/. I want to see them expand their tooling such that maps and data source authentication/authorization was controllable by the developer, to enable tenant isolation with proprietary data access. They could really disrupt how geospatial tech gets integrated into consumer apps.

This article doesn't acknowledge how niche this stuff is and it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc. I would replace a lot of my custom built mapping tools with Felt if it were possible, so I could focus on our core geospatial processes and not the code to display and play with it in the browser, which is almost as big if not bigger in terms of LOC to maintain.

As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.

dbreunig · 8 months ago

Author here: the beauty of DuckDB spatial is that the projections and CRS options are hidden until you need them. For 90% of geospatial data usage people don't and shouldn't need to know about projections or CRS.

Yes, there are so many great tools to handle the complexity for the capital-G Geospatial work.

I love Felt too! Sam and team have built a great platform. But lots of times a map isn't needed; an analyst just needs it as a column.

PostGIS is also excellent! But having to start up a database server to work with data doesn't lend itself to casual usage.

The beauty of DuckDB is that it's there in a moment and in reach for data generalists.

korkoros · 8 months ago

My experience has been that data generalists should stay away from geospatial analysis precisely because they lack a full appreciation of the importance of spatial references. I've seen people fail at this task in so many ways. From "I don't need a library to reproject, I'll just use a haversine function" to "I'll just do a spatial join of these address points in WGS84 to these parcels in NAD27" to "these North Korean missiles aren't a threat because according to this map using a Mercator projection, we are out of range."

DuckDB is great, but the fact that it makes it easier for data generalists to make mistakes with geospatial data is mark against it, not in its favor.

jparishy · 8 months ago

I think we're mostly making the same point about complexity, ya.

To me, I think it's mostly a frontend problem stopping the spread of mapping in consumer apps. Backend geo is easy tbh. There is so much good, free tooling. Mapping frontend is hell and there is no good off the shelf solution I've seen. Some too low level, some too high level. I think we need a GIS-lite that is embeddable to hide the complexity and let app developers focus on their value add, and not paying the tax of having frontend developers fix endless issues with maps they don't understand.

edit: to clarify, I think there's a relationship between getting mapping valued by leadership such that the geo work can be even be done by analysts, and having more mapping tools exist in frontend apps such that those leaders see them and understand why geo matters. it needs to be more than just markers on the map, with broad exposure. hence my focus on frontend web. sorry if that felt disjointed

febed · 8 months ago

Last I checked DuckDB spatial didn’t support handling projections. It couldn’t load the CRS from a .prj file. This makes it useless for serious geospatial stuff.

jandrewrogers · 8 months ago

> it's a lot of training to get people to up to speed on coordinate systems, projections, transformations, etc

This can mostly be avoided entirely with a proper spheroidal reference system, computational geometry implementation, and indexing. Most uses of geospatial analytics are not cartographic in nature. The map is at best a presentation layer, it is not the data model, and some don’t use a map at all. Forcing people to learn obscure and esoteric cartographic systems to ask simple intuitive questions about geospatial relationships is a big part of the problem. There is no reason this needs to be part of the learning curve.

I’ve run experiments on unsophisticated users a few times with respect to this. If you give them a fully spheroidal WGS84 implementation for geospatial analytics, it mostly “just works” for them anywhere on the globe and without regard for geospatial extent. Yes, the software implementation is much less trivial but it is qualitatively superior UX because “the world” kind of behaves how people intuit it should without having to know anything about projections, transforms, etc. And to be honest, even if you do know about projections and transforms, the results are still often less than optimal.

The only issue that comes up is that a lot of cartographic visualization toolkits are somewhat broken if you have global data models or a lot of complex geometry. Lots of rendering artifacts. Something else to work on I guess.

maxxen · 8 months ago

Im inclined to agree, but unfortunately a huge amount of the existing data and processes in this space does not assume a spheroidal earth and come provided with a coordinate reference system. Ultimately there are also some domains where you got data that you explicitly don't want to interpret using spheroidal semantics, e.g. when working with a city plan - in which case the map _is_ the data model, and you definitely want the angles of a triangle to sum up to 180.

groggo · 8 months ago

> the software implementation is much less trivial

Aren't most geospatial tools just doing simple geometry? And therefore need to work on some sort of projection?

If you can do the math on the spheroidal model, ok you get better results and its easier to intuit like you said, but it's much more complicated math. Can you actually do that today with tools like QGIS and GDAL?

kashifr · 8 months ago

Have you tried https://geobase.app/ they recently also had a post about duckdb integration: https://geobase.app/blog/duckdb-1-1-3

jparishy · 8 months ago

I had not, looks pretty cool but solves the inverse of the problem as I see it. I want a backend agnostic frontend toolset that is a GIS that I can customize to my needs. I don't want to implement the tools myself, that's too low level. I don't want the service to manage, control, or own the data, that's too high level. There's a sweet spot I don't think is being hit yet.

stevage · 8 months ago

I was just about to get into Felt then they took away the free tier and made it very expensive.

mtmail · 8 months ago

https://atlas.co/ still has a free tier. Less features I think, depends on your use case of course.

Dead Comment

fastasucan · 7 months ago

>As mentioned by another commenter, this DuckDB DX as described is basically the same as PostGIS too.

No, its not.

I'm a big fan of DuckDB and I do geospatial analysis, mostly around partitioning geographies (into Uber H3 hexagons), calculating Haversine distances, calculating areas of geometries, figuring out which geometry a point falls in, etc. Many of these features have existed in some form or other in geopandas or postgis, so DuckDB's spatial extensions bring nothing new.

But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.

DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.

https://duckdb.org/docs/stable/extensions/spatial/functions....

https://duckdb.org/community_extensions/extensions/h3.html

sroerick · 8 months ago

I totally agree with this. DuckDB for me was a huge QoL improvement just working with random datasets. I found it much easier to explore datasets using DuckDB rather than Pandas, Postgres or Databricks.

The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.

An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.

Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.

In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?

All of this is to say, DuckDB geospatial is very cool and helpful.

ngrilly · 8 months ago

> An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

What was shockingly useful in PowerBI compared to DuckDB?

wodenokoto · 8 months ago

Why do you use haver-sine over geodesic or reprojection?

I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.

code_biologist · 8 months ago

Just an app dev, not a geospatial expert, but reprojection always seemed like something a library should handle under the hood unless one has specific needs. I'm used to the ergonomics / moron-proofing of something like Postgis' `ST_Distance(geography point1, geography point2)` and it gives you the the right answer in meters. You can easily switch to spherical or Cartesian distances if you need distance calculations to go faster. `ST_Area(geography geog)` and it gives you the size of your shape in square meters wherever on the planet.

colkassad · 8 months ago

Look into Vincenty [1] or Karney (for a more robust solution) [2]. Vincenty should be good enough for most use cases.

[1] https://en.wikipedia.org/wiki/Vincenty's_formulae

[2] https://github.com/pbrod/karney

Reprojection is accurate locally but inaccurate at scale.

Geodesics are the most accurate (Vincenty etc) but are computationally heavy.

Haversine is a nice middle ground.

everybodyknows · 8 months ago

Looking just at the Hilbert reference, I'm wondering why there is no function to return, for a given level of precision, the set of segments along the curve containing corresponding to a sub-rectangle of the space. Is this functionality packaged up elsewhere?

Demiurge · 8 months ago

> Prior to this, getting up and running from a cold-start might’ve required installing or even compiling severall OSS packages, carefully noting path locations, standing up a specialized database… Enough work that a data generalist might not have bothered, or their IT department might not have supported it.

I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.

That requires data to already be in Postgres, otherwise you have to ETL data into it first.

DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).

I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")

Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])

[1] https://www.crunchydata.com/blog/pg_parquet-an-extension-to-...

Yeah, if you want to work with GeoParquet, and you want to keep your data in that format. I can see how that's easer to use your example. That's not what a lot of geospatial data is in. You might have shapefiles, geopackages, geojsons, who knows? There is a lot of software, from QGIS to ESRI to work with different formats to solve different problems. I don't think GeoParquet, even though it might be the fastest geospatial vector data format right now, is that common, and the article did not claim that either. So, given an average user trying to answer some GIS question, some ETL is pretty much a given, on average. And given that, installing PostGIS and installing DuckDB, both require some ETL, and learning some query and analytics language. DuckDB might be an improvement, but it's certainly not as much of a leap as quote is making it out to be.

edoceo · 8 months ago

Not wrong. Load to PG, then query. Duck UVP is like bringing 8 common tools/features under one tent.

paradox460 · 8 months ago

The original article feels a tremendous amount like another piece of DuckDB marketing, from the breathless admiration to the baseless claims like the title

keynesyoudigit · 7 months ago

The benefits of a simple install were exaggerated, I think the real point is the cloud native integration and the sheer scalability

jokoon · 8 months ago

I tested spatialite, it works okay, but the setup is a bit tedious when inserting data.

perrygeo · 8 months ago

Why? The article is light on details. Yes, having spatial analysis combined with SQL is awesome and very natural. There's nothing special about 2D geometries that makes them significantly different from floats and strings in an RDBMS perspective - geometry is just another column type, albeit with some special operators and indexes. We've been doing it with PostGIS, Spatialite, etc for two decades at this point.

What DuckDB brings to the table is cloud-native formats. This is standard geospatial functionality attached to an object store instead of a disk. As such, it doesn't require running a database process - data is always "at rest" and available over HTTP. I'm not downplaying the accomplishment, it's really convenient. But know that this is a repackaging of existing tech to work efficiently within a cloud IO environment. If anything, the major innovation of DuckDB is in data management not geospatial per se.

mpalmer · 8 months ago

The argument seemed pretty clear to me; more accessible tooling means more users and contributors. But as an apparent SME I could see how you'd feel clickbaited by the title.

Mainly a point of clarification - I don't think DuckDB represents anything new in geospatial. It represents a new paradigm for data management and data architecture. Understanding the difference isn't optional.

DuckDB handles floating point numbers too - is DuckDB the most important thing in floating point data? Of course not, the data types and operators haven't changed. The underlying compute environment has. That's where the innovation is. I'd simply appreciate if technical writers took two seconds to make this vital distinction - being precise isn't hard and I don't know what we gain by publishing intentionally muddled technical articles.

I’m not sure I agree that “install geospatial” is a game changer in simplicity compared to “pip install geopandas”.

They are both one line.

I think a big part is that duckdbs spatial extension doesnt have any transitive dependencies (except libc). It statically packages the standard suite of foss gis tools (including a whole database of coordinate systems) for multiple platforms (including WASM) and provides a unified SQL interface to it all.

(Disclaimer, I work on duckdb-spatial @duckdblabs)

jessekv · 8 months ago

I tried out duckdb-spatial for a hobby project and my takeaway is the main thing it offers over geopandas is performance in batch processing.

If I can reduce my spatial analysis to SQL and optimize that SQL a little bit, duckdb will happily saturate my CPUs and memory to process things fast in a way that a Python script struggles to do.

Yes, the static linking is nice. It helps with the problem of reproducible environments in Python. But that's not a game changer IMO.

lillecarl · 8 months ago

"Accessibility" is too often dismissed, yes you CAN do things with things, but getting people to do it is a craft and art. This is often where open-ish-core differs from the enterprise version of something too

WD-42 · 8 months ago

Is it that much simpler than ‘load extension postgis’? I know geos and gdal have always kinda been a pain, but I feel like docker has abstracted it all away anyway. ‘docker pull postgis’ is pretty easy, granted I’m not familiar with what else duckdb offers.

Yes. The difference between provisioning a server and running 'install spatial' in a CLI is night and day.

Docker has been a big improvement (when I was first learning PostGIS, the amount of time I had to hunt for proj directories or compile software just to install the plugin was a major hurdle), but it's many steps away from:

``` $ duckdb D install spatial; ```

What do you mean by "provisioning a server"? That's a strange requirement. You can install Postgis on a macbook in one command, or actually on all 3 major OS's in one command: "brew install postgis", "apt-get install postgresql-postgis, and "choco install postgis-9.3". Does DuckDB not require a "server" or a "computer"? What does Docker have to do with anything? This is a very confusing train of thought.

Doctor_Fegg · 8 months ago

PostGIS is included in Postgres.app which is a single executable for Mac. DuckDB appears also to be a single file download for Mac. I’m not sure your “when I was first learning PostGIS” experience reflects the current situation.

https://postgresapp.com/

frainfreeze · 8 months ago

I mean I like duckdb but this feels like you're pushing for it. On my system postgis comes from apt install, and it's one command to activate the "plugin". Is the night and day part not having to run random sh script from the internet to install software on my system?

It is not simpler. I use it with testcontainers in the notebooks usually https://testcontainers-python.readthedocs.io/en/latest/

twelvechairs · 8 months ago

DuckDB is a great thing for geospatial but most important of the past decade? There's so many tools in different categories it wouldnt come near top for me. Some might be QGIS, postGIS (still the standard), ArcGIS online (still the standard), JS mapping tools like mapbox (i prefer deckgl), new data types like COG, geopackage and geoparquet, photogrammetry tools, 3d tiles, core libraries like gdal and now pdal, shapely, etc.

Most of those tools came out circa ~2000.

Yeah, I feel old.

It clearly says "most important of the past 10 years" not "most important that has been invented in the past 10 years". Even taking your definition that would narrow down the list like half maybe and you should probably know that