rabernat (u/rabernat)

rabernat commented on Loading a trillion rows of weather data into TimescaleDB aliramadhan.me/2024/03/31... · Posted by u/PolarizedPoutin

ohmahjong · a year ago

This is a bit off-topic but I'm interested in the same space you are in.

There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.

What are your thoughts on how to strike that balance between use cases?

rabernat · a year ago

> It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets.

Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.

The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.

rabernat commented on Loading a trillion rows of weather data into TimescaleDB aliramadhan.me/2024/03/31... · Posted by u/PolarizedPoutin

counters · a year ago

Why?

Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

[1]: https://cloud.google.com/storage/docs/public-datasets/era5

rabernat · a year ago

True, but in fact, the Google ERA5 public data suffers from the exact chunking problem described in the post: it's optimized for spatial queries, not timeseries queries. I just ran a benchmark, and it took me 20 minutes to pull a timeseries of a single variable at a single point!

This highlights the needs for timeseries-optimized chunking if that is your anticipated usage pattern.

rabernat commented on Loading a trillion rows of weather data into TimescaleDB aliramadhan.me/2024/03/31... · Posted by u/PolarizedPoutin

rabernat · a year ago

Great post! Hi Ali!

I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?

As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.

rabernat commented on We're wasting money by only supporting gzip for raw DNA files bioinformaticszen.com/pos... · Posted by u/michaelbarton

rabernat · 3 years ago

The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)

rabernat commented on Request for Startups: Climate Tech ycombinator.com/blog/rfs-... · Posted by u/jeremylevy

smaddox · 3 years ago

I'm curious, have you considered using PyTorch or JAX for tensor processing? ML libraries seem to be much further along when it comes to performing compute-intensive, hardware-accelerated operations on Tensor's. And you get gradients basically for free (in terms of developer time). Also, the kernel compiler being added PyTorch 2 looks very promising.

rabernat · 3 years ago

PyTorch and JAX are used heavily in climate science on the ML side. For more general analytics, not so much. Many of our users like to use Xarray as a high-level API. There has been some work to integrate Xarray with PyTorch (https://github.com/pydata/xarray/issues/3232) but we're not there yet.

The Python Array API standard should help align these different back-ends: https://data-apis.org/array-api/latest/

rabernat commented on Request for Startups: Climate Tech ycombinator.com/blog/rfs-... · Posted by u/jeremylevy

jandrewrogers · 3 years ago

I've worked in "climate intelligence" for many years. The list overlooks one of the largest and most immediate opportunities around that market: the data infrastructure and analysis tools we have today are profoundly unfit for purpose. Just about everyone is essentially using cartography tools to do large-scale spatiotemporal analysis of sensor and telemetry data. The gaps for both features and practical scalability are massive.

It has made most of the climate intelligence analysis we'd like to do, and for which data is available, intractable. And what we can do is so computationally inefficient that we figuratively burn down a small forest every time we run an analysis on a non-trivial model, which isn't very green either.

(This is definitely something I'd work on if I had the bandwidth, it is a pretty pure deep tech software problem.)

rabernat · 3 years ago

Agree 100%. This is big part of the motivation behind our new startup Earthmover: https://earthmover.io/

Our mission is to make it easier to work with scientific data at scale in the cloud, focusing mainly on the climate, weather, and geospatial vertical.

My cofounder Joe Hamman and I are climate scientists who helped create the Pangeo project. We are also core devs on the Python packages Xarray and Zarr. We think that a layer of managed services (think a "modern data stack" oriented around the multidimensional array data model) is exactly what this ecosystem needs to make it easier for teams to build data-intensive products in the climate-tech space.

And we're hiring! https://earthmover.io/posts/earthmover-is-hiring/