Column – High-performance, columnar, in-memory store with bitmap indexing in Go

I'm a little naive on this subject, but just wondering what are the use cases for in-memory columnar stores? I was under the impression that columnar stores are good for OLAP use cases involving massive amounts of data. For datasets that fit within memory, are there still benefits in organizing data in a columnar manner and are the performance gains appreciable?

pepemysurprised · 5 years ago

Also see my comment above, but you find this kind of storage commonly in game development [0] where you are optimizing for batch access on specific columns to minimize cache misses. It's usually used as the storage layer for Entity Component Systems. It's also called data-oriented design [1]

[0] http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy...

[1] https://en.wikipedia.org/wiki/Data-oriented_design

potamic · 5 years ago

thanks!

mananaysiempre · 5 years ago

I’m not sure about any performance gains or working with large datasets, but the ancient Metakit[1] was just a really pleasant relational algebra library ( ≠ SQL data model library, it could do e.g. relations as values which are difficult for row-oriented databases). I’d say that Metakit & OOMK in Tcl is strictly better than the relational part of Pandas in Python, except the documentation is somewhere between bad and nonexistent.

[1]: https://git.jeelabs.org/jcw/metakit

nvartolomei · 5 years ago

Not subject matter expert but few that come to mind: memory can become a bottleneck, reading sequential data instead of jumping pointers/reading useless data and trashing caches gives much better throughout, compression applied to columnar data is more efficient and can give a throughput boost when memory bw becomes a bottleneck on systems with high number of CPUs.

dafelst · 5 years ago

I am merely a dabbler in this area and definitely not an expert, but my understanding is that columnar stores tend to be substantially more efficient for analytical operations over large sets of in memory data by virtue of the data being easier to operate on with vectorized instructions like SIMD.

Do you have a (Docker) container that can be used for trying it out?

L_226 · 5 years ago

OT: Not a Go dev here but have some side projects written in it... Isn't docker for Go a bit unorthodox? I had a few nice headaches setting up my local env to use docker with Go to mirror my python workflow (all projects have a Dockerfile, no dependencies installed locally). I was under the impression that Pro Go people do not use docker for local Go dev. Please correct me if I am wrong.

doctor_eval · 5 years ago

Docker and go work fine together but using docker for go dev is just an unnecessary hassle, especially if (like me) you’re doing dev in MacOS - you have to cross compile to Linux which is slower, and then build and deploy the container - versus the very quick compile-run cycle of regular Go.

As a reformed Java developer I can say that docker didn’t add much time to the build cycle and gave us a better way to package resources for Java code, but Go is far more ergonomic, so taking a <2 second compile time for a small microservice and adding docker to turn it into a 30 second build time just isn’t worth whatever utility you get from containers at dev time.

physicles · 5 years ago

Go doesn’t benefit as much from docker, but if you’re already living in a docker world (i.e. everything you deploy is a docker image, and it’s managed by compose or kubernetes) then it’s easier to use docker than not.

We build images (about 20, each with a Dockerfile) from a monorepo with a single go.mod. I have basically a full replica of prod running locally in k3s — letting k3s manage it all is easier than dealing with the pile of environment variables that would be needed to get everything hooked up properly. And with kustomize, we can reuse a bunch of yaml from prod.

Sometimes I’ll run go binaries locally on my machine for debugging (the builds still work because go’s packaging is finally stable). But the difference is minimal — using docker/k8s is more about streamlining deployment/config/rollback (and the occasional co-packaged asset) than anything else.

adamcstephens · 5 years ago

I agree that adding docker to a Go dev setup is not worth it, but I think commenter was asking for a docker image for running it. In that case, I’d say that docker could be worth it for the end user.

_wldu · 5 years ago

I dockerize Go apps to run in AWS ECS Fargate, but otherwise I agree. Go apps don't need docker.

I love seeing stuff like this, getting more understanding of the layers underlying high performance data analytics is super interesting to me.

This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.

ctvo · 5 years ago

> This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.

Arrow is primarily a serialization format to transfer data between distributed systems. It uses zero copy and other techniques to quickly process, and store large data sets in memory.

Other libraries allow you to query Arrow data once processed.

This project is an in-memory columnar data store with querying and other capabilities.

polskibus · 5 years ago

Great stuff, can it work with larger-than-memory datasets? Is there a way to limit resource consumption ? Or will process just blow up in such case?

kelindar · 5 years ago

It's actually possible, columns are simple Go interfaces and can be re-defined and defined for specific types. You can easily build implementation of columns that actually load data from disk or even a remote server (RDBMS, S3, ..?) and retain the indexing capability.

On the flip side, you could actually fit more data in-memory than with non-columnar methods, since the storage is column-by-column, it compresses very well. For example boolean values are stored as bitmaps in this implementation, strings could be stored in a hash map so there's only one string of a type that kept in memory, even if you have millions of rows.

eismcc · 5 years ago

Is there a Go equivalent of Calcite? If so, could probably bolt that onto the query path and work in the logical plan translation to the physical plan - which is the query API that’s currently provided.

DLA · 5 years ago

https://calcite.apache.org

Calcite is Java based, hence my question

Very cool. This kind of storage is similar to what's typically being used in Entity Component Systems like EnTT [0], which can also be interpreted as in-memory column oriented databases.

Recently I'm starting to like this type of programming over OOP. Each part of your application accesses a global in-memory DB with the full state. But unlike a traditional persistent db it's really fast.

[0] https://github.com/skypjack/entt

ignoramous · 5 years ago

The same developer has an open-source entity-component-system, as well: https://github.com/kelindar/ecs

Wait no, that repo was an experiment that I'll be rebasing and finally building a real ECS based on the columnar storage library.

pjmlp · 5 years ago

Something that gets lost is that this is also a variety of OOP.

https://www.amazon.com/Component-Software-Object-Oriented-Pr...

Programming against Objective-C protocols, COM interfaces, Component Pascal framework, and so forth.

shkkmo · 5 years ago

What? In ECS state is managed seperately from logic and there is no inheritance. How is it a variety of OOP?

eklitzke · 5 years ago

How is using an in-memory database related to OOP? They seem completely orthogonal to me.

It is not. It is related to ECS which is being contrasted with OOP.

losvedir · 5 years ago

I've wondered about bitmap indexing before. Is it an optimization for speed, too, or just memory?

If I had an array of a million things, and I wanted to specify some large subset of them via a separate million element array (like in numpy/pandas), is it faster to do it via a million bytes or via a million bits (ie I think this is bitmap indexing, right?). I would think that the bytes would be faster, even though terribly wasteful of memory. From my rudimentary knowledge of CPUs I thought they didn't really operate at the bit level, and so you'd have to do a few instructions of calculations. Or would it be made up for by the cache line reading in more of the indexer in one fetch?

de6u99er · 5 years ago

GeertJohan · 5 years ago

Really nice project! The transactions and replication streaming seem to make it a great choice for sharding/distributed environments!

That's the idea, a transaction commit log decoupled from underlying durable storage allows you to build your own persistence layers. I'm still thinking to build a simple (memory-mapped?) layer, but as an optional, separate lib.