> This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.
Arrow is primarily a serialization format to transfer data between distributed systems. It uses zero copy and other techniques to quickly process, and store large data sets in memory.
Other libraries allow you to query Arrow data once processed.
This project is an in-memory columnar data store with querying and other capabilities.
It's actually possible, columns are simple Go interfaces and can be re-defined and defined for specific types. You can easily build implementation of columns that actually load data from disk or even a remote server (RDBMS, S3, ..?) and retain the indexing capability.
On the flip side, you could actually fit more data in-memory than with non-columnar methods, since the storage is column-by-column, it compresses very well. For example boolean values are stored as bitmaps in this implementation, strings could be stored in a hash map so there's only one string of a type that kept in memory, even if you have millions of rows.
Is there a Go equivalent of Calcite? If so, could probably bolt that onto the query path and work in the logical plan translation to the physical plan - which is the query API that’s currently provided.
Very cool. This kind of storage is similar to what's typically being used in Entity Component Systems like EnTT [0], which can also be interpreted as in-memory column oriented databases.
Recently I'm starting to like this type of programming over OOP. Each part of your application accesses a global in-memory DB with the full state. But unlike a traditional persistent db it's really fast.
I've wondered about bitmap indexing before. Is it an optimization for speed, too, or just memory?
If I had an array of a million things, and I wanted to specify some large subset of them via a separate million element array (like in numpy/pandas), is it faster to do it via a million bytes or via a million bits (ie I think this is bitmap indexing, right?). I would think that the bytes would be faster, even though terribly wasteful of memory. From my rudimentary knowledge of CPUs I thought they didn't really operate at the bit level, and so you'd have to do a few instructions of calculations. Or would it be made up for by the cache line reading in more of the indexer in one fetch?
I'm a little naive on this subject, but just wondering what are the use cases for in-memory columnar stores? I was under the impression that columnar stores are good for OLAP use cases involving massive amounts of data. For datasets that fit within memory, are there still benefits in organizing data in a columnar manner and are the performance gains appreciable?
Also see my comment above, but you find this kind of storage commonly in game development [0] where you are optimizing for batch access on specific columns to minimize cache misses. It's usually used as the storage layer for Entity Component Systems. It's also called data-oriented design [1]
I’m not sure about any performance gains or working with large datasets, but the ancient Metakit[1] was just a really pleasant relational algebra library ( ≠ SQL data model library, it could do e.g. relations as values which are difficult for row-oriented databases). I’d say that Metakit & OOMK in Tcl is strictly better than the relational part of Pandas in Python, except the documentation is somewhere between bad and nonexistent.
Not subject matter expert but few that come to mind: memory can become a bottleneck, reading sequential data instead of jumping pointers/reading useless data and trashing caches gives much better throughout, compression applied to columnar data is more efficient and can give a throughput boost when memory bw becomes a bottleneck on systems with high number of CPUs.
I am merely a dabbler in this area and definitely not an expert, but my understanding is that columnar stores tend to be substantially more efficient for analytical operations over large sets of in memory data by virtue of the data being easier to operate on with vectorized instructions like SIMD.
OT: Not a Go dev here but have some side projects written in it... Isn't docker for Go a bit unorthodox? I had a few nice headaches setting up my local env to use docker with Go to mirror my python workflow (all projects have a Dockerfile, no dependencies installed locally). I was under the impression that Pro Go people do not use docker for local Go dev. Please correct me if I am wrong.
Docker and go work fine together but using docker for go dev is just an unnecessary hassle, especially if (like me) you’re doing dev in MacOS - you have to cross compile to Linux which is slower, and then build and deploy the container - versus the very quick compile-run cycle of regular Go.
As a reformed Java developer I can say that docker didn’t add much time to the build cycle and gave us a better way to package resources for Java code, but Go is far more ergonomic, so taking a <2 second compile time for a small microservice and adding docker to turn it into a 30 second build time just isn’t worth whatever utility you get from containers at dev time.
Go doesn’t benefit as much from docker, but if you’re already living in a docker world (i.e. everything you deploy is a docker image, and it’s managed by compose or kubernetes) then it’s easier to use docker than not.
We build images (about 20, each with a Dockerfile) from a monorepo with a single go.mod. I have basically a full replica of prod running locally in k3s — letting k3s manage it all is easier than dealing with the pile of environment variables that would be needed to get everything hooked up properly. And with kustomize, we can reuse a bunch of yaml from prod.
Sometimes I’ll run go binaries locally on my machine for debugging (the builds still work because go’s packaging is finally stable). But the difference is minimal — using docker/k8s is more about streamlining deployment/config/rollback (and the occasional co-packaged asset) than anything else.
I agree that adding docker to a Go dev setup is not worth it, but I think commenter was asking for a docker image for running it. In that case, I’d say that docker could be worth it for the end user.
That's the idea, a transaction commit log decoupled from underlying durable storage allows you to build your own persistence layers. I'm still thinking to build a simple (memory-mapped?) layer, but as an optional, separate lib.
This project seems very similar to Apache Arrow, if OP or anyone else is around to explain why one might be used over the other that would be great.
Arrow is primarily a serialization format to transfer data between distributed systems. It uses zero copy and other techniques to quickly process, and store large data sets in memory.
Other libraries allow you to query Arrow data once processed.
This project is an in-memory columnar data store with querying and other capabilities.
On the flip side, you could actually fit more data in-memory than with non-columnar methods, since the storage is column-by-column, it compresses very well. For example boolean values are stored as bitmaps in this implementation, strings could be stored in a hash map so there's only one string of a type that kept in memory, even if you have millions of rows.
Recently I'm starting to like this type of programming over OOP. Each part of your application accesses a global in-memory DB with the full state. But unlike a traditional persistent db it's really fast.
[0] https://github.com/skypjack/entt
https://www.amazon.com/Component-Software-Object-Oriented-Pr...
Programming against Objective-C protocols, COM interfaces, Component Pascal framework, and so forth.
If I had an array of a million things, and I wanted to specify some large subset of them via a separate million element array (like in numpy/pandas), is it faster to do it via a million bytes or via a million bits (ie I think this is bitmap indexing, right?). I would think that the bytes would be faster, even though terribly wasteful of memory. From my rudimentary knowledge of CPUs I thought they didn't really operate at the bit level, and so you'd have to do a few instructions of calculations. Or would it be made up for by the cache line reading in more of the indexer in one fetch?
[0] http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy...
[1] https://en.wikipedia.org/wiki/Data-oriented_design
[1]: https://git.jeelabs.org/jcw/metakit
As a reformed Java developer I can say that docker didn’t add much time to the build cycle and gave us a better way to package resources for Java code, but Go is far more ergonomic, so taking a <2 second compile time for a small microservice and adding docker to turn it into a 30 second build time just isn’t worth whatever utility you get from containers at dev time.
We build images (about 20, each with a Dockerfile) from a monorepo with a single go.mod. I have basically a full replica of prod running locally in k3s — letting k3s manage it all is easier than dealing with the pile of environment variables that would be needed to get everything hooked up properly. And with kustomize, we can reuse a bunch of yaml from prod.
Sometimes I’ll run go binaries locally on my machine for debugging (the builds still work because go’s packaging is finally stable). But the difference is minimal — using docker/k8s is more about streamlining deployment/config/rollback (and the occasional co-packaged asset) than anything else.