If I am guessing right, Motherduck will likely be acquired by GCP because most of the founding team was ex-BQ. Snowflake purchased Modin and polars is still quite immature to be acquisition ready. So, what does this leave us with. There is also EDB who is competing in enterprise Postgres space.
Folks I know in the industry are not very happy with databricks. Databricks themselves was hinting people that that they would be potentially acquired by Azure as Azure tries to compete in the data warehouse space. But everyone become an AI company which left Databricks in an awkward space. Their bdev team is not bestest from my limited interactions with them (lots of starbucks drinkers and let me get back to you after a 3 month PTO), so they do not know who or how to lead them to an AI pivot. With cash to burn from overinvestment and the snowflake/databricks conf coming up fast they needed a big announcement and this is that big announcement.
Should have sobered up before writing this though. But who cares.
Yeah, big companies globing up everything does not lead to a healthy ecosystem. Congrats on the founders for their the acquisition but everyone else loses with movements like this.
I'm still sour after their Redash purchase that instantly "killed" the open source version. Tabular acquisition was also a bit controversial since one of the founders is the PMC Chair for Iceberg which "competes" directly with Databricks own delta lake. The mere presence of these giants (mostly databricks and snowflake) makes the whole data ecosystem (both closed and open source) really hostile.
Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.
I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).
If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.
It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.
I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.
> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length
I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.