ignoreusernames (u/ignoreusernames)

ignoreusernames commented on The two versions of Parquet jeronimo.dev/the-two-vers... · Posted by u/tanelpoder

The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.

I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).

If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.

It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.

ignoreusernames · 10 days ago

> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.

> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length

I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.

ignoreusernames commented on Databricks in talks to acquire startup Neon for about $1B upstartsmedia.com/p/scoop... · Posted by u/ko_pivot

wqtz · 4 months ago

Databricks acquired bit.io and subsequently shut it down quite fast. Afaik bit.io had a very small team and the founder was a serial entrepreneur who is not going to stick around and he did not. I am not sure who from bit.io is still around at databricks.

If I am guessing right, Motherduck will likely be acquired by GCP because most of the founding team was ex-BQ. Snowflake purchased Modin and polars is still quite immature to be acquisition ready. So, what does this leave us with. There is also EDB who is competing in enterprise Postgres space.

Folks I know in the industry are not very happy with databricks. Databricks themselves was hinting people that that they would be potentially acquired by Azure as Azure tries to compete in the data warehouse space. But everyone become an AI company which left Databricks in an awkward space. Their bdev team is not bestest from my limited interactions with them (lots of starbucks drinkers and let me get back to you after a 3 month PTO), so they do not know who or how to lead them to an AI pivot. With cash to burn from overinvestment and the snowflake/databricks conf coming up fast they needed a big announcement and this is that big announcement.

Should have sobered up before writing this though. But who cares.

ignoreusernames · 4 months ago

> Folks I know in the industry are not very happy with databricks

Yeah, big companies globing up everything does not lead to a healthy ecosystem. Congrats on the founders for their the acquisition but everyone else loses with movements like this.

I'm still sour after their Redash purchase that instantly "killed" the open source version. Tabular acquisition was also a bit controversial since one of the founders is the PMC Chair for Iceberg which "competes" directly with Databricks own delta lake. The mere presence of these giants (mostly databricks and snowflake) makes the whole data ecosystem (both closed and open source) really hostile.

ignoreusernames commented on Anatomy of a SQL Engine dolthub.com/blog/2025-04-... · Posted by u/ingve

ignoreusernames · 4 months ago

I recommend anyone who works with databases to write a simple engine. It's a lot simpler than you may think and it's a great exercise. If using python, sqlglot (https://github.com/tobymao/sqlglot) let's you skip all the parsing and it even does some simple optimizations. From the parsed query tree it's pretty straightforward to build a logical plan and execute that. You can even use python's builtin ast module to convert sql expressions into python ones (so no need for a custom interpreter!)

ignoreusernames commented on ClickHouse gets lazier and faster: Introducing lazy materialization clickhouse.com/blog/click... · Posted by u/tbragin

kwillets · 4 months ago

Late Materialization, 19 years later.

https://dspace.mit.edu/bitstream/handle/1721.1/34929/MIT-CSA...

ignoreusernames · 4 months ago

Same thing with columnar/vectorized execution. It has been known for a long time that's the "correct" way to process data for olap workflows, but only became "mainstream" in the last few years(mostly due to arrow).

It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.

ignoreusernames commented on Apache DataFusion datafusion.apache.org/... · Posted by u/thebuilderjr

bdndndndbve · 8 months ago

Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.

ignoreusernames · 8 months ago

> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations

I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk

> [...] which used to be a point of comparison to MapReduce specifically.

So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop

> Not ground-breaking nowadays but when I was doing this stuff 10+ years

I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating

ignoreusernames commented on Apache DataFusion datafusion.apache.org/... · Posted by u/thebuilderjr

hipadev23 · 8 months ago

Absolutely agree. Spark is the same garbage as Hadoop but in-memory.

ignoreusernames · 8 months ago

just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:

  spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
  spark.read.parquet("sample_data").groupBy($"col").count().count()

after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.

ignoreusernames commented on Improving Parquet Dedupe on Hugging Face Hub huggingface.co/blog/impro... · Posted by u/ylow

ignoreusernames · a year ago

> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates

I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.

ignoreusernames commented on Sail – Unify stream processing, batch processing and compute-intensive workloads github.com/lakehq/sail... · Posted by u/chenxi9649

ignoreusernames · a year ago

From the announcement “As of now, we have mined 1,580 PySpark tests from the Spark codebase, among which 838 (53.0%) are successful on Sail. We have also mined 2,230 Spark SQL statements or expressions, among which 1,396 (62.6%) can be parsed by Sail”

Kinda early to call this a drop in replacement with those numbers no?

But, with enough parity this project could be a dream for anybody dealing with spark’s dreadful performance. Kudos to the team

ignoreusernames commented on Portugal brings back tax breaks for foreigners in bid to woo digital nomads fortune.com/europe/2024/0... · Posted by u/WWWMMMWWW

A_D_E_P_T · a year ago

https://archive.ph/NleMx

How does this make any sense in light of Lisbon being now "one of the world’s most unlivable cities"? Seems like it can't possibly be good for the native Portugese.

https://blogs.mediapart.fr/jacobin/blog/040922/real-estate-s...

ignoreusernames · a year ago

This is a fair argument that's often brought up but I never see actual raw data backing it up. Housing is fucked in several places around the world, including a bunch of countries in Europe that don't have any tax breaks for specialized labor. I would love to look at some metrics like

- How many units of housing are built each year

- How many units are rented and to what demography (Portuguese families, immigrants sharing rooms, students, etc)

- How many migrants (legal and ilegal)

- How many specialized migrants each year and the % of them that eventually buy a home

- How many units are bought up by funds and other financial entities

- How much taxes and social security contributions are collected per year for specialized migrants and how that money is reinvested

- etc

I known that's basically impossible to have an accurate picture since those numbers are way too "politically loaded". Politics and facts don't mix very well so we just default to who yells the loudest (specially true in Portugal, unfortunately)

EDIT: Format bullet points

ignoreusernames commented on The AWS S3 Denial of Wallet Amplification Attack blog.limbus-medtec.com/th... · Posted by u/croes

ignoreusernames · a year ago

Early Athena (managed prestodb by AWS) had a similar bug when measuring colunar file scans. If it touched the file, it considered the whole file instead of just the column chunks read. If I’m not mistaken, this was a bug on presto itself, but it was a simple patch that landed on upstream a long time before we did the tests. This was the first and only time we considered using a relatively early AWS product. It was so bad that our half assed self deployed version outperformed Athena by every metric that we cared about