hendiatris (u/hendiatris)

hendiatris commented on Jony Ive's OpenAI Device Barred From Using 'io' Name macrumors.com/2025/12/05/... · Posted by u/thm

yellow_lead · 21 days ago

The device would be popular in England

hendiatris · 21 days ago

And Brasil, Portugal and other portuguese-speaking places

hendiatris commented on A cryptography expert on how Web3 started, and how it’s going spectrum.ieee.org/web3-ha... · Posted by u/warrenm

Hizonner · 4 months ago

> Mine is going on 15 years.

I've been doing it for more like 30.

> In practice, you have to actually try pretty hard to misbehave enough for Google and Microsoft to notice and block you.

As far as I can tell, the "misbehavior" that got me blocked by Microsoft was being hosted on Linode... where I'd been for around ten years at that time, all on the same IP address. Tiny server, had never emitted a single even slightly spammy message, all the demanded technical measures in place, including the stupid ones.

Because of the huge number of people stupid enough to receive their email through Microsoft, I had to spend a bunch of time "appealing". That's centralization.

On edit: Oh, and the random yahoos out there running freelance blocklists can do a lot of damage to you, too, by causing smaller operators to reject your mail.

hendiatris · 4 months ago

They probably threw a fat CIDR block in their IP blacklist to fight off a spam campaign, and your IP was caught in the dragnet. This is how the big companies do it. They’ll evaluate for risk of false positives and as long as that stays below a threshold, they proceed.

hendiatris commented on DuckLake is an integrated data lake and catalog format ducklake.select/... · Posted by u/kermatt

amluto · 7 months ago

I have an personal pet peeve about Parquet that is solved, incompatibly, by basically every "data lake / lakehouse" layer on top, and I'd love to see it become compatible: ranged partitioning.

I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.

Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:

I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.

I could just write the files and import ".parquet". This kills performance and costs lots of money.

I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.

It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.

[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are

almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.

hendiatris · 7 months ago

In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant.

Just beware that one issue you can have is the limit of row groups per file (2^15).

hendiatris commented on Apache iceberg the Hadoop of the modern-data-stack? blog.det.life/apache-iceb... · Posted by u/samrohn

Gasp0de · 10 months ago

The problem is that the initial writing is already so expensive, I guess we'd have to write multiple sensors into the same file instead of having one file per sensor per interval. I'll look into parquet access options, if we could write 10k sensors into one file but still read a single sensor from that file that could work.

hendiatris · 10 months ago

You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine.

Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.

hendiatris commented on Apache iceberg the Hadoop of the modern-data-stack? blog.det.life/apache-iceb... · Posted by u/samrohn

indoordin0saur · 10 months ago

Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.

hendiatris · 10 months ago

I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

hendiatris commented on Apache iceberg the Hadoop of the modern-data-stack? blog.det.life/apache-iceb... · Posted by u/samrohn

hendiatris · 10 months ago

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

hendiatris commented on Luigi Mangione's account has been renamed on Stack Overflow substack.evancarroll.com/... · Posted by u/OsrsNeedsf2P

CaptainFever · a year ago

https://stratpolitics.org/2024/12/unitedhealthcare-poll/

31% positive for those under 45, 8% positive for those above 45.

41% negative for those under 45, 77% negative for those above 45.

Not the majority, even for younger people. And remember, this is just U.S. opinion; people in other countries might view this differently (likely even more negatively).

hendiatris · a year ago

Look at the website for that polling company. It is bizarre. None of the people on the people page have the company on their LinkedIn pages. Seems to be astroturf.

Edit: look at the photos of the people… AI generated perhaps?

hendiatris commented on Pgroll – Zero-downtime, reversible, schema changes for PostgreSQL (new website) pgroll.com/... · Posted by u/todsacerdoti

rixed · a year ago

Pgroll shines if you are doing slow rollouts.

Recently on the market for a tool to manage SQL migration patches with no need for slow rollouts, I reviewed many such tools and the one that impressed me was sqitch: https://github.com/sqitchers/sqitch

So if you are interrested in this field and if Pgroll is not quite what you are looking for, I recommand you have a look at sqitch.

hendiatris · a year ago

Sqitch is an incredibly under appreciated tool. It doesn’t have a business pushing it like flyway and liquibase, so it isn’t as widely known, but I vastly prefer it to comparable migration tools.

hendiatris commented on Show HN: Vekos – a Rust OS with Built-In Cryptographic Verification github.com/JGiraldo29/vek... · Posted by u/jgiraldo29

hendiatris · a year ago

How quickly does the append-only chain grow? What are the storage needs for it?

hendiatris commented on Roman Emperors' Outrageously Lavish Dinner Parties atlasobscura.com/articles... · Posted by u/diodorus

hendiatris · 2 years ago

There are some inaccuracies in this article. For example, the villa on Capri was built by Tiberius, not Augustus. https://en.wikipedia.org/wiki/Villa_Jovis