I've been doing it for more like 30.
> In practice, you have to actually try pretty hard to misbehave enough for Google and Microsoft to notice and block you.
As far as I can tell, the "misbehavior" that got me blocked by Microsoft was being hosted on Linode... where I'd been for around ten years at that time, all on the same IP address. Tiny server, had never emitted a single even slightly spammy message, all the demanded technical measures in place, including the stupid ones.
Because of the huge number of people stupid enough to receive their email through Microsoft, I had to spend a bunch of time "appealing". That's centralization.
On edit: Oh, and the random yahoos out there running freelance blocklists can do a lot of damage to you, too, by causing smaller operators to reject your mail.
I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.
Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:
I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.
I could just write the files and import ".parquet". This kills performance and costs lots of money.
I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.
It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.
[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are
almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.Just beware that one issue you can have is the limit of row groups per file (2^15).
Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.
Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...
What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.
31% positive for those under 45, 8% positive for those above 45.
41% negative for those under 45, 77% negative for those above 45.
Not the majority, even for younger people. And remember, this is just U.S. opinion; people in other countries might view this differently (likely even more negatively).
Edit: look at the photos of the people… AI generated perhaps?
Recently on the market for a tool to manage SQL migration patches with no need for slow rollouts, I reviewed many such tools and the one that impressed me was sqitch: https://github.com/sqitchers/sqitch
So if you are interrested in this field and if Pgroll is not quite what you are looking for, I recommand you have a look at sqitch.