If you're looking to give Iceberg a spin, here's how to get it running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as the query engine, but you could swap in Trino (or even chdb / clickhouse).
I think iceberg solves a lot of big data problems, for handling huge amounts of data on blob storage, including partitioning, compaction and ACID semantics.
I really like the way the catalog standard can decouple underlying storage as well.
My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,
Even DuckDB doesn’t support writing yet.
I built out a tool to stream data to iceberg which uses the python iceberg client:
Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.
> Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning
Yes, that solved the 2-column high NDV partitioning issue - if you had your ip traffic sorted on destination or source, you need Z-curves, which are a little easier with bit twiddling for fixed types to do the same thing.
Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.
This does fix the high NDV issue, but in general Netflix wrote hidden partitioning in specifically to avoid sorting on high NDV columns and to reduce the sort complexity on writes (most daily writes won't need any partitioned inserts at all).
While clustering on timestamp will force a sort even if it is a single day.
I think this mischaracterizes the state of the space. Iceberg is the winner of this competition, as of a few months ago. All major vendors who didn't directly invent one of the others now support iceberg or have announced plans to do so.
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
Yeah working in the data space I see a ton of customers using Iceberg and some using Delta Lake if they're already a Databricks shop. Virtually no Hudi.
The table on that page makes it look like all three of these are very similar, with schema evolution and partition evolution being the key differences. Is that really it?
I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.
There may be more in depth comparisons available by now but it’s at least a good starting point for understanding how S3 Tables integrates with Iceberg.
ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.
right now, starrocks or trino are likely your best options, but all the major query engines (clickhouse, snowflake, databricks, even duckdb) are improving their support too.
Yes, mainly driven by cost. BigQuery is really unpredictable when dashboards with filters are being used intensively by users. We don’t want to limit our users in their data exploration.
What I like about iceberg is that the partitions of the tables are not tightly coupled to the subfolder structure of the storage layer (at least logically, at the end of the day the partitions are still subfolders with files), but at least the metadata is not tied to that, so you can change the partition of the tables going forward and still query a mix of old and new partitions time ranges.
In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.
It’s been on the up in recent years though as it appears to have won the format wars. Every vendor is rallying around it and there were new open source catalogues and support from AWS at the end of 2024.
yeah, I'll admit I was worried when Databricks acquired Tabular[0] that it would hurt Iceberg's momentum (e.g. databricks would push delta instead), but it seems the opposite has happened.
0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
1 - https://www.definite.app/blog/cloud-iceberg-duckdb
Use it with Dropwizard/Springboot, you get to expose rest APIs too.
I really like the way the catalog standard can decouple underlying storage as well.
My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,
Even DuckDB doesn’t support writing yet.
I built out a tool to stream data to iceberg which uses the python iceberg client:
https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.
[1] - https://github.com/delta-io/delta/issues/490
Yes, that solved the 2-column high NDV partitioning issue - if you had your ip traffic sorted on destination or source, you need Z-curves, which are a little easier with bit twiddling for fixed types to do the same thing.
Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.
This does fix the high NDV issue, but in general Netflix wrote hidden partitioning in specifically to avoid sorting on high NDV columns and to reduce the sort complexity on writes (most daily writes won't need any partitioned inserts at all).
While clustering on timestamp will force a sort even if it is a single day.
[1] Open Table Formats:
https://www.starburst.io/data-glossary/open-table-formats/
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.
When AWS launched S3 Tables last month I wrote a blog post with my first impressions: https://meltware.com/2024/12/04/s3-tables
There may be more in depth comparisons available by now but it’s at least a good starting point for understanding how S3 Tables integrates with Iceberg.
[0] https://clickhouse.com/docs/en/sql-reference/table-functions...
[1] https://clickhouse.com/docs/en/engines/table-engines/integra...
https://github.com/ClickHouse/ClickHouse/issues/52054
0 - https://github.com/duckdb/duckdb-iceberg/pull/78
[1] https://www.starrocks.io/
https://trino.io/
[1] https://www.starburst.io/platform/icehouse/
In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.
0 - https://www.definite.app/blog/databricks-tabular-acquisition