Readit News logoReadit News
ryanworl commented on DuckLake is an integrated data lake and catalog format   ducklake.select/... · Posted by u/kermatt
tishj · 9 months ago
One thing to add to this: Snapshots can be retained (though rewritten) even through compaction

As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.

With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files

ryanworl · 9 months ago
Does this trick preclude the ability to sort your data within a partition? You wouldn’t be able to rely on the row IDs being sequential anymore to be able to just refer to a prefix of them within a newly created file.
ryanworl commented on ClickHouse gets lazier and faster: Introducing lazy materialization   clickhouse.com/blog/click... · Posted by u/tbragin
simianwords · a year ago
Maybe I'm too inexperienced in this field but reading the mechanism I think this would be an obvious optimisation. Is it not?

But credit where it is due, obviously clickhouse is an industry leader.

ryanworl · a year ago
This is a well-known class of optimization and the literature term is “late materialization”. It is a large set of strategies including this one. Late materialization is about as old as column stores themselves.
ryanworl commented on The Query Condition Cache   clickhouse.com/blog/intro... · Posted by u/zX41ZdbW
ryanworl · a year ago
The equivalent to this feature is one of my favorite parts of Husky, Datadog’s storage and query system for event data.

https://youtu.be/mNneCaZewTg?si=N68fsBlYS3tuvLe3 begins at 34:32

ryanworl commented on Tiered storage won't fix Kafka   warpstream.com/blog/tiere... · Posted by u/itunpredictable
wokwokwok · 2 years ago
I guess we’ll have to wait for a full write up of this, but it does seem like having multiple categories of object storage is pulls off hood tiered storage!

…rebranded with a different name, again.

Again complex, again no obvious way to query storage directly, again unclear performance characteristics, again no obvious reason to see why the networking costs make saving from it largely meaningless.

You have to admit it’s a bit of a hard sell without any comeback after literally just saying that people were just inventing new names for minor variations on tiered storage…

ryanworl · 2 years ago
We're still drafting our next post in this series, but the answer is actually very simple: two tiers of object storage do not have the same drawbacks as a combination of object storage and local disk. We wanted to explain that in this post too, but it would've been unreasonably long.

We've designed WarpStream to work extremely well on the slower, harder-to-use one first, and that is how 95+% of our workloads run in production. The tiered storage solutions from other streaming vendors do the opposite, where they were first designed for local SSDs and then bolted on object storage later.

The equivalent would be if we were pitching our support for an even slower, cheaper tier of object storage like AWS S3 Glacier.

ryanworl commented on Tiered storage won't fix Kafka   warpstream.com/blog/tiere... · Posted by u/itunpredictable
kdavyd · 2 years ago
The article doesn't mention which EBS volume type was used, but since Provisioned IOPS are mentioned, I assume it's gp3 or io2. One pattern that is especially often used in Time Series databases, but could work for Kafka too, is not tiering down to S3, but changing older volumes to a slower volume type, such as sc1 ($0.015/GiB-Mo). This can be done completely transparently to the application.

Another thing worth looking into is S3 Mountpoint with or without read caching, which offers a POSIX-like interface for S3 to applications that don't natively support S3.

ryanworl · 2 years ago
This strategy will not work well for Apache Kafka because it is extremely IOPS hungry if you have more than a few partitions, and a replay of a large topic will require lots of IO bandwidth. It would work well e.g. a columnar database where a query targeting old data may only require reading a small fraction of the size of the volume, but Kafka is effectively a row-oriented storage system, so the IO pattern is different.
ryanworl commented on Tiered storage won't fix Kafka   warpstream.com/blog/tiere... · Posted by u/itunpredictable
mschuster91 · 2 years ago
> But if you could rebuild streaming from the ground up for the cloud, you could achieve something a lot better than fewer disks – zero disks. The difference between some disks and zero disks is night and day. Zero disks, with everything running directly through object storage with no intermediary disks, would be better.

That's still a trade-off. Object storage, simply by the overhead of HTTP + SSL, has higher latency than EFS, which has higher latency than EBS, which has higher latency than local SSD. So in the end your service (no matter if it's Kafka or anything else) has _higher_ latency if you also want consistency (aka resilience against "everything goes dark in an instant") as all writes on all machines in the pool have to be committed to storage.

The only way a "zero disk" anything makes sense is if you have enough machines in enough diverse locations with enough RAM to cover the entire workload and to pray there's never any event taking the entire cloud provider offline.

ryanworl · 2 years ago
(WarpStream co-founder here)

We're not talking about no disks as in no storage, just nothing other than object storage. This does have a latency trade-off, but with the advent of S3 Express One Zone and Azure's equivalent high-performance tier (with GCP surely not far behind), a system designed purely around object storage can now trade cost for latency where it makes sense. WarpStream already has support for writing to a quorum of S3 Express One Zone buckets to provide regional availability, so there's not an availability trade-off here either.

u/ryanworl

KarmaCake day1454December 15, 2012
About
@ryanworl on twitter

ryantworl@gmail.com

View Original