Readit News logoReadit News
jamesblonde commented on We built another object storage   fractalbits.com/blog/why-... · Posted by u/fractalbits
kburman · 2 days ago
I feel like this product is optimizing for an anti-pattern.

The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.

Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).

Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.

It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

jamesblonde · 2 days ago
I agree that this is an anti-pattern for training. In training, you are often I/O bound over S3 - high b/w networking doesn't fix it (.saftensor files are typically 4GB in size). You need NVMe and high b/w networking along with a distributed file system.

We do this with tiered storage over S3 using HopsFS that has a HDFS API with a FUSE client, so training can just read data (from HopsFS datanode's NVMe cache) as if it is local, but it is pulled from NVMe disks over the network. In contrast, writes go straight to S3 vis HopsFS write-through NVMe cache.

jamesblonde commented on Cardiac implantable electronic devices' longevity: A novel modelling tool   journals.plos.org/plosone... · Posted by u/PaulHoule
jamesblonde · 18 days ago
I have one (top of the line!). Here's how bad the engineers were. For the last 6 months, the device emits 10 audible beeps every 6 hours. I do a lot of customer meetings and public speaking. People would sometimes ask - "what is that noise"? I would say "No idea, but if you wait 8 seconds, it will stop"!

Also, my heart rate would sometimes drop below 40 bpm. Then it would start pacing, which i didn't want and was extremely uncomfortable.

p.s., the reason the battery ran out was because i found a treatment for my condition that works really well through talking globally to experts (i am a computer scientist). I wrote a case study paper about my condition to help others, co-authored by my doctors. https://www.slideshare.net/slideshow/arvc-and-flecainide-cas... 16 years later, the device is still in place, but I will have it removed early next year.

jamesblonde commented on Agent design is still hard   lucumr.pocoo.org/2025/11/... · Posted by u/the_mitsuhiko
the_mitsuhiko · 23 days ago
> EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

I don't think it's because the audience is different but because the moderators are asleep when Europeans are up. There are certain topics which don't really survive on the frontpage when moderators are active.

jamesblonde · 23 days ago
Anything sovereign AI or whatever is gone immediately when the mods wake up. Got an EU cloud article? Publish it at 11am CET, it's disappears around 12.30.
jamesblonde commented on Peter Thiel sells off all Nvidia stock, stirring bubble fears   thestreet.com/investing/p... · Posted by u/hypeatei
kibibu · a month ago
It's spot on to say that Greta Thunberg is the literal manifestation of the biblical antichrist, heralding the end of days?
jamesblonde · 25 days ago
See, Peter Thiel is smart. There are enough idiots who will buy his shtick - it's not just maga who get pointed in the direction he wants society to go (serfdom).
jamesblonde commented on Cloudflare outage on November 18, 2025 post mortem   blog.cloudflare.com/18-no... · Posted by u/eastdakota
jamesblonde · a month ago
Cloudflare tried to build their own feature store, and get a grade F.

I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.

Reference: https://www.oreilly.com/library/view/building-machine-learni...

jamesblonde commented on Google boss says AI investment boom has 'elements of irrationality'   bbc.com/news/articles/cwy... · Posted by u/jillesvangurp
johnisgood · a month ago
My former friend in Finland finished med school, "for free", while living in a really nice apartment "for free", receiving ADHD medications "for free", and then went to a business school "for free". He has not worked a single day in his life and he is in his very late 20s.
jamesblonde · a month ago
Apart from the free appartment, that's the same as most countries in Europe.
jamesblonde commented on Europe's defence spending spree must fund domestic AI, official says   ft.com/content/fb744eaa-b... · Posted by u/jamesblonde
jamesblonde · a month ago
Henna Virkkunen says 10% of investment should be directed to artificial intelligence and quantum computing developed in EU.
jamesblonde commented on Peter Thiel sells off all Nvidia stock, stirring bubble fears   thestreet.com/investing/p... · Posted by u/hypeatei
iammjm · a month ago
Peter Thiel the guy that believes that the literal biblical antichrist is walking amongst us? How much rational thought can one really expect from this guy? I am not questioning AI being a good investment or not, I am questioning this dude's sanity
jamesblonde · a month ago
He's scarily sane. I am not joking when I say he invented the anti-christ thing as a way to entrench the billionaire class. He's basically saying the anti-christ will be somebody who want to solve the world's problems through collective action. So collective action is bad. We are in such an unbalanced world, that that is the best argument he can come up with for why we should allow such wealth inequality in society.
jamesblonde commented on 650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark   dataengineeringcentral.su... · Posted by u/tanelpoder
l_c_m · a month ago
This is most misrepresented article on two fronts

1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.

2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.

Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.

jamesblonde · a month ago
1. Important points that the query is a projection that only returns a fraction of the 650GB that fits in memory. DuckDB is good at streaming larger than memory queries, Polars less mature there. That would show in the results.

2. S3 defaults shouldn't prevent all available threads/cpus from reading the files in parallel, so I would assume that the network bandwidth of the VM (or container) would be the bottleneck.

u/jamesblonde

KarmaCake day4052October 13, 2013View Original