Fire-Flyer File System (3FS)

For those who are interested, the design was originally published here:

(Chinese) https://www.high-flyer.cn/blog/3fs/

This file system has been developed and utilized by them for several years .

Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.

I google translated some key parts here:

3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.

Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.

vlovich123 · 6 months ago

I hope they chose a multiple of 4096 for the alignment to minimize flash read amplification. QLC drives even use 16kib pages.

dekhn · 6 months ago

How critical is random reading of training data when assembling batches?

Put another way: in my experience, supporting fast random reads is a challenging problem, while supporting high sequential reads is fairly straightforward. When is random access to a training set absolutely necessary for training a model?

c4wrd · 6 months ago

Imagine you're studying for a test where you are given an image and need to answer the correct class. To prepare, you're given a deck of flashcards with an image on the front and the class on the back.

(Random) You shuffle the deck every time you go through it. You're forced to learn the images and their classifications without relying on any specific sequence, as the data has no signal from sequence order.

(Fixed order) Every time you go through the deck, the images appear in the exact same order. Over time you may start to unconsciously memorize the sequence of flashcards, rather than the actual classification of each image.

When it comes to actually training a model, if the batches are sampled sequentially from a dataset, it risks learning from correlations caused by the sequencing of the data, resulting in poor generalization. In contrast, when you sample the batches randomly, the model is biased and encouraged to learn features from the data itself rather than from any signals that arise from artifacts of the ordering.

arjvik · 6 months ago

On an SSD, random and sequential reads have nearly the exact same performance. Even on large arrays of spinning rust this is essentially true.

rvba · 6 months ago

Why is that a random read? Also is it truely random, or from seed? But if prng then they could cache right?

kilburn · 6 months ago

Random is prng. They still cannot cache though because they do many reading "passes" through the same data.

If build a cache that gets hits on the first pass, then it won't work for the second and later passes.

I think the difference between deepseek and OpenAI/Anthropic is one of the difference between practitioners and academics. Ofcourse there is world class talent at OpenAI. But there are also alot of "I went to Harvard and want to work in AI", and those types of people just simply dont have the technical exposure to even think of building something like this.

tway223 · 6 months ago

I would say most if not every large company in China has their own AI infra stack, partially because tech talent is relatively more abundant and partially some of the tech leads have been exposed to western tech via open source and work experience so they have a good success rate (which makes it a more common practice). Anecdotally, specifically Google, FB ex-employees from oversea offices, MSFT and Intel ex-employees from their China offices could be the key elements for this trend in the past two decades (Google left China around 2010).

The infra work is usually technically tedious so I think it may become some lost art in the west just like those manufacturing jobs.

smallmancontrov · 6 months ago

As opposed to the US, where every large company has its own AI infra stack, often extending down to the silicon and up to large open source projects?

What's going on here, why are people forgetting what's around them? Does familiarity breed contempt? Are attention spans so shot that failure to participate in this week's news cycle is enough for "out of sight, out of mind"? Or is HN full of Chinese bots now?

sureglymop · 6 months ago

I think it would be a bit irrational to claim that so broadly. I've met some incredibly talented people in academia and I've also met people that made me question how they even pass.

My hypothesis is that there is not such a big difference at all. All three of the companies you mentioned are world class competitors in this. DeepSeek were the last to have a "hit" but that isn't an indication that they'll be the next of the three (or other yet unknown entities) to have the next hit. We try to predict what happens next now but perhaps we should rather focus on who or what we want to succeed. For me it's quite clear: it should be open source or I'm long term not that interested.

mustpax · 6 months ago

Someone should write a blog post about the prestige/effectiveness negative feedback loop. This is also the Achilles heel of top tier SV VCs including YC.

bugglebeetle · 6 months ago

The problem isn’t the prestige it’s that prestigious institutions in America don’t produce high-quality talent. They’re instead mostly corrupt credentialing mills for the rich and well-connected. From what I understand, DeepSeek also only hires from the best universities in China, but “best” actually means something relative to how difficult entrance to those organizations is to achieve and their coursework.

djtango · 6 months ago

You mean this one? https://news.ycombinator.com/item?id=9125816

Deleted Comment

dvaun · 6 months ago

Can you expand on this?

robotnikman · 6 months ago

Makes me wonder where is the best place to learn how to put together and operate something like this then? Certainly there should be resources out there somewhere to teach yourself?

cma · 6 months ago

Weren't the flash attention authors not just from academia but in academia at the time?

ammo1662 · 6 months ago

codingwagie · 6 months ago

thohj4234234324 · 6 months ago

This is very humbling.

OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.

Great work; hope Deepseek does even more awesome things going forward.

richardw · 6 months ago

I’ve assumed that it’s partly because the company has done a lot of HFT, which is very focused on performance. But I’m not an expert in either.

WiSaGaN · 6 months ago

Indeed, the blog mentioned in the other comment showed part of 3FS code was completed at least since 2019, when this was still a project of the quant funds. In HFT, you tend to dogfood a lot of the things to achieve low latency, high performance, sometimes just because HFT system just need to do one specific thing, and those off the shelf stuff usually cater for a lot wider scenarios where HFT doesn't really care about. Here you see similar case which they focus specifically on loading large amount of data during training, and implement that to the extreme.

tetron · 6 months ago

Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.

grohan · 6 months ago

They appear to have Python bindings which seems reasonable from an API / usability perspective? https://github.com/deepseek-ai/smallpond

In terms of fast FUSE - also my first question, appears to be`io_uring` + FUSE :)

https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/Usr...

amelius · 6 months ago

Why is FUSE that much slower than providing your own read/write functions? I get that it has to go through the kernel, but the operations are on entire blocks and network should be the bottleneck by far (and disk/main memory should be a bottleneck if the data is local).

You have to bounce through the kernel back out to use space. The number of syscalls is quite high. In many cases this is mitigated somewhat by the page cache making reads cheaper, but that’s explicitly an anti design here.

I believe there’s work to minimize this using io_uring so that you can talk to the fuse driver without the kernel being in the middle, but that work isn’t ready last time I checked.

For what it’s worth at Palm we had a similar problem because our applications were stored compressed but exposed through fuse uncompressed, instead of O_DIRECT I just did an fadvise to dump the cache after a read. Not as high throughput but the least risky change to get the same effect.

pella · 6 months ago

related research paper (english - html ) - https://arxiv.org/html/2408.14158v2

arXiv:2408.14158v2 [cs.DC] 31 Aug 2024

"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"

Abstract:

"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."

hintymad · 6 months ago

A distributed file system is honed as one of the trickiest software to write, and we are usually advised not to write a file system from scratch (even on top of FUSE), let alone a highly optimized one. When a silicon value company is having the 100th meeting to align god-knows-what, a team of fewer than 60 already came up with a production-grade highly efficient parallel file system.

Have we in the valley companies lost touch?

htrp · 6 months ago

> team of fewer than 10

the highflyer team are pretty well resourced.... think they have more than 10 people

Thanks! Updated to 60 per their author list in their paper.

ycui1986 · 6 months ago

yes

jauntywundrkind · 6 months ago

Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.

That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!

bee_rider · 6 months ago

They sure are productive.

What are we going to see tomorrow? DeepSeek OS or something?

vitaflo · 6 months ago

To be fair they’ve been working on this since 2019 for HFT. So it’s not like they just whipped this up.

The amount of brainpower wasted on HFT games that will never see the light of day is kind of a bummer. Congrats to China.

logicallee · 6 months ago

>They sure are productive.

I have a theory as to why...

tuyguntn · 6 months ago

enlighten us

digdugdirk · 6 months ago

996 work culture?