Architecturally, it is a scale-out metadata filesystem [ref]. Other related distributed file systems are Collosus, Tectonic (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think PolarFS (Alibaba). They all use different distributed row-oriented DBs for storing metadata. S3FS uses FoundationDB, Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not sure), HopsFS uses RonDB.
What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).
Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.
Personally what stands out to me for 3FS isn't just that it has a FUSE client, but that they made it more of a hybrid of FUSE client and native IO path. You open the file just like normal but once you have a fd you use their native library to do the actual IO. You still need to adapt whatever AI training code to use 3FS natively if you want to avoid FUSE overhead, but now you use your FUSE client for all the metadata operations that the native client would have needed to implement.
There is also ObjectiveFS that supports FUSE and uses S3 for both data and metadata storage, so there is no need to run any metadata nodes. Using S3 instead of a separate database also allows scaling both data and metadata with the performance of the S3 object store.
I think that the author is spot on, there are a couple of dimensions in which you should evaluate these systems: theoretical limits, efficiency, and practical limits.
From a theoretical point of view, like others have pointed out, parallel distributed file systems have existed for years -- most notably Lustre. These file systems should be capable of scaling out their storage and throughput to, effectively, infinity -- if you add enough nodes.
Then you start to ask, well how much storage and throughput can I get with a node that has X TiB of disk -- starting to evaluate efficiency. I ran some calculations (against FSx for Lustre, since I'm an AWS guy) -- and it appears that you can run 3FS in AWS for about 12-30% cheaper depending on the replication factors that you choose against FSxL (which is good, but not great considering that you're now managing the cluster yourself).
Then, the third thing you start to ask is anecdotally, are people able to actually configure these file systems into the size of deployment that I want (which is where you hear things like "oh it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen from something like 3FS.
Ultimately, I obviously believe that storage and data are really important keys to how these AI companies operate -- so it makes sense that DeepSeek would build something like this in-house to get the properties that they're looking for. My hope is that we, at Archil, can find a better set of defaults that work for most people without needing to manage a giant cluster or even worry about how things are replicated.
Maybe AWS could start by making fast NVMes available - without requiring multi TB disks just to get 1 GB/s. S3FS experiments were run on 14 GB/s NVMe disks - an order of magnitude higher throughput than anything available in AWS today.
On my home LAN connected with 10gbps fiber between MacBook Pro and server, 10 feet away, I get about 1.5gbps vs the non-network speed of the disks of ~50 gbps. (Bits, not bytes)
I worked this out to the macOS SMB implementation really sucking. I set up a NFS driver and it got about twice as fast but it’s annoying to mount and use, and still far from the disk’s capabilities.
I’ve mostly resorted to abandoning the network (after large expense) and using Thunderbolt and physical transport of the drives.
the other important thing to note is what is that filesystem designed to be used for?
For example 3FS looks like its optimised for read throughput (which makes sense, like most training workloads its read heavy.) write operations look very heavy.
Can you scale the metadata server, what are the cost of metadata operations? Is there a throttling mechanism to stop a single client sucking all of the metadata server's IO? Does it support locking? Is it a COW filesystem?
IMO they serve similar at a glance, but actually very different use cases.
SeaweedFS is more about amazing small object read performance because you effectively have no metadata to query to read an object. You just distribute volume id, file id (+cookie) to clients.
3FS is less extreme in this, supports actual POSIX interface, and isn't particularly good at how fast you can open() files. On the other hand, it shards files into smaller (e.g. 512KiB) chunks, demands RDMA NICs and makes reading randomly from large files scary fast [0]. If your dataset is immutable you can emulate what SeaweedFS does, but if it isn't then SeaweedFS is better.
[0] By scary fast I mean being able to completely saturate 12 PCIe Gen 4 NVMe SSD at 4K random reads on a single storage server and you can horizontally scale that.
My guess is going to be that performance is pretty comparable, but it looks like Seaweed contains a lot more management features (such as tiered storage) which you may or may not be using.
I’m just a small business & homelab guy, so I’ll probably never use one of these big distributed file systems. But when people start talking petabytes, I always wonder if these things are actually backed up and what you use for backup and recovery?
It is common for the backup of these systems to be a secondary data center.
Remember that there are two purposes for backup. One is hardware failures, the second is fat fingers. Hardware failures are dealt with by redundancy which always involves keeping redundant information across multiple failure domains. Those domains can be as small as a cache line or as big as a data center. These failures can be dealt with transparently and automagically in modern file systems.
With fat fingers, the failure domain has no natural boundaries other than time. As such, snapshots kept in the file system are the best choice, especially if you have a copy-on-write that can keep snapshots with very little overhead.
There is also the special case of adversarial fat fingering which appears in ransomware. The answer is snapshots, but the core problem is timely detection since otherwise you may not have a single point in time to recover from.
Backup and recovery is a process with a non-zero failure rate. The more you test it, the lower the rate, but there is always a failure mode.
With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system.
So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup?
There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature.
Well, for active data, the idea is that the replication within the system is enough to keep the data alive from instance failure (assuming that you're doing the proper maintenance and repairing hosts pretty quickly after failure). Backup and recovery, in that case, is used more for saving yourself against fat-fingering an "rm -rf /" type command. Since it's just a file system, you should be able to use any backup and recovery solution that works with regular files.
Because of the replication factor here, I assume that this filesystem is optimised for read throughput rather than capacity. Either way, there is a concept of "nearline" storage. Its a storage tier that is designed to be only really accesed by a backupagent. The general idea is that it stores a snapshot of the main file system every n hours.
After that you have as many snapshots as you can afford.
This seems like a pretty complex setup with lots of features which aren't obviously important for a deep learning workload.
Presumably the key necessary features are PB's worth of storage, read/write parallelism (can be achieved by splitting a 1PB file into say 10,000 100GB shards, and then having each client only read the necessary shards), and redundancy
Consistency is hard to achieve and seems to have no use here - your programmers can manage to make sure different processes are writing to different filenames.
> Consistency is hard to achieve and seems to have no use here
Famous last words.
It is very common when operating data platforms like this at this scale to lose a lot of nodes over time especially in the cloud. So having a robust consistency/replication mechanism is vital to making sure your training job doesn't need to be restarted just because the block it needs isn't on the particular node.
indeed redundancy is fairly important (although the largest part, the training data, actually doesn't matter if chunks are missing).
But the type of consistency they were talking about is strong ordering - the type of thing you might want on a database with lots of people reading and writing tiny bits of data, potentially the same bits of data, and you need to make sure a users writes are rejected if impossible to fulfil, and reads never return an impossible intermediate state. That isn't needed for machine learning.
Yes, I think this is probably true. I've worked with a lot of different hedge funds who have a similar problem -- lots of shared data that they need in a file system so that they can do backtesting of strategies with things like kdb+. Generally, these folks are using NFS which is kind of a pain -- especially for scaleability -- so building your own for that specific use case (which happens to have a similar usage pattern for AI training) makes a lot of sense.
Why not use CephFS instead? It has been thoroughly tested in real-world scenarios and has demonstrated reliability even at petabyte scale. As an open-source solution, it can run on the fastest NVMe storage, achieving very high IOPS with 10 Gigabit or faster interconnect.
I think their "Other distributed filesystem" section does not answer this question.
Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.
For that you need zero-copy, RDMA etc.
Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going.
It's based on the awesome Seastar framework [1], backing ScyllaDB.
Achieving such performance would also require many changes to the client (RDMA, etc).
Something like Weka [2] has a much better design for this kind of performance.
I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.
DigitalOcean uses Ceph underneath their S3 and block volume products. When I was there they had 2 teams just managing Ceph, not even any of the control plane stuff built on top.
It is a complete bear to manage and tune at scale. And DO never greenlit offering anything based on CephFS either because it was going to be a whole other host of things to manage.
Then of course you have to fight with the maintainers (Red Hat devs) to get any improvements contributed, assuming you even have team members with the requisite C++ expertise.
IMO this is the problem with all storage clusters that you run yourself, not just Ceph. Ultimately, keeping data alive through instance failures is just a lot of maintenance that needs to happen (even with automation).
I thought they used ceph too. But I started looking around and it seems like they have switched to CernVM-FS and in house solution. I'm not sure what changed.
Similar to the SeaweedFS question in sibling comment, how does this compare to JuiceFS?
In particular for my homelab setup I'm planning to run JuiceFS on top of S3 Garage. I know garage is only replication without any erasure coding or sharding so it's not really comparable but I don't need all that and it looked at lot simpler to set up to me.
It's a very different architecture. 3FS is storing everything on SSDs, which makes it extremely expensive but also low latency (think ~100-300us for access). JuiceFS stores data in S3, which is extremely cheap but very high latency (~20-60ms for access). The performance scalability should be pretty similar, if you're able to tolerate the latency numbers. Of course, they both use databases for the metadata layer, so assuming you pick the same one -- the metadata performance should also be similar.
For the most part I can live with the latency. For frequently or repeatedly accessed files I believe JuiceFS also has a caching layer to bring down the latency on those. My use cases are quite different from what DeepSeek requires. I was just curious whether I needed to investigate 3FS since it's the new kid on the block.
What appeals to me about JuiceFS is that I can easily mount it into my containers for a distributed FS on a small cluster for some resilience. It looked simpler than Ceph or GlusterFS for that and it's backed by S3 so that I could swap out my S3 Garage setup for a cloud based S3 backend if required.
How easy is it to disable DeepSeek's distributed FS? Say for example a US college has been authorized to use DeepSeek for research, but must ensure no data leaves the local research cluster filesystem?
Edit: I am a DeepSeek newbie BTW, so if this question makes no sense at all, that's why ;)
I might need more clarification, but if one is paranoid or is dealing with this sensitive of information the DeepSeek model and 3FS are able to be deployed locally offline and not connected to the internet.
DeepSeek is a company. This article is about a distributed file system they have developed. It is a separate, unrelated piece software from their open-weight models(DeepSeek-R1, DeepSeek-V3, etc).
In your example it is likely the US college has been authorized to use a DeepSeek model for research, not the DeepSeek 3FS distributed file system.
What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easiter - and (2) NVMe storage - so that training pipelines aren't Disk I/O bound (you can't always split files small enough and parallel reading/writing enough to a S3 object store).
Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe for recent data and S3 for archival.
[ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...
Tectonic is built on ZippyDB which is a distributed DB built on RocksDB.
>What's important here with S3FS is that it supports (1) a fuse client - it just makes life so much easier
Tectonic also has a FUSE client built for GenAI workloads on clusters backed by 100% NVMe storage.
https://engineering.fb.com/2024/03/12/data-center-engineerin...
Personally what stands out to me for 3FS isn't just that it has a FUSE client, but that they made it more of a hybrid of FUSE client and native IO path. You open the file just like normal but once you have a fd you use their native library to do the actual IO. You still need to adapt whatever AI training code to use 3FS natively if you want to avoid FUSE overhead, but now you use your FUSE client for all the metadata operations that the native client would have needed to implement.
https://github.com/deepseek-ai/3FS/blob/ee9a5cee0a85c64f4797...
And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).
None of it is particularly ground breaking just a lot of pieces brought together.
From a theoretical point of view, like others have pointed out, parallel distributed file systems have existed for years -- most notably Lustre. These file systems should be capable of scaling out their storage and throughput to, effectively, infinity -- if you add enough nodes.
Then you start to ask, well how much storage and throughput can I get with a node that has X TiB of disk -- starting to evaluate efficiency. I ran some calculations (against FSx for Lustre, since I'm an AWS guy) -- and it appears that you can run 3FS in AWS for about 12-30% cheaper depending on the replication factors that you choose against FSxL (which is good, but not great considering that you're now managing the cluster yourself).
Then, the third thing you start to ask is anecdotally, are people able to actually configure these file systems into the size of deployment that I want (which is where you hear things like "oh it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen from something like 3FS.
Ultimately, I obviously believe that storage and data are really important keys to how these AI companies operate -- so it makes sense that DeepSeek would build something like this in-house to get the properties that they're looking for. My hope is that we, at Archil, can find a better set of defaults that work for most people without needing to manage a giant cluster or even worry about how things are replicated.
SSDs Have Become Ridiculously Fast, Except in the Cloud: https://news.ycombinator.com/item?id=39443679
I worked this out to the macOS SMB implementation really sucking. I set up a NFS driver and it got about twice as fast but it’s annoying to mount and use, and still far from the disk’s capabilities.
I’ve mostly resorted to abandoning the network (after large expense) and using Thunderbolt and physical transport of the drives.
For example 3FS looks like its optimised for read throughput (which makes sense, like most training workloads its read heavy.) write operations look very heavy.
Can you scale the metadata server, what are the cost of metadata operations? Is there a throttling mechanism to stop a single client sucking all of the metadata server's IO? Does it support locking? Is it a COW filesystem?
[1] https://github.com/seaweedfs/seaweedfs
SeaweedFS is more about amazing small object read performance because you effectively have no metadata to query to read an object. You just distribute volume id, file id (+cookie) to clients.
3FS is less extreme in this, supports actual POSIX interface, and isn't particularly good at how fast you can open() files. On the other hand, it shards files into smaller (e.g. 512KiB) chunks, demands RDMA NICs and makes reading randomly from large files scary fast [0]. If your dataset is immutable you can emulate what SeaweedFS does, but if it isn't then SeaweedFS is better.
[0] By scary fast I mean being able to completely saturate 12 PCIe Gen 4 NVMe SSD at 4K random reads on a single storage server and you can horizontally scale that.
https://github.com/s3fs-fuse/s3fs-fuse
Remember that there are two purposes for backup. One is hardware failures, the second is fat fingers. Hardware failures are dealt with by redundancy which always involves keeping redundant information across multiple failure domains. Those domains can be as small as a cache line or as big as a data center. These failures can be dealt with transparently and automagically in modern file systems.
With fat fingers, the failure domain has no natural boundaries other than time. As such, snapshots kept in the file system are the best choice, especially if you have a copy-on-write that can keep snapshots with very little overhead.
There is also the special case of adversarial fat fingering which appears in ransomware. The answer is snapshots, but the core problem is timely detection since otherwise you may not have a single point in time to recover from.
With these systems, the runtime guarantees of data integrity are very high and the failure rate is very low. And best of all, failure is constantly happening as a normal activity in the system.
So once you have data integrity guarantees that are better in you runtime system than your backup process, why backup?
There are still reasons, but they become more specific to the data being stored and less important as a general datastore feature.
Because of mistakes and malicious actors...
Because of the replication factor here, I assume that this filesystem is optimised for read throughput rather than capacity. Either way, there is a concept of "nearline" storage. Its a storage tier that is designed to be only really accesed by a backupagent. The general idea is that it stores a snapshot of the main file system every n hours.
After that you have as many snapshots as you can afford.
Speaking from experience working at a hyperscaler - 1. cross-regional mirroring 2. Good old tape backups
Presumably the key necessary features are PB's worth of storage, read/write parallelism (can be achieved by splitting a 1PB file into say 10,000 100GB shards, and then having each client only read the necessary shards), and redundancy
Consistency is hard to achieve and seems to have no use here - your programmers can manage to make sure different processes are writing to different filenames.
Famous last words.
It is very common when operating data platforms like this at this scale to lose a lot of nodes over time especially in the cloud. So having a robust consistency/replication mechanism is vital to making sure your training job doesn't need to be restarted just because the block it needs isn't on the particular node.
What follows is a long period of saying "see, distributed systems are easy for genius developers like me"
The last words are typically "oh shit", shortly followed oxymoronically by "bye! gotta go"
But the type of consistency they were talking about is strong ordering - the type of thing you might want on a database with lots of people reading and writing tiny bits of data, potentially the same bits of data, and you need to make sure a users writes are rejected if impossible to fulfil, and reads never return an impossible intermediate state. That isn't needed for machine learning.
[0] https://www.high-flyer.cn/blog/3fs/
I think their "Other distributed filesystem" section does not answer this question.
Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.
For that you need zero-copy, RDMA etc.
Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.
Achieving such performance would also require many changes to the client (RDMA, etc).
Something like Weka [2] has a much better design for this kind of performance.
[0] https://ceph.io/en/news/crimson/
[1] https://seastar.io/
[2] https://www.weka.io/
I do agree that nvme-of is the next hurdle for ceph performance.
"We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS."
Source: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/
I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.
It is a complete bear to manage and tune at scale. And DO never greenlit offering anything based on CephFS either because it was going to be a whole other host of things to manage.
Then of course you have to fight with the maintainers (Red Hat devs) to get any improvements contributed, assuming you even have team members with the requisite C++ expertise.
If my systems guys are telling me the truth is it a real time sink to run and can require an awful lot of babysitting at times.
In particular for my homelab setup I'm planning to run JuiceFS on top of S3 Garage. I know garage is only replication without any erasure coding or sharding so it's not really comparable but I don't need all that and it looked at lot simpler to set up to me.
For the most part I can live with the latency. For frequently or repeatedly accessed files I believe JuiceFS also has a caching layer to bring down the latency on those. My use cases are quite different from what DeepSeek requires. I was just curious whether I needed to investigate 3FS since it's the new kid on the block.
What appeals to me about JuiceFS is that I can easily mount it into my containers for a distributed FS on a small cluster for some resilience. It looked simpler than Ceph or GlusterFS for that and it's backed by S3 so that I could swap out my S3 Garage setup for a cloud based S3 backend if required.
Edit: I am a DeepSeek newbie BTW, so if this question makes no sense at all, that's why ;)
In your example it is likely the US college has been authorized to use a DeepSeek model for research, not the DeepSeek 3FS distributed file system.