Readit News logoReadit News
Posted by u/philippb 4 years ago
Ask HN: How would you store 10PB of data for your startup today?
I'm running a startup and we're storing north of 10PB of data and growing. We're currently on AWS and our contract is up for renewal. I'm exploring other storage solutions.

Min requirements of AWS S3 One Zone IA (https://aws.amazon.com/s3/storage-classes/?nc=sn&loc=3)

How would you store >10PB if you'd be in my shoes? Thought experiment can be with and without data transfer cost our of current S3 buckets. Please mention also what your experience is based on. Ideally you store large amounts of data yourself and speak of first hand experience.

Thank you for your support!! I will post a thread once we got to a decision on what we ended up doing.

Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

The data is all images and videos, and no queries need to be performed on the data.

pmlnr · 4 years ago
Non-cloud:

HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)

The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.

[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html

So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)

EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.

walrus01 · 4 years ago
> The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph,

I think at that scale you would want a ceph expert on staff as a full time salaried position.

For an organization that has 10PB now and can project a growth path to 15, 20, 25PB in the future, you should talk with management about creating a vacant position for that role, and filling it.

> EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

I am a huge advocate of hosting stuff yourself on bare metal you own, but this is a ridiculous statement. Any drive in that class should come with a 3 or 5 year warranty. And the manual labor and hassle time to replace one (you have hundreds of thousands of dollars of storage and no ready to go cold spares on a shelf?!?!) is infinitesimal.

pmlnr · 4 years ago
OK, clarification: most of our fleet has a LOT of supermicro machines, where it's impossible to identify the drive unless by serial number. There's on UID light, the machines needs to go offline, plus some 10 screws needs to come out to open the chassis, 4 more by drives.

The amount of downtime this would generate for a single machine plus the operation cost doesn't worth the hassle unless the machine loses a significant chunk of drives.

closeparen · 4 years ago
If the colo is far and there’s plenty of headroom, it might not justify much urgency.
secabeen · 4 years ago
You can also get units like this direct from Western Digital/HGST. We have a system with 3 of their 4U60 units, and they weren't all that expensive. Ordering direct from HGST, we only paid a small premium on top of the cost of the SAS drives.
Terretta · 4 years ago
This is the answer that worked for us storing petabytes a decade ago.

We collaborated with OEMs and also shared/compared notes with Backblaze on rackable mass storage for commodity drives.

Backblaze published a series of iterations of designs of multi-drive chassis, and one of the OEMs would make them for other buyers as well. If you’re doing this route, read through those for considerations and lessons learned.

Performance was > 10x better than enterprise solutions. A policy to “leave dead disks dead” aka “let them rot” as said elsewhere in this thread kept maintenance cheap.

The secret sauce part making this viable for commercial online storage hosting (we hosted video) was we used disks as JBOD with an in-house meta index with P2P health awareness to place objects redundantly across disks, chassis, racks, colocation providers, and regions.

more_corn · 4 years ago
Don't buy HPE gear. Qualify the gear with sample units from a few competing vendors and you'll see why.
specktr · 4 years ago
I’ve done qualification on hpe and various competing vendors and honestly haven’t seen dramatic differences in terms of performance and failure rates. From my experience the biggest difference was with vendor support services rather than the actual hardware. I’d be curious to hear more about your qualification experience with this particular vendor if you’d be willing.
metabrew · 4 years ago
When we set up user content storage of images and mp3s for Last.fm back in 2006ish we used MogileFS (from the bradfitz LJ perl days) running on our own hardware. 3/4/5/6u machines stuffed full of disks. I still think it's an elegant concept – easy to grok, easy to debug, easy to reason about. No special distributed filesystem to worry about.

Don't take this as an endorsement of the MogileFS perl codebase in 2021, but worth considering this style of storage system depending on your precise needs.

monstrado · 4 years ago
MinIO is an option as well and would allow you to transition from testing in S3 to your own MinIO cluster seamlessly.
goliatone · 4 years ago
I wonder if anyone can comment if they have experience running minio at scale. It would be a pleasant surprise if a “simple” minio cluster could handle such workload
dividedbyzero · 4 years ago
Does it scale that far?
lars_francke · 4 years ago
I'd be interested to learn more about your HDFS usage and your experience at that scale. Would you be willing to have a chat? If so, my email is in my profile.
skynet-9000 · 4 years ago
At that kind of scale, S3 makes zero sense. You should definitely be rolling your own.

10PB costs more than $210,000 per month at S3, or more than $12M after five years.

RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.

Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)

Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.

Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/

cdavid · 4 years ago
[disclaimer: while I have some small experience putting things in DC, including big GPU servers, I have never been anywhere near that scale, certainly not storage]

10k $ is for a server with no hard drive. W/ 12 Gb disks, and with enough RAM, we're talking closer to 40-50k$ per server. Let's say for simplicity you're going to need to buy 15 of those, and let's say you only need to replace 2 of them per year. You need 25 over five years, that's already ~ 750k $ over 5 years.

And then you need to factor in the network equipment, the hosting in a colocation space, and if storage is your core value, you need to think about disaster recovery.

You will need at least 2 people full time on this, in the US that means minimum 2x 150k$ of costs per year: over 5 years, that's 1.5m$. If you use software-defined storage, that's likely gonna cost you much more because of the skill demand.

Altogether that's all gonna cost you much more than 500k$ over 5 years. I would say you would need at least 5x to 10x this.

Deleted Comment

cerved · 4 years ago
yes, the TCO needs consideration, not just the metal
hpcjoe · 4 years ago
After a certain size, AWS et al simply don't make sense, unless you have infinitely deep pockets. For storage that you pull from, AWS et al charge bandwidth costs. These costs are non-trivial for non-trivial IO. I worked up financial operational models for one of my previous employers, when were were looking at costs of remaining on S3 and using it, versus rolling it into our own DCs. The download costs, the DC space, staff, etc. was far less per year (and the download cost is a 1 time cost) than the cold storage costs.

Up to about 1PB with infrequent use, AWS et al might be better. When you look at 10-100PB and beyond (we were at 500PB usable or so last I remembered) the costs are strongly biased towards in-house (or in-DC) vs cloud. That is, unless you have infinitely deep pockets.

hpcjoe · 4 years ago
I should add to this comment, as it may give the impression that I'm anti cloud. I'm not. Quite pro-cloud for a number of things.

The important point to understand in all of this is, is that there are cross-over points in the economics for which one becomes better than the other. Part of the economics is the speed of standing up new bits (opportunity cost of not having those new bits instantly available). This flexibility and velocity is where cloud generally wins on design, for small projects (well below 10PB).

This said, if your use case appears to be rapidly blasting through these cross-over points, the economics usually dictates a hybrid strategy (best case) or a migration strategy (worst case).

And while your use case may be rapidly approaching these limits (you need to determine where they are if you are growing/shrinking), there are things you can do to risk and cost reduce transitions ahead of this.

Hybrid as a strategy can work well, as long as your hot tier is outside of the cloud. Hybrid makes sense also if you have to include the possibility of deplatforming from cloud providers (which, sadly, appears to be a real, and significant, risk to some business models and people).

None of this analysis is trivial. You may not even need to do it, if you are below 1PB, and your cloud bills are reasonable. This is the approach that works best for many folks, though as you grow, it is as if you are a frog in ever increasing temperature water (with regard to costs). Figuring out the pain point where you need to make changes to get spending on a different (better) trajectory for your business is important then.

papageek · 4 years ago
And again at at even larger size it makes sense again with >80% discounts on compute and $0 egress.
cricalix · 4 years ago
The thing about fitting everything in one rack, potentially, is vibration. There have been several studies into drive performance degredation from vibration, and there's noticeable impact in some scenarios. The Open Compute "Knox" design as used by Facebook spins drives up when needed, and then back down, though whether that's for vibration impact, I don't know (their cold storage use [0]).

0: https://datacenterfrontier.com/inside-facebooks-blu-ray-cold...

https://www.dtc.umn.edu/publications/reports/2005_08.pdf

https://digitalcommons.mtu.edu/cgi/viewcontent.cgi?article=1...

pokler · 4 years ago
Here is Brendan Gregg showing how vibrations can affect disk latency:

https://www.youtube.com/watch?v=tDacjrSCeq4

Johnny555 · 4 years ago
10PB costs more than $210,000 per month at S3, or more than $12M after five years.

Your pricing is off by a 2X - he said he's ok with infrequent access, 1 zone, which is $0.01/GB, or $100K/month.

If he rarely needs to read most of the data, he can cut the price by 1/10th by using deep archive, $0.00099 per GB, so $10K/month, or around $600K over 5 years, not including retrieval costs.

Titan2189 · 4 years ago
Nope, can't use Deep Archive as he specified max retrieval time of 1000ms. But you're correct with S3-IA
warrenm · 4 years ago
>RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

I dare you to buy 102 12TB drives for $11k

The cheapest consumer class 12GB hdd is ~$275 a pop

That's $28k just for the drives

atomicity · 4 years ago
If you have a PBs of data that you rarely access, it seems to make sense to compress it first.

I've rarely seen any non-giants with PBs of data properly compressed. For example, small JSON files converted into larger, compressed parquet files will use 10-100x less space. I am not familiar with images but see no reason why encoding batches of similar images should make it hard to get similar or even better compression ratios

Also, if you decide to move off later on, your transfer costs will also be cheaper if you can move it off in a compressed form first.

cerved · 4 years ago
couple be wrong but I don't believe compression of batches of compressed images compresses well

but it'd be very interested to here about techniques on this because I have a lot of space eaten up by timelapses myself

darkr · 4 years ago
I’ve heard reports that minio gets slow beyond the hundreds of millions of objects threshold
tinus_hn · 4 years ago
You are mixing up your units, with 12GB drives and 15TB in a rack.
SergeAx · 4 years ago
You didn't take personnel cost into account. You will need at least two system administrators to look after those racks (even if remote hands to change faulty drives are in the monthly opex). It quickly takes you to surplus of 200k/year with current prices (which will rise another 50% in 5 years).

On the other hand, you may negotiate a very sizable discount from AWS for 10Pb storage for 5 years.

FireBeyond · 4 years ago
Does Snowball let you exfiltrate data from AWS? I was under the impression it was only for bulk ingestion.
skynet-9000 · 4 years ago
First sentence on the linked page: "With AWS Snowball, you pay only for your use of the device and for data transfer out of AWS."
user5994461 · 4 years ago
You realize you can't fit 10 appliances of 4U in a rack? (A rack is 42U)

There's network equipment and power equipment that requires space in the rack. There's power limitations and weight limitations on the rack that prevents to fill it to the brim.

jedberg · 4 years ago
I've put 39U of drives in a rack before. You only need 1U for a network switch, and you can get power that attaches vertically to the back, so it doesn't take up any space. If you have a cabinet with rack in front and back and all the servers have rails, the weight shouldn't be an issue.

The biggest issue will be cooling depending on how hot your servers run.

Specifically, it was a rack full of Xserve RAIDs, which are 3U each and about 100lbs each. So that was over 1300lbs.

shiftpgdn · 4 years ago
Gold standard APC PDUs are all 0U side mount.
user5994461 · 4 years ago
What if you want to move off S3? Let's do the math.

* To store 10+ PB of data.

* You need 15 PB of storage (running at 66% capacity)

* You need 30 PB of raw disks (twice for redundancy).

You're looking at buying thousands of large disks, in the order of a million dollar upfront. Do you have that sort of money available right now?

Maybe you do. Then, are you ready to receive and handle entire pallets of hardware? That will need to go somewhere with power and networking. They won't show up for another 3-6 months because that's the lead time to receive an order like that.

If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

There is a sweet spot in moving off the cloud, if you can fit your entire infrastructure into one rack. You're not in that sweet spot.

You're going to be filling multiple racks, which is a pretty serious issue in terms of logistics (space, power, upfront costs, networking).

Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

The conclusion of this story. S3 is pretty good. Your time would be better spend optimizing the software. What is expensive? The storage or the bandwidth or both?

* If it's the bandwidth. You need to improve your CDN and caching layer.

* If it's the storage. You should work on better compression for the images and videos. And check whether you can adjust retention.

latch · 4 years ago
> Let's do the math.

Offers no math.

At retail, 625 16TB drives is $400000. This is about 2x the MONTHLY retail s3 pricing. Further, as we all know, AWS bandwidth pricing is absolutely bonkers (1).

I think your conclusion that S3 is "pretty good" needs a lot more math to support.

(1) https://twitter.com/eastdakota/status/1371252709836263425

_nickwhite · 4 years ago
The math should also include the price of the staff who babysit 625 spinning metal disks, who likely drive to a data center multiple times a week to swap failed drives. I shudder to think if this job fell in my lap!
chii · 4 years ago
> 625 16TB drives is $400000

how much is the real estate cost of 625 drives and associated machinery to run it?

At a guess, AWS has an operating margin of about 30%, so you can approximate their cost of hardware, bandwidth, and other fixed costs as 70% of their sticker price. As a start up, can you actually get this price to be lower? I actually dont think you can, unless your operation is very small and can be done out of a home/small office.

gamegoblin · 4 years ago
FWIW you can get great redundancy with far less than 2x storage factor. e.g. Facebook uses a 10:14 erasure coding scheme[1] so they can lose up to 4 disks without losing data, and that only incurs a 1.4x storage factor. If one's data is cold enough, one can go wider than this, e.g. 50:55 or something has a 1.1x factor.

Not that this fundamentally changes your analysis and other totally valid points, but the 2x bit can probably be reduced a lot.

[1] https://engineering.fb.com/2015/05/04/core-data/under-the-ho...

dsyrk · 4 years ago
https://en.wikipedia.org/wiki/Parchive

Basically they use par2 multi-file archives for cold storage with each archive file segment scattered across different physical locations. Always fun to see the kids rediscovering tricks from the old days.

Nacraile · 4 years ago
> If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

This is just incorrect.

If you talk to HPE, they should be quite happy to sell you the my employer's software (Qumulo) alongside their hardware. 10+ PB is definitely supported. (The HPE part is not required)

If you talk to Dell EMC, they will quite happily sell you their competing product, which is also quite capable of scaling beyond 1-2PB.

hedora · 4 years ago
Most (all?) enterprise vendors will go well beyond 1-2PB.

Four years ago, one of the all flash vendors routinely advertised “well under a dollar a gigabyte”. Their prices have dropped dramatically since then, but the out of date numbers translate to “well under a million per PB”. That’s at the high end of performance with posix (nfs) or crash coherent (block) semantics. (Some also do S3, if that’s preferable for some reason)

With a 5 year depreciation cycle, those old machines were at << $16K / month per PB. Today’s all flash systems fit multiple PB per rack, and need less than one full time admin.

Hope that helps.

user5994461 · 4 years ago
I've checked what I could find on Qumulo. It is software that you run on top of regular servers, to form a storage cluster.

It seems to me you're only confirming my previous point, that you need to invest in complicated/expensive software to make the raw storage usable.

>>> Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

There's no listed price on the website, you will need to call sales. Wouldn't be surprised if it started at 6 figures a year for a few servers.

It looks like it may not run on just any server, but may need certified server hardware from HP or Qumulo.

qwertykb · 4 years ago
Always fun stumbling across another Qumulon on here :)
quantumofalpha · 4 years ago
AWS is ridiculously expensively at their scale, both for storage and egress. But the choice is not only between that and building a staffed on-premise storage facility.

You can compromise at a middle ground - rent a bunch of VPS/managed servers and let the hosting companies deal with all the nastiness of managing physical hardware and CAPEX. Cost around $1.6-2/TB/month (e.g. Hetzner's SX) for raw non-redundant storage, an order of magnitude better than AWS. Comes with far more reasonably priced bandwidth too.

Build some error correction on top using one of the many open-source distributed filesystems out there or perhaps an in-house software solution (reed-solomon isn't exactly rocket science). And for some 30+% overhead, depending on workload (you can have very low overload if you have few reads or very relaxed latency requirements), you should have a decently fault tolerant distributed storage at a fraction of AWS costs.

mark_l_watson · 4 years ago
I agree that considering Hetzner is a good idea. I have used them often, never any problems, and very low pricing.
montroser · 4 years ago
> You need to improve your CDN and caching layer.

Depends on usage patterns, but if this is 10PB of users'personal photos and videos, then you're not going to get much value from caching because the hit rate will be so low.

pier25 · 4 years ago
> If it's the bandwidth. You need to improve your CDN and caching layer.

What would you recommend for this?

(considering data is stored in S3)

shyn3 · 4 years ago
Verizon and Redis has worked well for me.
crescentfresh · 4 years ago
> * To store 10+TB of data.

> * You need 15 TB of storage (running at 66% capacity)

> * You need 30 TB of raw disks (twice for redundancy).

Did you mean PB?

user5994461 · 4 years ago
Corrected.
louwrentius · 4 years ago
Very good advice!
epistasis · 4 years ago
If you have good sysadmin/devops types, this is a few racks of storage in a datacenter. Ceph is pretty good at managing something this size, and offers an S3 interface to the data (with a few quirks). We were mostly storing massive keys that were many gigabytes, so if you have smaller keys, so I'm not sure about performance/scalding limits with smaller keys and 10PB. I'd be sure to give your team a few months to build a test cluster then build and scale the full size cluster. And a few months to transfer the data...

But you'll need to balance the cost of finding people with that level of knowledge and adaptability with the cost of bundled storage packages. We were running super lean, got great deals on bandwidth, power, and has low performance requirements. When we ran the numbers for all in costs, it was less than we thought we could get from any other vendor. And if you commit to buying the severs racks it will take to fit 10PB, you can probably get somebody like Quanta to talk to you.

philippb · 4 years ago
The is amazing. Thank you. I’ve been looking at Backblaze storage pods that seem to be designed for that use case. Never rented rack space.

Do you remember somehow the math on how much cheaper it was or how you thought about upfront cost vs ongoing. Just order of magnitude would be great.

mceachen · 4 years ago
Roughly a decade ago S3 storage pricing had a ~10x premium over self-hosted. The convenience of not having to touch any hardware is expensive.
cmeacham98 · 4 years ago
I've run the math on this for 1PB of similar data (all pictures), and for us it was about 1.5-2 orders of magnitude cheaper over the span of 10 years (our guess for depreciation on the hardware).

Note that we were getting significantly cheaper bandwidth than S3 and similar providers, which made up over half of our savings.

epistasis · 4 years ago
Upfront costs, with networking, rack and stacked, and wired, were far under $100/TB raw, around $40-$60, but this was quite a while ago and I don't know how it looks in the era of 10+TB drives. Also remember that once you are off S3 you are in the situation of doing your own backup, and the use case dictates the required availability when things fail... we didn't need anything online, but mirrored to a second site. With erasure coding, you can get by with 1.5x copies at each site or so, with a performance hit. So properly backed up with a full double, it's about 3x raw...

Opex will be power, data center rent, and internet access are hugely hugely variable. And of course, the personnel will be at least 1 full time person who's extremely competent.

dangerboysteve · 4 years ago
if you have looked at BB storage pods you should look at the 45drives.com the child of Protocase which manufactures the BB pods.
kfrzcode · 4 years ago
Totally out-of-band for this thread, but... what are the uses for a multi-gigabyte key?! I'm clearly unaware of some cool tech, any key words I can search?
dTal · 4 years ago
I'm no expert but I would guess it's just a fancy word for "file", as in "key-value store", as opposed to a god-proof encryption key.
epistasis · 4 years ago
When I say "key" I mean the blob that gets stored, but I may be misremembering or misusing S3 terms... it was large amounts of DNA sequencing data, and one of the first tasks was to add S3 support for indexed reads to our internal HTSlib fork, and since then somebody else's implementation has been added to the library. In any case, I quickly forgot about most of the details of S3 when I no longer had to deal with it directly...
tootie · 4 years ago
This is outside my domain and I don't know how the pricing works out, but AWS Outpost will sell you a physical rack that is fully S3 compatible and redundant to cloud.
aynsof · 4 years ago
The pricing would be prohibitive, I reckon. S3 on Outposts is $0.1/GB/mo, whereas the S3 single zone IA that OP is using as a baseline is $0.01/GB/mo - an order of magnitude less. (Prices are based on us-east-1.)
maestroia · 4 years ago
There are four hidden costs which not many have touched upon.

1) Staff You'll need at least one, maybe two, to build, operate, and maintain any self-hosted solution. A quick peek on Glassdoor and Salary show the unloaded salary for a Storage Engineer runs $92,000-130,000 US. Multiply by 1.25-1.4 for loaded cost of an employee (things like FICA, insurance, laptop, facilities, etc). Storage Administrators run lower, but still around $70K US unloaded. Point is, you'll be paying around $100K+/year per storage staff position.

2) Facilities (HVAC, electrical, floor loading, etc) If you host on-site (not hosting facility), you'd better make certain your physical facilities can handle it. Can your HVAC handle the cooling, or will you need to upgrade it? What about your electrical? Can you get the increased electrical in your area? How much will your UPS and generator cost? Can the physical structure of the building (floor loading, etc) handle the weight of racks and hundreds of drives, the vibration of mechanical drives, the air cycling?

3) Disaster Recovery/Business Continuity Since you're using S3 One Zone IA, you have no multi-zone duplicated redundancy. It's use case is for secondary backup storage for data, not the primary data store for running a startup. When there is an outage/failure (and it will happen), the startup may be toast, and investors none too happy. So this is another expense you're going to have to seriously consider, whether you stick with S3 or roll-your-own.

4) Cost of money With rolling-your-own, you're going to be doing CAPEX and OPEX. How much upfront and ongoing CAPEX can the startup handle? Would the depreciation on storage assets be helpful financially? You really need to talk to the CPA/finance person before this. There may be better tax and financial benefits by staying on S3 (OPEX). Or not.

Good luck.

aledalgrande · 4 years ago
I am 100% agreeing with this, especially cash flow for a startup, it's going to be harder to manage. I think S3 is still the answer.
ktpsns · 4 years ago
I have worked in HPC (academia) where the cluster storage size is measured in multiples of PB since a decade. Since latency and bandwidth is a killer requirement there, Infiniband (instead of Ethernet) is the defacto standard for connecting the storage pools to the computing nodes.

Maintaining such a (storage) cluster requires 1-2 people on site which replace a few hard disks every day.

Nevertheless, when I would continously need massive amount of data, I would opt in doing it myself anytime instead of cloud services. I just know how well these clusters run and there is little to no saving when outsourcing it.

craigyk · 4 years ago
I am a researcher in academia that handles most of my system admin needs myself. It’s way cheaper to do yourself than some of these comments here make it sound (if you have good server rack space available). I ordered two 60 drive JBODs that I racked by myself (I removed all the drives first to lighten them) for ~82k. I used Zfs and 10 drive raidz2 vdevs for a total capacity of ~960TB of useable file system space. Installing the servers and testing some setups and putting it into use took about 4-5 days. In four years I’ve put many PBs of reAfs and writes through these and had to replace 3 drives. I’d estimate I spend about 2% of my active work focus on maintaining and troubleshooting it. Scaling up to 10PB I’d probably switch to a supported SDS solution, which would be much more expensive, but still way way cheaper than cloud.
glbrew · 4 years ago
Since he needs 1000ms response on storage isn't ethernet the better option? It can reach 400gb/s on fastest hardware now. I thought Infiniband was only reasonable to use when machines need to quickly access other machines primary memory. I would like to know if I'm wrong about this though.
alfalfasprout · 4 years ago
Agreed and at this point with ROCE there's little reason to go with infiniband given you can find fast ethernet hardware that'll go toe to toe with infiniband on latency and throughput.
shiftpgdn · 4 years ago
I've done multiple multipetabyte scale projects and you only need to swap disks once a month or so. I had a project (as a solo engineer) 2 hours away and I drove there once in six months.
jtchang · 4 years ago
I would host in a datacenter of your choice and do a cross connect into AWS: https://aws.amazon.com/directconnect/pricing/

This allows you to read the data into AWS instances at no cost and process it as needed since there is 0 cost for ingress into AWS. I have some experience with this (hosting using Equinix)

aynsof · 4 years ago
Direct Connect isn't required from a cost perspective - ingress into AWS is free in all cases I can think of, but certainly in the case of S3 [0]. DX is useful when customers need assurances of bandwidth/throughput, or if they want to avoid their traffic routing over the internet.

[0] "You pay for all bandwidth into and out of Amazon S3, except for the following: Data transferred in from the internet..." - https://aws.amazon.com/s3/pricing/

philippb · 4 years ago
Thanks for the pointer. Never thought about this as an option. Great stuff!!!
pickle-wizard · 4 years ago
I had a similar problem at a past job. Though we only had a PB of data. We used a products called SwiftStack. It is open source, but they have paid support. I recommend getting support, as their support is really good. It is an object store like S3, but it has its own API. Though I think they now have an S3 compatible gateway now.

We had about 25 Dell R730xd servers. When the cluster would start to fill up, we would just replace drives with larger drives. Upgrading drives with SwiftStack is a piece of cake. When I left we were upgrading to 10TB drives as that was the best pricing. We didn't buy the drives from Dell as they were crazy expensive. We just bought drives from Amazon/New Egg, and kept some spares onsite. We got a better warranty that way too. Dell only had a 1 year warranty, but the drives we were buying had a 5 year warranty.

TechBro8615 · 4 years ago
I’m not an AWS pricing expert, but you should be aware you’re still on the hook for S3 requests even if you can get out of paying for bandwidth. Is AWS direct connect a pure peering arrangement? I wonder what their requirements are for that. Guess I’ll read the link :)

Idk what your team’s expertise is, but I’d advise avoiding the cloud as long as possible. If you can build out an on-premise infrastructure, it will be a huge competitive advantage for your company because it will allow you to offer features that your competitors can’t.

Examples of this:

- Cloudflare built up their own network and infrastructure and it’s always been their biggest asset. They set the standard for free tier of CDN pricing, and nobody who builds a CDN on top of an existing cloud provider will ever beat it.

- Zoom. By hosting their own servers and network, Zoom is similarly able to offer a free tier where they are not subject to variable costs from free customers losing them money on bandwidth charges.

- WhatsApp. They scaled to hundreds of millions of users with less than a dozen engineers, a few dozen (?) servers, and some Erlang code.

IMO defaulting to the cloud is one of the worst mistakes a young company can make. If your app is not business critical, you can probably afford up to a day of downtime or even some data loss. And that is unlikely to happen anyway, as long as you’ve got a capable team looking after it who chooses standard and robust software.

dsyrk · 4 years ago
I’d like to add I’d agree with the parent comment and add some specifics.

Buy storage servers from 45drives they basically build same hardware as Backblaze uses. Add copper 10G nics to the servers.

https://www.45drives.com/

Get necessary switches 10G with 40G uplink ports. Whatever your favorite. Use 10GBaseT to the servers.

Install hardware in a quality data center. Like one of theirs -

https://www.digitalrealty.com/

And get 10G virtual cross connects to AWS.

Back of the envelope calculation you need 30TB raw, so about 60 servers. They aren’t really that power hungry so 10 per cabinet. 6 cabinets. at least 6+2 switches.

Software wise you have lots of options with this infra. High upfront cost but low MRC vs all other options. Assuming you have skilled sys admins who know what they are doing.

comboy · 4 years ago
+ some deep archive glacier? I think waiting 12h for data is acceptable if your datacenter burns down but it may not be the case for you.
staticassertion · 4 years ago
It's going to depend entirely on a number of factors.

How are you storing this data? Is it tons of small objects, or a smaller number of massive objects?

If you can aggregate the small objects into larger ones, can you compress them? Is this 10PB compressed or not? If this is video or photo data, compression won't buy you nearly as much. If you have to access small bits of data, and this data isn't something like Parquet or JSON, S3 won't be a good fit.

Will you access this data for analytics purposes? If so, S3 has querying functionality like Athena and S3 Select. If it's instead for serving small files, S3 may not be a good fit.

Really, at PB scale these questions are all critically important and any one of them completely changes the article. There is no easy "store PB of data" architecture, you're going to need to optimize heavily for your specific use case.

philippb · 4 years ago
Great question. I updated the original post. It’s user generated images and videos. We download those to the phones in the background.

We don’t touch the data at all.

staticassertion · 4 years ago
> Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

> The data is all images and videos, and no queries need to be performed on the data.

OK, so this definitely helps a bit.

At 10PB my assumption is that storage costs are the major thing to optimize for. Compression is an obvious must, but as it's image and video you're going to have some trouble there.

Aggregation where you can is probably a good idea - like if a user has a photo album, it might make sense to store all of those photos together, compressed, and then store an index of photo ID to album. Deduplication is another thing to consider architecting for - if the user has the same photo, across N albums, you should ensure it's only stored the one time. Depending on what you expect to be more or less common this will change your approach a lot.

Of course, you want to avoid mutating objects in S3 too - so an external index to track all of this will be important. You don't want to have to pull from S3 just to determine that your data was never there. You can also store object metadata and query that first.

AFAIK S3 is the cheapest way to store a huge amount of data other than running your own custom hardware. I don't think you're at that scale yet.

Latency is probably an easy one. Just don't use Glacier, basically, or use it sparingly for data that is extremely rare to access ie: if you back up disabled user accounts in case they come back or something like that.

I think this'll be less of a "do we use S3 or XYZ" and more of a "how do we organize our data so that we can compress as much of it together, deduplicate as much of it as possible, and access the least bytes necessary".