Dear Customer, Unfortunately, we have to inform you that there was a data loss incident that affects a small amount of your snapshots on Hetzner Cloud. All snapshots you create are stored on our highly available storage systems. The snapshot contents are distributed over multiple internal servers and data is stored in a way that allows up to two separate disks to fail without impacting data integrity. This means the snapshot can still be accessed, even if two disks fail at the same time. Due to a recent, very unfortunate series of events in one of our clusters, multiple disks failed in short succession and caused a small number of snapshots to become unavailable. We immediately tried to recover the affected snapshots but unfortunately the data is lost and we have exhausted all our options.
Affected snapshots in your account: XXXXXXXXX
The snapshots have been removed from our system as they are no longer accessible. We sincerely hope this doesn’t cause too much trouble for you; we know losing data is the worst-case scenario. Also, we have added 20€ as Cloud Credits to your account (valid for one year). While we know that this will not bring back your data, we still hope that you will accept the gesture. In response to this we will re-evaluate our snapshot cluster data replication strategies as well as our strategies for replacing disks and rebuilding redundancy after replacement.
Best Regards, Hetzner Cloud
There are various options here but the one I happened to have read up on (and then contributed documentation to make it easier to use safely) is Restic with Rest-server in append-only mode. It's not perfect (e.g. no compression yet, only dedup) but it's pretty good. If you're an individual or small–medium business, put your favorite raspberry pi equivalent in a closet and invest a hundred bucks in a hard drive big enough for the important documents, pictures of loved ones, etc.
A single drive shouldn't be your only backup, just don't put 100% of your trust in magic cloud. See also Atlassian this week: it may be down, but if you need to access the data of ticket xyz for a lead this week and you have a local data export, you can find the needed data.
For durable backup I'd recommend storing objects in S3/Glacier. If you need to image an entire volume, make an AMI, which is stored in S3. Much more durable and targeted than a snapshot, and in the case of S3 objects, gives you benefits such as logging and version control.
For data you can’t afford to loose, please, for the love of ${DEITY}, don’t store it with just one vendor. You never know what happens.
https://www.backblaze.com/blog/the-3-2-1-backup-strategy/
For my cloud data, I sync it down to a local server and then sync that to removable offline media.
This seems like an excellent approach. Personally i use BackupPC which internally uses rsync and allows for incremental backups, compression and deduplication of said backups as needed, then copying the data over to additional HDDs with cron.
Works pretty decently on a slow residential connection, since it all happens in the background through WireGuard and one of my homelab boxes. Plus, BackupPC allows to test restoring backups with a single click, on a per-file or per-directory basis.
Of course, internally the solution is a bit of an Eldritch mess with Perl and whatnot, so other pieces of software like Bacula might be more popular in comparison.
Make 3 copies of your systems (just data if you can rebuild infra within your RTO)
Use 2 different media (compress then encrypt)
Move 1 copy off-site. (print the encryption key and store that in a fireproof safe)
My setup has 4-2-2 and using 2 different backup implementations for the remote, so bugs in them wouldn't be corrupting entire backups unknowingly at the same time.
Apparently the 1 local and 2 remotes are having generations of incremental backups as well.
The €20 of credit made air accelerate out of my nostrils.
Also they made sure you knew it's only valid for one year!
[1] https://www.youtube.com/watch?v=5eo8nz_niiM
Yeah I've seen way, way worse. Stuff like power cables zip tied to network cables and fibre lines. DUST. delerict hardware just sitting everywhere.
This is nice. What are you looking for, IBM dudes in white shirts running around?
Ever been in a firewall/router manufacturers R&D/QA lab room? Those are so, so much worse. This is heaven by comparison.
depends how localized the fire extinguishers are and how the power is distributed which is hard to see from the video
not sure what you expect, using 1-2U boxes or blades or something wont change the setup that much, it pretty much boils down to 'how many machines will die in the same time on power/fire' and how can we cool them as cheap as possible.
It seems odd that they'd be quite so vague and circumspect about it. Why not just say what it was, so we're not all left to speculate, perhaps for the worst?
Also a favorite of mine is power cycling a machine. With enough disks, chances are high that a couple won't come back after they spin down.
One of my favorite is, if you want to store three copies of something on a lot of disks, randomly select which three disks you want to send the copies to so that each bit of content is stored on three random disks. It superficially seems very appealing; I have 3 copies of everything, nothing is likely to go wrong!
However, what you've actually created is a situation where if any three disks go down, you're guaranteed to lose data because as you scale up, the probability that those three disks were the three randomly-chosen disks for some set of data goes to 1. The probability of three disks going down at some point also likewise trends to 1 as you scale up.
Not saying that was the case here, just sharing it as one of the well-known pitfalls I find particularly interesting.
Deleted Comment
Aside from backups, the snapshots can also be used as base images to provision new nodes though, losing those could indeed be inconvenient.
Edit: people use snapshots to compliment backups too, you can do snapshots hourly and back a snapshot up daily. (They don't act as backups, more like "oops" revert solutions, humans are more likely to fail than machines)
If you're using VMware or Hyper-V the backups rely on a dirty "bitmap" of which blocks has been written since last backup so it only has to pull those from the system, this doesn't work with more snapshots for some reason. Enterprise backup solutions will just pull the entire machine and compare what's changed anyways, but you usually get a warning about it. So grumpy admins
I believe EBS snapshots are equivalent to you temp-snapshot, incremental/deduplicated backup/delete temp snapshot combination. Logically it's still a snapshot, because it contains all data of the volume at a specific point in time, but the cloud provider abstracts away the details of how it achieves durability.
Highly available right until they are not. Nice to see that your data is worth 20 bucks to Hetzner. I always liked them but this is a bit rude to put it mildly.
S3 for example is designed for an availability such that it can fail 1 in every 10,000 requests on average and for a durability that can lose one object for every 100,000,000,000 objects you have each year on average.
They communicated what went wrong, how it happened (in a nutshell), apologized and stated how they're planning to do better.
They even threw in 20€ which when compared to the pricing of their snapshot storage is more than fair.
Shit happens, things go wrong. If you lost very important data because you only stored it in one location, then you're equally at fault. Especially when taking into account that they are a rather cheap service.
What more do you want?