Readit News logoReadit News
Posted by u/bratao 3 years ago
Hetzner Cloud – Data loss incident
Data loss incident (snapshots)

Dear Customer, Unfortunately, we have to inform you that there was a data loss incident that affects a small amount of your snapshots on Hetzner Cloud. All snapshots you create are stored on our highly available storage systems. The snapshot contents are distributed over multiple internal servers and data is stored in a way that allows up to two separate disks to fail without impacting data integrity. This means the snapshot can still be accessed, even if two disks fail at the same time. Due to a recent, very unfortunate series of events in one of our clusters, multiple disks failed in short succession and caused a small number of snapshots to become unavailable. We immediately tried to recover the affected snapshots but unfortunately the data is lost and we have exhausted all our options.

Affected snapshots in your account: XXXXXXXXX

The snapshots have been removed from our system as they are no longer accessible. We sincerely hope this doesn’t cause too much trouble for you; we know losing data is the worst-case scenario. Also, we have added 20€ as Cloud Credits to your account (valid for one year). While we know that this will not bring back your data, we still hope that you will accept the gesture. In response to this we will re-evaluate our snapshot cluster data replication strategies as well as our strategies for replacing disks and rebuilding redundancy after replacement.

Best Regards, Hetzner Cloud

uuyi · 3 years ago
Lost an EBS snapshot on AWS once. The only way to provide some assurance in this space is to make sure your data is stored in more than one physically and commercially separate location.
Aachen · 3 years ago
This is the way. I get that not doing small-scale things yourself saves costs and that's why "cloud" exists, but if something is important to you or your organization, I cannot stress enough how useful having just a simple old data copy is. I've lost data to bad drives too often myself. Microsoft can erroneously close your paid-for account without recourse, or ransomware can try to extort you, but your data will simply not be lost if you have an offline (or more advanced: append-only) system with an hourly/daily/... backup on it.

There are various options here but the one I happened to have read up on (and then contributed documentation to make it easier to use safely) is Restic with Rest-server in append-only mode. It's not perfect (e.g. no compression yet, only dedup) but it's pretty good. If you're an individual or small–medium business, put your favorite raspberry pi equivalent in a closet and invest a hundred bucks in a hard drive big enough for the important documents, pictures of loved ones, etc.

A single drive shouldn't be your only backup, just don't put 100% of your trust in magic cloud. See also Atlassian this week: it may be down, but if you need to access the data of ticket xyz for a lead this week and you have a local data export, you can find the needed data.

mahastore · 3 years ago
you do get what you paid for
sidewndr46 · 3 years ago
That works until you find out the commercially separate location was using Amazon for their storage solution as well.
ckozlowski · 3 years ago
I would argue that EBS snapshots should not be used as a backup solution. EBS is reliable, but isn't engineered for the same durability as say, S3. These also become more unwieldy and less useful the older they get. Use EBS snapshots for rollback prior to making changes to your environment.

For durable backup I'd recommend storing objects in S3/Glacier. If you need to image an entire volume, make an AMI, which is stored in S3. Much more durable and targeted than a snapshot, and in the case of S3 objects, gives you benefits such as logging and version control.

forty · 3 years ago
Aren't EBS snapshots stored on S3? :)
imtringued · 3 years ago
For most people Glacier is a terrible solution.
cpach · 3 years ago
Folks, remember the 3-2-1 rule. IMO, it’s still extremely relevant, even in today’s cloud-centric world.

For data you can’t afford to loose, please, for the love of ${DEITY}, don’t store it with just one vendor. You never know what happens.

https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

bombcar · 3 years ago
I think that people forget that you need two copies of your data under your control - and when your data is primarily on the cloud it is not data under your control; it is under the cloud company's control. You should have at least one copy that is physically in your possession, offline preferably, so that you have access to it.

For my cloud data, I sync it down to a local server and then sync that to removable offline media.

KronisLV · 3 years ago
> For my cloud data, I sync it down to a local server and then sync that to removable offline media.

This seems like an excellent approach. Personally i use BackupPC which internally uses rsync and allows for incremental backups, compression and deduplication of said backups as needed, then copying the data over to additional HDDs with cron.

Works pretty decently on a slow residential connection, since it all happens in the background through WireGuard and one of my homelab boxes. Plus, BackupPC allows to test restoring backups with a single click, on a per-file or per-directory basis.

Of course, internally the solution is a bit of an Eldritch mess with Perl and whatnot, so other pieces of software like Bacula might be more popular in comparison.

1970-01-01 · 3 years ago
You forgot to explain it:

Make 3 copies of your systems (just data if you can rebuild infra within your RTO)

Use 2 different media (compress then encrypt)

Move 1 copy off-site. (print the encryption key and store that in a fireproof safe)

freemint · 3 years ago
Maybe not compress and encrypt depending on your thread model.
mekster · 3 years ago
I don't like 3-2-1. It's pretty fragile. I don't know how any sys admins can sleep over 1 remote backup.

My setup has 4-2-2 and using 2 different backup implementations for the remote, so bugs in them wouldn't be corrupting entire backups unknowingly at the same time.

Apparently the 1 local and 2 remotes are having generations of incremental backups as well.

c7DJTLrn · 3 years ago
That's not good but you do get what you pay for. Should never put all your eggs in one basket like that.

The €20 of credit made air accelerate out of my nostrils.

cle · 3 years ago
> The €20 of credit made air accelerate out of my nostrils.

Also they made sure you knew it's only valid for one year!

systemvoltage · 3 years ago
I came across tour of Hetzner datacenter[1] and it looks like a completely unprofessional setup, like castle on stilts. Does not inspire confidence.

[1] https://www.youtube.com/watch?v=5eo8nz_niiM

jamal-kumar · 3 years ago
If that's your standard for a garbage data centre... hmmm

Yeah I've seen way, way worse. Stuff like power cables zip tied to network cables and fibre lines. DUST. delerict hardware just sitting everywhere.

This is nice. What are you looking for, IBM dudes in white shirts running around?

brokenodometer · 3 years ago
I just watched the whole thing and thought it looked really cool! Seems like a scrappy and self-reliant company that is focused on cost effectiveness and efficiency.
tobltobs · 3 years ago
Could you tell us some of the red flags you see in this video? Considering it is a low cost hoster it looks fine to me, but I am not an expert.
singingboyo · 3 years ago
Looks like standard machines and not really servers, but there's nothing wrong there. Fairly clean wiring, no random cables, etc etc. HDD testing is a little sketchy, but they need to be quick-swapped, so it makes sense.

Ever been in a firewall/router manufacturers R&D/QA lab room? Those are so, so much worse. This is heaven by comparison.

throwaway71271 · 3 years ago
looks pretty good

depends how localized the fire extinguishers are and how the power is distributed which is hard to see from the video

not sure what you expect, using 1-2U boxes or blades or something wont change the setup that much, it pretty much boils down to 'how many machines will die in the same time on power/fire' and how can we cool them as cheap as possible.

jacquesm · 3 years ago
I've seen far worse. That said, the compute density is pretty low and the gear is rather older than what you would expect, they clearly are not optimizing for power efficiency but for capital efficiency, which means that as long as they can operate a box profitable they'd rather add more buildings than upgrade their servers. Interesting thermal arrangement, I wonder how well this would work in a fire, it looks like a giant chimney to me but maybe there is some secret sauce.
native_samples · 3 years ago
FWIW if you focus on the modern areas, Google datacenters don't look much different.
philjohn · 3 years ago
That's why they're so cheap, compared to a lot of other offerings.
drfuchs · 3 years ago
Anybody have more color on the nature of the "recent, very unfortunate series of events in one of our clusters [such that] multiple disks failed in short succession"? What kind of "events"? A failed climate-control system? A berserk employee with a sledge hammer? Russian hackers? The Spanish Inquisition?

It seems odd that they'd be quite so vague and circumspect about it. Why not just say what it was, so we're not all left to speculate, perhaps for the worst?

aeyes · 3 years ago
I used to have a couple of filers with thousands of disks on premise, we'd always have a dead disk somewhere. Keeping up with dead disks was almost a full time job, you'd open a case with the vendor, get them access to the datacenter, then be on the phone to guide them to the rack and verify that the new disk came online. To not go completely crazy, we usually batched these jobs once or twice per week. And even with double parity and replicated filers in two locations things can go wrong.

Also a favorite of mine is power cycling a machine. With enough disks, chances are high that a couple won't come back after they spin down.

throwawayboise · 3 years ago
A batch of disks in a RAID array that were all from the same manufacturing lot. Such setups have been known to have multiple drives fail within hours/days of each other. Best practice is to mix manufacturers or at least manufacturing lots when you build the storage array.
jerf · 3 years ago
There's a couple of other well-known ways to screw up redundancy too, though I hope they're in far enough to know them.

One of my favorite is, if you want to store three copies of something on a lot of disks, randomly select which three disks you want to send the copies to so that each bit of content is stored on three random disks. It superficially seems very appealing; I have 3 copies of everything, nothing is likely to go wrong!

However, what you've actually created is a situation where if any three disks go down, you're guaranteed to lose data because as you scale up, the probability that those three disks were the three randomly-chosen disks for some set of data goes to 1. The probability of three disks going down at some point also likewise trends to 1 as you scale up.

Not saying that was the case here, just sharing it as one of the well-known pitfalls I find particularly interesting.

hoofhearted · 3 years ago
I believe the more likely scenario was that they lost a disk and replaced it.. While the disk array was restriping, another disk in the raid array failed because of the restriping load, and all data would be lost at that point in a raid setup as they described.

Deleted Comment

raxxorraxor · 3 years ago
A snapshot of the current data? So the data is still available and another snapshot can be created?
kro · 3 years ago
It reads like it indeed does not affect live data in running cloudservers, yes.

Aside from backups, the snapshots can also be used as base images to provision new nodes though, losing those could indeed be inconvenient.

throwawayboise · 3 years ago
I was always under the impression that a snapshot should be an short-lived thing. E.g. you take a snapshot, make a backup or copy from that, then delete the snapshot. Where I work, the VM admins will only allow one snapshot at any point in time without special request and justification, and they encourage deleting the snapshot within 72 hours.
carlhjerpe · 3 years ago
Depends on the solution, but in many systems snapshots are cheap, like zfs or btrfs. I can imagine ceph snapshots being quite cheap too.

Edit: people use snapshots to compliment backups too, you can do snapshots hourly and back a snapshot up daily. (They don't act as backups, more like "oops" revert solutions, humans are more likely to fail than machines)

If you're using VMware or Hyper-V the backups rely on a dirty "bitmap" of which blocks has been written since last backup so it only has to pull those from the system, this doesn't work with more snapshots for some reason. Enterprise backup solutions will just pull the entire machine and compare what's changed anyways, but you usually get a warning about it. So grumpy admins

CodesInChaos · 3 years ago
You're using a rather narrow/implementation specific definition of snapshot.

I believe EBS snapshots are equivalent to you temp-snapshot, incremental/deduplicated backup/delete temp snapshot combination. Logically it's still a snapshot, because it contains all data of the volume at a specific point in time, but the cloud provider abstracts away the details of how it achieves durability.

k8sToGo · 3 years ago
Also old snapshots
op00to · 3 years ago
Snapshots are often used for deployment of systems. What happens if the snapshot you rely on as a base image gets lost?
justinholmes · 3 years ago
Base images should be repeatable anyway using packer etc.
traspler · 3 years ago
Create a new one?
jacquesm · 3 years ago
> All snapshots you create are stored on our highly available storage systems.

Highly available right until they are not. Nice to see that your data is worth 20 bucks to Hetzner. I always liked them but this is a bit rude to put it mildly.

xmaayy · 3 years ago
The storage system is highly available, but the data is not guaranteed to be :)
charcircuit · 3 years ago
Availability and durability are different things.

S3 for example is designed for an availability such that it can fail 1 in every 10,000 requests on average and for a durability that can lose one object for every 100,000,000,000 objects you have each year on average.

jacquesm · 3 years ago
NSS.
rmetzler · 3 years ago
Better communication than Atlassian apparently.
idxodse · 3 years ago
It's really odd to see so many complaining when yesterday a thread about Atlassian was asking for precisely this type of communication.

They communicated what went wrong, how it happened (in a nutshell), apologized and stated how they're planning to do better.

They even threw in 20€ which when compared to the pricing of their snapshot storage is more than fair.

Shit happens, things go wrong. If you lost very important data because you only stored it in one location, then you're equally at fault. Especially when taking into account that they are a rather cheap service.

What more do you want?