GitLab Database Incident – Live Report

The public report is nice and we can see a sequence of mishaps from it, that shouldn't have been allowed to happen but which (unfortunately) are not that uncommon. I've done my share of mistakes, I know what's like to be in emergency mode and too tired to think straight, so I'm going to refrain from criticizing individual actions.

What I'm going to criticize is the excess of transparency:

You absolutely DO NOT publish postmortems referencing actions by NAMED individuals, EVER.

From reading the whole report it's clear that the group is at fault, not a single individual. But most people won't read the whole thing, even less people will try to understand the whole picture. That's why failures are always attributed publicly to the whole team, and actions by individuals are handled internally only.

And they're making it even worse by livestreaming the thing! It's like having your boss looking over your shoulder but a million times worse...

YorickPeterse · 9 years ago

I myself initially added my name to the document in various parts, this was later changed to just initials. I specifically told my colleagues it was OK to keep it in the document. I have no problems taking responsibility for mistakes, and making sure they don't happen ever again.

scaryclam · 9 years ago

I think perhaps, you want to not do this in the future.

Incident reports are about focusing on the "what" and "when" not the "who". This is not about taking responsibility (you don't need to be published on the internet to do that) and you can always have a follow up post after the incident report has been published as a "what I learned during incident X".

While it's great you're OK with publishing your name, you've now set a precedent that says it's OK to do this to other developers. A blanket policy on keeping names out of the incident report protects others who may not be as willing to get their name of HN (as well as not having to make amendments or retractions if the initial assumptions are incorrect). It also keeps a sense of professionalism as it's clear that no blame is being assigned. I know that you guys are not assigning blame, but if I was to show this to someone outside of this discussion, they'd assume that it was a fingerpointing exercise, which does not reflect well on Gitlab.

jawilson2 · 9 years ago

That's awesome, but why publicize it? This isn't an act of contrition for you, no one outside your team really needs to see your dirty laundry, and actually comes off as unprofessional to me. The gitlab team is a team, and you take responsibility as a team. Placing names and initials in the liveblog makes it look like SOMEONE is trying to assign and pass off blame, even if that is not what is happening.

Presumably in the coming days there will be a number of team meetings where you discover what went wrong, and what the action items are for everyone moving forward. The public looking info just needs to say what went wrong, how it is being fixed, and what will be done in the future to prevent it from happening again. I don't need names to get that.

testUser69 · 9 years ago

I don't know why people on hacker news are against transparency. I'm glad you guys are live streaming this, others would feel too inadequate to do so. Being this transparent only makes me want to use (and contribute to) gitlab even more.

RA_Fisher · 9 years ago

That's courageous but this wasn't your mistake, it was the CEO's mistake. They owe you a vacation and apology for putting you in that terrible position!

nanch · 9 years ago

Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.

GitLab has always blazed their own trail with their transparency, whether through their open run books, open source code, or in this case their open problem resolution. Kudos to them in whatever manner they want to do it in (with or without names).

To be honest, through all of the comments, yours seems the most high-strung, and you're the one complaining about high-pressure situations like having your boss looking over your shoulder. Relax buddy. :)

CrLf · 9 years ago

In a few years the guy doing the `rm -rf` is going to be on a job interview and someone will recall bits of this report. Enough bits to remember the guy, not enough bits to remember that it wasn't his (individual) fault.

Transparency doesn't mean publicly throwing people under the bus.

I'm not a GitLab customer, I'm relaxed. :)

rmc · 9 years ago

> Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.

That how the world should be. Not how it is.

Yes, someone with hiring/firing ability might blame the individual, and you could claim "Oh, you shouldn't listen to them, they're an idiot". But that's not much comfort if you're out of a job and gonna be kicked out of your house. In that situtation, the idiot with hiring/firing power matters to your life a lot.

Dead Comment

sbuttgereit · 9 years ago

I completely agree. Trying to get low level details out to the public while in the heat of the issue is a misstep; you can still be transparent while not risking over communication that could haunt you later.

While I think most of the HN audience understands that some days you have a bad day and that sometimes very small actions, like typing a single command at a prompt, can have dramatic consequences in technology, there are nonetheless less enlightened souls in hiring positions that simply might find fragments of this in a search on the name when that time comes.

Being too transparent could also encourage legal problems, too, if someone decides that they had a material loss over this, at least for the company. Terms of service or likelihood of a challenge prevailing doesn't necessarily matter: you can be sued for any reason and since there's no loser pay provision in any U.S. jurisdiction that I know of, even a win in court could be very costly. Being overly transparent in a case like this can bolster a claim of gross negligence (justified or not) and the law/courts/judges/juries cannot be relied upon to be consistently rational or properly informed.

Part of the problem is that this isn't actually a postmortem: they're basic live blogging/streaming in real time. What would be helpful for us (users) and them (GitLab) in terms of real-time transparency:

* Acknowledge there were problem during a maintenance and data may/may not have been lost. * If some data is known to be safe: what data that is. * What stage are we at. Still figuring it out? Waiting for backups to restore? Verification? * Broad -estimated- time to recovery: make clear it's a guess. Even coarsely: days away, 10's of hours away, etc. * When to expect the next public update on the issue.

None of this needs to be very detailed and likely shouldn't include actual technical detail. It just needs to be honest, forthright, and timely. That meets the transparency test while also protecting employees and the company.

Later, when there is time for thoughtful consideration, a technical postmortem at a good level of detail is completely appropriate.

[edit for clarity]

mverwijs · 9 years ago

At one company I worked for we had a saying: "You're not one of the team until you've brought down the network. "

We all mess up. Much respect to gitlab for being open about.

drvdevd · 9 years ago

That's the kind of team I want! All hands on deck. No lame responsibility shifters.

mateus1 · 9 years ago

Yep, I'd feel awful if I were the employee in this headline "GitLab Goes Down After Employee Deletes the Wrong Folder" [0].

It's the process and the team who are at fault.

[0] https://www.bleepingcomputer.com/news/hardware/gitlab-goes-d...

beamatronic · 9 years ago

Modern companies tend to have people in roles and teams. What I've done in postmortems is to use role names and team names, not person's names. Even if the team is just one person. This helps keep it about the team and the process. We're all professionals doing our best and striving for continuous improvement.

jjawssd · 9 years ago

Maybe these individuals don't mind, it could just be a cultural thing.

Deleted Comment

ejcx · 9 years ago

Many have echo'ed you, but I agree.

The person who made the error is just the straw that broke the camels back. I'm sure these folks knew that they needed to prioritize their backups but other things kept getting in the way. You don't throw people under the bus.

Ajedi32 · 9 years ago

Am I missing something? Where in this report are any individuals actually named? My understanding was that they're using initials in place of names specifically because they want to _avoid_ naming anyone.

sbuttgereit · 9 years ago

The original versions of the document had names. Those were later replaced with initials.

I think the issue was in part that this document didn't appear to be a public "here's what's going on doc" as much as it was a doc they seemed to be using as a focal point for their own coordination efforts.

I'm a huge Gitlab fan. But I long ago lost faith in their ability to run a production service at scale.

Nothing important of mine is allowed to live exclusively on Gitlab.com.

It seems like they are just growing too fast for their level of investment in their production environment.

One of the only reasons I was comfortable using Gitlab.com in the first place was because I knew I could migrate off it without too much disruption if I needed to (yay open source!). Which I ended up forced to do on short notice when their CI system became unusable for people who use their own runners (overloaded system + an architecture which uses a database as a queue. ouch.).

Which put an end to what seemed like constant performance issues. It was overdue, and made me sleep well about things like backups :).

A while back one of their database clusters went into split brain mode, which I could tell as an outsider pretty quickly... but for those on the inside, it took them a while before they figured it out. My tweet on the subject ended up helping document when the problem had started.

If they are going to continue offering Gitlab.com I think they need to seriously invest in their talent. Even with highly skilled folks doing things efficiently, at some point you just need more people to keep up with all the things that need to be done. I know it's a hard skillset to recruit for - us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.

I don't like writing things like this because I know it can be hard to hear/demoralizing. But it's genuine feedback that, taken in the kind spirit is intended, will hopefully be helpful to the Gitlab team.

Perihelion · 9 years ago

Hey Daniel, I want to thank you for your candid feedback. Rest assured that this sort of thing makes it back to the team and is truly appreciated no matter how harsh it is.

You're absolutely right -- we need to do better. We're aware of several issues related to the .com service, mostly focused on reliability and speed, and have prioritized these issues this quarter. The site is down so I can't link directly, but here's a link to a cached version of the issue where we're discussing all of this if you'd like to chime in once things are back up: https://webcache.googleusercontent.com/search?q=cache:YgzBJm...

b0ti · 9 years ago

I'm running a remote-only company and we moved to GitLab.com last summer from cloud hosted trac+git/svn combo (xp-dev). The reason we picked GitLab.com was because the stack is awesome and Trac is showing its age. We also wanted a solution that could be ran on premises if needed. We spent about a month migrating stuff over to GitLab from Trac. Once we were settled the reliability issues started to show. We were hoping that these would be quickly sorted out given the fact that the pace of the development with the UI and features was quite speedy.

A sales rep reached out and I told him we would be happy to pay if that's required to be able to use the cloud hosted version reliably but I got no response. Certainly we could host GitLab EE or CE on our own but this is what we wanted to avoid and leave it to those who know it best. xp-dev never ever had any downtime longer than 10 minutes that we actively used during the last 6 years. I'm still paying them so that I can search older projects as the response time is instant while gitlab takes more than 10 seconds to search.

Besides the slow response times and frequent 500 and timeout errors that we got accustomed to, gitlab.com displays the notorious "Deploy in progress" message every other day for over 20-30 minutes preventing us from working. I really hoped that 6-7 months would be enough time to sort these problems out but it only seems to be worsening and this incident kinda makes it more apparent that there are more serious architectural issues, i.e. the whole thing running on one single postgresql instance that can't be restored in one day.

We have one gitlab issue on gitlab.com to create automated backups of all our projects so that we could migrate to our own hosted instance (or perhaps github) but afair gitlab.com does not support exporting the issues. This currently locks us into gitlab.com.

On one hand I'm grateful to you guys because of the great service as we haven't paid a penny, on the other hand I feel that it was a big mistake picking gitlab.com since we could be paying GitHub and be productive instead of watching your twitter feed for a day waiting for the postresql database to be restored. If anyone can offer a paid hosted gitlab service that we could escape to, I'd be curious to hear about.

iagooar · 9 years ago

I'm a bit curious here. Do you think that your issues with scalability and reliability have to do with your tech choice (I think it was Ruby on Rails)? Don't want to bash Rails, I'm just genuinely curious, since I come from a Rails background as well and have seen issues similar to yours in the past.

tssva · 9 years ago

I have searched the gitlab website and repositories looking for processes and procedures addressing change management, release management, incident management or really anything. I have found work instructions but no processes or procedures. Until you develop and enforce some appropriate processes and the resulting procedures I'm afraid you will never be able to deliver and maintain an enterprise level service.

Hopefully this will be the learning experience which allows you to place an emphasis on these things going forward and don't fall into the trap of thinking formal processes and procedures are somehow incongruent with speedy time to market, technological innovation or in conflict with DevOps.

throwaway3347 · 9 years ago

Like you, I would like to add my 2 cent, which I hope will be taken positively, as I would like to see them provide healthy competition for GitHub for years to come.

Since GitLab is so transparent about everything, from their marketing/sales/feature proposals/technical issues/etc., they make it glaringly obvious, from time to time, that they lack very fundamental core skills, to do things right/well. In my opinion, they really need to focus on recruiting top talent, with domain expertise.

They (GitLab) need to convince those that would work for Microsoft or GitHub, to work for GitLab. With their current hiring strategy, they are getting capable employees, but they are not getting employees that can help solidify their place online (gitlab.com) and in Enterprise. The fact that they were so nonchalant about running bare metal and talking about implementing features, that they have no basic understanding of, clearly shows the need for better technical guidance.

They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.

GitLab right now, is building a great company to help address low hanging fruit problems, but not a team that can truly compete with GitHub, Atlassian, and Microsoft in the long run. Once the low hanging fruit problems have been addressed, people are going to expect more from Git hosting and this is where Atlassian, GitHub, Microsoft and others that have top talent/domain expertise, will have the advantage.

Let this setback be a vicious reminder that you truly get what you pay for and that it's not too late to build a better team for the future.

jldugger · 9 years ago

> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location

For those who haven't been following along, Gitlab's compensation policy is pretty much intentionally designed to not pay people to live in SF. It's a somewhat reasonable strategy for an all remote company. But they seem to have some pretty ambitious plans that may not be compatible with operating a physical plant.

throwaway14859 · 9 years ago

Look, I love GitLab. Gitlab was there for me when both my son and I got cancer, and they were more than fair with me when I needed to get healthy and planned to return to work. I have nothing but high praises for Sid and the Ops team.

With that said, I'll agree that the salary realities for GitLab employees are far below the base salary that was expected for a senior level DevOps person. I've got about 10 years experience in the space, and the salary was around $60K less than what I had been making at my previous job. I took the Job at GitLab because I believe in the product, believe in the team, and believe in what Gitlab could become...

With that said, starting from Day 1, we were limited by an Azure infrastructure that didn't want to give us Disk iops, legacy code and build processes that made automation difficult at times, and a culture that proclaimed openness, but, didn't really seem to be that open. Some of the moves that they've made (Openshift, rolling their own infrastructure, etc) have been moves in the right direction, but, they still haven't solved the underlying stability issues -- and these are issues that are a marathon, not a sprint. They've been saying that the performance, stability, and reliability of gitlab.com is a priority -- and it has been since 2014 -- but, adding complexity of the application isn't helping: if I were engineering management, I'd take two or three releases and just focus on .com. Rewrite code. Focus on pages that return in longer than 4 seconds and rewrite them. When you've got all of that, work on getting that down to three seconds. Make gitlab so that you can run it on a small DO droplet for a team of one or two people. Include LE support out of the box. Work on getting rid of the omnibus insanity. Caching should be a first class citizen in the Gitlab ecosystem.

I still believe in Gitlab. I still believe in the Leadership team. Hell, if Sid came to me today and said, "Hey, we really need your technical expertise here, could you help us out," I'd do so in a heartbeat -- because I want to see GitLab succeed (because we need to have quality open source alternatives to Jira, SourceForge Enterprise Edition, and others).

Not trying to be combative, but, "You truly get what you pay for" seems a little vindictive here -- the one thing that I wish they would have done was be open with the salary from the beginning -- but, Sid made it very clear that the offer that he would give me was going to be "considerably less" than what I was making.

ahawkins · 9 years ago

> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.

SIGN ME UP! That would be a freaking great opportunity!!

hueving · 9 years ago

Why would they try to recruit from Microsoft? Most of the software engineers at Microsoft are not focused on developing scalable web services architectures. And the ones that do have built up all of their expertise with Microsoft technologies (.net running on Windows server talking to mssql).

>Microsoft and others that have top talent/domain expertise, will have the advantage.

Again, Microsoft isn't even in this same field (git hosting) or if they are, are effectively irrelevant due to little market/mindshare. Are you an employee there or something?

devwastaken · 9 years ago

>us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.

The sad thing is it doesn't have to be this way. Software stacks and sysadmin is out there for the learning, but due to the incentives of moving jobs every two years, nobody wants to invest to make those people, we all know we'l find /someone/ to do it anyways.

inframouse · 9 years ago

I think they are running to catch up on the gitlab system itself, let alone running it as a production service. The bugs in the last few months have been epic. Backups not working, merge requests broken, chrome users seeing bugs, chaotic support. Basically their qa and release processes are not remotely enterprise ready.

CaliforniaKarl · 9 years ago

If I understand correctly, the public Gitlab is similar to what you can get with a private Gitlab instance. That makes me wonder, instead of trying to scale the one platform up, would it be OK to spin up a second public silo? I mean yeah, it would be a different silo, but for something free I'd say "meh".

I think it's totally fine admitting when you've stopped being able to scale up, and need to start scaling out.

DanielDent · 9 years ago

They could, and as a stopgap measure that might work, but..

(1) Some of the collaboration features (e.g. work on Gitlab itself) depend on having everyone on the same instance.

(2) Gitlab.com gives them a nice dogfood-esque environment for what it's like to actually operate Gitlab at scale. If they are having problems scaling it, then potentially so are their customers. Fixing the root cause is usually a good thing and is often an imperative to avoid being drowned in technical debt.

(3) It moves the problem in some respects. Modern devops techniques mostly allow the number of like servers to be largely irrelevant, but still.. the more unique instances of Gitlab, the more overhead there will be managing those instances (and figuring out which projects/people go on which instances).

It's a simple approach which I'm sure would work, but it also means a bunch of new problems are introduced which don't currently exist.

rovr138 · 9 years ago

There is a free version and paid version.

The one they offer for free is the paid version.

You can run your own, but you won't have every feature unless you pay.

elementalest · 9 years ago

>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage

>2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

>5. Our backups to S3 apparently don’t work either: the bucket is empty

>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one notice?

gizmo · 9 years ago

What if I told you all of society is held together by duct tape? If you're surprised that startups cut corners you're in for a rude awakening. I'm frequently amazed anything works at all.

cnst · 9 years ago

Startups only, you say?

Everything, everywhere, is held together by ducttape!

StringEpsilon · 9 years ago

Reminds me of this:

> Websites that are glorified shopping carts with maybe three dynamic pages are maintained by teams of people around the clock, because the truth is everything is breaking all the time, everywhere, for everyone. Right now someone who works for Facebook is getting tens of thousands of error messages and frantically trying to find the problem before the whole charade collapses. There's a team at a Google office that hasn't slept in three days. Somewhere there's a database programmer surrounded by empty Mountain Dew bottles whose husband thinks she's dead. And if these people stop, the world burns.

https://www.stilldrinking.org/programming-sucks

And if you think that paragraphs does not apply to your applications / company, ask yourself if that's really true. My company sends around incident statistics and there's always some shit that broke. Always.

hardwaresofton · 9 years ago

The real question is what holds together duct tape?

SZJX · 9 years ago

I don't think a lot of things work at all. We who live in developed countries are the minority really. Most things are simply non-functional and downright ugly. Such complex systems the world has. And now apparently even the developed worlds are in for some troubles. The peace and progress in the last few decades just really isn't the norm in human history.

That said, one can't deny that there are indeed things that do work, and work very well, and people who make that happen, and one can always be amazed/inspired by those. There are good things as well as haphazard things. It's just that the latter generally outnumber the former in many settings. It doesn't necessarily imply a sweeping statement about everything though.

overcast · 9 years ago

There's a big difference between holding together things with duct tape, and literally taking zero backups of it.

tjalfi · 9 years ago

It's called the Fundamental Failure-Mode Theorem - "Complex systems usually operate in an error mode". https://en.wikipedia.org/wiki/Systemantics has more rules from the book. It's worth the read.

x0x0 · 9 years ago

if #2 is correct, holy shit did gitlab get lucky someone snapshotted 6 hours before.

Dear you: it's not a backup until you've (1) backed up, (2) pushed to external media / s3; (3) redownloaded and verified the checksum; (4) restored back into a throwaway; (5) verified whatever is supposed to be there is, in fact, there, and (6) alerted if anything went wrong. Lots of people say this, and it's because the people saying this, me included, learned the hard way. You can shortcut the really painful learning process by scripting the above.

orblivion · 9 years ago

Do you have to download the entire backup or is a test backup using the same flow acceptable? I'm thinking about my personal backups, and I don't know if I have the time or space to try the full thing.

shimon_e · 9 years ago

Need to save this comment.

victor9000 · 9 years ago

I'm guessing they all worked at some point in time, but they failed to set up any sort of monitoring to verify the state of their infrastructure over time.

onion2k · 9 years ago

failing silently

I really wish all the applications I use had an option to never do that.

skarap · 9 years ago

Systems failing one by one and nobody caring about them is not uncommon at all. See https://en.wikipedia.org/wiki/Bhopal_disaster (search for "safety devices" in the article).

krenoten · 9 years ago

everything is broken, and we are usually late to notice due to the infrequency of happy-path divergence.

js2 · 9 years ago

If you're a sys admin long enough, it will eventually happen to you that you'll execute a destructive command on the wrong machine. I'm fortunate that it happened to me very early in my career, and I made two changes in how I work at the suggestion of a wiser SA.

1) Before executing a destructive command, pause. Take your hands off the keyboard and perform a mental check that you're executing the right command on the right machine. I was explicitly told to literally sit on my hands while doing this check, and for a long time I did so. Now I just remove my hands from the keyboard and lower them to my side while re-considering my action.

2) Make your production shells visually distinct. I setup staging machine shells with a yellow prompt and production shells with a red prompt, with full hostname in the prompt. You can also color your terminal window background. Or use a routine such as: production terminal windows are always on the right of the screen. Close/hide all windows that aren't relevant to the production task at hand. It should always be obvious what machine you're executing a commmand on and especially whether it is production. (edit: I see this is in outage the remeditation steps.)

One last thing: I try never to run 'rm -rf /some/dir' straight out. I'll almost always rename the directory and create a new directory. I don't remove the old directory till I confirm everything is working as expected. Really, 'rm -rf' should trigger red-alerts in your brain, especially if a glob is involved, no matter if you're running it in production or anywhere else. DANGER WILL ROBINSON plays in my brain every time.

Lastly, I'm sorry for your loss. I've been there, it sucks.

Here's the .bashrc/.bash_login I use:

https://gist.github.com/jaysoffian/8c75e661f7a61b0d094703e26...

ams6110 · 9 years ago

23:00-ish
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

theptip · 9 years ago

Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.

piinbinary · 9 years ago

Or use `mv x x.bak` when `rmdir` fails

bpchaps · 9 years ago

These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:

  pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir

It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.

foobiekr · 9 years ago

I am actually curious what user they were logged in as and what permissions were in effect.

Unfortunately, the answer most places is that the diagnostic account (as opposed to the corrective action account) is fully privileged (or worse, root).

weavie · 9 years ago

Alternatively, if you have the luxury - `zfs snapshot`.

Bad_CRC · 9 years ago

lsof -> check if a process is accessing that file/directory.

eric_h · 9 years ago

Good lesson on making command prompts on machines always tell you exactly what machine you're working on.

m_fam_wa_k · 9 years ago

I like to color code my terminal. Production systems are always red. Dev are blue/green. Staging is yellow.

wtbob · 9 years ago

'db1' vs. 'db2' is still insufficiently clear, though. Even better would be e.g. to name development systems after planets and production systems after superheroes. Very few people would mistake 'superman' for either 'green-lantern' or 'pluto,' but it's really easy to mistake 'sfnypudb13' for 'sfnydudb13.'

echelon · 9 years ago

This doesn't really help if there are multiple production databases. It could be sharded, replicated, multi-tenant, etc.

dorfsmay · 9 years ago

Uh no! Don't rely on command prompt, there are hardcoded ones out there, and clonining scripts have duplicated them.

uname -n

Takes seconds.

Also a good lesson for testing your availability and disaster recovery measures for effectiveness.

Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.

Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.

craigkerstiens · 9 years ago

I'm not sure there is a "late" at night or tired in the incident report. All times are UTC and it's unclear where all the team is located but if in SF then this is at 4pm which is when the incident just occurred. It doesn't necessarily change that you shouldn't be firefighting for extremely long cycles in hero-mode for that long, but not exactly the same as exhausted and powering away for hours.

Scriptor · 9 years ago

Gitlab's team is worldwide. YP seems to be someone in EU time, which right now is UTC+1.

That would mean that the incident happened around 11pm or midnight in YP's local time.

Actually, there is a brief comment:

"At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden."

anamoulous · 9 years ago

I audibly gasped when I hit this part of the document.

1_2__3 · 9 years ago

I think I can count on one hand the number of times I've run an rm command on a production server. I'll move it at worst, and only delete anything if I'm critically low on disk space. But even then I don't even like typing those characters if I can avoid it, regardless of if I'm running as root or a normal user.

cesarb · 9 years ago

> Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

In these situations, I always keep the following xkcd in mind: https://xkcd.com/349/

mbrain · 9 years ago

Just a thought, but is it possible to override destructive commands to confirm the hostname by typing it before running the command?

I remember when we accidentally deleted our customers' data. That was the worst feeling I ever had running our business. It was about 4% of our entire storage set and had to let our customers know and start the resyncs. Those first 12 hours of panic were physically and emotionally debilitating - more than they have the right to be. I learned an important lesson that day: Business is business and personal is personal. I remember it like it was yesterday, the momement I conciously decided I would no longer allow business operations determine my physical health (stress level, heart rate, sleep schedule).

For what it's worth, it was a lesson worth learning despite what seemed like catastrophic world-ending circumstances.

We survived, and GitLab will too. GitLab has been an extraordinary service since the beginning. Even if their repos were to get wiped (which seems not to be the case), I'd still continue supporting them (after I re-up'd from my local repos). I appreciate their transparency and hope that they can turn this situation into a positive lesson in the long run.

Best of luck to GitLab sysops and don't forget to get some sleep and relax.

totally · 9 years ago

I had a great manager a little while back who said they had an expression in Spain:

> "The person who washes the dishes is the one who breaks them."

Not, like, all the time. But sometimes. If you don't have one of these under your belt, you might ask yourself if you're moving too slow.

If that didn't help, he would also point out:

> "This is not a hospital."

Whatever the crisis, and there were some good ones, we weren't going to save anyone's life by running around.

Sure, data loss sucks, but nobody died today because of this.

I really appreciate the raw timeline. I feel your pain. Get some sleep. Tomorrow is a new day.

drewmate · 9 years ago

> Get some sleep.

Definitely get sleep, but it would be nice if the site were back online before that. I actually just created a new GitLab account and project a couple days ago for a project I needed to work on with a collaborator tonight. This is not a good first impression.

sidlls · 9 years ago

Paid or unpaid account and project?

niftich · 9 years ago

I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.

To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.

My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.

chatmasta · 9 years ago

Be careful not to overlook the benefits of instrumentation even in the "unknown unknowns" scenario. If you implement it properly, the instruments will alert you to where the problem is, saving you time from debugging in the wrong place.

The initial goal of instrumentation should be to provide sufficient cover to a broad area of failure scenarios (database, network, CPU, etc), so that in the event of a failure, you immediately know where to look. Then, once those broad areas are covered, move onto more fine-grained instrumentation, preferably prioritized by failure rates and previous experience. A bug should never be undetectable a second time.

As a contrived example, it was "instrumentation," albeit crudely targeted, that alerted GitLab the problem was with the database. This instrumentation only pointed them to the general area of the problem, but of course that's a necessary first step. Now that they had this problem, they can improve their database-specific instrumentation and catch the error faster next time.

zenlikethat · 9 years ago

Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.

Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.