Readit News logoReadit News
Posted by u/Ozzie_osman 5 years ago
Ask HN: Best “I brought down production” story?
What is your best "and then I brought down production" story?
theon144 · 5 years ago
Not me, but a colleague - he wanted to look around the system as the `uwsgi` user, so he ran `sudo -u wsgi -s /bin/bash`.

Except that he typoed, and instead ran `sudo -c wsgi -s /bin/bash`. What that does is instead of launching the (-s)hell as the uwsgi (-u)ser, it interprets the rest as a (-c)ommand. Now, `wsgi` is also a binary, and unfortunately, it does support a `-s` switch. It tries to open a socket at that address - or a filesystem path, as the case may be. Meaning that the command (under root) overwrote /bin/bash with 0 bytes.

Within minutes, jobs started failing, the machine couldn't be SSH'd into, but funnily enough, as /bin/bash was the login shell for all users, not even logging in via a tty through KVM worked.

Perhaps not the best story, but certainly a fun way to blow your foot off on a Monday morning :)

soneil · 5 years ago
That's beautiful. I'm not sure I'd have had a clue what just happened even if it was me making the typo.
jhugo · 5 years ago
`ssh $host /bin/sh` (or another shell) should work?
hddqsb · 4 years ago
That won't work because sshd runs the command using the user's login shell. From https://man.openbsd.org/sshd#LOGIN_PROCESS:

> When a user successfully logs in, sshd does the following:

> ...

> 9. Runs user's shell or command. All commands are run under the user's login shell as specified in the system password database.

dotancohen · 5 years ago
Logging into another shell would be attempted only if someone knew at the time why the logins were failing.

But thanks, I've just added another technique to my toolbox.

thayne · 5 years ago
Depending on the distribution, /bin/sh might be a symlink to bash
bch · 5 years ago
On a Linux box isn’t that just a link to bash?
bentcorner · 5 years ago
Now I'm curious how you managed to recover. I only know enough of my way around a shell to be dangerous and I'd be SoL if I ended up in this situation.
lights0123 · 5 years ago
Recovery disk, then either copy the disk's copy of bash (if it doesn't depend on a later glibc version), copy another shell to /bin/bash (as the system probably doesn't depend on bash-specific commands to boot), chroot and use the package manager, or use the package manager with an explicit sysroot (e.g. pacman --sysroot). The first two steps are very easy compared to the latter two, but should be followed by a reinstallation of the package that provides bash.
hamburglar · 5 years ago
Does the proc entry for a running process still link to the now-deleted file in that situation? If so, you might be able to save yourself from a running bash shell by doing a “cat /proc/$$/exe > /bin/bash”
linsomniac · 5 years ago
Probably not if it was overwritten (": >/bin/bash") rather than removed and recreated ("rm -f /bin/bash; : >/bin/bash"). The former will cause all processes to see the empty file, the latter would leave processes with access to the old contents.

In this case if you noticed and still had a shell, you could just copy another shell over ("cp /bin/sh /bin/bash"), to at least get back to probably able to login, until you could pull a copy from another machine or backups.

maxk42 · 5 years ago
Back in the days of MyISAM and before Google had their own ad network I worked for the world's largest advertising network. It had a global reach of 75%, meaning 3 / 4s of people saw at least one of our ads daily.

I was trying to learn MySQL and the CTO made the mistake of giving me access to the prod database. This huge network that served most of the ads in the world ran off of only two huge servers running in an office outside Los Angeles.

MyISAM uses a read lock on every SELECT query. I did not know this at the time. I was running a number of queries that were trying to pull historical performance data for all our ads across all time. They were taking a long time so I let them run in the background while working on a spreadsheet somewhere else.

A little while later I hear some murmuring. Apparently the whole network was down. The engineering team was frantically trying to find the cause of the problem. Eventually, the CTO approaches my desk. "Were you running some queries on the database?" "Yes." "The query you ran was trying to generate billions of rows of results and locked up the entire database. Roughly three quarters of the ads in the world have been gone for almost two hours."

After the second time I did this, he showed me the MySQL EXPLAIN command and I finally twigged that some kinds of JOINs can go exponential.

Kudos to him for never revoking my access and letting me learn things the hard way. Also, if he worked for me I would have fired him.

mansoon · 5 years ago
I’m confused by the last part of your post.

Sounds like you appreciated that your boss gave you space to learn, and understood that you made an honest mistake, but you’d fire someone who made this mistake if they were working for you?

How do you square those two things internally?

notatoad · 5 years ago
It's not good to punish people for making mistakes in the course of their work (especially if that work is meant to be educational)

It is good to punish people who give access to production databases to people who shouldn't have it. And the guy learning MySQL should not be given that access.

Taking down prod is always a symptom of a systemic failure. The person responsible for the systemic failure should see the consequences, not the person responsible for the symptom.

lenova · 5 years ago
It's a poetic way of saying the boss was/is a better person than the OP.
puchatek · 4 years ago
I think the hint is in "after the second time I did this". I would also wonder if to keep them on at that point.
maxk42 · 4 years ago
> How do you square those two things internally?

Easy: (1) He wasn't my boss. (2) He allowed a person not associated his team or even the tech department to conduct potentially harmful operations on the production database without supervision. (3) Those actions resulted in millions of dollars of lost revenue and make-goods. (4) He did not coach the person who brought the database down. (5) He repeated the mistake.

mikeywazowski · 5 years ago
Wait, this happened twice? Weren't you at great pains to avoid it reoccurring after the first time?
iJohnDoe · 4 years ago
Yeah, about firing him for making the same mistake twice that took down production both times.

Sounds like the boss was cool.

CodesInChaos · 4 years ago
> some kinds of JOINs can go exponential.

How? AFAIK a single join can be at most quadratic, and multiple joins should at most be polynomial, where the exponent is the number of joins. To go exponential, you'd need some kind of recursion or self reference, but I know no way to express such a thing as an ordinary join statement.

(of course quadratic performance is already prohibitively slow on large tables, so there is no need to go exponential in order to take "forever")

2muchcoffeeman · 4 years ago
Unfortunately exponential sometimes means “grows rapidly” and not exponential in a mathematical sense.
mormegil · 4 years ago
In our software, a minority codepath sometimes reported database deadlocks. Nothing critical but it littered the ops error logs and probably displayed error messages to a few customers. So I added a pessimistic exclusive lock to a query which basically solved the deadlock problem (not a great solution but it worked). However, what I missed was the query, even though in a minority code path, was touching another table used basically in all hot-path queries. So basically the code seemed to work fine until deployed to all servers when all operations of the whole cluster got basically serialized through this single lock. So, yeah, database locks can bite you hard!
vietvu · 4 years ago
Not as bad as yours, but MySQL and also blocking prod table: on my first job after graduate, I once run a delete commands on about 20 rows on a quite large table (maybe 500M+ rows), but was causing deadlock because of gap lock, it has been 6 years so I don't really remember the details.

I was not expert but knew about MySQL optimization at that time, but it looks like sometime you just do things and not think through.

15 minutes later, sysad team PM me and ask WTF am I doing, and I realized what happen.

samus · 4 years ago
Someone hogging the database with an analytics query is a honest error because of an insidious footgun inherent in the technology stack. On the other hand, the CTO permitted access to the production database ... why? To learn MySQL, it would have been sufficient to set up a local instance. Or connect to testing/staging environments to get at some data.

Deleted Comment

oceanghost · 5 years ago
Can It be a story I was involved in but I didn't do it?

I used to work for a major university as a student systems admin. The only thing that was "student" about it was the pay-- I had a whole lab of Sun and SGI servers/desktops, including an INCREIDBLE 1TB of storage-- we had 7xSun A1000's (an array of arrays) if memory serves.

Our user directories were about 100GB at the time. I had sourced this special tape drive that could do that, but it was fidgety (which is not something you want in a backup drive admittedly). The backups worked, I'd say, 3/4ths of the time. I think the hardware was buggy, but the vendor could never figure it out. Also, before you lecture me, we were very constrained with finances, I couldn't just order something else.

So I graduated, and as such had to find a new admin. We interviewed two people, one was very sharp and wore black jeans and a black shirt-- it was obvious he couldn't afford a suit which would have been the correct thing to wear. The other candidate had suit, and he was punching below his weight. Over my objections, suit guy gets hired.

Friday night, my last day of employment I throw tapes into the machine and start a full L0 backup which would take all weekend to complete.

Monday morning I get a panicked phone calls from my former colleagues. "The new guy deleted the home directories!"

The suit guy literally, had in his first few hours destroyed the entire labs research. All of it. Anyways, I said something to the effect of, "Is the light on the AIT array green or amber?"

"Green."

"You're some lucky sons of bitches. I'll be down in an hour and we'll straighten it out."

takoid · 5 years ago
Hilarious story, thanks for sharing.

> "Is the light on the AIT array green or amber?"

Can you explain this? What is an AIT array?

oceanghost · 5 years ago
As the others have correctly intuited-- its a tape backup system. We couldn't afford a proper tape library with a robot. I can't remember why, but there was a limitation of the software or SunOS that we needed to be able to get an L0 onto one tape. This thing was two Sony AIT tape machines that had a special SCSI board that made them look like one single drive to the host thus doubling the capacity. It was just enough to skate by. I didn't have much faith in differentials and didn't like to let them run more then a few days as well.

I always assumed the fault was in that SCSI board. The hand-off between tape-1 and tape-2 was what usually failed. The problem might occur 24 hours into a backup so it was difficult to get good backups. Also, I was not a full time employee (being a student), so I couldn't babysit this thing 5 days a week like a full time employee. Also, it kind of killed the performance of the system, so I had to do them at odd hours.

I am proud to say I never lost a single bit at that job.

In essence if this backup had failed months of research would have been lost (maybe two to 4 week old backup x 12 researchers).

Anyways, thank you for reading my silly story!

LgWoodenBadger · 5 years ago
I assume the 3/4-reliable tape drive?
mrleinad · 5 years ago
I take it deleting everything was an accident, right? Not that the guy applied for a job only to destroy that information... lol
oceanghost · 5 years ago
I do not actually know what happened. But something basically resulted in an rm -rf /home and was accidental.

By all reports, he eventually became a well liked and good admin. I just don't think he knew that much Unix when he started.

lamontcg · 5 years ago
I was a system engineer at Amazon from 2001-2006. Sometime around 2004/2005 or so there was a development team working on the "a9 search engine" (meant to complete with google) down in SF. They were sort of an official "shadow IT" offshoot and asked for special treatment and they got me assigned specifically to them to build out the first of their two webservers.

They did the usual mistake of wanting to jettison all the developer tooling and start from scratch. So there was a special request to just install a base O/S, put accounts on the box, and setup a plain old apache webserver with a simple /var/www/index.html (this was well outside of how Amazon normally deployed webservers which was all customized apache builds and software deployment pipelines and had a completely different file system layout).

They didn't specify what was to go into the index.html files on the servers.

So I just put "FOO" in the index.html and validated that hitting them on port 80 produced "FOO".

Then I handed off the allocated IPs to the networking team to setup a single VIP in a loadbalancer that had these two behind it.

The network engineer brought up the VIP on a free public IP address, as asked.

What nobody know was that the IP had been a decomissioned IP for www.amazon.com from a year or two earlier when there was some great network renumbering project and it had pointing at a cluster of webservers on the old internal fabric.

The DNS loadbalancers were still configured for that IP address and they were still in the rotation for www.amazon.com. And all they did as a health check was pull GET / and look for 200s and then based on the speed of the returns they'd adjust their weighting.

They found that this VIP was incredibly well optimized for traffic and threw most all the new incoming requests over to these two webservers.

I learned of this when my officemate said "what is this sev1... users reporting 'foo' on the website..."

This is why you always keep it professional, kids...

exikyut · 4 years ago
Awesome.

Sadly, it seems that the Web Archive didn't happen to grab any pages from Amazon during the (presumably-small) window this was live.

Specifically, I ran the CDX query hxxp://web.archive-dot-org/cdx/search/cdx?url=amazon.com&matchType=domain, and then grepped through the 174MB of results (1,274,038 lines) for response lengths 9999 bytes and less (ie, [0-9]{1,4}), on the assumption this should find every tiny response. The only such responses are 30x redirects and a couple of 503s. :(

(That's a normal URL above - s/xx/tt/ and s/-dot-/./ - but since it spits out 174MB of text I figured I'd save IA the bandwidth from crawlers fetching everything they see on HN etc.)

gorbachev · 4 years ago
> This is why you always keep it professional, kids...

In the late 90s someone in a company I worked put a placeholder HTML file on a client's production web server that instead of the usual lorem ipsum stuff had something like "this shit is beneath me", except much worse. The placeholder was never removed, and while it wasn't the index page and if I remember correctly wasn't even directly linked from anywhere, someone still found it. The incident found itself on the national newspapers.

imvetri · 5 years ago
Hahaha
Animats · 5 years ago
Back when I was working on proof of correctness, when that was a very new thing, I was using the Boyer-Moore theorem prover remotely on a large time-shared mainframe at SRI International. At the time, you needed a mainframe to run LISP. I was working on proofs of basic numeric functions for bounded arithmetic. So I was writing theorems with numbers such as 65536.

This caused the mainframe to run out of memory, page out to disk, and thrash, bringing other users to a crawl. It took a while to figure out why relatively simple theorems were doing this.

Boyer and Moore explained to me that the internal representation of numbers was exactly that of their constructive mathematics theory. 2 was (ADD1 (ADD1 (ZERO))). 65536 was a very long string of CONS cells. I was told that most of their theorems involved numbers like 1.

They went on to improve the number representation in their prover, after which it could prove useful theorems about bounded arithmetic.

(I still keep a copy of their prover around. It's on Github, at [1]. It's several thousand times faster on current hardware than it was on the DECSYSTEM 2060.)

[1] https://github.com/John-Nagle/nqthm

yjftsjthsd-h · 5 years ago
> Boyer and Moore explained to me that the internal representation of numbers was exactly that of their constructive mathematics theory. 2 was (ADD1 (ADD1 (ZERO))). 65536 was a very long string of CONS cells. I was told that most of their theorems involved numbers like 1.

Can we have a moment for the folks who managed to turn numerical purity into integers being O(n)? It's unbearably beautiful... And I do mean unbearably...

gumby · 5 years ago
> Can we have a moment for the folks who managed to turn numerical purity into integers being O(n)?

It’s not an unreasonable approach given the application was automatic theorem proving: https://en.wikipedia.org/wiki/Peano_axioms

Both Boyer and Moore are quite sharp so I wouldn’t jump to the conclusion that they didn’t know what they were doing.

shoo · 4 years ago
could have been much worse: https://en.wikipedia.org/wiki/Dedekind_cut
tluyben2 · 5 years ago
Long ago this one (close to 2000), but we were hosting some of our clients on machines in our office because it was (a lot) cheaper for our startup to do so. We had 3 rather large (for our company) clients running at the time in the server room in the office. The servers where hooked up to 2nd hand APCs so power failures went unnoticed if they were short. One Friday afternoon, we had some drinks (tgif) and a bunch of us (I was the cto....) were fooling around in one of the rooms throwing tennis balls; I threw one straight into the firealarm. This was just a glass with a button behind it: if you pressed the glass it would trigger for the entire (20 story) building. This cut the power and switched on the emergency lights. There was no fire obviously, but, for bureaucracy rules, the firebrigade had to pull up, inspect the building, we had to sign docs etc and then they switched off the alarm and on the power. Too late for our APCs: everything was down and (like said: long time ago for Linux etc) we had to run fsck and basically spent a large portion of the evening getting it all back. We moved to xs4all co location after that incident...
nullc · 5 years ago
A friend of mine ran a large and relatively popular (as in at least 30 users online at any given time ...) PvP MUD on a server of mine back in the (late?) 90s.

I didn't play muds and my experience was mostly limited to helping him fix C programming bugs from time to time and fielding an occasional irate phone call from users who got my number off the whois data. But because of the programming help I had some kind of god-access to the mud.

One afternoon I had a ladyfriend over that I was presumably trying to impress and she'd asked about the mud. We hopped on and I summoned into existence a harmless raggedy-ann doll. That was kind of boring so I thought it would be fun to attach an NPC script to it, -- I went through the list and saw something called a "Zombie Lord" which sounded promising. I applied it, and suddenly the doll started nattering on about the apocalypse and ran off. Turned out that it killed everyone it encountered, turned them all into zombie lords, creating an exponential wave of destruction that rapidly took over the over the whole game.

I found the mental image of some little doll running around bringing on the apocalypse to be just too funny-- until my phone started ringing. Ultimately the game state had to be reported to a backup from a day or two prior.

[I've posted a couple examples, -- I dunno which one is best, but people can vote. :)]

sneak · 5 years ago
RHSeeger · 5 years ago
That was an annoying time in the game. Not horrible, but you pretty much had to avoid town because people were purposefully trying to spread the plague. Certainly memorable though.
arminiusreturns · 4 years ago
I was there for it, but I literally walked into Ironforge right after the first wave hit, and it was just corpses everywhere. I had no idea what was going on at the time!
ShakataGaNai · 5 years ago
Oh man. MUD's in the late 90s. Just randomly reminded me of all the time I spent in Medivia (I'm fairly sure that was the one). Those were good times. Surprised the name even came to me since I haven't thought about those in... 20ish years.
sjaak · 5 years ago
Medivia? Judging from screenshots that looks a lot like Tibia (https://www.tibia.com)
parkersweb · 5 years ago
This was 20 years ago now - it was my first day in a new job working for a startup.

Our startup was based in the garden office of a large house and the production server was situated in a cupboard in the same room.

The day I started was a cold January day and I’d had to cycle through flooded pathways to get to work that morning - so by the time I arrived my feet were soaked.

Once I’d settled down to a desk I asked if I could plug a heater in to dry my shoes. As we were in a garden office every socket was an extension cable so I plugged the heater in to the one under my desk.

A few minutes later I noticed that I couldn’t access the live site I’d been looking through - and others were noticing the same.

It turned out the heater I was using had popped the fuse on the socket. The extension I was using was plugged into the UPS used by the servers. So the battery had warmed my feet for a few minutes before shutting down and taking the servers down too.

And that’s how I brought production down within 3 hours of starting my first job in the web industry…

tsjq · 4 years ago
quite close , the first few mins of this Apple WWDC from a few years ago . https://youtu.be/oaqHdULqet0?t=32