Why Hasn't Twitter Crashed Yet?

Failure is proportional to change.

A growing company is frequently changing. A company that launches new features is changing. A company trying to fix architecture is changing. The large work forces a lot of valley companies have is built around and justified by this growth/change.

The change that twitter will likely experience now is machine failure (3/1000 a day probably), hard drive expiration, potentially database promotions. Failures of cache machines.

Automation can drive a lot of these to very small workloads, but capacity management is a potentially existential crisis looming over all tech companies.

Then you get to the real problem that twitter faces. Political change, security change, and workforce rot.

Political/regulatory change poses a problem because it often requires changes to infrastructure. This creates the type of change that can result in failure.

Security change can be supply chain problems or bug reports. Maybe keys need to get rotated, new encryption added, software updated. All of these are change. All can result in failure, and potentially catastrophic failure.

Lastly, the largest existential problem is that the engineers left at twitter are likely not their best and many of them are probably coerced into staying due to H1B regulation. Now you run into a problem of attrition and replacing that attrition. When your good engineers leave (or are over worked), it's harder to hire good engineers. The difference between a good engineer and a bad engineer is their `complexity to result` ratio. Good engineers can create simple solutions, while bad engineers create complex solutions, even though both might produce the same end result.

Failure is also proportional to complexity and outage duration is most impacted by complexity.

nivertech · 3 years ago

> The difference between a good engineer and a bad engineer is their `complexity to result` ratio. Good engineers can create simple solutions, while bad engineers create complex solutions, even though both might produce the same end result.

No serious engineer likes complexity for the sake of complexity. This may only apply to juniors practicing RDD (Resume-Driven Development).

Although there are times when a simple solution is not obvious even to the seniors, but these are generally very rare cases.

NigelThornberry · 3 years ago

The engineers aren't "coerced" by H1B regulation. They took the job and accepted the visa understanding that their ability to work in the US legally was contingent on their providing services to a particular employer. They were free not to enter into the agreement in the first place.

Because 80% of the company was cruft. As it is in most large orgs.

badpun · 3 years ago

The problem is always - which 80% to cut. If Elon has some magic formula to determine that that all other companies missed, he's going to be a trillionaire.

chunk_waffle · 3 years ago

Indeed, Pareto distribution.

PaulHoule · 3 years ago

You've got to be careful what conclusions you draw from that.

It's not easy to tell which people are really pulling their weight, particularly when you have people in operations who are doing things that are essential but not flashy and people in development and marketing who are doing things that are flashy but not essential.

When there are mass layoffs often the best people jump ship early figuring they'll have an easy time getting work elsewhere and an even easier time if they are the first to go. Some of the people who stay are the people who don't feel they have a choice.

I am not a fan of OKRs, stack ranking, and other practices that create arcane and "high stakes" processes for measuring value because a pathological narcissists core competence is to convince management that their glass is 70% full and that your glass is 30% empty.

Halfway decent systems "just work".

Back in the 1990s my wife worked at the ag school and they had a moment of panic when they realized they had no idea where the web server was. Turned out they had a tiny little HP PA/RISC machine in a closet covered in dust bunnies that had been running for two years without anybody thinking about it.

Last night I wanted to create a webhook and decided to use AWS Lambda. I have a few things in AWS including Lambda functions. I figured I'd look at my old ones as a reference for my new one, but I was shocked to realize I had things that had been running for five years without any intervention at all.

In both of my cases you have middling software and negligent management but the underlying hardware or services are reliable and high quality. It's not like the entrepreneur I knew who was always finding web hosting that was a lot cheaper than anyone else with the downside that every few months we had to move to another data center in a hurry.

hayst4ck · 3 years ago

beardyw · 3 years ago

I don't think any sensible people thought Twitter was kept alive by huge amounts of intervention. You don't build systems like that. I had a side project which ran for a decade during which I barely looked at it. Because I built it that way.

However people things like moderation, sales, etc are a different issue. Degradation, if any, is likely to be in quality, not system crashes.

tricksforfree · 3 years ago

beauHD · 3 years ago

From what I recall from various blog posts by Twitter staff over the years, Twitter is incredibly lean and terse, and does one thing well: micro blogging. It gets into trouble when serving long form video, which Musk is trying to change with 30min / longer videos or otherwise trying to provide rich media at scale, which is a hard problem. They can throw CDNs at the problem, but that gets expensive, fast, not to mention Twitter video content is compressed almost to the point of being unwatchable. And as for increasing the character count: I welcome that, but rich media is the real problem, not longform text.

Deleted Comment

abudabi123 · 3 years ago

Elon M has been removing the barnacles stuck on the Twitter boat. The boat will go fast without them.

Unlike your Volkswagen CEO with a dim view of software in the past, Elon M understands software and packet switching at millisecond resolution and has demonstrated experience debugging complexity at that scale equiping him to make informed and impactful decisions.

Less is more.

more_corn · 3 years ago

Same reason there’s a code freeze around Christmas. Fewer changes means fewer chances to mess things up.

If you fire 80% of the operations staff you better fire 80% of the eng staff too.

As changes get made and released I expect to see some more outages.

Also good staff made the plans and laid the groundwork for resiliency. That work continues to provide benefits after they’ve gone.