Why Twitter didn’t go down: From a real Twitter SRE

From this operations engineer's perspective, there are only 3 main things that bring a site down: new code, disk space, and 'outages'. If you don't push new code, your apps will be pretty stable. If you don't run out of disk space, your apps will keep running. And if your network/power/etc doesn't mysteriously disappear, your apps will keep running. And running, and running, and running.

The biggest thing that brings down a site is changes. Typically code changes, but also schema/data changes, infra/network/config changes, etc. As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. The trick is to design it to be as immutable and simple as possible.

There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

The last thing off the top of my head that will absolutely bring a site down over time, is expired certs. If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.

mjr00 · 3 years ago

It's crazy to think about, but many people who use and build software today, including HN readers/commenters, are young enough to have only been exposed to the SaaS, cloud-first era, where software built with microservices deployed from CI/CD systems multiple times per day is just the way things are done.

You're totally right; if you don't make changes to the software, it's unlikely to spontaneously stop working, especially after that first 6-12 months of "hardening" where bugs are found and patched.

Many people working in tech have never been exposed to a piece of software which isn't being constantly changed in small increments and forced upon end users. People are assuming that software is inherently unstable simply because they never use anything that isn't a "cloud service".

This probably comes off as "old man yells at cloud" but I'm not trying to bash cloud here. The cloud/SaaS approach has a ton of advantages for both consumers and businesses. But the average tech person in their 20s vastly underestimates how stable software can be when you aren't constantly pushing new features.

Lutger · 3 years ago

Absolutely. I remember we build a unimaginably brittle application many years ago, I think it was running on Windows XP and glued together a complex system with COM calls into this single page webapp even before react was a thing. It was build on a very small budget, serving the core business of a very tiny company.

Like maybe 8 years later I found out it was still humming along happily, without really even a sysadmin attending to it, on a single workstation using consumer hardware, servicing the company that had grown tenfold in size.

It blew my mind it still just worked all these years.

dfghbbjiiugb · 3 years ago

Your broad point is obviously correct (most outages are caused by code or config changes) but there are still classes of failures that can happen without any real changes, like various performance degradations (maybe your table grows too large) or occasional catastrophic failures from things like disk space or id overflow or something.

There's also the stability of third party systems: forced deprecations, security EOL, etc. The cert expiration stuff people have been mentioning is in this category too. I wouldn't be surprised if something does slip through the cracks at Twitter in the next 4 or 6mo.

Tade0 · 3 years ago

I remember how it was over a decade ago and one hallmark of such systems was that they were easy to exploit.

My friend in college would just go into Wordpress admin panels and the like by using common exploits because nobody updated PHP on their VPSes back then.

As someone who spent most of their career to date as a front-end developer I learned that as long as they have the budget, stakeholders are insatiable. It's just that ten years ago most of their ideas were either technically not feasible or very expensive.

Nowadays browsers are much more capable, so the pressure to produce more features is much greater.

To our own peril, we can do much more now.

mgkimsal · 3 years ago

The other side of that is browsers. Even if you don’t change your code, the platform people are running your code in changes, automatically in many cases. New JS or CSS behavior in next safari or chrome? You need to patch/push to accommodate running environments that are outside your control.

mrazomor · 3 years ago

The ecosystem changed since then. Now updates of your binary's environment are more frequent, and often enforced. How often did you update or patch your Windows 98 or NT?

Today, it's false to assume that fire and forget releasing will work even for standalone Windows binaries.

shapefrog · 3 years ago

> The cloud/SaaS approach

You dont have to push constant updates for a cloud / SaaS product - many chose to - but ultimately you dont have to.

A year of 'no new features' should be something customers and vendor alike benefit from.

drdrey · 3 years ago

Another thing we noticed at Netflix was that after services didn’t get pushed for a while (weeks), performance started degrading because of things like undiscovered memory leaks, threads leaks, disks filling up. You wouldn’t notice during normal operations because of regular autoscaling and code pushes, but code freezes tended to reveal these issues.

TranquilMarmot · 3 years ago

We used to have a horribly written node process that was running in a Mesos cluster (using Marathon). It had a memory leak and would start to fill up memory after about a week of running, depending on what customers were doing and if they were hitting it enough.

The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.

We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.

lcw · 3 years ago

Agreed, one of the craziest bugs I had to deal with was we had a distributed system using lots of infrastructure. Said distributed system started having trouble communicating with random nodes and sub-systems. I spent 3 hard days finding a Linux kernel bug where the ARP cache was not removing least recently accessed network addresses. Normally, this wouldn't be a big deal for a typical network because few networks would fill up the default arp cache size. That was even true for ours except that we would slowly add and remove infrastructure over the course of a couple months until eventually the ARP cache would fill and remove the random network devices... It wasn't even our distributed application code... Some bugs take time to manifest themselves in very creative ways.

polio · 3 years ago

If resource leaks became a serious issue I imagine they could buy time by restarting. I'm curious what the causes were for code freezes. At Meta they would freeze around Thanksgiving and NYE because of unusually high traffic.

__bjoernd · 3 years ago

I once debugged a kernel memory leak in an internal module that manifested after around 6 years of (physical) server uptime. There are surprises lurking very far down the road.

r3trohack3r · 3 years ago

We joked about adding this to the NodeQuark platform:

    // Fix Slow Memory Leaks
    setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)

bandrami · 3 years ago

Back in the Pleistocene I worked in a ColdFusion shop (USG was all CF back then and we were contractors) and we had two guys whose job was to bounce stacks when performance fell under some defined level.

Deleted Comment

evanelias · 3 years ago

> you don't run out of disk space (from logs for example)

For a social media / user-generated content application, the macro storage concerns are a lot more important than the micro ones. By this I mean, care more about overall fleet-wide capacity for product DBs and media storage, instead of caring about a single server filling up its disk with logs.

With UGC applications, product data just grows and grows, forever, never shrinking. Even if the app becomes less popular over time, the data set will still keep growing -- just more slowly than before.

Even if your database infrastructure has fully automated sharding, with bare metal hosting you still need to keep doing capacity planning and acquiring new database hardware. If no one is doing this, it's game over, there's simply nowhere to store new tweets (or new photos, or whichever infra tier runs out of hardware first...)

Staffing problems in other eng areas can exacerbate this. For example, if automated bot detection becomes inadequate, bot posting volume goes way up and takes up an increasing amount of storage space.

threeseed · 3 years ago

> absolutely bring a site down over time, is expired certs

From today's Casey Newton's newsletter:

In early December, a number of Twitter’s security certificates are set to expire — particularly those that power various back-end functions of the site. (“Certs,” as they are usually called, serve to reassure users that the website they are visiting is authentic. Without proper certs, a modern web browser will refuse to establish the connection or warn users not to visit the site). Failure to renew these certs could make Twitter inaccessible for most users for some period of time.

We’re told by some members of Twitter’s engineering team that the people responsible for renewing these certs have largely resigned — raising concerns that Twitter’s site could go down without the people on hand to bring it back. Others have told us that the renewal process is largely automated, and such a failure is highly unlikely. But the issue keeps coming up in conversations we have with current and former employees.

kibibyte · 3 years ago

I can imagine both cases being true, that the renewal process is automated and that certs won't get renewed because institutional knowledge has left the door. Where I'm at, service-to-service TLS certificates (the bulk of our certs) are automatically rotated by our deploy systems. But there are always the edge cases: the certificates manually created a long time ago (predating any standardized monitoring systems) with long expiry dates, and certificates for systems that simply can't run off the standard infrastructure. Sometimes, they'll bring down systems with low SLOs; other times, they'll block all internal development.

edanm · 3 years ago

> There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc.

In my experience as a data engineer, unusual states are one of the leading causes of issues, at least after something is built for the first time. You can spend half a year running into weird corner cases like "this thing we assumed had to always be a number apparently can arbitrarily get filled in with a string, now everything is broken."

Also, conditions changing causing code changes is the norm, not the exception, definitely in the beginning but also often later. Most services aren't written and done - they evolve as user needs evolve and the world evolves.

TranquilMarmot · 3 years ago

> As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. > ... > There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

Aren't these changes inevitable, though? There is no such thing as bug free code.

Another thing that forces consistent code changes is compliance reasons- any time a 0-day is discovered or some library we're using comes out with a critical fix, we would have to go update things that hadn't been touched sometimes in years.

At my last job, I spent a significant amount of time just re-learning how to update and deploy services that somebody who left the company years ago wrote, usually with little-to-no documentation. And yes, things broke when we would deploy the service anew, but we were beholden to government agencies to make the changes or else lose our certifications to do business with them.

Eventually, Twitter will have to push code changes, if only to patch security vulnerabilities. Just waiting for another Heartbleed to come around...

smaudet · 3 years ago

Software never goes stale, it's the environment around it which stales.

Something from the 70s works perfectly fine, except it can't run on anything bare any longer, and the hard drives etc. have all long since failed or their PSU capacitors have blown....so Twitter will absolutely rot, how fast depends on several factors.

I personally suspect the infrastructure used to build Twitter will rot faster than Twitter itself, and of course the largest most dramatic source of rot is the power required to run it - several large communities have abandoned it already, making it less much less relevant, meaning the funding for it will also dry up, meaning more wasted cpu cycles and the like.

Thats of course assuming its left in some sort of limbo, it doesnt sound like thats the case with the current management, its only a matter of time before it topples over from shitty low-rate contractor code. Honestly, the app worked like so much hot garbage already, I could see it falling over itself and imploding with a couple poorly placed loops...

grayfaced · 3 years ago

But real world conditions can force code changes. For example, a region abandons daylight savings time or a court order on copyright infringement. Someone unqualified working a system they are unfamiliar with could blow it up. Losing that knowledge of how the system works is a risk.

threeseed · 3 years ago

> But real world conditions can force code changes

Security fixes.

ithkuil · 3 years ago

An example where something that correlates with time can reveal pre-existing bugs long after the system was chugging along just fine: counter limits/overflows.

Simple example: you have a DB with a table with an auto incrementing table. You chose a small integer type for the primary key and after years this just worked fine, you finally saturate that integer type you can no longer insert rows in the table. Imagine now this has cascading effects in other systems that depend on this database indirectly and you end up with an "outage"

btbuildem · 3 years ago

> The biggest thing that brings down a site is changes

Absolutely agreed. In that vein, there is such a thing as too much automation. Sometimes, build chains are set up to always pull in the newest and the freshest -- and given the staggering number of dependencies software generally has, this might mean, small changes all the time. Even when your code does not change, it can eventually break.

It's been my experience that a notable part of software development (in the cloud age, anyway) is about keeping up with all the small incremental changes. It takes bodies to keep up with this churn, bodies which twitter now does not have.

It'll be interesting to keep observing this. So far it's been a testament to the teams that built it and set up the infra -- it keeps running, despite a monkey loose in a server room. It's very impressive.

sleight42 · 3 years ago

"Outages": this is an enormous ellipsis.

* Power outages and general acts of God

* Resource utilization

How do your databases perform when their CPUs are near capacity? Or disks? Or I/O? I've seen Postgres do some "weird s%$#": where query times don't go exponential but they go hockey stick.

* Fan-out and fan-in

These can peg CPU, RAM, I/O. Peg any one of these and you're in trouble. Even run close to capacity for any one of these and you're liable to experience heisenbugs. Troublesome fan-out and fan-in can sometimes be a result of...

* Unintended consequences

The engineering decision made months or years ago may have been perfectly legitimate for the knowledge available at the time. However, we live in a volatile, uncertain, complex, and ambiguous (VUCA) world; conditions change. If your inputs deviate qualitatively or quantitatively significantly, you risk resource utilization issues or, possibly, good ol' fashioned application errors.

"No battle plan survives contact with the enemy." -- Stormin' Norman

Same with software systems. They're living entities that can only maintain homeostasis so long as their environment remains predictable within norms. Deviate far enough from that and boom.

dymk · 3 years ago

Any sort of cached object expiring might bring the servers down. Who knows when the Death TTL will come?

coldcode · 3 years ago

I worked as an engineer for a very large non tech company (but used a lot of tech, both bought and in-house). We had 100s of teams supporting services, internal apps (web and mobile), external apps (web and mobile), and connections to vendors plus a huge infrastructure in the real world that interconnected to all of this. One time someone changed something in a single data center (I vaguely remember some kind of DNS or routing update) and every single system worldwide failed in a short time. Even after the issue was resolved, it took most of a day and hundreds of people to successfully restart everything, all while our actual business had to continue without pissing off all of our customers. The triage was brutal as to what mattered most.

You can't do this without a lot of people. Sure you could pare it down, maybe improve some architecture, but without a ton of people involved who understand the systems and how they connect, when things might go south they may never return.

radu_floricica · 3 years ago

I have an old project I gave up on - haven't touched it, done any code changes or maintenance in... almost a decade? At least a stubborn client is still using it, successfully. And it's not an old guy in a living room, but an honest small sized company that has this software as the core of its operations.

So yeah, I totally agree with you. No code changes = long life.

nebula8804 · 3 years ago

You should be proud. I hope that one day some software I write can serve people for that long.

lr4444lr · 3 years ago

You didn't mention data scale. Just because the disks have room, doesn't mean the data access patterns in perfectly stable code will perform well at continual multiples if old data isn't somehow moved to colder storage.

vonmoltke · 3 years ago

> There are other things that can bring a site down, like [...] too much traffic[.] But generally speaking those things are rare and don't bring down an entire site.

I agree with your assessment, but I do want to highlight that this condition is not rare for Twitter. Load is very spiky, sometimes during predictable periods (e.g., the World Cup, New Year's Eve) and sometimes during unpredictable periods (e.g., Queen Elizabeth II's death, the January 6th US Capitol attack). It isn't going to cause a total site failure (anymore), but it can degrade user experience in subtle or not-so-subtle ways.

An aside on the "anymore", there was a time when the entire site did go down due to high-traffic events. A lot of the complication in the infrastructure was built to add resiliency and scalability to the backend services to allow Twitter to handle these events more gracefully. That resiliency is going to help keep the services up even if maintenance is understaffed and behind a learning curve.

mmcnl · 3 years ago

Sorry for hijacking your expertise, but why no mention of memory leaks? In my experience they can cause really weird bugs not obvious at first, and are difficult to reproduce, i.e. triggered by edge cases that happen infrequently. Or are you assuming services automatically restart when memory is depleted?

throwaway892238 · 3 years ago

It depends how well the service was "operationalized":

1) Best case: Monitoring of the service checks for service degradation outside of a sliding window. In this case, more than X percent of responses are not 2xx or 3xx. After a given time period (say, 30 minutes of this) the service can be restarted automatically. This allows you to auto-heal the service for any given "degradation" coming from that service itself. (This does not detect upstream degradation, of course, so everything upstream needs its own monitoring and autohealing, which is difficult to figure out, because it might be specific to this one service. The development/product team needs to put more thought into this in order to properly detect it, or use something like chaos engineering to see the problem and design a solution)

2) If you have a health check on the service (that actually queries the service, not just hits a static /healthcheck endpoint that always returns 200 OK), and a memory leak has caused the service to stop responding (but not die), the failed health check can trigger an automatic service restart.

3) The memory leak makes the process run out of memory and die, and the service is automatically restarted.

4) Ghetto engineering: Restart the service every few days or N requests. This extremely dumb method works very well, until you get so much traffic that it starts dying well before the restart, and you notice that your service just happens to go down on regular intervals for no reason.

5) The failed health check (if it exists) is not set up to trigger a restart, so when the service stops responding due to memory leak (but doesn't exit) the service just sits there broken.

6) Worst case: Nothing is configured to restart the service at all, so it just sits there broken.

If you do the best practice and put dynamic monitoring, a health check, and automatic restart in place, the service will self-heal in the face of memory leaks.

mschuster91 · 3 years ago

> If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.

At least for expired certs, most people have learned the hard way just how bad that is, and either implemented automated renewal (thank to heavens for cert-manager, LetsEncrypt, AWS ACM and friends) or where that doesn't work (MS AD...) monitoring.

marstall · 3 years ago

I'll add one: when usage scales beyond anticipated levels. then that code that is "good enough" will no longer be, and serious intervention may be required - by senior engineers with history.

enominezerum · 3 years ago

Takes me back to the first broken mess of an environment I worked in. Change freezes were a day of life and lasted and, magically, nothing would break during that time.

Now, those change freezes even extended to preventative maintenance, one of the dual PSUs in a core switch went bad and we couldn't get an exception to replace it... for 6 months. We got an exception when the second one went down and we had to move a few connections to its still alive mate.

itsoktocry · 3 years ago

>The biggest thing that brings down a site is changes.

Well, Elon is talking about a massive amount of changes coming down the pipe, so I guess we'll see how that goes!

illiac786 · 3 years ago

I think that without code push they won’t be able to maintain compatibility. With updated APIs from third parties, new hardware, new encryption requirements from clients or browsers etc. It’s a slow descent into chaos indeed.

Deleted Comment

city41 · 3 years ago

A browser update is a form of "new code". It's rare, but having to work around newly introduced browser bugs does happen.

ranguna · 3 years ago

And vulnerabilities.

> This left a lot wondering what exactly was going on with all those engineers and made it seem like it was all just bloat.

I was partly expecting the rest of the article to explain to me why exactly it wasn't just bloat. But it goes on talking about this 1~3-person cache SRE team that built solid infra automation that's really resilient to both hardware and software failures. If anything, the article might actually persuade me that it was all bloat.

donkeyd · 3 years ago

> the article might actually persuade me that it was all bloat

First of all, how does it persuade you of that? The article touches a really small (though incredibly important for up-time) subject.

Secondly, in any large company, the majority is 'bloat'. It's security engineers, code reviews, data architecture, HR, internal audit teams, content moderators, ccrum masters and I can keep going. In a start-up many of these roles can be ignored, becaus growth > stability. In a large organization, part of the bloat helps insure a certain amount of stability that's necessary to keep an organization alive.

If a product is mature enough, like Twitter seems to be, removing engineers won't instantly crash the product. It'll happen slowly. Bugs will creep in, because less time is spent on review and over all architecture. Security issues will creep in because of about the same issues and less oversight. Then, once this causes enough issues for the product to actually crash, the right people to fix it quickly might not be there anymore. That's when fixing the issues suddenly takes a lot more time.

If the current state of affairs at Twitter keeps up, it'll probably be a slow descent into chaos. Especially with Elon pushing for new features to be implemented quickly, inevitably by people who cannot fully understand the implications of said features, because 80% of knowledge is missing.

resonious · 3 years ago

> how does it persuade you of that?

By flowing from many people think it's bloat - I'll tell you what's really going on to tiny team of 1~3 built whole infra for critical component.

I'm not really trying to make commentary on whether or not Twitter engineering was bloat, or whether or not I think it'll hit problems in the future. Just commenting on the fact that the article broke my expectations a little bit as a reader.

throw0101c · 3 years ago

> In a start-up many of these roles can be ignored, becaus growth > stability. In a large organization, part of the bloat helps insure a certain amount of stability that's necessary to keep an organization alive.

It also (a) increases the bus factor, [1] and (b) allows people to take vacations and time off without having to watch their phones like hawk.

[1] https://en.wikipedia.org/wiki/Bus_factor

itsoktocry · 3 years ago

>removing engineers won't instantly crash the product. It'll happen slowly

It's amazing to me how many people following the Twitter saga, some familiar with or actually working in technology, thought that Twitter would crash within days of the engineers being fired. And because it didn't, the job cuts are justified.

ivanhoe · 3 years ago

> If the current state of affairs at Twitter keeps up, it'll probably be a slow descent into chaos.

It's not like Twitter was bug free before. How many times it annoyingly refreshed the timeline while I was reading something, or when it shows notification that it failed to send the DM, and when you retry it says "you've already wrote this", or you open the reply dialog, but it freezes, has no send button at all, so you have to re-open it. All of this was happening to me pretty regularly long before Elon came along.

As we all know, just hiring more people is not necessarily the solution to every problem, and to me it seems it was exactly what Twitter tried to do in the past. Now they deconstructed it to the bare bones, which will clearly show what are the core problems and requirements. They basically turned Twitter back into a startup. And from that new starting point they can hire again to cover the needs as they arise. If they succeed it will be a huge success as they'll end up with far more optimal team (and huge savings), and of course, if they fail to catch up with problems it will be a huge failure. We'll see how well Musk can manage it...

fnord123 · 3 years ago

Twitter had 7,500 employees. most of the roles you mention (security engineers, code reviews, data architecture, HR, internal audit teams, content moderators, scrum master) are not bloat. So the question is what are the other 7000 people doing?

bryanrasmussen · 3 years ago

>Bugs will creep in,

https://twitter.com/IlluminatiGanga/status/15946097904324444...

new members joining in 1970. hmmm.

ALittleLight · 3 years ago

I think Twitter was stuck in a not-especially profitable niche. They shift into fast-mode to get out of it and find a better spot, then they can shift back into stable mode once they occupy a better equilibrium.

That said, there are lots of bugs in Twitter now, today, when they presumably had the benefit of being in stable mode for a long time. For example, Twitter regularly refreshes and loads new tweets while I'm reading them, pushing the tweet I was in the middle of reading out of view. That seems like a pretty silly bug to exist in a mature product. I regularly reach a state where I have to kill the app and relaunch it because all of the "back" commands just minimize the app instead of taking me back to the timeline. I could go on.

wikfwikf · 3 years ago

data architecture is bloat?

Have you implemented a system which stores hundreds of billions of pieces of media content and makes different slices of them immediately available to hundreds of millions of users?

lightbendover · 3 years ago

> and I can keep going

You focused mostly on additive bloat, there's also multiplicative bloat in the form of multiple teams focused on building separate versions of the same product to increase likelihood of success and empire building where leaders don't actually have a remit large enough to support the team size they have, but they have woven a narrative that defends the necessity nonetheless. Put everything together and teams are very easily 6x+ larger than they absolutely need to be to get a product into market.

holoduke · 3 years ago

How difficult can a platform like Twitter be? I am convinced you can run the entire tech stack with a team of a 100 people.

cogman10 · 3 years ago

The whole point of the article is twitter was designed to be resilient. (and it shows, twitter has great uptime). And the whole point of resiliency, beyond not negatively impacting customer experience, is to buy engineers time to fix things when stuff breaks.

What we are watching is a massive failure event right now and the question really is if there's enough time for twitter management to fill in the gaps before there's an outage.

bumby · 3 years ago

Your post made me think of this discussion about how mature organizations naturally have to pivot towards maintenance:

https://freakonomics.com/podcast/in-praise-of-maintenance/

kneebonian · 3 years ago

Based on the recent disclosures it seems that any new security issues pales in comparison to the current ones.

blueprint · 3 years ago

if 80% of knowledge is missing due to 80% of people being gone then the team who built it failed to document or automate themselves out of a job, meaning even other people on the team in the good ol' days would still have faced the same issues and merely had to chase down the original authors. that doesn't sound like it would be completely true. maybe 5-20% of un-documented knowledge walked out the door? completely rough guess. but even 0.5% of knowledge can sometimes be critical.

birdyrooster · 3 years ago

They will stagnate and have more failed launches than Space X.

neycoda · 3 years ago

I'd argue that growth relies on stability, and if you don't have stability, you'll randomly lose growth.

mihaaly · 3 years ago

> The article touches a really small [...] subject.

That's how it couldn't prove the claim.

Deleted Comment

thinkmcfly · 3 years ago

We are entering the windows ME era of Twitter

Dead Comment

28304283409234 · 3 years ago

Argh. "It works now, so it will work until forever."

It takes _effort_ to make it work this smoothly now, _and in the future_.

SRE is about _preventing_ issues. Not mopping up after them.

To me, the article read like every succesfull sysadmin story: there's no fires, so sysadmin must be bloat.

tempestn · 3 years ago

I think you're misinterpreting the comment you're replying to. They would agree with you that the tiny SRE team described in the article sounds very effective, and likely have a lot to do with why the site is still up and running currently. Work like that should continue. But if 1-3 people can have that degree of impact, what are the other 8000 doing? (Again, this is just me attempting to interpret the point made by the parent, not trying to make one myself.)

ryanfreeborn · 3 years ago

No, I think the article makes it very clear what the value and function of SRE is. The point of the comment you're responding to is that the author was the only one doing this—not a team of ten, not even a team of two. This is Twitter's whole cache system! Probably the most important part of their hardware stack, in terms of "is the site performing well for users". There are other SRE needs at Twitter, but not that that many. What were the other 9k people at the company? It begs the question.

b3lvedere · 3 years ago

Whenever clients complain about those costs and efforts, i tell them it's just like their car.

Your car is working perfectly fine so why should you pay for maintenance?

foton1981 · 3 years ago

To be honest I was very surprised to hear what a cache SRE was working on. It sounded like he had to build all of handling of hardware issues, rack awareness and other basic datacenter stuff himself. Does it mean that every specialized team also had to do it? Why would cache engineer need to know about hardware failures at all, its datacenter team's responsibility to detect and predict issues and shutdown servers gracefully if possible. It should be completely abstracted from cache SRE, like cloud abstracts you from it. Yet he and is team spends years on automation around this stuff using Mesos stack that they probably regret adopting by now. I feel like in this zoomed in case of twitter caches what they were working on is questionable, but the team size seems to be adequate to the task, so my takeaway is that like any older, larger company Twitter accumulated fair amount of tech debt and there is no one to take large scale initiative to eliminate it.

oxfordmale · 3 years ago

Let's do a thought experiment and see what functions aren't needed to keep the light on for 30 days:

1) HR 2) Legal 3) Sales 4) Marketing 5) Payroll 6) Admin staff 7) Most of Engineering, other than the bare minimum of L1/L2/3 support.

As someone paraphrased, a car without breaks and steering wheel works just fine until you hit the first bend.

coldtea · 3 years ago

>As someone paraphrased, a car without breaks and steering wheel works just fine until you hit the first bend.

On the other hand, a car without a second and third steering wheel, 20 windscreen wipers, and an oven in the back, keeps running just fine, even after the first bend...

jiggawatts · 3 years ago

This is more like a car with just the bare essentials: an engine, steering wheel, brakes, and the like.

No: radio, air-conditioning, seat padding, wipers, lights, radar, etc...

Oh, and no maintenance.

It'll drive... for now. But that's it.

weinzierl · 3 years ago

No matter if it was or not and for better or worse:

If Twitter survives this without any major harm it will have profound consequences for the whole software industry.

moduspol · 3 years ago

And it's almost inevitable. That's why the overwhelming "Elon is so dumb" narrative seems so odd to me.

"Major harm," to me, would be either bankruptcy or a competitor overtaking significant chunks of Twitter's users. Even if Elon literally has to re-hire half the roles he fired for, or Twitter is down for a few days or a week, or they struggle to get advertisers for a little while, that's nothing for the long-term. In six months, the chances that it'll look like these firings were a bad idea are minimal.

And likewise with the "the only realistic way to moderate an online platform is the way Twitter was doing it" narrative. In six months, all it takes is the ship to still be floating without the old moderation to prove it out, and that's by far the most likely outcome.

ravloony · 3 years ago

This is exactly why this thread is full of slightly insecure comments making vague predictions. I'd suggest to most of them to get off HN and back to work, now is the time to make yourself useful!

pas · 3 years ago

That's tiny bit of an exaggeration, maaaybe. Maybe completely prophetic, though!

That said, Instagram was run by just a dozen people back then, while it had hundreds of millions of users, right? So it's not a new data point.

Storing, retrieving, indexing, managing 280 char blobs (with links, threads, embeds) is not exactly the most hardcore of a problem domains.

Microblogs are the typical tutorial topic, and twitter's only "innovation" was to increase the text limit.

hn2017 · 3 years ago

It will be destructive to society in the amount of toxic speech. Lay off all the content moderators, ok buddy..

https://www.theguardian.com/technology/2022/nov/20/twitter-f...

jameshart · 3 years ago

That small team seems to have been running the caches for other teams, by using infrastructure provided by another team, in two massive datacenters operated by other teams, using monitoring tools managed by another team, and a ticketing system run by another, on hardware purchases by another team…

All just to put caching in front of services that actually do anything.

water8 · 3 years ago

Can’t forget about those people that ordered that hardware once. He probably had to go to the business website and everything.

twblalock · 3 years ago

It’s easy to think it’s bloat at a steady state. When something important goes down and nobody knows how to fix it, it looks different.

cm2187 · 3 years ago

That being said it's not like twitter is a massively complex product with lots of different features. I can imagine you could keep it running with a skeletton team. Liasing with ads buyers excepted.

bjarneh · 3 years ago

> I was partly expecting the rest of the article to explain to my why exactly it wasn't just bloat

Same here. I guess his header was on point in why Twitter is still up; but I was also interested in hearing about why Twitter actually needs all those people. If it can be run with 50-80% of the staff gone, that does sound like some bloat at least.

depereo · 3 years ago

Slack space leads to innovations, like developing infrastructure automation and improving capacity planning. SRE as a practice needs slack space for operations teams to work on improvements and fixes in addition to BAU fault fixing, deployments and patching.

Lutger · 3 years ago

I'm puzzled by this statement. Do you think of resiliency as waste? That twitter would have been fine without it?

The article makes a point that the reason Twitter is running ok on 20% of personnel at this moment is exactly because it was build to be resilient, not because the personnel was bloated. A large part of this so called bloat, the 80%, was responsible for Twitter to be running right now. Calling this bloat implies it is actually not important for Twitter to be available all the time (or at all).

concordDance · 3 years ago

How was twitter doing fine a few years back at half the number of employees and same number of users?

toss1 · 3 years ago

>>If anything, the article might actually persuade me that it was all bloat.

Not for me

This is almost exactly like the new manager coming in, noticing that the floors and surfaces are all clean, all the systems work, the trash is emptied, etc., and so deciding that the entire maintenance staff is unnecessary and firing them.

The place doesn't become a decrepit pigsty the next morning; it slowly degrades.

Same for these systems. They were designed, built, tuned, and maintained over the course of years to go from requiring constant manual intervention to running largely unattended and with a good buffer of ready hardware and automatic failover for failures. That "largely" in "largely unattended" is doing some very heavy lifting.

The system WILL require human intervention to keep running, and more than just a skeleton crew. The only question is whether it will happen before the new crew gets up to speed to handle the inevitable degradation.

This does NOT mean that the SREs were bloat - it means that they were doing an excellent job and could safely take a break. We're now in just the two-week vacation zone - same as if the entire SRE team went on a holiday. We'd expect it to work. Now let's see what happens in two months.

nashashmi · 3 years ago

In addition to the many great comments here, remember that super star engineers don't exactly fix problems from day to day. They fix the problems before they become problems.

The engineer was doing stability planning for 6 months out for the purpose of cost optimization. I guess we can assume that the costs of infrastructure is about to go up and reliability is about to go down in the coming months.

memish · 3 years ago

There's an incredible amount of bloat in big tech.

It's become an adult daycare, https://twitter.com/DavidSacks/status/1561096423243800576

Twitter's layoffs followed by that 1AM photo of hackers at work is terrifying to lifestyle employees.

It's the Return of the Nerds.

Google, Meta, Netflix, Microsoft are all watching.

mylons · 3 years ago

what does working until 1AM have anything to do with being a nerd or a hacker?

ryanfreeborn · 3 years ago

I have been under the good faith assumption that most (though definitely not all) of the employees that have departed Twitter were probably necessary and valuable to the company. I left the article with the same impression as you. This single person did this very important job, seemingly well, and didn't appear to be drowning in the work. What were the other 8-9k doing?

pron · 3 years ago

> But it goes on talking about this 1~3-person cache SRE team that built solid infra automation that's really resilient to both hardware and software failures.

... for the Cache component. There are many others.

carabiner · 3 years ago

Yeah it's dancing around the question: Was Musk right? All signs so far are pointing to, yes. MBA's will be studying this for years.

tempestn · 3 years ago

Musk might have been right about some things. There probably was some degree of bloat. But to say he's badly mishandled this whole saga is a gross understatement. It is very difficult to utterly kill a site like Twitter; the fact that we're even considering that as a realistic possibility shows just how badly.

I think Musk is used to Tesla and SpaceX, which are both companies that a lot of people are (or at least were) excited to work for because they believe in the mission and what's being created. Plus there aren't many alternatives if you want to do that work. Twitter really isn't like that for most people; a Twitter developer has many other options to do similar work. Add to that the fact that he's both cranked up the intensity of the abuse and that it's more visible to everyone, and you can't expect a lot of good people to stick around. And despite the fact that it might coast for quite a while on the back of excellent work in the past, eventually you do need good people to keep a business going. (This is leaving aside the direct impacts of his actions on users and advertisers!)

riffraff · 3 years ago

> All signs so far are pointing to, yes

which signs? They launched a new feature (blue checks for $8) and had to turn it off immediately because it was bleeding money and ruining the platform and they have less ad revenue booked for next year than they had at the same time last time.

I don't think one should judge the new twitter course yet, but "well, the site is still up" is a very bad measure of success.

DharmaPolice · 3 years ago

It's way too soon to tell. On a dramatically smaller scale my team went through a big drop in headcount. Day 1 the impact was nil. Day 10 the impact was negligible. Day 30 some minor problems were identified. But it wasn't until about Day 90 that we had our first outage and Day 270 that we had our first lengthy outage.

itsoktocry · 3 years ago

>Was Musk right? All signs so far are pointing to, yes.

Huh? He's been in charge for, like, two weeks. Did you think it could implode the instant the engineers received pink slips? Let's give it a year before we say he was right.

matwood · 3 years ago

Way too soon to declare that Musk was right. I don't even think signs are pointing there. Twitter is bleeding some of its most valuable users, the content creators, to things like Mastodon. There do appear to be cracks happening at the edges. Bots and hate speech do appear to have increased.

Thing is, I think Twitter was bloated and it needed a kick in the rear. Pre-acquisition I heard the same from many I follow. How Musk has gone about it has been the problem. Ignoring his perpetual hates, he had a decent amount of goodwill the day the deal closed. Then, he squandered it with all his antics. A transparent content moderation board turns out to be a game-able Twitter poll. Blue check for all was completely missing any point. No one wants a blue check for money w/o the associated verification. Verification for all would have been awesome.

Ads quality has dropped from what I've seen. It looks like people are pulling out, albeit slowly. MBAs will be studying this, but how things are going means we may look back and see this as Twitters Yahoo/AOL moment when it sells for a few billion in a couple years.

ReGenGen · 3 years ago

Did Musk fire the right people? Or did he slash based on obedience, mood, whim? In what situation would one want THE senior SRE for the cache systems (which are unique and affect twitter across the board) GONE? There's a reason why managers were trying to claw back some engineers and employees after the big layoff. Sure you chop off a few fingers and keep working.... but lose a thumb?

enkid · 3 years ago

What do you mean by "right"? He grossly overpaid for a business that isn't profitable and likely made it even less profitable. Even if it limps along and some husk of Twitter survives, it's hard to see Musk as making the right move here.

yywwbbn · 3 years ago

The only winners here are former Twitter shareholders no matter what happens next.

scott_w · 3 years ago

I guess you were also declaring Putin a tactical genius for reaching Kyiv so quickly after invading Ukraine? Military historians will be studying that advance for years.

geysersam · 3 years ago

Come on. There's no way to draw any conclusions only a couple of months after the change.

Aeolun · 3 years ago

And so is the cache setup. It’s permanently (and deliberately) running at less than 50% utilization to prevent an issue that comes up only once every 5 years (according to the author).

ouid · 3 years ago

Of course its all bloat. Software runs on computers, not engineers. The default assumption for software is that it will go on working. The state might devolve, but the software is exactly as reliable right now as it was before. I'm no friend of Elon, and I think its hilarious to think that he can be king of twitter, but all these people talking about "code entropy" are certifiably insane.

Big tech maintains talent so that they won't use their knowledge of the system to produce an identical competitor without the technical debt or investor liability of the original.

H8crilA · 3 years ago

This is a PR article that tries to push the idea that Twitter is OK (to work at, and maybe also to buy ad space from). Damage control.

Lutger · 3 years ago

I read it as the exact opposite: the reason Twitter is still ok* is not because all these people were just browsing reddit at work. You can't just gut Twitter to run on only a couple of hundred people and still expect the same results in the longterm.

Twitter was not a leftist commie welfare company as Musk and its fans want it to be. It was actually the fine work SRE (amongst others) put into it that makes it still tick along as it does...for now.

* actually some things are already breaking, but it will take some time for the real damage to surface on a technical level

lazyant · 3 years ago

I guess different definitions for "bloat" but how is it bloat to have a tiny team taken care of a fundamental piece of infrastructure? if the team is now gone, an issue there would mean hours of downtime. If that's acceptable then yes, it was "bloat".

halffaday · 3 years ago

I’m suspicious that most of the value in these systems comes from a small fraction of the effort and many technology jobs boil down to knowing you’re a huge cost center and putting on a performance to hide that.

w0m · 3 years ago

If you only care about 'it mostly didn't crash' as the end-goal of a company, that would be a reasonable take.

adql · 3 years ago

You just need to look at profit and revenue. Revenue grew nicely in last 2-3 years, profit not so much. Bloat is only reason.

raspberry1337 · 3 years ago

This, in my reflection, is that one insightful comment that should be higher up even it came to late. Twitter userbase did not expand significantly in the last couple of years. Revenue increased. Why did cost increase so much?

HeavyStorm · 3 years ago

Indeed!

v0idzer0 · 3 years ago

Of course it was bloat. This whole “twitter is going to crash and burn” thing is a weird fantasy. Most likely it will just be run more efficiently by far less people.

adam_arthur · 3 years ago

Well, WhatsApp had ~50 employees and Instagram around ~15 when FB acquired them, and they were around the same order of magnitude of complexity as Twitter.

The only concern Id have is that by having so many people, your design probably comes to rely on them whereas a smaller team would be forced to make the system easier to maintain.

Personally, if I were Elon, I’d build an entirely new backend and point the clients to that rather than trying to incrementally improve what they have.

Get 50-100 10x engineers that are loyal to Elon, with big equity stakes, and crush it

pclmulqdq · 3 years ago

A lot of people who don't understand why Twitter owns two datacenters point to "complexity" as an argument and completely disregard scale. It turns out that massive scale adds a lot of complexity to a system, particularly around many-to-many pubsub systems (like social media). It also means a lot of features, like compliance with government regulations around the world.

usgroup · 3 years ago

I expect it all to work better with Musk in charge. He knows how to make scalable software and he knows about performant teams.

That he’s not going to realise these totally obvious first order consequences people are raising seems unlikely.

kace91 · 3 years ago

I honestly can’t tell if you’re being sarcastic.

FranzFerdiNaN · 3 years ago

Everything that is happening with Twitter proves Musk is yet another wealthy idiot who doenst know shit about shit, except how to blow his own horn. Musk is simlpy lucky that Twitter used to have such excellent engineers like this SRE so that the site isnt yet on fire.

GoOnThenDoTell · 3 years ago

You do have to add the /s, else its confusing

poulsbohemian · 3 years ago

I did SRE consulting work for a phase of my career... as the author points out, these systems are scaled out and resilient, but what happens next is entropy. Team sizes shrink, everything starts to be viewed through a cost cutting / savings lens, overtaxed staff start ignoring problems or the long-term view because they are in firefighting mode, it becomes hard to attract new talent because the perception is "the good times are over." Things start to become brittle and/or not get the attention needed, contractors are brought in because they are cheaper and/or bits get outsourced to the cheapest bidder... the professional care and attention like the author clearly brought just starts to shift over time. Consultants like me are brought in to diagnose what's wrong - the good staff could write our briefs, they know what's going on - and generally we slap a band-aid on the problem because management just wants to squeeze whatever value they can out of the assets rather than actually improve anything.

rjzzleep · 3 years ago

The reality is most huge companies are majority bloat. The hiring numbers are also in part crap that goes into Series X Raise pitch decks. Oftentimes a lot of the new bloat pisses off competent people, because their work doesn't actually get less, it becomes more. Not only do they have to now nanny people that are often not actually competent in their job, they just happened to go through the coding interview with wholly unqualified interviewers, but they now also have to handle nightmare features that were built by people completely disconnected from the other side.

I'm not a friend of Elon's, but outside of the flashiness of the whole thing, I don't think his firing spree was wholly unwarranted.

The other day I saw a video of a bunch of people at twitter leaving that have been there for a decade or so. I mean wholly crap, this reminds me of old German industry where people retire in the place they started.

moonchrome · 3 years ago

Except in these firing sprees it's not competent employees that remain - those people are usually the first to jump ship because they have options and see where it's headed. It's not like they get raises for staying to get a part of the savings from the cuts - usually they get pay freezes.

Restructuring is done on whatever idea the new owner(s) have in plan - which could be equally disconnected from "real product".

I'm not that optimistic about Twitter long term - never was TBH but this Musk thing is turning to a shitshow and also exposes what it's like working for him in his other companies.

Yes and no. I think we'd all agree that large tech companies have tons of really obvious staffing inefficiencies. There are many teams working on what are essentially vanity projects and many other teams have more staffing than they realistically need.

On the other hand the slash and burn Elon approach seems objectively terrible. Indiscriminately firing most of the company kills morale and is likely to send the company into a hiring death spiral where your good employees leave and you can't attract good talent. This won't automatically kill the product or the company but it's not going to lend itself to big positive successes in the future.

yunohn · 3 years ago

> I mean wholly crap, this reminds me of old German industry where people retire in the place they started.

This is honestly uncalled for. Job hopping every 2-3yrs should not be an expected task.

bartvk · 3 years ago

> this reminds me of old German industry where people retire in the place they started

What is wrong with that?

German here. I think this actually is a huge part of the success of the famous Mittelstand - all that institutional knowledge these people have is extremely valuable. It's not just basic stuff like "know time tracking, billing and other admin systems and internal processes", but also the stuff that really can speed up your work: whom to ask on the "kurzer Dienstweg" aka short-circuitting bureaucracy when needed, personal relationships with people in other departments on whose knowledge you rely (it's one thing if you get a random email asking for some shit from someone you don't know, but I'll always find some minutes to help out someone who has helped me out in the past), all the domain-specific knowledge about the precise needs and desires of your customers...

Attrition is bad for a company as a whole, the problem is US-centric capitalism cannot quantify that impact (and it doesn't want to, given that attrition-related problems are long-term issues with years of time to impact), and so there is no KPI for leadership other than attrition rate itself.

The only problem is that over the last years, employers' mindset has shifted from regular wage raises to paying the bare minimum which makes changing jobs every few years a virtual requirement for employees to get raises, and so we are already seeing the first glimpses of US employment culture and its issues cropping up.

entropi · 3 years ago

> The reality is most huge companies are majority bloat.

This is true, and in my opinion, true for a reason. And that reason is not "most huge companies are dumb", as opposed to what Musk's cult seem to believe. The reality is, measuring what exactly is "bloat" and precisely cutting that bloat is extremely difficult and firing more than half of your workforce is probably like using a warhammer to do brain surgery.

kranke155 · 3 years ago

Software engineers at Twitter got used to working on an money losing company for 10 years and being told they’re great at their job. Then they were fired en masse because someone’s called them out on it.

If your company is losing money all this time you are likely to be fired eventually in the real world. Job security in sw world had become so high that no one seemed to expect it. Everyone assumed “sure we’re losing money and the company has no direction” but all is fine.

They all stayed there in their tables working for 10 years in a rudderless company as if it was a government job.

Something I learned early in my career is that some companies consider this a feature not a bug. As in, they are hoarding talent so that they don't go to work for their competitor.

bigiain · 3 years ago

New owner: "Nothing ever goes wrong with the cache! It just works, look at the status logs. Why are we even paying those guys to look after it!"

Also new owner: "What even is Mesos? Why are we running something called Aurora? Obviously pure bloat. Fire the lot of them."

Yeah, but that's not what's happening here, is it? Musk is pretty obviously looking for a corporate culture shift, not turning the company into a cash cow - and for this to work, entropy needs only to work slower than half a year. Which this article argues pretty convincingly is going to happen.

Dave3of5 · 3 years ago

I don't think it will fail on a technical level. As this article says lots of engineering has gone in to make the thing pretty resilient. I would also say that there are still enough engineers who work there who can figure out what's gone wrong and "turn it on and off again" or w/e makes it splutter back to life.

In terms of changes to the Platform ditto. It's not difficult to make these changes that a team of 100's of devs who are not 100% aware of what's there already can't figure it out. I've taken over systems that I knew very little of and were pretty big (not as big as twitter) and I managed for years to make changes without drastically breaking stuff. In any event if they do break stuff they will be able to fix what they have broken.

No, the real failure here is the massive debt burden and the fact that there is no way that twitter can ever service that amount of debt. Note that before EM took over it was ticking along with a relatively small loss. If they had cut headcount by maybe 10% they would have been break even easily. There is no way that's possible with $4million of interest per day. They have to radically change the way they monetise the platform to get to that level. I don't think they will ever get there and Musk will sell off at a bargain basement price at some point in the future to pay back the debt.

Moissanite · 3 years ago

The fact that he was able to "buy" Twitter and yet transfer a significant amount of debt to the company rather than being liable himself is just another sign of how different "rich people accounting" is. He will walk away from this with a bit less theoretical money but no material impact to his life, while thousands of people are having their lives up-ended. How long are we going to keep letting shit like this happen?

guiriduro · 3 years ago

Yeah I love it. I would like to "buy" a multi-million dollar mansion in the Bahamas with big bank money, but have the bank only encumber the property itself, and I get to live in it and pay service staff for my luxury, maybe the odd AirBnB letting to maintain the occasional pretense of repayment, up until the time 'it' can't sustain 'itself' and I walk away debt-free.

But apparently I'm not a rich person so that kind of accounting doesn't apply to me.

hnbad · 3 years ago

Leveraged buyouts are certainly a thing that sounds like it shouldn't be possible the first time you hear of it. It seems extremely odd that you can buy a company with money you don't have and then have the money take on that debt instead of having to take it on yourself.

As I understand it, this works by rounding up potential loans, approaching the board of the company and getting them to sign over ownership of the company for a pittance in return for the shareholders being paid out by the company using the loans you brought in. This feels more like an emergent property of the system (specifically, contract law and how publicly traded companies operate) than how the system is intended to function.

Intuitively this shouldn't be possible as it's acting against the company's own self-interest despite being in the interest of the shareholders (and the buyer), but I think "the company's interest" in practice is defined by "the owners' interest" (and the owners in this case are the shareholders, who sell the company). I guess corporations aren't people after all.

hef19898 · 3 years ago

You were not around during the period of highly leveraged take overs starting in the 80s, were you? That is classical hedge fund behavior, if done right. The fact that Musk lending banks only get around 60 cents on the dollar on the market for the loans they gave Musk shows that it might not have been done "right" when it comes to Twitter.

drcross · 3 years ago

Debts from where? Any evidence of this at all? It seems like fantasy considering how much money the rest of Musks companies have in the bank.

larsnystrom · 3 years ago

Are you saying Twitter is paying ~$1.5B in interest rates every year?

Yes that's the approximate amount they have to pay due to the way he's done the financing of the deal. Keep in mind that the current interest is at a similar amount to there revenues from last year.

That's the reason to deeply cut expenses and to try and make more money. He could probably have serviced that debt if everything just kept going as normal (no big ad spend cut) and he had fired everyone but that's an impossible scenario.

unculture · 3 years ago

It’s going to be something like that. He bought it for 44 billion. That was mostly loans. Those loans are transferred to the company and the company pays interest on them.

stefan_ · 3 years ago

The interest rate of the loans is based on an index + fixed %. If the index goes up, like in an inflationary environment, so do the payments.

andy_ppp · 3 years ago

I hear $1bn quoted but maybe $1.5nm is accurate.

kaputmi · 3 years ago

Personnel was their biggest cost, which has now been cut by 60+%. That will help a bit in servicing the debt

andreyk · 3 years ago

Yep, agree. Twitter's revenue was primarily from ads, and now I'd bet their ads revenue has dropped a huge amount. Given how Musk is behaving now (reinstating Trump etc.), and losing many pivotal sales people, and the firing of a ton of people in charge of dealing with hate speech and such, it seems unlikely advertisers will return now.

chrsig · 3 years ago

advertisers wont care in a couple months. the situation has lots of public attention now, which is what the advertisers are actually afraid of. They don't want their brand associated with the craziness. They don't actually care one way or the other about any choice musk makes. They're not going to just walk away from a 300+ million person audience permanently just due to principles

_3u10 · 3 years ago

I don’t think given their ad sales that the sales people were that pivotal.

If you consider that engineers cost about $1000 a day and he got rid of 4,000 of them… oddly it works out to about $4 million per day.

I’m not an expert in math but it’s seems pretty possible.

Plus they’ll have a lot more impressions to sell now that people are allowed to speak. Ad rates might drop but then someone will write an article about how they are getting great CPC on Twitter and everything will be back to normal after the blue checks have their sob fest.

pjc50 · 3 years ago

> allowed to speak

Advertisers have already said they don't want their ads screenshot next to slurs.

watt · 3 years ago

Dead cat bounce. Engineers are paid $1000 a day because they create value to the tune of $10 000 a day. So maybe you saved $4 million a day in very short term, but now the momentum is gone, and soon you will not be earning $40 million a day from the value that now was not being created, just in a "few" short months.

I wish Elon the best, but he could have hired his own team of "hardcore" engineers and put them to work 80 hours a week, spending a fraction of the price that was paid.

What a waste.

swellguy · 3 years ago

I think the real question is: Twitter grew 3x on the headcount front with a flat stock price over the course of less than 5 years. What exactly where these thousands of employees actually doing and why did the previous CEO think what they were doing was worth hiring them for? That's just basic accountability from a stock holder or employee perspective. That's apparently a ton of money being wasted on nothing at all.

summerlight · 3 years ago

Twitter used to experience significant downtime compared to all other major platforms and one of the reason was its lack of redundancies across everything. Headcount is one such thing and it takes manpower to automate infrastructures as discussed in the post.

Sure, you can run the platform with 1/10 headcount with significantly degraded user experiences (say ~98%). This is not a problem for startups but people usually have higher expectations for established companies. As always, the last 2% is a hard problem and business doesn't really want to deal with a such unreliable platform. You wanna onboard big advertisers which potentially spend $100M ARR? Then you need to assign a dedicated account manager to handle all customer escalations. PMs then triage and plan their feature requests and later engineers implement it. Which all adds up.

And they also uses your competitor's product, like Google, FB, TikTok etc etc... Twitter is a severely underdog here, so you need to support at least a minimal, essential subset of features in those products to convince them to spend their money on Twitter. That alone takes hundreds of engineers, data scientists and PM thanks to modern ad serving stacks with massive complexity.

Yeah, it ultimately boils down into a simple fact that it's really hard to take other folk's money. You need to first earn trust from them. They want to see if your product is capable of following a modern standard of digital ad serving for now and foreseeable futures. Twitter has spent lots of time for earning trusts and the original post is one evidence of such efforts. And this usually needs more man power. You might be able to do that in a more efficient manner, but I don't think that's as simple as firing 75% of your entire headcount.

> Sure, you can run the platform with 1/10 headcount with significantly degraded user experiences (say ~98%). This is not a problem for startups but people usually have higher expectations for established companies.

This exactly. During the recent Whatsapp outage, many threads popped up on HN about how big of an issue this is in Europe, since Whatsapp is the main messaging platform in Europe. Thankfully, these outages are short and far between, so they never actually cause real issues. This is obviously costing Meta/Facebook a lot of money, but allows them to be an essential service. So essential in fact, that every major news outlet in my country sends a push message as soon as Whatsapp is down.

If Twitter wants to be a comparably important platform, they need that same stability. And Twitter, for me, is very much the best place to stay up-to-date on any current event (in near real-time). Reddit used to be pretty good with Live, but that's pretty much died (and was mostly a summary of tweets anyway). I really hope Twitter survives Elon, because I don't know of an alternative right now that has the same value in this use case.

powerapple · 3 years ago

I think the opposite. Many softwares at its best when the team was small. Software companies have to hire many people because it needs to report growth to investors, headcount is one of the measurement of growth. It is not necessarily good for the product, actually many times, it hurts the product, but overall it is good for the company, the company will enter new areas, can explore new things.

What Twitter is doing is to scale down first, focus on the product, and once it gains traction, it definitely can scale up again. I don't think it will hurt the product very much.

fastball · 3 years ago

The headcount at WhatsApp in 2013 was somewhere between 50-100, at which time they were servicing approx 400m MAU, which is more than users than Twitter has been able to boast for most of their existence.

Coincidentally, in 2013 SpaceX was just starting to provide commerical launch capacity, at which point I think they too had < 100 software engineers. A few short years later and they were re-using rockets, a feat many people had thought unlikely/impossible and requires some hardcore software eng.

Not surprised Elon Musk thinks he can run twitter with a skeleton crew.

ezoe · 3 years ago

Redundancy.

Now the systems are stable but human workers either be sick, leave, or die eventually.

Rising the pay has diminishing returns. You can't prevent workers leaving because of lost of interests, be sick or die by throwing more money at them.

The article wrote about achieving stability by the distributed system so an unexpected death of one rack doesn't affect the service availability. The same can be done for the human workers unexpectedly not working anymore. Have a multiple workers doing the same things improve stability.

Sure, it's inefficient in terms of money. But alternative is one sick important employee catch a COVID-19 and die lost the knowledge of the system. Documents doesn't solve it because you want the manual operation available right now rather than a few months later when replaced workers learned from the documents.

randmeerkat · 3 years ago

> Rising the pay has diminishing returns. You can't prevent workers leaving because of lost of interests, be sick or die by throwing more money at them.

People would absolutely be more engaged and more excited about their work if they were paid more. The only reason people work is literally for money…

mirzap · 3 years ago

Yeah, and they were even profitable with ~3k employees. Then the hiring spree started and they went negative. Even if there wasn't Musk they would have to let go at least 30% of the people.

spaceman_2020 · 3 years ago

The stock was rightly crashing when the company was public. A social media site that manges to lose money during the pandemic is truly mismanaged.

mhio · 3 years ago

Have a browse through their engineering blog: https://blog.twitter.com/engineering/en_us

It's largely focussed on the event stream behind the core service and data analytics. There's maybe one entry on the main data store and one on search over the last few years.

busymom0 · 3 years ago

Not just that. From my personal experience, twitter had to be one of the slowest websites I use. Even in my 2020 mac, it often shows the memory warning in safari. Things take a while to load. And the UX is terrible with having to constantly click to read child comments, having to click on “show hidden replies” etc. I honestly have no idea how a company with thousands of employees and a billion in loss was able to operate such a terribly performing website.

pessimizer · 3 years ago

It had to take a bunch of them to wreck the UI.

xedrac · 3 years ago

I think a likely answer is "moderating". That would explain why Elon was ok letting so many go so quickly.

Robotbeat · 3 years ago

From what I understand, contractors were used for moderating.

francisofascii · 3 years ago

> What exactly where these thousands of employees actually doing Maybe trying to increase traffic, attract advertisers, add efficiencies. Not everyone is an SRE. As long as your efforts increase revenue by more than you cost as an employee, you are adding value.

whatshisface · 3 years ago

If the 3x headcount increase really did add no value, there are still about 1/3rd profitable employees there now. In fact giant layoffs tend to cut the best people first because they are the ones who feel comfortable walking. The people that are the last to go are the ones who are very entrenched in the organization and who don't estimate their chances outside of it highly, and that's the exact description of who Elon thinks he's laying off.

JamesBarney · 3 years ago

I've found the opposite. I almost never see low performing employees fired outside of a mass layoff. In every layoffs I've seen 10x as many people were fired as quit. So you lose a bunch of low performers involuntarily, and a few top performers both voluntarily and involuntarily and that leads to the average quality improve.

xyzzyz · 3 years ago

> In fact giant layoffs tend to cut the best people first because they are the ones who feel comfortable walking.

This is more true when the layoffs happen because the company’s situation deteriorates. If the company cuts jobs because revenues fall and products fail, better employees are indeed more likely to move to greener pastures before mediocre ones do. If, however, the company prospects improve, rather than worsen, this is no longer the case.

kybernetyk · 3 years ago

>What exactly where these thousands of employees actually doing

They had wine on tap.

amrocha · 3 years ago

Ok? Are we going to pretend that perks like that aren't common in the tech industry? Are you being disingenuous on purpose?

fhaldridge7 · 3 years ago

I have worked at companies with beer on tap. It doesn't mean that people are constantly drinking. The only time someone touched it was Friday evenings

jansan · 3 years ago

Did't you see the "Day in my life at the Twitter office video!"?

https://www.tiktok.com/@realpankhilpatel/video/7159187292631...

Normal people don't have vacations like that.

bearmode · 3 years ago

There is nothing here that other big tech companies don't have. To attract the best, they spend a shitload on perks and benefits.

halfmatthalfcat · 3 years ago

This is the real question? Your question has nothing to do with the blog post and if you take a look around, what Twitter did was literally done across the entire industry, hence all the layoffs recently. There was a hiring glut to take advantage of cheap capital during COVID recovery. The capital has dried up, glut has ended and a lot of people lost their jobs. Why is that so hard to see? None of this is unique in any way to Twitter.

subroutine · 3 years ago

A bunch of people just got axed from Twitter because covid cash dried up? Shit I thought it was because Elon took over and fired anyone who refused to work at the office instead of at home.

MuffinFlavored · 3 years ago

> Twitter grew 3x on the headcount front

There were multiple executives making $10m/yr+

There were board members

There were shareholders

Why did all of them not stop this headcount increase if it's as easily reduced as "too much headcount bad, smaller headcount good"? These are paid professionals who are supposedly wealthy, good at their jobs, smart, informed, etc.

How can us commenters on HackerNews sit from our armchair and say "ah, goofballs should've just not let headcount get so high!"

These qualified people thought at the time it was a good idea to get up to 7.5k people. How were they all wrong?

frognumber · 3 years ago

It's not goofballs. It's generally misaligned incentives. Managing a 10,000 org leads to better job prospects than a 1000 person org, than a 100 person org, than a pizza box team.

Organizations tend to bloat.

Random, rapid cuts might not be the fix here, but headcount was too high.

jdminhbg · 3 years ago

> How can us commenters on HackerNews sit from our armchair and say "ah, goofballs should've just not let headcount get so high!"

The cliche HN comment on sites like Twitter (and many, many others, any time headcount comes up) has always been "why do they need so many people?" I've mostly dismissed it the same way I dismiss "I could build Uber in a weekend," but with every other tech giant laying people off, maybe I shouldn't. Maybe the effect of all that extra money sloshing around in the system was to incentivize hiring everyone to make sure you didn't accidentally get a false negative, and not all of those hires were good ones.

Mikeb85 · 3 years ago

> These qualified people thought at the time it was a good idea to get up to 7.5k people. How were they all wrong?

Come on, just look at the tech industry. When rates were low and stock prices kept going up, "headcount" was used as an indicator of future growth. Grow headcount, investors are happy. After all, the promise of tech stocks was "growth". Usually you're not looking to cut costs until you think growth is over. Of course, Twitter was a dog and did nothing useful for years, no innovation, no new products, nothing. But tech investors definitely saw rising headcount as a good thing...

googlryas · 3 years ago

Companies are mismanaged into death all the time by groups of highly intelligent, previously successful people.

StanislavPetrov · 3 years ago

>Why did all of them not stop this headcount increase if it's as easily reduced as "too much headcount bad, smaller headcount good"?

For the same reason that colleges and universities have seen their administrative bloat skyrocket at 10x the rate of student enrollment. Administrative bloat inevitably creeps into all large organizations. Many of the people in the trenches making hiring decisions weren't considering the overall financial performance of Twitter as a company. They were making hiring decisions based on what was happening in their own department, or how that decision would help advance their own agenda, or increase their budget, or increase manpower on a favored project. When you further consider that many at Twitter openly conceded (and in many cases, bragged about) that they viewed their role at Twitter as moral arbiters of society, crucial to policing the discourse of the public, it is not hard to see how enlisting as many true believers as possible to the cause would be seen as desirable, regardless of the larger financial implications.

xcambar · 3 years ago

> There were multiple executives making $10m/yr+

> These are paid professionals who are supposedly wealthy, good at their jobs, smart, informed, etc.

Wealth is not a valid indicator of ability.

I'm not judging the execs and board members individually but rather questioning your assumption. I have read you mention "supposedly", yet it can be read as a rhetorical term.

strangeattractr · 3 years ago

I've worked in multiple financial services companies where management is incentivised to be as ruthless as possible and they are always overstaffed in areas and understaffed in others. I've been in teams of 10 people that could be staffed by 2.

Hiring often isn't done because of current requirements. Senior execs come and go and with them so do strategic objectives. You accumulate people and they're often not laid off when the thing they work on becomes redundant. Large scale layoffs are awful for morale and usually only come after a 'crisis' occurs.

s1artibartfast · 3 years ago

Has no company ever been mismanaged? Have they they ever grossly misallocated funds? The answer is of course yes, that happens all the time. Corporate leaderships are not not infallible.

ab8 · 3 years ago

Tesla has 99k employees

sooyoo · 3 years ago

Ever tried assembling a car in your home office?

rajamaka · 3 years ago

Tesla's product can't be replicated with a 2000 line CRUD app

bmn__ · 3 years ago

https://nitter.lacontrevoie.fr/libsoftiktok/status/158539526...

ulkesh · 3 years ago

All this does is point out that smart people worked at Twitter who may now no longer work there, whether on their own accord, or due to Elon’s bulldogging tactics.

Elon thinks he knows what he’s doing, but what he is going to be left with are people who are willing to work hard by his standards, but not necessarily smart.

The simple truth is Elon knows nothing about the actual work involved in tech. He knows words or elicits help from others on what to say that sounds like tech speak (RPCs!), but when it comes to being truly knowledgeable in this space, he is losing his most valuable assets because of his amazingly poor managerial and ownership style.

I know there are a lot of Elon fans on this site, and will disagree with all of this; but his abilities have not at all been proven. Yes, he knows how to spend money to claim credit for technical advances, but until he actually has his hands dirty in the muck of the hard work of tech, he will always be a glorified self-promoter with no substance.

And Twitter will suffer for it.

John Carmack, "Elon is definitely an engineer. He is deeply involved with technical decisions at spacex and Tesla. He doesn’t write code or do CAD today, but he is perfectly capable of doing so."

Kevin Watson, who developed the avionics for Falcon 9 and Dragon and previously managed the Advanced Computer Systems and Technologies Group within the Autonomous Systems Division at NASA's Jet Propulsion laboratory: "Elon is brilliant. He’s involved in just about everything. He understands everything. If he asks you a question, you learn very quickly not to go give him a gut reaction.

He wants answers that get down to the fundamental laws of physics. One thing he understands really well is the physics of the rockets. He understands that like nobody else. The stuff I have seen him do in his head is crazy.

He can get in discussions about flying a satellite and whether we can make the right orbit and deliver Dragon at the same time and solve all these equations in real time. It’s amazing to watch the amount of knowledge he has accumulated over the years."

karpathy · 3 years ago

Elon also understands deep neural nets a lot more than I think people imagine. He starts with good intuitions and mental models, but also actively asks for technical deep dives, and has very good retention. E.g. I recall teaching him about our use of focal loss in contrast to binary cross-entropy for the object detection neural net (I said it had given us a 5% bump and he asked to know more) and he understood how it works about as quickly as you'd expect a PhD student to. The fact that he can do this across many technical disciplines is impressive and borderline superhuman. I don't think people understand or would believe how low-level and technical typical meetings with him are. Just saying because I get triggered reading way off innacurate takes on this topic (original comment).

thefz · 3 years ago

> "Anyone who actually writes software, please report to the 10th floor at 2 pm today. Before doing so, please email a bullet point summary of what your code commands have achieved in the past ~6 months, along with up to 10 screenshots of the most salient lines of code"

Actual quote. Anyone using the term "code commands" comes out a little detached from programming reality, let alone the rest of this request, it is out of a Dilbert strip.

minhazm · 3 years ago

Many people who have worked with Musk have shared similar sentiments in interviews. But it seems that people just refuse to believe any of it. People think that there's no way it's possible for someone to be that deeply technical and be a CEO of multiple companies at the same time. I've talked to people about it and they straight up refuse to believe it saying that it's impossible and that any evidence of him being technical in interviews is all set up and that he was trained on the materials and questions ahead of time.

hackfrednews · 3 years ago

Channing Robertson, the face of Stanford chemical engineering department and the associate dean of Stanford’s School of Engineering, who taught and mentored Elizabeth Holmes, has said the following to say about her:

“She had somehow been able to take and synthesize these pieces of science and engineering and technology in ways that I had never thought of.”

“I never encountered a student like this before of the then thousands of students that I had talked”

“You start to realize you are looking in the eyes of another Bill Gates, or Steve Jobs.”

He also maintained that Holmes was a once-in-a-generation genius, comparing her to Newton, Einstein, Mozart, and Leonardo da Vinci.

Excerpt from: "Bad Blood: Secrets and Lies in a Silicon Valley Startup" by John Carreyrou.

alsodumb · 3 years ago

Adding another datapoint from one of my previous comments:

In response to someone saying on Twitter how Elon doesn't understand the technical stuff of rocketry, Tom Meuller, former CTO of Propulsion at SpaceX and the designer of many of their engines responded

"I worked for Elon directly for 18 1/2 years, and I can assure you, you are wrong"

https://twitter.com/lrocket/status/1512919230689148929?s=20&...

lelanthran · 3 years ago

You know, I think Musk is an ass, and would never work for him, but don't you think that someone who has managed to launch and then run many successful and complex technology projects might actually know a thing or two about launching and running simpler technology projects?

And if you're going to claim that his successes have been due to the people surrounding him who actually know what they are doing, then all that tells me is that you are acknowledging that he knows how to surround himself with people who know what they are doing.

We're not fans (I'm certainly not), but it takes a special kind of mind to look at Musk's track record of successes and conclude that his latest project is doomed.

gjulianm · 3 years ago

Well, I think the issue is precisely considering Twitter a "simple technology project", and it's the same mistake that Musk does. Twitter isn't a "software and servers business" as he said. Twitter is a social community, and while in some regards it might be easier, it's also far more difficult in others. Just compare how many business and institutions can reliably launch rockets or create cars, and how many can reliably create social networks.

WalterBright · 3 years ago

America is such a great country that a random person can just fecklessly blunder into creating a revolutionary electric car company and cluelessly blunder into creating a rocket company that is the envy of the world.

edmundsauto · 3 years ago

Automotive and aerospace are not that similar to social media. People buying into the vision of "get the planet off fossil fuels for transport" and "get this species to Mars" are probably willing to make sacrifices that people working on social media are not.

It's the Halo Effect fallacy to think competence in one field automatically translates to another. Especially when the founder in question has displayed increasingly erratic behavior in the meantime.

Is today's Elon capable of doing what Elon from 15 years ago did at Tesla? I don't think that is necessarily in evidence, much less in a very different industry.

trollerator23 · 3 years ago

I _think_ you are joking. No?

nverno · 3 years ago

Judging by some of the old patents he's filed [1], I'd guess he has at least a decent understanding of the tech involved. Probably less so, when it comes to the details of more modern distributed systems, but I also wouldn't be surprised if he's spent some effort towards all that as well - he's been working in/around pretty cutting edge tech for quite a while. Could he sit down and code it himself? probably not, but that's hardly required in his situation.

1. https://patents.justia.com/inventor/elon-musk

HyperSane · 3 years ago

Musk was fired from PayPal because he wanted to replace all the Unix servers with Windows.

lupire · 3 years ago

The CEO get to put his name on the company's patents, yes.

acknorabotr · 3 years ago

It's crazy to think a guy who builds reusable rockets thinks he can run a complex technology operation like Twitter.

nikau · 3 years ago

s/builds/obtains funding to get other people to build

tiborsaas · 3 years ago

Well, he's definitely not afraid of blowing up complex systems :)

emodendroket · 3 years ago

You aren't wrong, but his playbook is familiar to anyone who's gone through acquisitions (especially leveraged ones) and many companies were in a strong enough position to start with that they do manage to limp through and get sold off despite all the abuse.

jillesvangurp · 3 years ago

You underestimate Elon Musk. Many people have done that before and lost that bet. If anything, he repeatedly succeeded in building world class software and hardware teams for Tesla, SpaceX, and a few other companies. The notion that he won't be able to attract world class talent is ludicrous. Yes, he is a bit of a liability and his management style is obnoxious and unconventional. But he does get things right once in a while.

And he hates bloated inefficient teams. His decrees on meetings are infamous. Tripling the team at Twitter implies a lot of internal politics, fiefdoms, communication overhead, and generally a lot of headless chickens running around. There's no nice way to fix such a team. A sledge hammer is one way to fix it and obviously he likes getting results quickly.

So, the notion of laying off most of that team was a foregone conclusion. The notion that a lot of the better people would get upset about that and leave as well is also highly predictable. What's left is a team with some gaps but also a lot of breathing room. And he can always lure key people back in by throwing money at them.

Simple plan. It might actually work. At the cost of a bit of drama, temporary instability, and lots of free publicity. Exactly his style. Cringe worthy and effective. I can see the logic here.

mazurnification · 3 years ago

Agree, I wonder how much it is thought through strategy and how much is just "natural" style applied indiscriminately. I think one more important part is that he has money/resources to be able to make mistakes without bankrupting and stubbornness to plough thru even when things go wrong.

> If anything, he repeatedly succeeded in building world class software and hardware teams for Tesla, SpaceX, and a few other companies

The point is that Twitter doesn't really needed someone to build a world class software and hardware team. The technical challenges in reliability and speed seemed pretty much solved or on track to be solved already. The problem of Twitter was that they never knew how to properly manage the community and make the company profitable.

Twitter doesn't have a tech problem, it has a community problem.

headsoup · 3 years ago

Nothing beats an internet random telling us how much a successful person is silly and also can't meet their own superior standards.

It's not about being a fan or not, it's that you're not actually providing any real insight other than signalling how smart you are.

theclansman · 3 years ago

These people behave just like the irrational fanboys, except they just do the exact opposite. Being a sheep and being a contrarian sheep are the same thing.

triggercut · 3 years ago

Years ago I had a massively downvoted comment when I criticised his AR for CAD vapour ware. As someone who was fully in that area at the time, what he was showing while looking fancy had no practical application in the area of design he was talking about.

Ever watch someone do CAD/CAM modelling? They need extreme precision of input that AR sausage fingers just aren't going to help with. You need a num-pad and a good mouse with a stepped click wheel.

I get it. We share some similar neurodivergence traits. He wants to be right in the detail. Constantly jumping from interest to interest, seeing the hidden patterns and connections that aren't apparent to others. But there are a times when I know I just need to shut up and let someone more experienced talk despite my brain wanting to lead every discussion right into solution mode, or providing additional context mode.

I've spent the last 6 years in management consulting (without formal business education), I agree with him when he says MBAs are useless. We know that the best solutions come from diverse teams with diverse backgrounds, skills and knowledge. Not 5 clones who know how to build value driver trees, not to say the tools they bring aren't useful, but they can be incredibly limiting.

For someone who hates MBAs he's sure going about this take-over like someone who barely passed one (i.e. knows more than enough to be dangerous). Sure, you're hemorrhaging money in operations. You need to cut costs and find new revenue streams.

What are your biggest costs?

Labor. Slash / Burn. The old McKinsey 7% FTE reduction will give you some extra operating cash from the years remaining budget and you know it's not so much that people (in fear of their jobs) won't just pick up the slack to keep everything moving. Do it quick because you need to rip the band-aid off and get rid of all that accrued leave, restricted cash etc. off your books too.

Equipment. Redundancy? Sounds like unused resources we can fire sale.

Contracts. Renegotiate? The only two meaningful levers are price and quantity. Start cutting quantity now, renegotiate price later.

This is all dummies guide stuff and tends to go terribly in reality when implemented all at once all together.

For instance, research has shown companies that lay-off when under pressure end up underperforming against the ones who chose not to.

Now who's going to help build and operate those new revenue streams?

Quick fixes for a quick buck and a whole lot of extra risk.

philistine · 3 years ago

And Twitter's problem are nowhere near technological. The site needed to make more money, not reengineer the whole thing while advertisers are fleeing because Trump is back on on a whim!

taolegal · 3 years ago

So should the platform be guided by advertisers? Especially one that’s apparently the de facto public square?

Was Twitter 'good' aside Musk purchased it? Without that event, would it still need substantial changes to be profitable and useful going forward?

NaturalPhallacy · 3 years ago

>while advertisers are fleeing because Trump is back on

[citation needed]

CNN's ratings were never better than under Trump. He's fantastic for advertising. So is Musk. All controversial figures are. That, oddly, isn't controversial in advertising.

>on a whim!

He created a public poll, and when people voted for Trump to be allowed back won, he unbanned him, tweeting "vox populi, vox dei" ("the will of the people is the will of god"). Had he unbanned him despite the poll saying "no" you could argue it was a whim, but that isn't the reality we're in. He also refused to unban Alex Jones, citing exploitation of child deaths and a personal story. Not unbanning Alex Jones was more whimsical than unbanning Trump was, factually speaking. Why do people always misrepresent his actions? And why is it always upvoted and not flagged here?

Man, it's not about tech. This is simply about leveraging the "Musk halo effect" on a highly visible, failing business.

You can screenshot this: Musk will cut down costs, make Twitter profitable, and take it public again when markets are better placed.

The markets will give the newly listed Twitter the same "Musk-boost" as Tesla and ramp up the valuation to $100B.

raldi · 3 years ago

The most helpful thing to reflect on in these Twitter operational discussions is the difference between homeostasis and evolution.

You can get rid of 80% of the work force and the existing homeostasis systems will keep things running smoothly despite known day-to-day chaos.

Where you’re really going to run into trouble is inventing responses to novel chaos and gradually changing times.

I think this is kind of baked in though. Part of the thought process seems to be, at least for non-paying customers, it's not actually necessary to have five nines for Twitter, because people will just put up with it if it's less reliable.

kweingar · 3 years ago

I don’t have personal experience in this, so obviously I can’t speak with any authority. But I have heard from colleagues that tons of little factors can dramatically affect user engagement. For example, even a couple dozen milliseconds of longer load times can push a noticeable number of users away from your app.

onion2k · 3 years ago

Very few people are going to be converted to paying users if they start to see downtime or breakages. No one buys into a failing app.

8n4vidtmkvmk · 3 years ago

true. if twitter, Facebook, reddit, and hackernews go down for a couple days it wouldn't affect me at all. if GitHub and npm went down I'd me mildly annoyed but could still work.

judge2020 · 3 years ago

I'm sure we're going to see some sabotage accusations once this happens.

oska · 3 years ago

We had an obituary for Fred Brooks on here just the other day. I'd suggest that his thesis in The Mythical Man-Month conflicts with your comment above (that reduction in staff count for a software project has a good correlation with the ability to maintain it / evolve it / innovate on top of it).

sangnoir · 3 years ago

I've never heard of (or thought of) your interpretation of the corollary to Brooke's Law, but removing people from projects until they succeed and are on time seems like a bold strategy.

valachio · 3 years ago

I think the opposite is true.

The bigger a ship is, the slower it is to turn.

IBM is a "tech" company that employs 282,000 employees, and when was the last time they invented something? I don't remember the last time I heard IBM in the news about something they made.

The bigger the company, you often times find less innovation and more administration & bureaucracy.

The reason startups can survive is because of its small size that makes it very flexible and adaptable to chaos and change, that gives it the edge over bigger companies.

Homeostasis is a good metaphor, but it implies a living, dynamic system. Something that resists entropy by itself being in a state of flow -- the matter constantly changing, while maintaining the form.

In modern software environments, the entropy is almost violent -- the changes in all the constituent dependencies are constant and relentless. Something frozen in time does not stand a chance, unless it's entirely stand-alone and dependency-free -- an unlikely scenario with a service of Twitter's size.

mtejo · 3 years ago

Hey, sorry for the new account, i just like to try my best to keep my online identity separate. this for better or for worse has my real name on it. Hope this is interesting!

cloudking · 3 years ago

Nice work! Does Twitter runs their own DCs or hosted somewhere else?

devoutsalsa · 3 years ago

As of three years ago, Twitter had 2 to 3 data centers, and was moving some stuff to GCP. Not sure if current state of things.

Thanks! Yeah the other guy is right

leftcenterright · 3 years ago

Kudos for nice work!

What did you make of Mudge's report regarding resiliency of data-centers?

> Insufficient data center redundancy, 59 without a plan to cold-boot or recover from even minor overlapping data center failure, raising the risk of a brief outage to that of a catastrophic and existential risk for Twitter's survival.

- https://techpolicy.press/wp-content/uploads/2022/08/whistleb...

Thanks! There was a lot in that and I didn't follow it closely. Honestly that always confused me about the dc failing prep. Like I get that it would be a good idea to have a documented plan but at the same time, we did lose a whole DC and figured out how to recover(not that it was that easy) Im not sure having a cold boot prep plan would have helped that much. The things that failed that I dealt with personally during that event, I don't know if I would have foreseen. The site went on afterwards so it didn't seem to be existential.

ilaksh · 3 years ago

So you still work there? Or did you quit recently?

I left over the summer

zekrioca · 3 years ago

Weren’t Twitter moving out of Mesos to K8s?

Yeah there was a few teams trying to make it happen