Yep, there's a premium on making your architecture more cloudy. However, the best point for Use One Big Server is not necessarily running your big monolithic API server, but your database.
Use One Big Database.
Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
It is so much easier to do these joins efficiently in a single database than fanning out RPC calls to multiple different databases, not to mention dealing with inconsistencies, lack of atomicity, etc. etc. Spin up a specific reader of that database if there needs to be OLAP queries, or use a message bus. But keep your OLTP data within one database for as long as possible.
You can break apart a stateless microservice, but there are few things as stagnant in the world of software than data. It will keep you nimble for new product features. The boxes that they offer on cloud vendors today for managed databases are giant!
> Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
This works until it doesn't and then you land in the position my company finds itself in where our databases can't handle the load we generate. We can't get bigger or faster hardware because we are using the biggest and fastest hardware you can buy.
Distributed systems suck, sure, and they make querying cross systems a nightmare. However, by giving those aspects up, what you gain is the ability to add new services, features, etc without running into scotty yelling "She can't take much more of it!"
Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
I don't know what's the complexity of your project, but more often than not the feeling of doom coming from hitting that wall is bigger than the actual effort it takes to solve it.
People often feel they should have anticipated and avoid the scaling issues altogether, but moving from a single DB to master/replica model, and/or shards or other solutions is fairly doable, and it doesn't come with worse tradeoffs than if you sharded/split services from the start. It always feels fragile and bolt on compared to the elegance of the single DB, but you'd also have many dirty hacks to have a multi DB setup work properly.
Also, you do that from a position where you usually have money, resources and a good knowledge of your core parts, which is not true when you're still growing full speed.
I've basically been building CRUD backends for websites and later apps since about 1996.
I've fortunately/unfortunately never yet been involved in a project that we couldn't comfortably host using one big write master and a handful of read slaves.
Maybe one day a project I'm involved with will approach "FAANG scale" where that stops working, but you can 100% run 10s of millions of dollars a month in revenue with that setup, at least in a bunch of typical web/app business models.
Early on I did hit the "OMG, we're cooking our database" where we needed to add read cacheing. When I first did that memcached was still written in Perl. So that joined my toolbox very early on (sometime in the late 90s).
Once read cacheing started to not keep up, it was easy enough to make the read cache/memcached layer understand and distribute reads across read slaves. I remember talking to Monty Widenius at The Open Source Conference, I think in Sad Jose around 2001 or so, about getting MySQL replication to use SSL so I could safely replicate to read slaves in Sydney and London from our write master in PAIX.
I have twice committed the sin of premature optimisation and sharded databases "because this one was _for sure_ going to get too big for our usual database setup". It only ever brought unneeded grief and never actually proved necessary.
Many databases can be distributed horizontally if you put in the extra work, would that not solve the problems you're describing? MariaDB supports at least two forms of replication (one master/replica and one multi-master), for example, and if you're willing to shell out for a MaxScale license it's a breeze to load balance it and have automatic failover.
Shouldn't your company have started to split things out and plan for hitting the limit of hardware a couple box sizes back? I feel there is a happy middle ground between "spend months making everything a service for our 10 users" and "welp i looks like we cant upsize the DB anymore, guess we should split things off now?"
That is, one huge table keyed by (for instance) alphabet and when the load gets too big you split it into a-m and n-z tables, each on either their own disk or their own machine.
Then just keep splitting it like that. All of your application logic stays the same … everything stays very flat and simple … you just point different queries to different shards.
I like this because the shards can evolve from their own disk IO to their own machines… and later you can reassemble them if you acquire faster hardware, etc.
> Once you get to that point, it becomes SUPER hard to start splitting things out.
Maybe, but if you split it from the start you die by a thousand cuts, and likely pay the cost up front, even if you’d never get to the volumes that’d require a split.
>Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
But that's survivorship bias and looking back at things from current problems perspective.
You know what's the least future proof and scalable project ? The one that gets canceled because they failed to deliver any value in reasonable time in the early phase. Once you get to "huge project status" you can afford glacial pace. Most of the time you can't afford that early on - so even if by some miracle you knew what scaling issues you're going to have long term and invested in fixing them early on - it's rarely been a good tradeoff in my experience.
I've seen more projects fail because they tangle themselves up in unnecessary complexity early on and fail to execute on core value proposition, than I've seen fail from being unable to manage the tech debt 10 years in. Developers like to complain about the second, but they get fired on the first kind. Unfortunately in todays job market they just resume pad their failures as "relevant experience" and move on to the next project - so there is not correcting feedback.
I'd be curious to know what your company does which generates this volume of data (if you can disclose), what database you are using and how you are planning to solve this issue.
You can get a machine with multiple terabytes of ram and hundreds of CPU cores easily. If you can afford that, you can afford a live replica to switch to during maintenance.
FastComments runs on one big DB in each region, with a hot backup... no issues yet.
Before you go to microservices you can also shard, as others have mentioned.
This is absolutely true - when I was at Bitbucket (ages ago at this point) and we were having issues with our DB server (mostly due to scaling), almost everyone we talked to said "buy a bigger box until you can't any more" because of how complex (and indirectly expensive) the alternatives are - sharding and microservices both have a ton more failure points than a single large box.
I'm sure they eventually moved off that single primary box, but for many years Bitbucket was run off 1 primary in each datacenter (with a failover), and a few read-only copies. If you're getting to the point where one database isn't enough, you're either doing something pretty weird, are working on a specific problem which needs a more complicated setup, or have grown to the point where investing in a microservice architecture starts to make sense.
One issue I've seen with this is that if you have a single, very large database, it can take a very, very long time to restore from backups. Or for that matter just taking backups.
I'd be interested to know if anyone has a good solution for that.
I'm glad this is becoming conventional wisdom. I used to argue this in these pages a few years ago and would get downvoted below the posts telling people to split everything into microservices separated by queues (although I suppose it's making me lose my competitive advantage when everyone else is building lean and mean infrastructure too).
But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.
Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.
Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.
I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!
It's not though. You're just seeing the most popular opinion on HN.
In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.
My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.
As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.
I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.
'over the wire' is less obvious than it used to be.
If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.
I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.
>"I'm glad this is becoming conventional wisdom. "
Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.
I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).
Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.
Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).
>As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.
I think a strong test a lot of "let's use Google scale architecture for our MVP" advocates fail is: can your architecture support a performant paginated list with dynamic sort, filter and search where eventual consistency isn't acceptable?
Pretty much every CRUD app needs this at some point and if every join needs a network call your app is going to suck to use and suck to develop.
I’ve found the following resource invaluable for designing and creating “cloud native” APIs where I can tackle that kind of thing from the very start without a huge amount of hassle https://google.aip.dev/general
I don't believe you. Eventual consistency is how the real world works, what possible use case is there where it wouldn't be acceptable? Even if you somehow made the display widget part of the database, you can't make the reader's eyeballs ACID-compliant.
> if every join needs a network call your app is going to suck to use and suck to develop.
And yet developers do this every single day without any issue.
It is bad practice to have your authentication database be the same as your app database. Or you have data coming from SaaS products, third party APIs or a cloud service. Or even simply another service in your stack. And with complex schemas often it's far easier to do that join in your application layer.
I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.
If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.
And, if you need one single place for analytics, push changes to that space asynchronously.
Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.
I have done both models. My previous job we had a monolith on top of a 1200 table database. Now I work in an ecosystem of 400 microservices, most with their own database.
What it fundamentally boils down to is that your org chart determines your architecture. We had a single team in charge of the monolith, and it was ok, and then we wanted to add teams and it broke down. On the microservices architecture, we have many teams, which can work independently quite well, until there is a big project that needs coordinated changes, and then the fun starts.
Like always there is no advice that is absolutely right. Monoliths, microservices, function stores. One big server vs kubernetes. Any of those things become the right answer in the right context.
Although I’m still in favor of starting with a modular monolith and splitting off services when it becomes apparent they need to change at a different pace from the main body. That is right in most contexts I think.
To clarify the advice, at least how I believe it should be done…
Use One Big Database Server…
… and on it, use one software database per application.
For example, one Postgres server can host many databases that are mostly* independent from each other. Each application or service should have its own database and be unaware of the others, communicating with them via the services if necessary. This makes splitting up into multiple database servers fairly straightforward if needed later. In reality most businesses will have a long tail of tiny databases that can all be on the same server, with only bigger databases needing dedicated resources.
*you can have interdependencies when you’re using deep features sometimes, but in an application-first development model I’d advise against this.
There's no need for "microservices" in the first place then. That's just logical groupings of functionality that can be separate as classes, namespaces or other modules without being entirely separate processes with a network boundary.
Breaking apart a stateless microservice and then basing it around a giant single monolithic database is pretty pointless - at that stage you might as well just build a monolith and get on with it as every microservice is tightly coupled to the db.
To note that quite a bit of the performance problems come when writing stuff. You can get away with A LOT if you accept 1. the current service doesn't do (much) writing and 2. it can live with slightly old data. Which I think covers 90% of use cases.
So you can end up with those services living on separate machines and connecting to read only db replicas, for virtually limitless scalability. And when it realizes it needs to do an update, it either switches the db connection to a master, or it forwards the whole request to another instance connected to a master db.
(1) Different programming languages e.g. you're written your app in Java but now you need to do something for which the perfect Python library is available.
(2) Different parts of your software need different types of hardware. Maybe one part needs a huge amount of RAM for a cache, but other parts are just a web server. It'd be a shame to have to buy huge amounts of RAM for every server. Splitting the software up and deploying the different parts on different machines can be a win here.
I reckon the average startup doesn't need any of that, not suggesting that monoliths aren't the way to go 90% of the time. But if you do need these things, you can still go the microservices route, but it still makes sense to stick to a single database if at all possible, for consistency and easier JOINs for ad-hoc queries, etc.
Agree. Nothing worse than having different programs changing data in the same database. The database should not be an integration point between services.
I disagree. Suppose you have an enormous DB that's mainly written to by workers inside a company, but has to be widely read by the public outside. You want your internal services on machines with extra layers of security, perhaps only accessible by VPN. Your external facing microservices have other things like e.g. user authentication (which may be tied to a different monolithic database), and you want to put them closer to users, spread out in various data centers or on the edge. Even if they're all bound to one database, there's a lot to recommend keeping them on separate, light cheap servers that are built for http traffic and occasional DB reads. And even more so if those services do a lot of processing on the data that's accessed, such as building up reports, etc.
yah, this is something i learned when designing my first server stack (using sun machines) for a real business back during the dot-com boom/bust era. our single database server was the beefiest machine by far in the stack, 5U in the rack (we also had a hot backup), while the other servers were 1U or 2U in size. most of that girth was for memory and disk space, with decent but not the fastest processors.
one big db server with a hot backup was our best tradeoff for price, performance, and reliability. part of the mitigation was that the other servers could be scaled horizontally to compensate for a decent amount of growth without needing to scale the db horizontally.
Definitely use a big database, until you can't. My advice to anyone starting with a relational data store is to use a proxy from day 1 (or some point before adding something like that becomes scary).
When you need to start sharding your database, having a proxy is like having a super power.
We see both use cases: single large database vs multiple small, decoupled. I agree with the sentiment that a large database offer simplicity, until access patterns change.
We focus on distributing database data to the edge using caching. Typically this eliminates read-replicas and a lot of the headache that goes with app logic rewrites or scaling "One Big Database".
Yep, with a passive replica or online (log) backup.
Keeping things centralized can reduce your hardware requirement by multiple orders of magnitude. The one huge exception is a traditional web service, those scale very well, so you may not even want to get big servers for them (until you need them).
If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
Shard your datastore from day 1, get your dataflow right so that you don't need atomicity, and it'll be painless and scale effortlessly. More importantly, you won't be able to paper over crappy dataflow. It's like using proper types in your code: yes, it takes a bit more effort up-front compared to just YOLOing everything, but it pays dividends pretty quickly.
This is true IFF you get to the point where you have to split up.
I know we're all hot and bothered about getting our apps to scale up to be the next unicorn, but most apps never need to scale past the limit of a single very high-performance database. For most people, this single huge DB is sufficient.
Also, for many (maybe even most) applications, designated outages for maintenance are not only acceptable, but industry standard. Banks have had, and continue to have designated outages all the time, usually on weekends when the impact is reduced.
Sure, what I just wrote is bad advice for mega-scale SaaS offerings with millions of concurrent users, but most of us aren't building those, as much as we would like to pretend that we are.
I will say that TWO of those servers, with some form of synchronous replication, and point in time snapshots, are probably a better choice, but that's hair-splitting.
(and I am a dyed in the wool microservices, scale-out Amazon WS fanboi).
> If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
At which point a new OneBigServer will be 100x as powerful, and all your upfront work will be for nothing.
It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
> It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
But if you have many small databases, you need
> backups, replicas, testing environments, staging, development
all times `n`. Which doesn't sound like an improvement.
> What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
Bad engineering orgs will clutch defeat from the jaws of victory no matter what the early architectural decisions were. The one vs many databases/services is almost moot entirely.
Just FYI, you can have one big database, without running it on one big server. As an example, databases like Cassandra are designed to be scaled horizontally (i.e. scale out, instead of scale up).
There are trade-offs when you scale horizontally even if a database is designed for it. For example, DataStax's Storage Attached Indexes or Cassandra's hidden-table secondary indexing allow for indexing on columns that aren't part of the clustering/partitioning, but when you're reading you're going to have to ask all the nodes to look for something if you aren't including a clustering/partitioning criteria to narrow it down.
You've now scaled out, but you now have to ask each node when searching by secondary index. If you're asking every node for your queries, you haven't really scaled horizontally. You've just increased complexity.
Now, maybe 95% of your queries can be handled with a clustering key and you just need secondary indexes to handle 5% of your stuff. In that case, Cassandra does offer an easy way to handle that last 5%. However, it can be problematic if people take shortcuts too much and you end up putting too much load on the cluster. You're also putting your latency for reads at the highest latency of all the machines in your cluster. For example, if you have 100 machines in your cluster with a mean response time of 2ms and a 99th percentile response time of 150ms, you're potentially going to be providing a bad experience to users waiting on that last box on secondary index queries.
This isn't to say that Cassandra isn't useful - Cassandra has been making some good decisions to balance the problems engineers face. However, it does come with trade-offs when you distribute the data. When you have a well-defined problem, it's a lot easier to design your data for efficient querying and partitioning. When you're trying to figure things out, the flexibility of a single machine and much cheaper secondary index queries can be important - and if you hit a massive scale, you figure out how you want to partition it then.
Cassandra may be great when you have to scale your database that you no longer develop significantly. The problem with this DB system is that you have to know all the queries before you can define the schema.
A relative worked for a hedge fund that used this idea. They were a C#/MSSQL shop, so they just bought whatever was the biggest MSSQL server at the time, updating frequently. They said it was a huge advantage, where the limit in scale was more than offset by productivity.
I think it's an underrated idea. There's a lot of people out there building a lot of complexity for datasets that in the end are less than 100 TB.
But it also has limits. Infamously Twitter delayed going to a sharded architecture a bit too long, making it more of an ugly migration.
I do, it is running on the same big (relatively) server as my native C++ backend talking to the database. The performance smokes your standard cloudy setup big time. Serving thousand requests per second on 16 core without breaking sweat. I am all for monoliths running on real no cloudy hardware. As long as the business scale is reasonable and does not approach FAANG (like for 90% of the businesses) this solution is superior to everything else money, maintenance, development time wise.
I agree with this sentiment but it is often misunderstood as a means to force everything into a single database schema. More people need to learn about logically separating schemas with their database servers!
Another area for consolidation is auth. Use one giant keycloak, with individual realms for every one of the individual apps you are running. Your keycloak is back ended by your one giant database.
I agree that 1BDB is a good idea, but having one ginormous schema has its own costs. So I still think data should be logically partitioned between applications/microservices - in PG terms, one “cluster” but multiple “databases”.
We solved the problem of collecting data from the various databases for end users by having a GraphQL layer which could integrate all the data sources. This turned out to be absolutely awesome. You could also do something similar using FDW. The effort was not significant relative to the size of the application.
The benefits of this architecture were manifold but one of the main ones is that it reduces the complexity of each individual database, which dramatically improved performance, and we knew that if we needed more performance we could pull those individual databases out into their own machine.
I'd say, one big database per service. Often times there are natural places to separate concerns and end up with multiple databases. If you ever want to join things for offline analysis, it's not hard to make a mapreduce pipeline of some kind that reads from all of them and gives you that boundless flexibility.
Then if/when it comes time for sharding, you probably only have to worry about one of those databases first, and you possibly shard it in a higher-level logical way that works for that kind of service (e.g. one smaller database per physical region of customers) instead of something at a lower level with a distributed database. Horizontally scaling DBs sound a lot nicer than they really are.
>>(they don't know how your distributed databases look, and oftentimes they really do not care)
Nor should they, it's the engineer's/team's job to provide the database layer to them with high levels of service without them having to know the details
I'm pretty happy to pay a cloud provider to deal with managing databases and hosts. It doesn't seem to cause me much grief, and maybe I could do it better but my time is worth more than our RDS bill. I can always come back and Do It Myself if I run out of more valuable things to work on.
Similarly, paying for EKS or GKE or the higher-level container offerings seems like a much better place to spend my resources than figuring out how to run infrastructure on bare VMs.
Every time I've seen a normal-sized firm running on VMs, they have one team who is responsible for managing the VMs, and either that team is expecting a Docker image artifact or they're expecting to manage the environment in which the application runs (making sure all of the application dependencies are installed in the environment, etc) which typically implies a lot of coordination between the ops team and the application teams (especially regarding deployment). I've never seen that work as smoothly as deploying to ECS/EKS/whatever and letting the ops team work on automating things at a higher level of abstraction (automatic certificate rotation, automatic DNS, etc).
That said, I've never tried the "one big server" approach, although I wouldn't want to run fewer than 3 replicas, and I would want reproducibility so I know I can stand up the exact same thing if one of the replicas go down as well as for higher-fidelity testing in lower environments. And since we have that kind of reproducibility, there's no significant difference in operational work between running fewer larger servers and more smaller servers.
"Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care)."
This isn't a problem if state is properly divided along the proper business domain and the people who need to access the data have access to it. In fact many use cases require it - publicly traded companies can't let anyone in the organization access financial info and healthcare companies can't let anyone access patient data. And of course are performance concerns as well if anyone in the organization can arbitrarily execute queries on any of the organization's data.
I would say YAGNI applies to data segregation as well and separations shouldn't be introduced until they are necessary.
"combine these data sources" doesn't necessarily mean data analytics. Just as an example, it could be something like "show a badge if it's the user's birthday", which if you had a separate microservice for birthdays would be much harder than joining a new table.
At my current job we have four different databases so I concur with this assessment. I think it's okay to have some data in different DBs if they're significantly different like say the user login data could be in its own database. But anything that we do which is a combination of e-commerce and testing/certification I think they should be in one big database so I can do reasonable queries for information that we need. This doesn't include two other databases we have on-prem which one is a Salesforce setup and another is an internal application system that essentially marries Salesforce to that. It's a weird wild environment to navigate when adding features.
> Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
I'm not sure how to parse this. What should "asks" be?
Mostly agree, but you have to be very strict with the DB architecture. Have very reasonable schema. Punish long running queries. If some dev group starts hammering the DB cut them off early on, don't let them get away with it and then refuse to fix their query design.
The biggest nemesis of big DB approach are dev teams who don't care about the impact of their queries.
Also move all the read-only stuff that can be a few minutes behind to a separate (smaller) server with custom views updated in batches (e.g. product listings). And run analytics out of peak hours and if possible in a separate server.
The rule is: Keep related data together. Exceptions are: Different customers (usually don't require each others data) can be isolated. And if the database become the bottleneck you can separate unrelated services.
Surely having separate DBs all sit on the One Big Server is preferable in many cases. For cases where you really to extract large amounts of data that is derived from multiple DBs, there's no real harm in having some cross-DB joins defined in views somewhere. If there are sensible logical ways to break a monolithic service into component stand-alone services, and good business reasons to do (or it's already been designed that way), then having each talk to their own DB on a shared server should be able to scale pretty well.
If you get your services right there is little or no communications between the services since a microservice should have all the data it needs in it's own store.
Hardware engineers are pushing the absolute physical limits of getting state (memory/storage) as close as possible to compute. A monumental accomplishment as impactful as the invention of agriculture and the industrial revolution.
Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
Fast and powerful browser? Let's completely ignore 20 years of performance engineering and reinvent...rendering. Hmm, sucks a bit. Let's add back server rendering. Wait, now we have to render twice. Ah well, let's just call it a "best practice".
The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
The state of us, the absolute garbage that we put out, and the creative ways in which we try to justify it. It's like a mind virus.
Actually, for those who push for these cloudy solutions, they do that in part to make data close to you. I am talking mostly about CDNs, I don't thing YouTube and Netflix would have been possible without them.
Google is a US company, but you don't want people in Australia to connect to the other side of the globe every time they need to access Google services, it would be an awful waste of intercontinental bandwidth. Instead, Google has data centers in Australia to serve people in Australia, and they only hit US servers when absolutely needed. And that's when you need to abstract things out. If something becomes relevant in Australia, move it in there, and move it out when it no longer matters. When something big happens, copy it everywhere, and replace the copies by something else as interest wanes.
Big companies need to split everything, they can't centralize because the world isn't centralized. The problem is when small businesses try to do the same because "if Google is so successful doing that, it must be right". Scale matters.
Agreed and I think it's easier to compare tech to the movie industry. Just look at all the crappy movies they produce with IMDB ratings below 5 out of 10, that is movies that nobody's going to even watch; then there are the shitty blockbusters with expensive marketing and greatly simplified stories optimized for mindless blockbuster movie goers; then there are rare gems, true works of art that get recognized at festivals at best but usually not by the masses. The state of the movie industry is overall pathetic, and I see parallels with the tech here.
> Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
That's because the concept which is even more impactful than agriculture and the computer, and makes them and everything else in our lives, is abstraction. It makes it possible to reason about large and difficult problems, to specialize, to have multiple people working on them.
Computer hardware is as full of abstraction and separation and specialization as software is. The person designing the logic for a multiplier unit has no more need to know how transistors are etched into silicon than a javascript programmer does.
Heh, there's a mention here to Andy and Bill's Law, "What Andy giveth, Bill taketh away," which is a reference to Andy Grove (Intel) and Bill Gates (Microsoft).
Since I have a long history with Sun Microsystems, upon seeing "Andy and Bill's Law" I immediately thought this was a reference to Andy Bechtolsheim (Sun hardware guy) and Bill Joy (Sun software guy). Sun had its own history of software bloat, with the latest software releases not fitting into contemporary hardware.
> The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
I'm using a Logitech MX Master 3, and it comes with the "Logi Options+" to configure the mouse. I'm super frustrated with the cranky and slow app. It updates every other day and crashes often.
The experience is much better when I can configure the mouse with an open-source driver [^0] while using Linux.
I use Logi Options too, but while it's stable for me, it still uses a bafflingly high amount of CPU. But if I don't run Logi Options, then mouse buttons 3+4 stop working :-/
It's been like that for years.
Logitech's hardware is great, so I don't know why they think it's OK to push out such shite software.
Let me add fuel to the fire. When I started my career, users were happy to select among a handful of 8x8 bitmap font. Nowadays, users expect to see a scalable male-doctor-skin-ton-1 emoji. The former can be implemented by bliting 8 bytes from ROM. The latter requires an SVG engine -- just to render one character.
While bloatware cannot be excluded, let's not forget that user expectations have temendously increased.
We're not a very serious industry. Despite uhm, it pretty much running the world. We're a joke. Sometimes I feel it doesn't even earn the term "engineering" at all, and rather than improving, it seems to get ever worse.
Which really is a stunning accomplishment in a backdrop of spectacular hardware advances, ever more educated people, and other favorable ingredients.
Software engineers don't want to be managing physical hardware and often need to run highly available services. When a team lacks the skill, geographic presence or bandwidth to manage physical servers but needs to deliver a highly-available service, I think the cloud offers legitimate improvements in operations with downsides such as increased cost and decreased performance per unit of cost.
> However, cloud providers have often had global outages in the past, and there is no reason to assume that cloud datacenters will be down any less often than your individual servers.
A nice thing about being in a big provider is when they go down a massive portion of the internet goes down, and it makes news headlines. Users are much less likely to complain about your service being down when it's clear you're just caught up in the global outage that's affecting 10 other things they use.
This is a huge one -- value in outsourcing blame. If you're down because of a major provider outage in the news, you're viewed more as a victim of a natural disaster rather than someone to be blamed.
I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
Agreed. Recently I was discussing the same point with a non-technical friend who was explaining that his CTO had decided to move from Digital Ocean to AWS, after DO experienced some outage. Apparently the CEO is furious at him and has assumed that DO are the worst service provider because their services were down for almost an entire business day. The CTO probably knows that AWS could also fail in a similar fashion, but by moving to AWS it becomes more or less an Act of God type of situation and he can wash his hands of it.
I find this entire attitude disappointing. Engineering has moved from "provide the best reliability" to "provide the reliability we won't get blamed for the failure of". Folks who have this attitude missed out on the dang ethics course their college was teaching.
If rolling your own is faster, cheaper, and more reliable (it is), then the only justification for cloud is assigning blame. But you know what you also don't get? Accolades.
I throw a little party of one here when Office 365 or Azure or AWS or whatever Google calls it's cloud products this week is down but all our staff are able to work without issue. =)
If you work in B2B you can put the blame on Amazon and your customers will ask "understandable, take the necessary steps to make sure it doesn't happen again". AWS going down isn't an act of God, it's something you should've planned for, especially if it happened before.
I don't really have much to do with contracts - but my company is stating that we have up time of 99.xx%.
In terms of contract customers don't care if I have Azure/AWS or I keep my server in the box under the stairs. Yes they do due diligence and would not buy my services if I keep it in shoe box.
But then if they loose business they come to me .. I can go after Azure/AWS but I am so small they will throw some free credits and me and tell to go off.
Maybe if you are in B2C area then yeah - your customers will probably shrug and say it was M$ or Amazon if you write sad blog post with excuses.
Users are much more sympathetic to outages when they're widespread. But, if there's a contractual SLA then their sympathy doesn't matter. You have to meet your SLA. That usually isn't a big problem as SLAs tend to account for some amount of downtime, but it's important to keep the SLA in mind.
There is also the consideration that this isn't even an argument of "other things are down too!" or "outsourcing blame" as much as, depending on what your service is of course, you are unlikely to be operating in a bubble. You likely have some form of external dependencies, or you are an external dependency, or have correlated/cross-dependency usage with another service.
Guaranteeing isolation between all of these different moving parts is very difficult. Even if you're not directly affected by a large cloud outage, it's becoming less-and-less common that you, or your customers, are truely isolated.
As well, if your AWS-hosted service mostly exists to service AWS-hosted customers, and AWS is down, it doesn't matter if you are down. None of your customers are operational anyways. Is this a 100% acceptable solution? Of course not. But for 95% of services/SaaS out there, it really doesn't matter.
Depends on how technical your customer base is. Even as a developer I would tend not to ascribe too much signal to that message. All it tells me is that you don't use AWS.
"We stayed online when GCP, AWS, and Azure go down" is a different story. On the other hand, if those three go down simultaneously, I suspect the state of the world will be such that I'm not worried about the internet.
You also have to calculate in the complexity of running thousands of servers vs running just one server. If you run just one server it's unlikely to go down even once in it's lifetime. Meanwhile cloud providers are guaranteed to have outages due to the share complexity of managing thousands of servers.
When migrating from [no-name CRM] to [big-name CRM] at a recent job, the manager pointed out that when [big-name CRM] goes down, it's in the Wall Street Journal, and when [no-name] goes down, it's hard to get their own Support Team to care!
No. Your users have no idea that you rely on AWS (they don't even know what it is), and they don't think of it as a valid or reasonable excuse as to why your service is down.
If you are not maxing out or even getting above 50% utilization of 128 physical cores (256 threads), 512 GB of memory, and 50 Gbps of bandwidth for $1,318/month, I really like the approach of multiple low-end consumable computers as servers. I have been using arrays of Intel NUCs at some customer sites for years with considerable cost savings over cloud offerings. Keep an extra redundant one in the array ready to swap out a failure.
Another often overlooked option is that in several fly-over states it is quite easy and cheap to register as a public telecommunication utility. This allows you to place a powered pedestal in the public right-of-way, where you can get situated adjacent to an optical meet point and get considerable savings on installation costs of optical Internet, even from a tier 1 provider. If your server bandwidth is peak utilized during business hours and there is an apartment complex nearby you can use that utility designation and competitively provide residential Internet service to offset costs.
> competitively provide residential
> Internet service to offset costs.
I uh. Providing residential Internet for an apartment complex feels like an entire business in and of itself and wildly out of scope for a small business? That's a whole extra competency and a major customer support commitment. Is there something I'm missing here?
It depends on the scale - it does not have to be a major undertaking. You are right, it is a whole extra competency and a major customer support commitment, but for a lot of the entrepreneurial folk on HN quite a rewarding and accessible learning experience.
The first time I did anything like this was in late 1984 in a small town in Iowa where GTE was the local telecommunication utility. Absolutely abysmal Internet service, nothing broadband from them at the time or from the MSO (Mediacom). I found out there was a statewide optical provider with cable going through the town. I incorporated an LLC, became a utility and built out less than 2 miles of single mode fiber to interconnect some of my original software business customers at first. Our internal moto was "how hard can it be?" (more as a rebuke to GTE). We found out. The whole 24x7 public utility thing was very difficult for just a couple of guys. But it grew from there. I left after about 20 years and today it is a thriving provider.
Technology has made the whole process so much easier today. I am amazed more people do not do it. You can get a small rack-mount sheet metal pedestal with an AC power meter and an HVAC unit for under $2k. Being a utility will allow you to place that on a concrete pad or vault in the utility corridor (often without any monthly fee from the city or county). You place a few bollards around it so no one drives into it. You want to get quotes from some tier 1 providers [0]. They will help you identify the best locations to engineer an optical meet and those are the locations you run by the city/county/state utilities board or commission.
For a network engineer wanting to implement a fault tolerant network, you can place multiple pedestals at different locations on your provider's/peer's network to create a route diversified protected network.
After all, when you are buying expensive cloud based services that literally is all your cloud provider is doing ... just on a completely more massive scale. The barrier to entry is not as high as you might think. You have technology offerings like OpenStack [1], where multiple competitive vendors will also help you engineer a solution. The government also provides (financial) support [2].
The best perk is the number of parking spaces the requisite orange utility traffic cone opens up for you.
You're missing "apartment complex" - you as the service provider contract with the apartment management company to basically cover your costs, and they handle the day-to-day along with running the apartment building.
Done right, it'll be cheaper for them (they can advertise "high speed internet included!" or whatever) and you won't have much to do assuming everything on your end just works.
The days where small ISPs provided things like email, web hosting, etc, are long gone; you're just providing a DHCP IP and potentially not even that if you roll out carrier-grade NAT.
I have only done a few midwestern states. Call them and ask [0] - (919) 733-7328. You may want to first call your proposed county commissioner's office or city hall (if you are not rural), and ask them who to talk with about a new local business providing Internet service. If you can show the Utilities Commission that you are working with someone at the local level I have found they will treat you more seriously. In certain rural counties, you can even qualify for funding from the Rural Utilities Service of the USDA.
EDIT: typos + also most states distinguish between facilities-based ISP's (ie with physical plant in the regulated public right-of-way) and other ISPs. Tell them you are looking to become a facilities-based ISP.
We have a different take on running "one big database." At ScyllaDB we prefer vertical scaling because you get better utilization of all your vCPUs, but we still will keep a replication factor of 3 to ensure that you can maintain [at least] quorum reads and writes.
So we would likely recommend running 3x big servers. For those who want to plan for failure, though, they might prefer to have 6x medium servers, because then the loss of any one means you don't take as much of a "torpedo hit" when any one server goes offline.
So it's a balance. You want to be big, but you don't want to be monolithic. You want an HA architecture so that no one node kills your entire business.
I also suggest that people planning systems create their own "torpedo test." We often benchmark to tell maximal optimum performance, presuming that everything is going to go right.
But people who are concerned about real-world outage planning may want to "torpedo" a node to see how a 2-out-of-3-nodes-up cluster operates, versus a 5-out-of-6-nodes-up cluster.
This is like planning for major jets, to see if you can work with 2 of 3 engines, or 1 of 2.
Obviously, if you have 1 engine, there is nothing you can do if you lose that single point of failure. At that point, you are updating your resume, and checking on the quality of your parachute.
I think this is the right approach, and I really admire the work you do at ScyllaDB. For something truly critical, you really do want to have multiple nodes available (at least 2, and probably 3 is better). However, you really should want to have backup copies in multiple datacenters, not just the one.
Today, if I were running something that absolutely needed to be up 24/7, I would run a 2x2 or 2x3 configuration with async replication between primary and backup sites.
Exactly. Regional distribution can be vital. Our customer Kiwi.com had a datacenter fire. 10 of their 30 nodes were turned to a slag heap of ash and metal. But 20 of 30 nodes in their cluster were in completely different datacenters so they lost zero data and kept running non-stop. This is a rare story, but you do NOT want to be one of the thousands of others that only had one datacenter, and their backups were also stored there and burned up with their main servers. Oof!
Well said. Caring about vertical scale doesn't mean you have to throw out a lot of the lessons learned about still being horizontally scalable or high availability.
Some comments wrongly equate bare-metal with on-premise. Bare-metal servers can be rented out, collocated, or installed on-premise.
Also, when renting, the company takes care of hardware failures. Furthermore, as hard disk failures are the most common issue, you can have hot spares and opt to let damaged disks rot, instead of replacing them.
For example, in ZFS, you can mirror disks 1 and 2, while having 3 and 4 as hot spares, with the following command:
Disregarding the security risks of multi-tenant cloud instances, bare-metal is more cost-effective once your cloud bill exceeds $3,000 per year, which is the cost of renting two bare-metal servers.
---
Here's how you can create a two-server infrastructure:
IMO microservices primarily solve organizational problems, not technical problems.
They allow a team to release independently of other teams that have or want to make different risk/velocity tradeoffs. Also smaller units being released means fewer changes and likely fewer failed releases.
I have been doing this for two decades. Let me tell you about bare metal.
Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
But you can't just buy 900 servers. You always need more capacity, so you have to predict what your peak load will be, and buy for that. And you have to do it well in advance because it takes a long time to build and ship 900 servers and then assemble them, run burn-in, replace the duds, and prep the OS, firmware, software. And you have to do this every 3 years (minimum) because old hardware gets obsolete and slow, hardware dies, disks die, support contracts expire. But not all at once, because who knows what logistics problems you'd run into and possibly not get all the machines in time to make your projected peak load.
If back then you told me I could turn on 900 servers for 1 month and then turn them off, no planning, no 3 year capital outlay, no assembly, burn in, software configuration, hardware repair, etc etc, I'd call you crazy. Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity.
And by the way: cloud prices are retail prices. Get on a savings plan or reserve some instances and the cost can be half. Spot instances are a quarter or less the price. Serverless is pennies on the dollar with no management overhead.
If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you, as it can take up to several days for some cloud vendors to get some hardware classes in some regions. And I pray you were doing daily disk snapshots, and can get your dead disks replaced quickly.
The thing that confuses me is, isn't every publicly accessible service bursty on a long timescale? Everything looks seasonal and predictable until you hit the front page of Reddit, and you don't know what day that will be. You don't decide how much traffic you get, the world does.
> I have been doing this for two decades. Let me tell you about bare metal.
> Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
> We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
I started working with real (bare metal) servers on real internet loads in 2004 and retired in 2019. While there's truth here, there's also missing information. In 2004, all my servers had 100M ethernet, but in 2019, all my new servers had 4x10G ethernet (2x public, 2x private), actually some of them had 6x, but with 2x unconnected, I dunno why. In the meantime, cpu, nics, and operating systems have improved such that if you're not getting line rate for full mtu packets, it's probably becsause your application uses a lot of cpu, or you've hit a pathological case in the OS (which happens, but if you're running 1000 servers, you've probably got someone to debug that).
If you still need 1000 beefy 10G servers, you've got a pretty formidable load, but splitting it up into many more smaller servers is asking for problems of different kinds. Otoh, if your load really scales to 10x for a month, and you're at that scale, cloud economics are going to work for you.
My seasonal loads were maybe 50% more than normal, but usage trends (and development trends) meant that the seasonal peak would become the new normal soon enough; cloud managing the peaks would help a bit, but buying for the peak and keeping it running for the growth was fine. Daily peaks were maybe 2-3x the off-peak usage, 5 or 6 days a week; a tightly managed cloud provisioning could reduce costs here, but probably not enough to compete with having bare metal for the full day.
Let me take you back to March, 2020. When millions of Americans woke up to find out there was a pandemic and they would be working from home now. Not a problem, I'll just call up our cloud provider and request more cloud compute. You join a queue of a thousand other customers calling in that morning for the exact same thing. A few hours on hold and the CSR tells you they aren't provisioning anymore compute resources. east-us is tapped out, central-europe tapped out hours ago, California got a clue and they already called to reserve so you can't have that either.
I use cloud all the time but there are also blackswan events where your IaaS can't do anymore for you.
I never had this problem on AWS though I did see some startups struggle with some more specialized instances. Are midsize companies actually running into issues with non-specialized compute on AWS?
That's a good point about cloud services being retail. My company gets a very large discount from one of the most well-known cloud providers. This is available to everybody - typically if you commit to 12 months of a minimum usage then you can get substantial discounts. What I know is so far everything we've migrated to the cloud has resulted in significantly reduced total costs, increased reliability, improved scalability, and is easier to enhance and remediate. Faster, cheaper, better - that's been a huge win for us!
The entire point of the article is that your dated example no longer applies: you can fit the vast majority of common loads on a single server now, they are this powerful.
Redundancy concerns are also addressed in the article.
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
You are taking this a bit too literally. The article itself says one server (and backups).
So "one" here just means a small number not literally no fallback/backup etc. (obviously... even people you disagree with are usually not morons)
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
There's intermediate ground here. Rent one big server, reserved instance. Cloudy in the sense that you get the benefits of the cloud provider's infrastructure skills and experience, and uptime, plus easy backup provisioning; non-cloudy in that you can just treat that one server instance like your own hardware, running (more or less) your own preferred OS/distro, with "traditional" services running on it (e.g. in our case: nginx, gitea, discourse, mantis, ssh)
i handled a 8x increase in traffic to my website from a youtuber reviewing our game, by increasing the cache timer and fixing the wiki creating session table entries for logged out users on a wiki that required accounts to edit it.
we already get multiple millions of page hits a months for this happened.
This server had 8 cores but 5 of them were reserved for the 10tb a month in bandwidth game servers running on the same machine.
If you needed 1,000 physical computers to run your webapp, you fucked up somewhere along the line.
Use One Big Database.
Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
It is so much easier to do these joins efficiently in a single database than fanning out RPC calls to multiple different databases, not to mention dealing with inconsistencies, lack of atomicity, etc. etc. Spin up a specific reader of that database if there needs to be OLAP queries, or use a message bus. But keep your OLTP data within one database for as long as possible.
You can break apart a stateless microservice, but there are few things as stagnant in the world of software than data. It will keep you nimble for new product features. The boxes that they offer on cloud vendors today for managed databases are giant!
> Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
This works until it doesn't and then you land in the position my company finds itself in where our databases can't handle the load we generate. We can't get bigger or faster hardware because we are using the biggest and fastest hardware you can buy.
Distributed systems suck, sure, and they make querying cross systems a nightmare. However, by giving those aspects up, what you gain is the ability to add new services, features, etc without running into scotty yelling "She can't take much more of it!"
Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
People often feel they should have anticipated and avoid the scaling issues altogether, but moving from a single DB to master/replica model, and/or shards or other solutions is fairly doable, and it doesn't come with worse tradeoffs than if you sharded/split services from the start. It always feels fragile and bolt on compared to the elegance of the single DB, but you'd also have many dirty hacks to have a multi DB setup work properly.
Also, you do that from a position where you usually have money, resources and a good knowledge of your core parts, which is not true when you're still growing full speed.
I've fortunately/unfortunately never yet been involved in a project that we couldn't comfortably host using one big write master and a handful of read slaves.
Maybe one day a project I'm involved with will approach "FAANG scale" where that stops working, but you can 100% run 10s of millions of dollars a month in revenue with that setup, at least in a bunch of typical web/app business models.
Early on I did hit the "OMG, we're cooking our database" where we needed to add read cacheing. When I first did that memcached was still written in Perl. So that joined my toolbox very early on (sometime in the late 90s).
Once read cacheing started to not keep up, it was easy enough to make the read cache/memcached layer understand and distribute reads across read slaves. I remember talking to Monty Widenius at The Open Source Conference, I think in Sad Jose around 2001 or so, about getting MySQL replication to use SSL so I could safely replicate to read slaves in Sydney and London from our write master in PAIX.
I have twice committed the sin of premature optimisation and sharded databases "because this one was _for sure_ going to get too big for our usual database setup". It only ever brought unneeded grief and never actually proved necessary.
That is, one huge table keyed by (for instance) alphabet and when the load gets too big you split it into a-m and n-z tables, each on either their own disk or their own machine.
Then just keep splitting it like that. All of your application logic stays the same … everything stays very flat and simple … you just point different queries to different shards.
I like this because the shards can evolve from their own disk IO to their own machines… and later you can reassemble them if you acquire faster hardware, etc.
Maybe, but if you split it from the start you die by a thousand cuts, and likely pay the cost up front, even if you’d never get to the volumes that’d require a split.
But that's survivorship bias and looking back at things from current problems perspective.
You know what's the least future proof and scalable project ? The one that gets canceled because they failed to deliver any value in reasonable time in the early phase. Once you get to "huge project status" you can afford glacial pace. Most of the time you can't afford that early on - so even if by some miracle you knew what scaling issues you're going to have long term and invested in fixing them early on - it's rarely been a good tradeoff in my experience.
I've seen more projects fail because they tangle themselves up in unnecessary complexity early on and fail to execute on core value proposition, than I've seen fail from being unable to manage the tech debt 10 years in. Developers like to complain about the second, but they get fired on the first kind. Unfortunately in todays job market they just resume pad their failures as "relevant experience" and move on to the next project - so there is not correcting feedback.
FastComments runs on one big DB in each region, with a hot backup... no issues yet.
Before you go to microservices you can also shard, as others have mentioned.
Deleted Comment
I'm sure they eventually moved off that single primary box, but for many years Bitbucket was run off 1 primary in each datacenter (with a failover), and a few read-only copies. If you're getting to the point where one database isn't enough, you're either doing something pretty weird, are working on a specific problem which needs a more complicated setup, or have grown to the point where investing in a microservice architecture starts to make sense.
I'd be interested to know if anyone has a good solution for that.
In my mind, reasons involve keeping transactional integrity, ACID compliance, better error propagation, avoiding the hundreds of impossible to solve roadblocks of distributed systems (https://groups.csail.mit.edu/tds/papers/Lynch/MIT-LCS-TM-394...).
But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.
Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.
Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.
I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!
It's not though. You're just seeing the most popular opinion on HN.
In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.
My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.
As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.
I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.
If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.
I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.
Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.
Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.
Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).
Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.
Pretty much every CRUD app needs this at some point and if every join needs a network call your app is going to suck to use and suck to develop.
The patterns section covers all of this and more
And yet developers do this every single day without any issue.
It is bad practice to have your authentication database be the same as your app database. Or you have data coming from SaaS products, third party APIs or a cloud service. Or even simply another service in your stack. And with complex schemas often it's far easier to do that join in your application layer.
All of these require a network call and join.
_at some point_ is the key word here.
Most startups (and businesses) can likely get away with this well into Series A or Series B territory.
I emphatically disagree.
I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.
If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.
And, if you need one single place for analytics, push changes to that space asynchronously.
Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.
What it fundamentally boils down to is that your org chart determines your architecture. We had a single team in charge of the monolith, and it was ok, and then we wanted to add teams and it broke down. On the microservices architecture, we have many teams, which can work independently quite well, until there is a big project that needs coordinated changes, and then the fun starts.
Like always there is no advice that is absolutely right. Monoliths, microservices, function stores. One big server vs kubernetes. Any of those things become the right answer in the right context.
Although I’m still in favor of starting with a modular monolith and splitting off services when it becomes apparent they need to change at a different pace from the main body. That is right in most contexts I think.
Use One Big Database Server…
… and on it, use one software database per application.
For example, one Postgres server can host many databases that are mostly* independent from each other. Each application or service should have its own database and be unaware of the others, communicating with them via the services if necessary. This makes splitting up into multiple database servers fairly straightforward if needed later. In reality most businesses will have a long tail of tiny databases that can all be on the same server, with only bigger databases needing dedicated resources.
*you can have interdependencies when you’re using deep features sometimes, but in an application-first development model I’d advise against this.
If you are creating microservices, you must segment them all the way through.
So you can end up with those services living on separate machines and connecting to read only db replicas, for virtually limitless scalability. And when it realizes it needs to do an update, it either switches the db connection to a master, or it forwards the whole request to another instance connected to a master db.
(1) Different programming languages e.g. you're written your app in Java but now you need to do something for which the perfect Python library is available.
(2) Different parts of your software need different types of hardware. Maybe one part needs a huge amount of RAM for a cache, but other parts are just a web server. It'd be a shame to have to buy huge amounts of RAM for every server. Splitting the software up and deploying the different parts on different machines can be a win here.
I reckon the average startup doesn't need any of that, not suggesting that monoliths aren't the way to go 90% of the time. But if you do need these things, you can still go the microservices route, but it still makes sense to stick to a single database if at all possible, for consistency and easier JOINs for ad-hoc queries, etc.
99% of apps are best fit as monolithic apps and databases and should focus on business value rather than scale they'll never see.
Deleted Comment
What is the point of that? it doesn't add anything. Just more shit to remember and get right (and get wrong!)
yah, this is something i learned when designing my first server stack (using sun machines) for a real business back during the dot-com boom/bust era. our single database server was the beefiest machine by far in the stack, 5U in the rack (we also had a hot backup), while the other servers were 1U or 2U in size. most of that girth was for memory and disk space, with decent but not the fastest processors.
one big db server with a hot backup was our best tradeoff for price, performance, and reliability. part of the mitigation was that the other servers could be scaled horizontally to compensate for a decent amount of growth without needing to scale the db horizontally.
When you need to start sharding your database, having a proxy is like having a super power.
We see both use cases: single large database vs multiple small, decoupled. I agree with the sentiment that a large database offer simplicity, until access patterns change.
We focus on distributing database data to the edge using caching. Typically this eliminates read-replicas and a lot of the headache that goes with app logic rewrites or scaling "One Big Database".
[1] https://www.polyscale.ai/
Yep, with a passive replica or online (log) backup.
Keeping things centralized can reduce your hardware requirement by multiple orders of magnitude. The one huge exception is a traditional web service, those scale very well, so you may not even want to get big servers for them (until you need them).
Shard your datastore from day 1, get your dataflow right so that you don't need atomicity, and it'll be painless and scale effortlessly. More importantly, you won't be able to paper over crappy dataflow. It's like using proper types in your code: yes, it takes a bit more effort up-front compared to just YOLOing everything, but it pays dividends pretty quickly.
I know we're all hot and bothered about getting our apps to scale up to be the next unicorn, but most apps never need to scale past the limit of a single very high-performance database. For most people, this single huge DB is sufficient.
Also, for many (maybe even most) applications, designated outages for maintenance are not only acceptable, but industry standard. Banks have had, and continue to have designated outages all the time, usually on weekends when the impact is reduced.
Sure, what I just wrote is bad advice for mega-scale SaaS offerings with millions of concurrent users, but most of us aren't building those, as much as we would like to pretend that we are.
I will say that TWO of those servers, with some form of synchronous replication, and point in time snapshots, are probably a better choice, but that's hair-splitting.
(and I am a dyed in the wool microservices, scale-out Amazon WS fanboi).
At which point a new OneBigServer will be 100x as powerful, and all your upfront work will be for nothing.
what about using something like cocroach from day 1?
It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
But if you have many small databases, you need
> backups, replicas, testing environments, staging, development
all times `n`. Which doesn't sound like an improvement.
> What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
Bad engineering orgs will clutch defeat from the jaws of victory no matter what the early architectural decisions were. The one vs many databases/services is almost moot entirely.
https://cassandra.apache.org/_/cassandra-basics.html
You've now scaled out, but you now have to ask each node when searching by secondary index. If you're asking every node for your queries, you haven't really scaled horizontally. You've just increased complexity.
Now, maybe 95% of your queries can be handled with a clustering key and you just need secondary indexes to handle 5% of your stuff. In that case, Cassandra does offer an easy way to handle that last 5%. However, it can be problematic if people take shortcuts too much and you end up putting too much load on the cluster. You're also putting your latency for reads at the highest latency of all the machines in your cluster. For example, if you have 100 machines in your cluster with a mean response time of 2ms and a 99th percentile response time of 150ms, you're potentially going to be providing a bad experience to users waiting on that last box on secondary index queries.
This isn't to say that Cassandra isn't useful - Cassandra has been making some good decisions to balance the problems engineers face. However, it does come with trade-offs when you distribute the data. When you have a well-defined problem, it's a lot easier to design your data for efficient querying and partitioning. When you're trying to figure things out, the flexibility of a single machine and much cheaper secondary index queries can be important - and if you hit a massive scale, you figure out how you want to partition it then.
I think it's an underrated idea. There's a lot of people out there building a lot of complexity for datasets that in the end are less than 100 TB.
But it also has limits. Infamously Twitter delayed going to a sharded architecture a bit too long, making it more of an ugly migration.
I do, it is running on the same big (relatively) server as my native C++ backend talking to the database. The performance smokes your standard cloudy setup big time. Serving thousand requests per second on 16 core without breaking sweat. I am all for monoliths running on real no cloudy hardware. As long as the business scale is reasonable and does not approach FAANG (like for 90% of the businesses) this solution is superior to everything else money, maintenance, development time wise.
We solved the problem of collecting data from the various databases for end users by having a GraphQL layer which could integrate all the data sources. This turned out to be absolutely awesome. You could also do something similar using FDW. The effort was not significant relative to the size of the application.
The benefits of this architecture were manifold but one of the main ones is that it reduces the complexity of each individual database, which dramatically improved performance, and we knew that if we needed more performance we could pull those individual databases out into their own machine.
Then if/when it comes time for sharding, you probably only have to worry about one of those databases first, and you possibly shard it in a higher-level logical way that works for that kind of service (e.g. one smaller database per physical region of customers) instead of something at a lower level with a distributed database. Horizontally scaling DBs sound a lot nicer than they really are.
Nor should they, it's the engineer's/team's job to provide the database layer to them with high levels of service without them having to know the details
It may be reasonable to have two databases e.g. a class a and class b for pci compliance. So context still deeply matters.
Also having a dev DB with mock data and a live DB with real data is a common setup in many companies.
Similarly, paying for EKS or GKE or the higher-level container offerings seems like a much better place to spend my resources than figuring out how to run infrastructure on bare VMs.
Every time I've seen a normal-sized firm running on VMs, they have one team who is responsible for managing the VMs, and either that team is expecting a Docker image artifact or they're expecting to manage the environment in which the application runs (making sure all of the application dependencies are installed in the environment, etc) which typically implies a lot of coordination between the ops team and the application teams (especially regarding deployment). I've never seen that work as smoothly as deploying to ECS/EKS/whatever and letting the ops team work on automating things at a higher level of abstraction (automatic certificate rotation, automatic DNS, etc).
That said, I've never tried the "one big server" approach, although I wouldn't want to run fewer than 3 replicas, and I would want reproducibility so I know I can stand up the exact same thing if one of the replicas go down as well as for higher-fidelity testing in lower environments. And since we have that kind of reproducibility, there's no significant difference in operational work between running fewer larger servers and more smaller servers.
This isn't a problem if state is properly divided along the proper business domain and the people who need to access the data have access to it. In fact many use cases require it - publicly traded companies can't let anyone in the organization access financial info and healthcare companies can't let anyone access patient data. And of course are performance concerns as well if anyone in the organization can arbitrarily execute queries on any of the organization's data.
I would say YAGNI applies to data segregation as well and separations shouldn't be introduced until they are necessary.
I'm not sure how to parse this. What should "asks" be?
The biggest nemesis of big DB approach are dev teams who don't care about the impact of their queries.
Also move all the read-only stuff that can be a few minutes behind to a separate (smaller) server with custom views updated in batches (e.g. product listings). And run analytics out of peak hours and if possible in a separate server.
Nor should they.
Hardware engineers are pushing the absolute physical limits of getting state (memory/storage) as close as possible to compute. A monumental accomplishment as impactful as the invention of agriculture and the industrial revolution.
Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
Fast and powerful browser? Let's completely ignore 20 years of performance engineering and reinvent...rendering. Hmm, sucks a bit. Let's add back server rendering. Wait, now we have to render twice. Ah well, let's just call it a "best practice".
The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
The state of us, the absolute garbage that we put out, and the creative ways in which we try to justify it. It's like a mind virus.
I want my downvotes now.
Google is a US company, but you don't want people in Australia to connect to the other side of the globe every time they need to access Google services, it would be an awful waste of intercontinental bandwidth. Instead, Google has data centers in Australia to serve people in Australia, and they only hit US servers when absolutely needed. And that's when you need to abstract things out. If something becomes relevant in Australia, move it in there, and move it out when it no longer matters. When something big happens, copy it everywhere, and replace the copies by something else as interest wanes.
Big companies need to split everything, they can't centralize because the world isn't centralized. The problem is when small businesses try to do the same because "if Google is so successful doing that, it must be right". Scale matters.
CDN = good distribution.
Microservices = bad distribution.
That's because the concept which is even more impactful than agriculture and the computer, and makes them and everything else in our lives, is abstraction. It makes it possible to reason about large and difficult problems, to specialize, to have multiple people working on them.
Computer hardware is as full of abstraction and separation and specialization as software is. The person designing the logic for a multiplier unit has no more need to know how transistors are etched into silicon than a javascript programmer does.
The web is slower than ever. Desktop apps 20 years ago were faster than today's garbage. We failed.
Since I have a long history with Sun Microsystems, upon seeing "Andy and Bill's Law" I immediately thought this was a reference to Andy Bechtolsheim (Sun hardware guy) and Bill Joy (Sun software guy). Sun had its own history of software bloat, with the latest software releases not fitting into contemporary hardware.
I'm using a Logitech MX Master 3, and it comes with the "Logi Options+" to configure the mouse. I'm super frustrated with the cranky and slow app. It updates every other day and crashes often.
The experience is much better when I can configure the mouse with an open-source driver [^0] while using Linux.
[^0] https://github.com/PixlOne/logiops
It's been like that for years.
Logitech's hardware is great, so I don't know why they think it's OK to push out such shite software.
[] https://www.youtube.com/watch?v=ZSRHeXYDLko
While bloatware cannot be excluded, let's not forget that user expectations have temendously increased.
Which really is a stunning accomplishment in a backdrop of spectacular hardware advances, ever more educated people, and other favorable ingredients.
Seems like a fair trade-off to make.
Speak for yourself, I need to get some use out of my winter jacket ever since winters stopped being a thing.
A nice thing about being in a big provider is when they go down a massive portion of the internet goes down, and it makes news headlines. Users are much less likely to complain about your service being down when it's clear you're just caught up in the global outage that's affecting 10 other things they use.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
You will primarily be judged by how much of an inconvenience the outage was to every individual.
The best you can hope for is that the local ISP gets the blame, but honestly. It can't be more than a rounding error in the end.
If rolling your own is faster, cheaper, and more reliable (it is), then the only justification for cloud is assigning blame. But you know what you also don't get? Accolades.
I throw a little party of one here when Office 365 or Azure or AWS or whatever Google calls it's cloud products this week is down but all our staff are able to work without issue. =)
The real reason that talented engineers secretly support all of the middle management we vocally complain about.
I don't really have much to do with contracts - but my company is stating that we have up time of 99.xx%.
In terms of contract customers don't care if I have Azure/AWS or I keep my server in the box under the stairs. Yes they do due diligence and would not buy my services if I keep it in shoe box.
But then if they loose business they come to me .. I can go after Azure/AWS but I am so small they will throw some free credits and me and tell to go off.
Maybe if you are in B2C area then yeah - your customers will probably shrug and say it was M$ or Amazon if you write sad blog post with excuses.
This is terrible for many reasons, but I wouldn't be surprised to hear someone has done this.
Guaranteeing isolation between all of these different moving parts is very difficult. Even if you're not directly affected by a large cloud outage, it's becoming less-and-less common that you, or your customers, are truely isolated.
As well, if your AWS-hosted service mostly exists to service AWS-hosted customers, and AWS is down, it doesn't matter if you are down. None of your customers are operational anyways. Is this a 100% acceptable solution? Of course not. But for 95% of services/SaaS out there, it really doesn't matter.
Deleted Comment
Imagine the clout of saying : "we stayed online while AWS died"
"We stayed online when GCP, AWS, and Azure go down" is a different story. On the other hand, if those three go down simultaneously, I suspect the state of the world will be such that I'm not worried about the internet.
Another often overlooked option is that in several fly-over states it is quite easy and cheap to register as a public telecommunication utility. This allows you to place a powered pedestal in the public right-of-way, where you can get situated adjacent to an optical meet point and get considerable savings on installation costs of optical Internet, even from a tier 1 provider. If your server bandwidth is peak utilized during business hours and there is an apartment complex nearby you can use that utility designation and competitively provide residential Internet service to offset costs.
The first time I did anything like this was in late 1984 in a small town in Iowa where GTE was the local telecommunication utility. Absolutely abysmal Internet service, nothing broadband from them at the time or from the MSO (Mediacom). I found out there was a statewide optical provider with cable going through the town. I incorporated an LLC, became a utility and built out less than 2 miles of single mode fiber to interconnect some of my original software business customers at first. Our internal moto was "how hard can it be?" (more as a rebuke to GTE). We found out. The whole 24x7 public utility thing was very difficult for just a couple of guys. But it grew from there. I left after about 20 years and today it is a thriving provider.
Technology has made the whole process so much easier today. I am amazed more people do not do it. You can get a small rack-mount sheet metal pedestal with an AC power meter and an HVAC unit for under $2k. Being a utility will allow you to place that on a concrete pad or vault in the utility corridor (often without any monthly fee from the city or county). You place a few bollards around it so no one drives into it. You want to get quotes from some tier 1 providers [0]. They will help you identify the best locations to engineer an optical meet and those are the locations you run by the city/county/state utilities board or commission.
For a network engineer wanting to implement a fault tolerant network, you can place multiple pedestals at different locations on your provider's/peer's network to create a route diversified protected network.
After all, when you are buying expensive cloud based services that literally is all your cloud provider is doing ... just on a completely more massive scale. The barrier to entry is not as high as you might think. You have technology offerings like OpenStack [1], where multiple competitive vendors will also help you engineer a solution. The government also provides (financial) support [2].
The best perk is the number of parking spaces the requisite orange utility traffic cone opens up for you.
[0] https://en.wikipedia.org/wiki/Tier_1_network
[1] https://www.openstack.org/
[2] https://www.usda.gov/reconnect
Done right, it'll be cheaper for them (they can advertise "high speed internet included!" or whatever) and you won't have much to do assuming everything on your end just works.
The days where small ISPs provided things like email, web hosting, etc, are long gone; you're just providing a DHCP IP and potentially not even that if you roll out carrier-grade NAT.
Is North Carolina one of those states? I'm intrigued…
[0] https://www.ncuc.net/
EDIT: typos + also most states distinguish between facilities-based ISP's (ie with physical plant in the regulated public right-of-way) and other ISPs. Tell them you are looking to become a facilities-based ISP.
Stares at the 3 NUCs on my desk waiting to be clustered for a local sandbox.
So we would likely recommend running 3x big servers. For those who want to plan for failure, though, they might prefer to have 6x medium servers, because then the loss of any one means you don't take as much of a "torpedo hit" when any one server goes offline.
So it's a balance. You want to be big, but you don't want to be monolithic. You want an HA architecture so that no one node kills your entire business.
I also suggest that people planning systems create their own "torpedo test." We often benchmark to tell maximal optimum performance, presuming that everything is going to go right.
But people who are concerned about real-world outage planning may want to "torpedo" a node to see how a 2-out-of-3-nodes-up cluster operates, versus a 5-out-of-6-nodes-up cluster.
This is like planning for major jets, to see if you can work with 2 of 3 engines, or 1 of 2.
Obviously, if you have 1 engine, there is nothing you can do if you lose that single point of failure. At that point, you are updating your resume, and checking on the quality of your parachute.
The ordering of these events seems off but that's understandable considering we're talking about distributed systems.
Today, if I were running something that absolutely needed to be up 24/7, I would run a 2x2 or 2x3 configuration with async replication between primary and backup sites.
https://www.scylladb.com/2021/03/23/kiwi-com-nonstop-operati...
Also, when renting, the company takes care of hardware failures. Furthermore, as hard disk failures are the most common issue, you can have hot spares and opt to let damaged disks rot, instead of replacing them.
For example, in ZFS, you can mirror disks 1 and 2, while having 3 and 4 as hot spares, with the following command:
---The 400Gbps are now 700Gbps
https://twitter.com/DanRayburn/status/1519077127575855104
---
About the break even point:
Disregarding the security risks of multi-tenant cloud instances, bare-metal is more cost-effective once your cloud bill exceeds $3,000 per year, which is the cost of renting two bare-metal servers.
---
Here's how you can create a two-server infrastructure:
https://blog.uidrafter.com/freebsd-jails-network-setup
"grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too
seem very confusing to grug"
https://grugbrain.dev/#grug-on-microservices
They allow a team to release independently of other teams that have or want to make different risk/velocity tradeoffs. Also smaller units being released means fewer changes and likely fewer failed releases.
The interfaces are the hard part, so you may have fewer internal failures but problems between services seem more likely.
Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
But you can't just buy 900 servers. You always need more capacity, so you have to predict what your peak load will be, and buy for that. And you have to do it well in advance because it takes a long time to build and ship 900 servers and then assemble them, run burn-in, replace the duds, and prep the OS, firmware, software. And you have to do this every 3 years (minimum) because old hardware gets obsolete and slow, hardware dies, disks die, support contracts expire. But not all at once, because who knows what logistics problems you'd run into and possibly not get all the machines in time to make your projected peak load.
If back then you told me I could turn on 900 servers for 1 month and then turn them off, no planning, no 3 year capital outlay, no assembly, burn in, software configuration, hardware repair, etc etc, I'd call you crazy. Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity.
And by the way: cloud prices are retail prices. Get on a savings plan or reserve some instances and the cost can be half. Spot instances are a quarter or less the price. Serverless is pennies on the dollar with no management overhead.
If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you, as it can take up to several days for some cloud vendors to get some hardware classes in some regions. And I pray you were doing daily disk snapshots, and can get your dead disks replaced quickly.
The point was most people don't have that and even their bursts can fit in a single server. This is my experience as well.
> Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
> We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
I started working with real (bare metal) servers on real internet loads in 2004 and retired in 2019. While there's truth here, there's also missing information. In 2004, all my servers had 100M ethernet, but in 2019, all my new servers had 4x10G ethernet (2x public, 2x private), actually some of them had 6x, but with 2x unconnected, I dunno why. In the meantime, cpu, nics, and operating systems have improved such that if you're not getting line rate for full mtu packets, it's probably becsause your application uses a lot of cpu, or you've hit a pathological case in the OS (which happens, but if you're running 1000 servers, you've probably got someone to debug that).
If you still need 1000 beefy 10G servers, you've got a pretty formidable load, but splitting it up into many more smaller servers is asking for problems of different kinds. Otoh, if your load really scales to 10x for a month, and you're at that scale, cloud economics are going to work for you.
My seasonal loads were maybe 50% more than normal, but usage trends (and development trends) meant that the seasonal peak would become the new normal soon enough; cloud managing the peaks would help a bit, but buying for the peak and keeping it running for the growth was fine. Daily peaks were maybe 2-3x the off-peak usage, 5 or 6 days a week; a tightly managed cloud provisioning could reduce costs here, but probably not enough to compete with having bare metal for the full day.
I use cloud all the time but there are also blackswan events where your IaaS can't do anymore for you.
Redundancy concerns are also addressed in the article.
You are taking this a bit too literally. The article itself says one server (and backups). So "one" here just means a small number not literally no fallback/backup etc. (obviously... even people you disagree with are usually not morons)
There's intermediate ground here. Rent one big server, reserved instance. Cloudy in the sense that you get the benefits of the cloud provider's infrastructure skills and experience, and uptime, plus easy backup provisioning; non-cloudy in that you can just treat that one server instance like your own hardware, running (more or less) your own preferred OS/distro, with "traditional" services running on it (e.g. in our case: nginx, gitea, discourse, mantis, ssh)
> it can take up to several days for some cloud vendors to get some hardware classes in some regions.
I wonder how these two can be true at the same time…
we already get multiple millions of page hits a months for this happened.
This server had 8 cores but 5 of them were reserved for the 10tb a month in bandwidth game servers running on the same machine.
If you needed 1,000 physical computers to run your webapp, you fucked up somewhere along the line.