> The markup cost of using RDS (or any managed database) is worth it.
Every so often I price out RDS to replace our colocated SQL Server cluster and it's so unrealistically expensive that I just have to laugh. It's absurdly far beyond what I'd be willing to pay. The markup is enough to pay for the colocation rack, the AWS Direct Connects, the servers, the SAN, the SQL Server licenses, the maintenance contracts, and a full-time in-house DBA.
Once you get past the point where the markup can pay for one or more full-time employees, I think you should consider doing that instead of blindly paying more and more to scale RDS up. You're REALLY paying for it with RDS. At least re-evaluate the choices you made as a fledgling startup once you reach the scale where you're paying AWS "full time engineer" amounts of money.
Some orgs are looking at moving back to on prem because they're figuring this out. For a while it was vogue to go from capex to opex costs, and C suite people were incentivized to do that via comp structures, hence "digital transformation" ie: migration to public cloud infrastructure. Now, those same orgs are realizing that renting computers actually costs more than owning them, when you're utilizing them to a significant degree.
I was once part of an acquisition from a much larger corporate entity. The new parent company was in the middle of a huge cloud migration, and as part of our integration into their org, we were required to migrate our services to the cloud.
Our calculations said it would cost 3x as much to run our infra on the cloud.
We pushed back, and were greenlit on creating a hybrid architecture that allowed us to launch machines both on-prem and in the cloud (via a direct link to the cloud datacenter). This gave us the benefit of autoscaling our volatile services, while maintaining our predictable services on the cheap.
After I left, apparently my former team was strong-armed into migrating everything to the cloud.
A few years go by, and guess who reaches out on LinkedIn?
The parent org was curious how we built the hybrid infra, and wanted us to come back to do it again.
Context: I build internal tools and platforms. Traffic on them varies, but some of them are quite active.
My nasty little secret is for single server databases I have zero fear of over provisioning disk iops and running it on SQLite or making a single RDBMS server in a container. I've never actually run into an issue with this. It surprises me the number of internal tools I see that depend on large RDS installations that have piddly requirements.
That’s made possible because of all the orchestration platforms such as Kubernetes being standardized, and as such you can get pretty close to a cloud experience while having all your infrastructure on-premise.
Same experience here. As a small organization, the quotes we got from cloud providers have always been prohibitively expensive compared to running things locally, even when we accounted for geographical redundancy, generous labor costs, etc. Plus, we get to keep know how and avoid lock-in, which are extremely important things in the long term.
I am sure for some use-cases cloud services might be worth it, especially if you are a large organization and you get huge discounts. But I see lots of business types blindly advocating for clouds, without understanding costs and technical tradeoffs. Fortunately, the trend seems to be plateauing. I see an increasing demand for people with HPC, DB administration, and sysadmin skills.
It's not an either/or. Many business both own and rent things.
If price is the only factor, your business model (or executives' decision-making) is questionable. Buy only the cheapest shit, spend your time building your own office chair rather than talking to a customer, you aren't making a premium product, and that means you're not differentiated.
RDS pricing is deranged at the scales I've seen too.
$60k/year for something I could run on just a slice of one of my on-prem $20k servers. This is something we would have run 10s of. $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this? If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Most software startups these days? The blog post is about work done at a startup after all. By the time your db is big enough to cost an unreasonable amount on RDS, you’re likely a big enough team to have options. If you’re a small startup, saving a couple hundred bucks a month by self managing your database is rarely a good choice. There’re more valuable things to work on.
I have a small MySQL database that’s rather important, and RDS was a complete failure.
It would have cost a negligible amount. But the sheer amount of time I wasted before I gave up was honestly quite surprising. Let’s see:
- I wanted one simple extension. I could have compromised on this, but getting it to work on RDS was a nonstarter.
- I wanted RDS to _import the data_. Nope, RDS isn’t “SUPER,” so it rejects a bunch of stuff that mysqldump emits. Hacking around it with sed was not confidence-inspiring.
- The database uses GTIDs and needed to maintain replication to a non-AWS system. RDS nominally supports GTID, but the documented way to enable it at import time strongly suggests that whoever wrote the docs doesn’t actually understand the purpose of GTID, and it wasn’t clear that RDS could do it right. At least Azure’s docs suggested that I could have written code to target some strange APIs to program the thing correctly.
Time wasted: a surprising number of hours. I’d rather give someone a bit of money to manage the thing, but it’s still on a combination of plain cloud servers and bare metal. Oh well.
> Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this?
Very small businesses with phone apps or web apps are often using it. There are cheaper options of course, but when there is no "prem" and there are 1-5 employees then it doesn't make much sense to hire for infra. You outsource all digital work to an agency who sets you up a cloud account so you have ownership, but they do all software dev and infra work.
> If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
Small businesses again, some of my clients could probably run off a Pentium 4 from 2008, but due to nature of the org and agency engagement it often needs to live in the cloud somewhere.
I am constantly beating the drum to reduce costs and use as little infra as needed though, so in a sense I agree, but the engagement is what it is.
Additionally, everyone wants to believe they will need to hyperscale, so even medium scale businesses over-provision and some agencies are happen to do that for them as they profit off the margin.
Lots of cases. It doesn't even have to be a tiny database. Within <1TB range there's a huge number of online companies that don't need to do more than hundreds of queries per second, but need the reliability and quick failover that RDS gives them. The $600k cost is absurd indeed, but it's not the range of what those companies spend.
Also, Aurora gives you the block level cluster that you can't deploy on your own - it's way easier to work with than the usual replication.
RDS is not so bulletproof as advertised, and the support is first arrogant then (maybe) helpful.
People pay for RDS because they want to believe in a fairy tale that it will keep potential problems away and that it worked well for other customers. But those mythical other customers also paid based on such belief. Plus, no one wants to admit that they pay money in such irrational way.
It's a bubble
> $600k/year operational against sub-$100k capital cost pays DBAs, backups, etc with money to spare.
One of these is not like the others (DBAs are not capex.)
Have you ever considered that if a company can get the same result for the same price ($100K opex for RDS vs same for human DBA), it actually makes much more sense to go the route that takes the human out of the loop?
The human shows up hungover,
goes crazy, gropes Stacy from HR, etc.
That's a huge instance with an enterprise license on top. Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone. The amount of people running half a million a year RDS bills might not be that large. For most people RDS is worth it as soon as you have backup requirements and would have to implement them yourself.
> Most large SaaS companies can run off of $5k / m or cheaper RDS
Hard disagree. An r6i.12xl Multi-AZ with 7500 IOPS / 500 GiB io1 books at $10K/month on its own. Add a read replica, even Single-AZ at a smaller size, and you’re half that again. And this is without the infra required to run a load balancer / connection pooler.
I don’t know what your definition of “large” is, but the described would be adequate at best at the ~100K QPS level.
RDS is expensive as hell, because they know most people don’t want to take the time to read docs and understand how to implement a solid backup strategy. That, and they’ve somehow convinced everyone that you don’t have to tune RDS.
Definitely--I recommend this after you've reached the point where you're writing huge checks to AWS. Maybe this is just assumed but I've never seen anyone else add that nuance to the "just use RDS" advice. It's always just "RDS is worth it" full stop, as in this article.
>Most large SaaS companies can run off of $5k / m or cheaper RDS deployments which isn't enough to pay someone.
After initial setup, managing equivalent of $5k/m RDS is not full time job. If you add to this, that wages differ a lot around the world, $5k can take you very, very far in terms of paying someone.
Discount rates are actually much better too on the bigger instances. Therefore the "sticker price" that people compare on the public site is no where close to a fair comparison.
We technically aren't supposed to talk about pricing publically, but I'm just going to say that we run a few 8XL and 12Xl RDS instances and we pay ~40% off the sticker price.
If you switch to Aurora engine the pricing is absurdly complex (its basically impossible to determine without a simulation calculator) but AWS is even more aggressive with discounting on Aurora, not to mention there are some legit amazing feature benefits by switching.
I'm still in agreeance that you could do it cheaper yourself at a Data Center. But there are some serious tradeoffs made by doing it that way. One is complexity and it certainly requires several new hiring decisions. Those have their own tangible costs, but there are a huge amount of intangible costs as well like pure inconvenience, more people management, more hiring, split expertise, complexity to network systems, reduce elasticity of decisions, longer commitments, etc.. It's harder to put a price on that.
When you account for the discounts at this scale, I think the cost gap between the two solutions is much smaller and these inconveniences and complexities by rolling it yourself are sometimes worth bridging that smaller gap in cost in order to gain those efficiencies.
This is because you are using SQL Server. Microsoft has intentionally made cloud pricing for SQL server prohibitively expensive for non-Azure cloud workloads by requiring per-core licensing that is extremely punitive for the way EC2 and RDS is architected. This has the effect of making RDS vastly more expensive than running the same workload on bare metal or Azure.
Frankly, this is anti-competitive, and the FTC should look into it, however, Microsoft has been anti-competitive and customer hostile for decades, so if you're still using their products, you must have accepted the abuse already.
You don't get the higher end machines on AWS unless you're a big guy. We have Epyc 9684X on-prem. Cannot match that at the price on AWS. That's just about making the choices. Most companies are not DB-primary.
I think most people who’ve never experienced native NVMe for a DB are also unaware of just how blindingly fast it is. Even io2 Block Express isn’t the same.
Elsewhere today I recommended RDS, but was thinking of small startup cases that may lack infrastructure chops.
But you are totally right it can be expensive. I worked with a startup that had some inefficient queries, normally it would matter, but with RDS it cost $3,000 a month for a tiny user base and not that much data (millions of rows at most).
Also, it is often overlooked that you still need skilled people to run RDS. It's certainly not "2-clicks and forget" and "you don't need to pay anyone running your DB".
I haven't run a Postgres instance with proper backup and restore, but it doesn't seem like rocket science using barman or pgbackrest.
Data isn't cheap never was. Paying the licensing fees on top make it more expensive. It really depends on the circunstance a managed database usually has exended support from the compaany providing it. You have to weigh a team's expertise to manage a solution on your own and ensure you spent ample time making it resilient. Other half is the cost of upgrading hardware sometimes it is better to just outright pay a cloud provider if you business does not have enough income to outright buy hardware.There is always an upfront cost.
Small databases or test environment databases you can also leverage kubernetes to host an operator for that tiny DB. When it comes to serious data and it needs a beeline recovery strategy RDS.
Really it should be a mix self hosted for things you aren't afraid to break. Hosted for the things you put at high risk.
> Data is the most critical part of your infrastructure. You lose your network: that’s downtime. You lose your data: that’s a company ending event. The markup cost of using RDS (or any managed database) is worth it.
You need well-run, regularly tested, air gapped or otherwise immutable backups of your DB (and other critical biz data). Even if RDS was perfect, it still doesn't protect you from the things that backups protect you from.
After you have backups, the idea of paying enormous amounts for RDS in order to keep your company from ending is more far fetched.
I agree that RDS is stupidly expensive and not worth it provided that the company actually hires at least 2x full-time database owners who monitor, configure, scale and back up databases. Most startups will just save the money and let developers "own" their own databases or "be responsible for" uptime and backups.
Even for small workloads it's a difficult choice. I ran a small but vital db, and RDS was costing us like 60 bucks a month per env. That's 240/month/app.
DynamoDB as a replacement, pay per request, was essentially free.
I found Dynamo foreign and rather ugly to code for initially, but am happy with the performance and especially price at the end.
For big companies such as banks this cost comparison is not as straight forward. They have whole data centres just sitting there for disaster recovery. They periodically do switchovers to test DR. All of this expense goes away when they migrate to cloud.
From what I’ve read, a common model for mmorpg companies is to use on-prem or colocated as their primary and then provision a cloud service for backup or overage.
Seems like a solid cost effective approach for when a company reaches a certain scale.
Lots of companies, like Grinding Gear Games and Square Enix, just rent whole servers for a tiny fraction of the price compared to what the price gouging cloud providers would charge for the same resources. They get the best of both worlds. They can scale up their infrastructure in hours or even minutes and they can move to any other commodity hardware in any other datacenter at the drop of a hat if they get screwed on pricing. Migrating from one server provider (such as IBM) to another (such as Hetzner) can take an experienced team 1-2 weeks at most. Given that pricing updates are usually given 1-3 quarters ahead at a minimum, they have massive leverage over their providers because they an so easily switch. Meanwhile, if AWS decides to jack up their prices, well you're pretty much screwed in the short-term if you designed around their cloud services.
I know this is an unpopular opinion but I think google cloud is amazing compared to AWS. I use google cloud run and it works like a dream. I have never found an easier way to get a docker container running in the cloud. The services all have sensible names, there are fewer more important services compared to the mess of AWS services, and the UI is more intuitive. The only downside I have found is the lack of community resulting in fewer tutorials, difficulty finding experienced hires, and fewer third party tools. I recommend trying it. I'd love to get the user base to an even dozen.
The reasoning the author cites is that AWS has more responsive customer service and maybe I am missing out but it would never even occur to me to speak to someone from a cloud provider. They mention having "regular cadence meetings with our AWS account manager" and I am not sure what could be discussed. I must be doing simper stuff.
> "regular cadence meetings with our AWS account manager" and I am not sure what could be discusse.
As being on a number of those calls, its just a bunch of crap where they talk like a scripted bot reading from corporate buzzword bingo card over a slideshow. Their real intention is two fold. To sell you even more AWS complexity/services, and to provide "value" to their person of contact (which is person working in your company).
We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
So even when you want to reach out to that team you have to first to through L1 support which I'm sure will be replaced by bots soon (and no value will be lost) which is useful in 1 out of 10 cases. Then if you're not satisfied with L1's answer(s), then you try to escalate to your "dedicated" support team, then they schedule a call in three days time, or if that is around Friday, that means Monday etc.
Their goal is to stall so you figure and fix stuff on your own so they shield their own better quality teams. No wonder our top engineers just left all AWS communication and in cases where unavoidable they delegate this to junior people who still think they are getting something in return.
> We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
I’ve found a lot of the time the issues we run into are self-inflicted. When we call support for these, they have to reverse-engineer everything which takes time.
However when we can pinpoint the issue to AWS services, it has been really helpful to have them on the horn to confirm & help us come up with a fix/workaround. These issues come up more rarely, but are extremely frustrating. Support is almost mandated in these cases.
It’s worth mentioning that we operate at a scale where the support cost is a non-issue compared to overall engineering costs. There’s a balance, and we have an internal structure that catches most of the first type of issue nowadays.
In my experience all questions I've had for AWS were:
1. Their bugs, which won't be fixed in near future anyway.
2. Their transient failures, that will be fixed anyway soon.
So there's zero value in ever contacting AWS support.
We are a reasonably large AWS customer and our account manager sends out regular emails with NDA information on what's coming up, we have regular meetings with them about things as wide ranging as database tuning and code development/deployment governance.
They often provide that consulting for free, and we know their biases. There's nothing hidden about the fact that they will push us to use AWS services.
On the other hand, they will also help us optimize those services and save money that is directly measurable.
GCP might have a better API and better "naming" of their services, but the breadth of AWS services, the incorporation of IAM across their services, governance and automation all makes it worth while.
Cloud has come a long way from "it's so easy to spin up a VM/container/lambda".
> There's nothing hidden about the fact that they will push us to use AWS services.
Our account team don't even do that. We use a lot of AWS anyway and they know it, so they're happy to help with competitor offerings and integrating with our existing stack. Their main push on us has been to not waste money.
In a previous role I got all of these things from GCP – they ran training for us, gave us early access to some alpha/beta stage products (under NDA), we got direct onboarding from engineers on those, they gave us consulting level support on some things and offered much more of it than we took up.
I don’t have as much experience with aws but I do hate gcp. The ui is slow and buggy. The way they want things to authenticate is half baked and only implemented in some libraries and it isn’t always clear what library supports it. The gcloud command line tool regularly just doesn’t work; it just hangs and never times out forcing you to kill it manually wondering if it did anything and you’ll mess something up running it again. The way they update client libraries by running code generation means there’s tons of commits that aren’t relevant to the library you’re actually using. Features are not available across all client libraries. Documentation contradicts itself or contradicts support recommendations. Core services like bigquery lack any emulator or Docker image to facilitate CI or testing without having to setup a separate project you have to pay for.
Oh, friend, you have not known UI pain until you've used portal.azure.com. That piece of junk requires actual page reloads to make any changes show up. That Refresh button is just like the close-door elevator button: it's there for you to blow off steam, but it for damn sure does not DO anything. I have boundless screenshots showing when their own UI actually pops up a dialog saying "ok, I did what you asked but it's not going to show up in the console for 10 minutes so check back later". If you forget to always reload the page, and accidentally click on something that it says exists but doesn't, you get the world's ugliest error message and only by squinting at it do you realize it's just the 404 page rendered as if the world has fallen over
I suspect the team that manages it was OKR-ed into using AJAX but come from a classic ASP background, so don't understand what all this "single page app" fad is all about and hope it blows over one day
Totally agree, GCP is far easier to work with and get things up and running for how my brain works compared to AWS. Also, GCP name stuff in a way that tells me what it does, AWS name things like a teenage boy trying to be cool.
That's completely opposite to my experience. Do you have any examples of AWS naming that you think is "teenage boy trying to be cool"? I am genuinely curious.
I have had the experience of an AWS account manager helping me by getting something fixed (working at a big client). But more commonly, I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
> I think the account manager’s job at AWS or any cloud or SAAS is to create a reality distortion field and distract you from how much they are charging you.
Maybe your TAM is different, but our regularly do presentations about cost breakdown, future planning and possible reservations. There's nothing distracting there.
AWS enterprise support (basically first line support that you paid for) is actually really really good. they will look at your metrics/logs and share with you solid insights. anything more you can talk to a TAM who can then reach out to relevant engineering teams
Heartily seconded. Also don't forget the docs: Google Cloud docs are generally fairly sane and often even useful, whereas my stomach churns whenever I have to dive into AWS's labyrinth of semi-outdated, nigh-unreadable crap.
To be fair there are lots of GCP docs, but I cannot say they are as good as AWS. Everything is CLI-based, some things are broken or hello-world-useless. Takes time to go through multiple duplicate articles to find anything decent. I have never had this issue with AWS.
GCP SDK docs must be mentioned separately as it's a bizarre auto-generated nonsense. Have you seen them? How can you even say that GCP docs are good after that?
We're relatively small GCP users (low six figures) and have monthly cadence meetings with our Google account manager. They're very accommodating, and will help with contacts, events and marketing.
Oh I disagree - we migrated from azure to AWS, and running a container on Fargate is significantly more work than Azure Container Apps [0]. Container Apps was basically "here's a container, now go".
GCP support is atrocious. I've worked at one of their largest clients and we literally had to get executives into the loop (on both sides) to get things done sometimes. Multiple times they broke some functionality we depended on (one time they fixed it weeks later except it was still broken) or gave us bad advice that cost a lot of money (which they at least refunded if we did all the paperwork to document it). It was so bad that my team viewed even contacting GCP as an impediment and distraction to actually solving a problem they caused.
I also worked at a smaller company using GCP. GCP refused to do a small quota increase (which AWS just does via a web form) unless I got on a call with my sales representative and listened to a 30 minute upsell pitch.
> I’ve had technicians at both GCP and Azure debug code and spend hours on developing services.
Almost every time Google pulled in a specialist engineer working on a service/product we had issues with it was very very clear the engineer had no desire to be on that call or to help us. In other words they'd get no benefit from helping us and it was taking away from things that would help their career at Google. Sometimes they didn't even show up to the first call and only did to the second after an escalation up the management chain.
GCP's SDK and documentation is a mess compared to AWS. And looking at the source code I don't see how it can get better any time soon. AWS seems to have proper design in mind and uses less abstractions giving you freedom to build what you need. AWS CDK is great for IAC.
The only weird part I experienced with AWS is their SNS API. Maybe due to legacy reasons, but what a bizarre mess when you try doing it cross-account. This one is odd.
I have been trying GCP for a while and DevX was horrible. The only part that more-or-less works is CLI but the naming there is inconsistent and not as well-done as in AWS. But it's relative and subjective, so I guess someone likes it. I have experienced GCP official guides that broken, untested or utterly braindead hello-world-useless. And also they are numerous and spread so it takes time to find anything decent.
No dark mode is an extra punch. Seriously. Tried to make it myself with an extension but their page is Angular hell of millions embedded divs. No thank you.
And since you mentioned Cloud Run -- it takes 3 seconds to deploy a Lambda version in AWS and a minute or more for GCP Could Function.
The author leads infrastructure at Cresta. Cresta is a customer service automation company. His first point is about how happy he is to have picked AWS and their human-based customer service, versus Google's robot-based customer service.
I'm not saying there's anything wrong, and I'm oversimplifying a bit, but I still find this amusing.
Haha very good catch. I prefer GCP but I will admit any day of the week that their support is bad. Makes sense that they would value good support highly.
We used to use AWS and GCP at my previous company. GCP support was fine, and I never saw anything from AWS support that GCP didn't also do. I've heard horror stories about both, including some security support horror stories from AWS that are quite troubling.
Utter insanity. So much cost and complexity, and for what? Startups don’t think about costs or runway anymore, all they care about is “modern infrastructure”.
The argument for RDS seems to be “we can’t automate backups”. What on earth?
I see this argument a lot. Then most startups use that time to create rushed half-assed features instead of spending a week on their db that'll end up saving hundreds of thousands of dollars. Forever.
All that infra doesn’t integrate itself. Everywhere I’ve worked that had this kind of stack employed at least one if not a team of DevOps people to maintain it all, full time, the year round. Automating a database backup and testing it works takes half a day unless you’re doing something weird
> The argument for RDS seems to be “we can’t automate backups”. What on earth?
I can automate backups and I'm extremely happy they with some extra cost in RDS, I don't have to do that.
Also, at some size automating the database backup becomes non-trivial. I mean, I can manage a replica (which needs to be updated at specific times after the writer), then regularly stop replication for a snapshot, which is then encrypted, shipped to storage, then manage the lifecycle of that storage, then setup monitoring for all of that, then... Or I can set one parameter on the Aurora cluster and have all of that happen automatically.
The argument for RDS (and other services along those lines) is "we can't do it as good, for less".
And, when factoring in all costs and considering all things the service takes care of, it seems like a reasonable assumption that in a free market a team that specializes in optimizing this entire operation will sell you a db service at a better net rate than you would be able to achieve on your own.
Which might still turn out to be false, but I don't think it's obvious why.
I agree but also I'm not entirely sure how much of this is avoidable. Even the most simple web applications are full of what feels like needless complexity, but I think actually a lot of it is surprisingly essential. That said, there is definitely a huge amount of "I'm using this because I'm told that we should" over "I'm using this because we actually need it"
Everyone who says they can run a database better than Amazon is probably lying or Has a story about how they had to miss a family event because of an outage.
The point isn’t that you can’t do it, the point is that it’s less work for extremely high standards. It is not easy to configure multi region failover without an entire network team and database team unless you don’t give a shit about it actually working. Oh yea, and wait until you see how much SOC2 costs if you roll your own database.
One don’t necessarily need to run a DB better than Amazon. Just sufficiently good for the product/service you’re are working on. And depending on specifics it may costs much less (but your mileage may vary).
My contrarian view is that EC2 + ASG is so pleasant to use. It’s just conceptually simple: I launch an image into an ASG, and configure my autoscale policies. There are very few things to worry about. On the other hand, using k8s has always been a big deal. We built a whole team to manage k8s. We introduce dozens of concepts of k8s or spend person-years on “platform engineering” to hide k8s concepts. We publish guidelines and sdks and all kinds of validators so people can use k8s “properly”. And we still write 10s of thousands lines of YAML plus 10s of thousands of code to implement an operator. Sometimes I wonder if k8s is too intrusive.
K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform. Securing Kubernetes is a nightmare. And lock-in never really went away because it's all coupled with cloud specific stuff anyway.
Many of the core concepts of Kubernetes should be taken to build a new alternative without all the footguns. Security should be baked in, not an afterthought when you need ISO/PCI/whatever.
> K8S is a disastrous complexity bomb. You need millions upon millions of lines of code just to build a usable platform.
I don't know what you have been doing with Kubernetes, but I run a few web apps out of my own Kubernetes cluster and the full extent of my lines of code are the two dozen or so LoC kustomize scripts I use to run each app.
kubeadm + fabric + helm got me 99% of the way there. My direct report, a junior engineer, wrote the entire helm chart from our docker-compose. It will not entirely replace our remote environment but it is nice to have something in between our SDK and remote deployed infra. Not sure what you meant by security; could you elaborate? I just needed to expose one port to the public internet.
To me, it sounds like your company went through a complex re-architecturing exercise at the same time you moved to Kubernetes, and your problems have more to do with your (probably flawed) migration strategy than the tool.
Lifting and shifting an "EC2 + ASG" set-up to Kubernetes is a straightforward process unless your app is doing something very non-standard. It maps to a Deployment in most cases.
The fact that you even implemented an operator (a very advanced use-case in Kubernetes) strongly suggests to me that you're doing way more than just lifting and shifting your existing set-up. Is it a surprise then that you're seeing so much more complexity?
> My contrarian view is that EC2 + ASG is so pleasant to use.
Sometimes I think that managed kubernetes services like EKS are the epitome of "give the customers what they want", even when it makes absolutely no sense at all.
Kubernetes is about stitching together COTS hardware to turn it into a cluster where you can deploy applications. If you do not need to stitch together COTS hardware, you have already far better tools available to get your app running. You don't need to know or care in which node your app is suppose to run and not run, what's your ingress control, if you need to evict nodes, etc. You have container images, you want to run containers out of them, you want them to scale a certain way, etc.
I tend to agree that for most things on AWS, EC2 + ASG is superior. It's very polished. EKS is very bare bones. I would probably go so far as to just run Kubernetes on EC2 if I had to go that route.
But in general k8s provides incredibly solid abstractions for building portable, rigorously available services. Nothing quite compares. It's felt very stable over the past few years.
Sure, EC2 is incredibly stable, but I don't always do business on Amazon.
So by and large I agree with the things in this article. It's interesting that the points I disagree with the author on are all SaaS products:
> Moving off JIRA onto linear
I don't get the hype. Linear is fine and all but I constantly find things I either can't or don't know how to do. How do I make different ticket types with different sets of fields? No clue.
> Not using Terraform Cloud No Regrets
I generally recommend Terraform Cloud - it's easy for you to grow your own in house system that works fine for a few years and gradually ends up costing you in the long run if you don't.
> GitHub actions for CI/CD Endorse-ish
Use Gitlab
> Datadog Regret
Strong disagree - it's easily the best monitoring/observability tool on the market by a wide margin.
Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
> Pagerduty Endorse
Pagerduty charges like 10x what Opsgenie does and offers no better functionality.
When I had a contract renewal with Pagerduty I asked the sales rep what features they had that Opsgenie didn't.
He told me they're positioning themselves as the high end brand in the market.
Cool so I'm okay going generic brand for my incident reporting.
Every CFO should use this as a litmus test to understand if their CTO is financially prudent IMO.
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
I loved Datadog 10 years ago when I joined a company that already used it where I never once had to think about pricing. It was at the top of my list when evaluating monitoring tools for my company last year, until I got to the costs. The pricing page itself made my head swim. I just couldn’t get behind subscribing to something with pricing that felt designed to be impossible to reason about, even if the software is best in class.
Linear has a lot going for it. It doesn't support custom fields, so if that's a critical feature for you, I can see it falling short. In my experience though, custom fields just end up being a mess anytime a manager changes and decides to do things differently, things get moved around teams, etc.
- It's fast. It's wild that this is a selling point, but it's actually a huge deal. JIRA and so many other tools like it are as slow as molasses. Speed is honestly the biggest feature.
- It looks pretty. If your team is going to spend time there, this will end up affecting productivity.
- It has a decent degree of customization and an API. We've automated tickets moving across columns whenever something gets started, a PR is up for review, when a change is merged, when it's deployed to beta, and when it's deployed to prod. We've even built our own CLI tools for being able to action on Linear without leaving your shell.
- It has a lot of keyboard shortcuts for power users.
- It's well featured. You get teams, triaging, sprints (cycles), backlog, project management, custom views that are shareable, roadmaps, etc...
OpsGenie’s cheapest is $9 per user month but arbitrarily crippled, the plan anybody would want to use is $19 per user month
So instead of a factor of ten it’s ten percent cheaper. And i just kind of expect Atlassian to suck.
Datadog is ridiculously expensive and on several occasions I’ve run into problems where an obvious cause for an incident was hidden by bad behavior of datadog.
Grafana OnCall can be self hosted for free or you can pay $20 a month, and still always have the option to migrate to self hosting if you want to save money
I just started building out on-call rotation scheduling to fit teams that already have an alerting solution and need simple automated scheduling. I’d love to get some feedback: https://majorpager.com
DatDog is a freaking beast. NY wife works in workday (a huge employee management system) and they have a very large number of tutorials, videos, "working hours" and other tools to ensure their customers are making the best use of it.
Datadog on the other side... their "DD University" is a shame and we as paying customers are overwhelmed and with no real guidance. DD should assign some time for integration for new customers, even if it is proportional to what you pay annually. (I think I pay around 6000 usd annually.
In terms of Datadog - the per host pricing on infrastructure in a k8/microservices world is perhaps the most egregious of pricing models across all datadog services. Triply true if you use spot instances for short lived workloads.
For folks running k8 at any sort of scale, I generally recommend aggregating metrics BEFORE sending them to datadog, either on a per deployment or per cluster level. Individual host metrics tend to also matter less once you have a large fleet.
You can use opensource tools like veneur (https://github.com/stripe/veneur) to do this. And if you don't want to set this up yourself, third party services like Nimbus (https://nimbus.dev/) can do this for you automatically (note that this is currently a preview feature). Disclaimer also that I'm the founder of Nimbus (we help companies cut datadog costs by over 60%) and have a dog in this fight.
I mostly agreed with OP's article, but you basically nailed all of the points of disagreement I did have.
Jira: Its overhyped and overpriced. Most HATE jira. I guess I don't care enough. I've never met a ticket system that I loved. Jira is fine. Its overly complex sure. But once you set it up, you don't need to change it very often. I don't love it, I don't hate it. No one ever got fired for choosing Jira, so it gets chosen. Welcome to the tech industry.
Terraform Cloud: The gains for Terraform Cloud are minimal. We just use Gitlab for running Terraform pipelines and have a super nice custom solution that we enjoy. It wasn't that hard to do either. We maintain state files remotely in S3 with versioning for the rare cases when we need to restore a foobar'd statefile. Honestly I like having Terraform pipelines in the same place as the code and pipelines for other things.
GitHub Actions: Yeah switch to GitLab. I used to like Github Actions until I moved to a company with Gitlab and it is best in class, full stop. I could rave about Gitlab for hours. I will evangelize for Gitlab anywhere I go that is using anything else.
DataDog: As mentioned, DataDog is the best monitoring and observability solution out there. The only reason NOT to use it is the cost. It is absurdly expensive. Yes, truly expensive. I really hate how expensive it is. But luckily I work somewhere that lets us have it and its amazing.
Pagerduty: Agree, switch to OpsGenie. Opsgenie is considerably cheaper and does all the pager stuff of Pager duty. All the stuff that PagerDuty tries to tack on top to justify its cost is stuff you don't need. OpsGenie does all the stuff you need. Its fine. Similar to Jira, its not something anyone wants anyway. No ones going to love it, no one loves being on call. So just save money with OpsGenie. If you're going to fight for the "brand name" of something, fight for DataDog instead, not a cooler pager system.
I'll be dead in the ground before I use TFC. 10 cents per resource per month my ass. We have around 100k~ resources at an early-stage startup I'm at, our AWS bill is $50~/mo and TFC wants to charge me $10k/mo for that? We can hire a senior dev to maintain an in-house tool full time for that much.
Agreed on PagerDuty
It doesn't really do a lot, administrating it is fairly finicky, and most shops barely use half the functionality it has anyway.
To me its whole schedule interface is atrocious for its price, given from an SRE/dev perspective, that's literally its purpose - scheduled escalations.
Why gitlab? GitHub actions are a mess but gitlab online's ci cd is not much better at all, and for self hosted it opens a whole different can of worms. At least with GitHub actions you have a plugin ecosystem that makes the super janky underlying platform a bit more bearable.
> Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
Datadog's cheapest pricing is $15/host/month. I believe that is based on the largest sustained peak usage you have.
We run spot instances on AWS for machine learning workflows. A lot of them if we're training and none otherwise. Usually we're using zero. Using DataDog at it's lowest price would basically double the cost of those instances.
Interesting. Atlassian also just launched an integration with OpsGenie. I have the same opinion of JIRA. I've tried many competitors (not Linear so far) and regretted it every time.
I agree. I’m afraid I’m one of those 00s developers and can relate. Back then many startups were being launched on super simple stacks.
With all of that complexity/word salad from TFA, where’s the value delivered? Presumably there’s a product somewhere under all that infrastructure, but damn, what’s left to spend on it after all the infrastructure variable costs?
I get it’s a list of preferences, but still once you’ve got your selection that’s still a ton of crap to pay for and deal with.
Do we ever seek simplicity in software engineering products?
We had FB up to 6 figures in servers and a billion MAUs (conservatively) before even tinkering with containers.
The “control plane” was ZooKeeper. Everything had bindings to it, Thrift/Protobuf goes in a znode fine. List of servers for FooService? znode.
The packaging system was a little more complicated than a tarball, but it was spiritually a tarball.
Static link everything. Dependency hell: gone. Docker: redundant.
The deployment pipeline used hypershell to drop the packages and kick the processes over.
There were hundreds of services and dozens of clusters of them, but every single one was a service because it needed a different SKU (read: instance type), or needed to be in Java or C++, or some engineering reason. If it didn’t have a real reason, it goes in the monolith.
This was dramatically less painful than any of the two dozen server type shops I’ve consulted for using kube and shit. It’s not that I can’t use Kubernetes, I know the k9s shortcuts blindfolded. But it’s no fun. And pros built these deployments and did it well, serious Kubernetes people can do everything right and it’s complicated.
After 4 years of hundreds of elite SWEs and PEs (SRE) building a Borg-alike, we’d hit parity with the bash and ZK stuff. And it ultimately got to be a clear win.
But we had an engineering reason to use containers: we were on bare metal, containers can make a lot of sense on bare metal.
In a hyperscaler that has a zillion SKUs on-demand? Kubernetes/Docker/OCI/runc/blah is the friggin Bezos tax. You’re already virtualized!
Some of the new stuff is hot shit, I’m glad I don’t ssh into prod boxes anymore, let alone run a command on 10k at the same time. I’m glad there are good UIs for fleet management in the browser and TUI/CLI, and stuff like TailScale where mortals can do some network stuff without a guaranteed zero day. I’m glad there are layers on top of lock servers for service discovery now. There’s a lot to keep from the last ten years.
But this yo dawg I heard you like virtual containers in your virtual machines so you can virtualize while you virtualize shit is overdue for its CORBA/XML/microservice/many-many-many repos moment.
You want reproducibility. Statically link. Save Docker for a CI/CD SaaS or something.
You want pros handing the datacenter because pets are for petting: pay the EC2 markup.
You can’t take risks with customer data: RDS is a very sane place to splurge.
Half this stuff is awesome, let’s keep it. The other half is job security and AWS profits.
The funny thing is a lot of smaller startups are seeing just how absurdly expensive these service are, and are just switching back to basic bare metal server hosting.
For 99% of businesses it's a wasteful, massive overkill expense. You dont NEED all the shiny tools they offer, they don't add anything to your business but cost. Unless you're a Netflix or an Apple who needs massive global content distribution and processing services theres a good chance you're throwing money away.
I am 10s developer/systems engineer and my eyes kept getting wider with each new technology on the list. I don't know if its overkill or just the state of things right now.
There is no way one person can thoroughly understand so many complex pieces of technology. I have worked for 10 years more or less at this point, and I would only call myself confident on 5 technical products, maybe 10 if I being generous to myself.
Not really, it's just like counting: awk, grep, sed, uniq, tail, etc.
"CloudOS" is in it's early days right now.
You need to be careful on what tool or library you pick.
No, not at all. Maybe baffled by the use of expensive cloud services instead of running on your own bare metal where the cost is in datacenter space and bandwidth. The loss of control coupled with the cost is baffling.
Reading this I couldn’t help but think: yeah all of these points make sense in isolation, but if you look at the big picture, this is an absurd level of complexity.
Why do we need entire teams making 1000s of micro decisions to deploy our app?
I’m hungry for a simpler way, and I doubt I’m alone in this.
You’re not alone. There is a constant undercurrent of pushback against this craziness. You see it all the time here on hacker news and with people I talk to irl.
Does not mean each of these things don’t solve problems. The issue as always about complexity-utility tradeoff. Some of these things have too much complexity for too little utility. I’m not qualified to judge here, but if the suspects have Turing-complete-yaml-templates on their hands, it probably ties them to the crime scene.
Every so often I price out RDS to replace our colocated SQL Server cluster and it's so unrealistically expensive that I just have to laugh. It's absurdly far beyond what I'd be willing to pay. The markup is enough to pay for the colocation rack, the AWS Direct Connects, the servers, the SAN, the SQL Server licenses, the maintenance contracts, and a full-time in-house DBA.
https://calculator.aws/#/estimate?id=48b0bab00fe90c5e6de68d0...
Total 12 months cost: 547,441.85 USD
Once you get past the point where the markup can pay for one or more full-time employees, I think you should consider doing that instead of blindly paying more and more to scale RDS up. You're REALLY paying for it with RDS. At least re-evaluate the choices you made as a fledgling startup once you reach the scale where you're paying AWS "full time engineer" amounts of money.
Just like any other asset.
I was once part of an acquisition from a much larger corporate entity. The new parent company was in the middle of a huge cloud migration, and as part of our integration into their org, we were required to migrate our services to the cloud.
Our calculations said it would cost 3x as much to run our infra on the cloud.
We pushed back, and were greenlit on creating a hybrid architecture that allowed us to launch machines both on-prem and in the cloud (via a direct link to the cloud datacenter). This gave us the benefit of autoscaling our volatile services, while maintaining our predictable services on the cheap.
After I left, apparently my former team was strong-armed into migrating everything to the cloud.
A few years go by, and guess who reaches out on LinkedIn?
The parent org was curious how we built the hybrid infra, and wanted us to come back to do it again.
I didn't go back.
My nasty little secret is for single server databases I have zero fear of over provisioning disk iops and running it on SQLite or making a single RDBMS server in a container. I've never actually run into an issue with this. It surprises me the number of internal tools I see that depend on large RDS installations that have piddly requirements.
Besides, running things locally can be refreshingly simple if you are just starting something and you don't need tons of extra stuff, which becomes accidental complexity between you, the problem, and a solution. This old post described that point quite well by comparing Unix to Taco Bell: http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra.... See HN discussion: https://news.ycombinator.com/item?id=10829512.
I am sure for some use-cases cloud services might be worth it, especially if you are a large organization and you get huge discounts. But I see lots of business types blindly advocating for clouds, without understanding costs and technical tradeoffs. Fortunately, the trend seems to be plateauing. I see an increasing demand for people with HPC, DB administration, and sysadmin skills.
I would have a hard time doing servers as cheap as hetzner for example including the routing and everything
If price is the only factor, your business model (or executives' decision-making) is questionable. Buy only the cheapest shit, spend your time building your own office chair rather than talking to a customer, you aren't making a premium product, and that means you're not differentiated.
For example, how long does it take to rent another rack that you didnt plan for?
And not to mention that the cost of cloud management platforms that you have to deploy to manage these owned assets is not free.
I mean, how come even large consumers of electricity does not buy and own their own infrastructure to generate it?
Sure, maybe if you are some sort of SaaS with a need for a small single DB, that also needs to be resilient, backed up, rock solid bulletproof.. it makes sense? But how many cases are there of this? If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
Most software startups these days? The blog post is about work done at a startup after all. By the time your db is big enough to cost an unreasonable amount on RDS, you’re likely a big enough team to have options. If you’re a small startup, saving a couple hundred bucks a month by self managing your database is rarely a good choice. There’re more valuable things to work on.
It would have cost a negligible amount. But the sheer amount of time I wasted before I gave up was honestly quite surprising. Let’s see:
- I wanted one simple extension. I could have compromised on this, but getting it to work on RDS was a nonstarter.
- I wanted RDS to _import the data_. Nope, RDS isn’t “SUPER,” so it rejects a bunch of stuff that mysqldump emits. Hacking around it with sed was not confidence-inspiring.
- The database uses GTIDs and needed to maintain replication to a non-AWS system. RDS nominally supports GTID, but the documented way to enable it at import time strongly suggests that whoever wrote the docs doesn’t actually understand the purpose of GTID, and it wasn’t clear that RDS could do it right. At least Azure’s docs suggested that I could have written code to target some strange APIs to program the thing correctly.
Time wasted: a surprising number of hours. I’d rather give someone a bit of money to manage the thing, but it’s still on a combination of plain cloud servers and bare metal. Oh well.
Very small businesses with phone apps or web apps are often using it. There are cheaper options of course, but when there is no "prem" and there are 1-5 employees then it doesn't make much sense to hire for infra. You outsource all digital work to an agency who sets you up a cloud account so you have ownership, but they do all software dev and infra work.
> If its so fundamental to your product and needs such uptime & redundancy, what are the odds its also reasonably small?
Small businesses again, some of my clients could probably run off a Pentium 4 from 2008, but due to nature of the org and agency engagement it often needs to live in the cloud somewhere.
I am constantly beating the drum to reduce costs and use as little infra as needed though, so in a sense I agree, but the engagement is what it is.
Additionally, everyone wants to believe they will need to hyperscale, so even medium scale businesses over-provision and some agencies are happen to do that for them as they profit off the margin.
Also, Aurora gives you the block level cluster that you can't deploy on your own - it's way easier to work with than the usual replication.
People pay for RDS because they want to believe in a fairy tale that it will keep potential problems away and that it worked well for other customers. But those mythical other customers also paid based on such belief. Plus, no one wants to admit that they pay money in such irrational way. It's a bubble
One of these is not like the others (DBAs are not capex.)
Have you ever considered that if a company can get the same result for the same price ($100K opex for RDS vs same for human DBA), it actually makes much more sense to go the route that takes the human out of the loop?
The human shows up hungover, goes crazy, gropes Stacy from HR, etc.
RDS just hums along without all the liabilities.
Hard disagree. An r6i.12xl Multi-AZ with 7500 IOPS / 500 GiB io1 books at $10K/month on its own. Add a read replica, even Single-AZ at a smaller size, and you’re half that again. And this is without the infra required to run a load balancer / connection pooler.
I don’t know what your definition of “large” is, but the described would be adequate at best at the ~100K QPS level.
RDS is expensive as hell, because they know most people don’t want to take the time to read docs and understand how to implement a solid backup strategy. That, and they’ve somehow convinced everyone that you don’t have to tune RDS.
After initial setup, managing equivalent of $5k/m RDS is not full time job. If you add to this, that wages differ a lot around the world, $5k can take you very, very far in terms of paying someone.
We technically aren't supposed to talk about pricing publically, but I'm just going to say that we run a few 8XL and 12Xl RDS instances and we pay ~40% off the sticker price.
If you switch to Aurora engine the pricing is absurdly complex (its basically impossible to determine without a simulation calculator) but AWS is even more aggressive with discounting on Aurora, not to mention there are some legit amazing feature benefits by switching.
I'm still in agreeance that you could do it cheaper yourself at a Data Center. But there are some serious tradeoffs made by doing it that way. One is complexity and it certainly requires several new hiring decisions. Those have their own tangible costs, but there are a huge amount of intangible costs as well like pure inconvenience, more people management, more hiring, split expertise, complexity to network systems, reduce elasticity of decisions, longer commitments, etc.. It's harder to put a price on that.
When you account for the discounts at this scale, I think the cost gap between the two solutions is much smaller and these inconveniences and complexities by rolling it yourself are sometimes worth bridging that smaller gap in cost in order to gain those efficiencies.
Frankly, this is anti-competitive, and the FTC should look into it, however, Microsoft has been anti-competitive and customer hostile for decades, so if you're still using their products, you must have accepted the abuse already.
But you are totally right it can be expensive. I worked with a startup that had some inefficient queries, normally it would matter, but with RDS it cost $3,000 a month for a tiny user base and not that much data (millions of rows at most).
I haven't run a Postgres instance with proper backup and restore, but it doesn't seem like rocket science using barman or pgbackrest.
Small databases or test environment databases you can also leverage kubernetes to host an operator for that tiny DB. When it comes to serious data and it needs a beeline recovery strategy RDS.
Really it should be a mix self hosted for things you aren't afraid to break. Hosted for the things you put at high risk.
> Data is the most critical part of your infrastructure. You lose your network: that’s downtime. You lose your data: that’s a company ending event. The markup cost of using RDS (or any managed database) is worth it.
You need well-run, regularly tested, air gapped or otherwise immutable backups of your DB (and other critical biz data). Even if RDS was perfect, it still doesn't protect you from the things that backups protect you from.
After you have backups, the idea of paying enormous amounts for RDS in order to keep your company from ending is more far fetched.
RDS make perfect sense for them
DynamoDB as a replacement, pay per request, was essentially free.
I found Dynamo foreign and rather ugly to code for initially, but am happy with the performance and especially price at the end.
They need to replicate everything in multiple availability zones, which is going to be more expensive than replicating data centres.
They still need to test their cloud infrastracuture works.
Just to pay someone else enough money to provide the same service and make a profit while do it
Seems like a solid cost effective approach for when a company reaches a certain scale.
1. Hiring someone full time to work on the database means migrating off RDS
2. Database work is only about spend reduction
Deleted Comment
Dead Comment
I know this is an unpopular opinion but I think google cloud is amazing compared to AWS. I use google cloud run and it works like a dream. I have never found an easier way to get a docker container running in the cloud. The services all have sensible names, there are fewer more important services compared to the mess of AWS services, and the UI is more intuitive. The only downside I have found is the lack of community resulting in fewer tutorials, difficulty finding experienced hires, and fewer third party tools. I recommend trying it. I'd love to get the user base to an even dozen.
The reasoning the author cites is that AWS has more responsive customer service and maybe I am missing out but it would never even occur to me to speak to someone from a cloud provider. They mention having "regular cadence meetings with our AWS account manager" and I am not sure what could be discussed. I must be doing simper stuff.
As being on a number of those calls, its just a bunch of crap where they talk like a scripted bot reading from corporate buzzword bingo card over a slideshow. Their real intention is two fold. To sell you even more AWS complexity/services, and to provide "value" to their person of contact (which is person working in your company).
We're paying north of 500K per year in AWS support (which is a highway robbery), and in return you get a "team" of people supposedly dedicated to you, which sounds good in theory but you get a labirinth of irresponsiblity, stalling and frustration in reality.
So even when you want to reach out to that team you have to first to through L1 support which I'm sure will be replaced by bots soon (and no value will be lost) which is useful in 1 out of 10 cases. Then if you're not satisfied with L1's answer(s), then you try to escalate to your "dedicated" support team, then they schedule a call in three days time, or if that is around Friday, that means Monday etc.
Their goal is to stall so you figure and fix stuff on your own so they shield their own better quality teams. No wonder our top engineers just left all AWS communication and in cases where unavoidable they delegate this to junior people who still think they are getting something in return.
I’ve found a lot of the time the issues we run into are self-inflicted. When we call support for these, they have to reverse-engineer everything which takes time.
However when we can pinpoint the issue to AWS services, it has been really helpful to have them on the horn to confirm & help us come up with a fix/workaround. These issues come up more rarely, but are extremely frustrating. Support is almost mandated in these cases.
It’s worth mentioning that we operate at a scale where the support cost is a non-issue compared to overall engineering costs. There’s a balance, and we have an internal structure that catches most of the first type of issue nowadays.
In my experience all questions I've had for AWS were: 1. Their bugs, which won't be fixed in near future anyway. 2. Their transient failures, that will be fixed anyway soon.
So there's zero value in ever contacting AWS support.
I am so tired of the support team having all the real metrics, especially in io and throttling, and not surfacing it to us somehow.
And cadence is really an opportunity for them to sell to you, the parent is completely right.
They often provide that consulting for free, and we know their biases. There's nothing hidden about the fact that they will push us to use AWS services.
On the other hand, they will also help us optimize those services and save money that is directly measurable.
GCP might have a better API and better "naming" of their services, but the breadth of AWS services, the incorporation of IAM across their services, governance and automation all makes it worth while.
Cloud has come a long way from "it's so easy to spin up a VM/container/lambda".
Our account team don't even do that. We use a lot of AWS anyway and they know it, so they're happy to help with competitor offerings and integrating with our existing stack. Their main push on us has been to not waste money.
And never did I miss something in GCP that I could find in AWS. Not sure the breadth is adding much compared to a simpler product suite in GCP.
I suspect the team that manages it was OKR-ed into using AJAX but come from a classic ASP background, so don't understand what all this "single page app" fad is all about and hope it blows over one day
How do they do this jedi mind trick?
GCP SDK docs must be mentioned separately as it's a bizarre auto-generated nonsense. Have you seen them? How can you even say that GCP docs are good after that?
I don't have a ton of Azure or cloud experience but I run an Unraid server locally which has a decent Docker gui.
Getting a docker container running in Azure is so complicated. I gave up after an hour of poking around.
[0] https://azure.microsoft.com/en-gb/products/container-apps
I also worked at a smaller company using GCP. GCP refused to do a small quota increase (which AWS just does via a web form) unless I got on a call with my sales representative and listened to a 30 minute upsell pitch.
I’ve had technicians at both GCP and Azure debug code and spend hours on developing services.
Almost every time Google pulled in a specialist engineer working on a service/product we had issues with it was very very clear the engineer had no desire to be on that call or to help us. In other words they'd get no benefit from helping us and it was taking away from things that would help their career at Google. Sometimes they didn't even show up to the first call and only did to the second after an escalation up the management chain.
We started using Azure Container Apps (ACA) and it seems simple enough.
Create ACA, point to GitHub repo, it runs.
Push an update to GitHub and it redeploys.
The only weird part I experienced with AWS is their SNS API. Maybe due to legacy reasons, but what a bizarre mess when you try doing it cross-account. This one is odd.
I have been trying GCP for a while and DevX was horrible. The only part that more-or-less works is CLI but the naming there is inconsistent and not as well-done as in AWS. But it's relative and subjective, so I guess someone likes it. I have experienced GCP official guides that broken, untested or utterly braindead hello-world-useless. And also they are numerous and spread so it takes time to find anything decent.
No dark mode is an extra punch. Seriously. Tried to make it myself with an extension but their page is Angular hell of millions embedded divs. No thank you.
And since you mentioned Cloud Run -- it takes 3 seconds to deploy a Lambda version in AWS and a minute or more for GCP Could Function.
I'm not saying there's anything wrong, and I'm oversimplifying a bit, but I still find this amusing.
The argument for RDS seems to be “we can’t automate backups”. What on earth?
For me that's short-sighted.
I can automate backups and I'm extremely happy they with some extra cost in RDS, I don't have to do that.
Also, at some size automating the database backup becomes non-trivial. I mean, I can manage a replica (which needs to be updated at specific times after the writer), then regularly stop replication for a snapshot, which is then encrypted, shipped to storage, then manage the lifecycle of that storage, then setup monitoring for all of that, then... Or I can set one parameter on the Aurora cluster and have all of that happen automatically.
And, when factoring in all costs and considering all things the service takes care of, it seems like a reasonable assumption that in a free market a team that specializes in optimizing this entire operation will sell you a db service at a better net rate than you would be able to achieve on your own.
Which might still turn out to be false, but I don't think it's obvious why.
The point isn’t that you can’t do it, the point is that it’s less work for extremely high standards. It is not easy to configure multi region failover without an entire network team and database team unless you don’t give a shit about it actually working. Oh yea, and wait until you see how much SOC2 costs if you roll your own database.
My contrarian view is that EC2 + ASG is so pleasant to use. It’s just conceptually simple: I launch an image into an ASG, and configure my autoscale policies. There are very few things to worry about. On the other hand, using k8s has always been a big deal. We built a whole team to manage k8s. We introduce dozens of concepts of k8s or spend person-years on “platform engineering” to hide k8s concepts. We publish guidelines and sdks and all kinds of validators so people can use k8s “properly”. And we still write 10s of thousands lines of YAML plus 10s of thousands of code to implement an operator. Sometimes I wonder if k8s is too intrusive.
Many of the core concepts of Kubernetes should be taken to build a new alternative without all the footguns. Security should be baked in, not an afterthought when you need ISO/PCI/whatever.
I don't know what you have been doing with Kubernetes, but I run a few web apps out of my own Kubernetes cluster and the full extent of my lines of code are the two dozen or so LoC kustomize scripts I use to run each app.
Who exactly needs millions of lines of code?
Lifting and shifting an "EC2 + ASG" set-up to Kubernetes is a straightforward process unless your app is doing something very non-standard. It maps to a Deployment in most cases.
The fact that you even implemented an operator (a very advanced use-case in Kubernetes) strongly suggests to me that you're doing way more than just lifting and shifting your existing set-up. Is it a surprise then that you're seeing so much more complexity?
Sometimes I think that managed kubernetes services like EKS are the epitome of "give the customers what they want", even when it makes absolutely no sense at all.
Kubernetes is about stitching together COTS hardware to turn it into a cluster where you can deploy applications. If you do not need to stitch together COTS hardware, you have already far better tools available to get your app running. You don't need to know or care in which node your app is suppose to run and not run, what's your ingress control, if you need to evict nodes, etc. You have container images, you want to run containers out of them, you want them to scale a certain way, etc.
But in general k8s provides incredibly solid abstractions for building portable, rigorously available services. Nothing quite compares. It's felt very stable over the past few years.
Sure, EC2 is incredibly stable, but I don't always do business on Amazon.
Loading comment...
> Moving off JIRA onto linear
I don't get the hype. Linear is fine and all but I constantly find things I either can't or don't know how to do. How do I make different ticket types with different sets of fields? No clue.
> Not using Terraform Cloud No Regrets
I generally recommend Terraform Cloud - it's easy for you to grow your own in house system that works fine for a few years and gradually ends up costing you in the long run if you don't.
> GitHub actions for CI/CD Endorse-ish
Use Gitlab
> Datadog Regret
Strong disagree - it's easily the best monitoring/observability tool on the market by a wide margin.
Cost is the most common complaint and it's almost always from people who don't have it configured correctly (which to be fair Datadog makes it far too easy to misconfigure things and blow up costs).
> Pagerduty Endorse
Pagerduty charges like 10x what Opsgenie does and offers no better functionality.
When I had a contract renewal with Pagerduty I asked the sales rep what features they had that Opsgenie didn't.
He told me they're positioning themselves as the high end brand in the market.
Cool so I'm okay going generic brand for my incident reporting.
Every CFO should use this as a litmus test to understand if their CTO is financially prudent IMO.
I loved Datadog 10 years ago when I joined a company that already used it where I never once had to think about pricing. It was at the top of my list when evaluating monitoring tools for my company last year, until I got to the costs. The pricing page itself made my head swim. I just couldn’t get behind subscribing to something with pricing that felt designed to be impossible to reason about, even if the software is best in class.
Loading comment...
Loading comment...
- It's fast. It's wild that this is a selling point, but it's actually a huge deal. JIRA and so many other tools like it are as slow as molasses. Speed is honestly the biggest feature.
- It looks pretty. If your team is going to spend time there, this will end up affecting productivity.
- It has a decent degree of customization and an API. We've automated tickets moving across columns whenever something gets started, a PR is up for review, when a change is merged, when it's deployed to beta, and when it's deployed to prod. We've even built our own CLI tools for being able to action on Linear without leaving your shell.
- It has a lot of keyboard shortcuts for power users.
- It's well featured. You get teams, triaging, sprints (cycles), backlog, project management, custom views that are shareable, roadmaps, etc...
OpsGenie’s cheapest is $9 per user month but arbitrarily crippled, the plan anybody would want to use is $19 per user month
So instead of a factor of ten it’s ten percent cheaper. And i just kind of expect Atlassian to suck.
Datadog is ridiculously expensive and on several occasions I’ve run into problems where an obvious cause for an incident was hidden by bad behavior of datadog.
So if that's the upgrade path you're going down I'd expect it to be fantastic.
Datadog on the other side... their "DD University" is a shame and we as paying customers are overwhelmed and with no real guidance. DD should assign some time for integration for new customers, even if it is proportional to what you pay annually. (I think I pay around 6000 usd annually.
Deleted Comment
For folks running k8 at any sort of scale, I generally recommend aggregating metrics BEFORE sending them to datadog, either on a per deployment or per cluster level. Individual host metrics tend to also matter less once you have a large fleet.
You can use opensource tools like veneur (https://github.com/stripe/veneur) to do this. And if you don't want to set this up yourself, third party services like Nimbus (https://nimbus.dev/) can do this for you automatically (note that this is currently a preview feature). Disclaimer also that I'm the founder of Nimbus (we help companies cut datadog costs by over 60%) and have a dog in this fight.
Jira: Its overhyped and overpriced. Most HATE jira. I guess I don't care enough. I've never met a ticket system that I loved. Jira is fine. Its overly complex sure. But once you set it up, you don't need to change it very often. I don't love it, I don't hate it. No one ever got fired for choosing Jira, so it gets chosen. Welcome to the tech industry.
Terraform Cloud: The gains for Terraform Cloud are minimal. We just use Gitlab for running Terraform pipelines and have a super nice custom solution that we enjoy. It wasn't that hard to do either. We maintain state files remotely in S3 with versioning for the rare cases when we need to restore a foobar'd statefile. Honestly I like having Terraform pipelines in the same place as the code and pipelines for other things.
GitHub Actions: Yeah switch to GitLab. I used to like Github Actions until I moved to a company with Gitlab and it is best in class, full stop. I could rave about Gitlab for hours. I will evangelize for Gitlab anywhere I go that is using anything else.
DataDog: As mentioned, DataDog is the best monitoring and observability solution out there. The only reason NOT to use it is the cost. It is absurdly expensive. Yes, truly expensive. I really hate how expensive it is. But luckily I work somewhere that lets us have it and its amazing.
Pagerduty: Agree, switch to OpsGenie. Opsgenie is considerably cheaper and does all the pager stuff of Pager duty. All the stuff that PagerDuty tries to tack on top to justify its cost is stuff you don't need. OpsGenie does all the stuff you need. Its fine. Similar to Jira, its not something anyone wants anyway. No ones going to love it, no one loves being on call. So just save money with OpsGenie. If you're going to fight for the "brand name" of something, fight for DataDog instead, not a cooler pager system.
Loading comment...
I'll be dead in the ground before I use TFC. 10 cents per resource per month my ass. We have around 100k~ resources at an early-stage startup I'm at, our AWS bill is $50~/mo and TFC wants to charge me $10k/mo for that? We can hire a senior dev to maintain an in-house tool full time for that much.
To me its whole schedule interface is atrocious for its price, given from an SRE/dev perspective, that's literally its purpose - scheduled escalations.
Loading comment...
Datadog's cheapest pricing is $15/host/month. I believe that is based on the largest sustained peak usage you have.
We run spot instances on AWS for machine learning workflows. A lot of them if we're training and none otherwise. Usually we're using zero. Using DataDog at it's lowest price would basically double the cost of those instances.
You're staying within an ecosystem you know and it seems to offer almost all of the necessary functionality
Loading comment...
Loading comment...
Loading comment...
Dead Comment
With all of that complexity/word salad from TFA, where’s the value delivered? Presumably there’s a product somewhere under all that infrastructure, but damn, what’s left to spend on it after all the infrastructure variable costs?
I get it’s a list of preferences, but still once you’ve got your selection that’s still a ton of crap to pay for and deal with.
Do we ever seek simplicity in software engineering products?
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Things were more far more manual and much less secure, scalable and reliable in the past, but they were also far far simpler.
Loading comment...
The “control plane” was ZooKeeper. Everything had bindings to it, Thrift/Protobuf goes in a znode fine. List of servers for FooService? znode.
The packaging system was a little more complicated than a tarball, but it was spiritually a tarball.
Static link everything. Dependency hell: gone. Docker: redundant.
The deployment pipeline used hypershell to drop the packages and kick the processes over.
There were hundreds of services and dozens of clusters of them, but every single one was a service because it needed a different SKU (read: instance type), or needed to be in Java or C++, or some engineering reason. If it didn’t have a real reason, it goes in the monolith.
This was dramatically less painful than any of the two dozen server type shops I’ve consulted for using kube and shit. It’s not that I can’t use Kubernetes, I know the k9s shortcuts blindfolded. But it’s no fun. And pros built these deployments and did it well, serious Kubernetes people can do everything right and it’s complicated.
After 4 years of hundreds of elite SWEs and PEs (SRE) building a Borg-alike, we’d hit parity with the bash and ZK stuff. And it ultimately got to be a clear win.
But we had an engineering reason to use containers: we were on bare metal, containers can make a lot of sense on bare metal.
In a hyperscaler that has a zillion SKUs on-demand? Kubernetes/Docker/OCI/runc/blah is the friggin Bezos tax. You’re already virtualized!
Some of the new stuff is hot shit, I’m glad I don’t ssh into prod boxes anymore, let alone run a command on 10k at the same time. I’m glad there are good UIs for fleet management in the browser and TUI/CLI, and stuff like TailScale where mortals can do some network stuff without a guaranteed zero day. I’m glad there are layers on top of lock servers for service discovery now. There’s a lot to keep from the last ten years.
But this yo dawg I heard you like virtual containers in your virtual machines so you can virtualize while you virtualize shit is overdue for its CORBA/XML/microservice/many-many-many repos moment.
You want reproducibility. Statically link. Save Docker for a CI/CD SaaS or something.
You want pros handing the datacenter because pets are for petting: pay the EC2 markup.
You can’t take risks with customer data: RDS is a very sane place to splurge.
Half this stuff is awesome, let’s keep it. The other half is job security and AWS profits.
Loading comment...
For 99% of businesses it's a wasteful, massive overkill expense. You dont NEED all the shiny tools they offer, they don't add anything to your business but cost. Unless you're a Netflix or an Apple who needs massive global content distribution and processing services theres a good chance you're throwing money away.
There is no way one person can thoroughly understand so many complex pieces of technology. I have worked for 10 years more or less at this point, and I would only call myself confident on 5 technical products, maybe 10 if I being generous to myself.
Loading comment...
Why do we need entire teams making 1000s of micro decisions to deploy our app?
I’m hungry for a simpler way, and I doubt I’m alone in this.
Does not mean each of these things don’t solve problems. The issue as always about complexity-utility tradeoff. Some of these things have too much complexity for too little utility. I’m not qualified to judge here, but if the suspects have Turing-complete-yaml-templates on their hands, it probably ties them to the crime scene.
Loading comment...