Has anyone ever switched clouds from one service provider to another (e.g. AWS to Azure or vice versa), partially or entirely?
If so why? They all offer almost identical services. Do small (but maybe significant?) differences or unique products (e.g. Spanner) make such a big difference that it has swayed someone to switch their cloud infrastructure?
I wonder how much these little things matter and how such a transition (in partial or as a whole) went along and how key stakeholders (who were possibly heavily invested in one cloud or felt they were responsible for the initial choice) were convinced to make the switch?
I'd love to hear some stories from real world experiences and crucially what it was that pushed the last domino to make the move.
Their savings from using the credits were at least 20x what the migrations cost.
We did the migration by having reverse proxies in each environment that could proxy to backends each place, set up a VPN between them, and switched DNS. Trickiest part was the database failover and ensuring updates would be retried transparently after switching master.
Upside was that afterwards they had a setup that was provider agnostic and ready to do transparent failover of every part of the service, all effectively paid for through the free credits they got.
* Set up haproxy, nginx or similar as reverse proxy and carefully decide if you can handle retries on failed queries. If you want true zero-downtime migration there's a challenge here in making sure you have a setup that lets you add and remove backends transparently. There are many ways of doing this of various complexity. I've tended to favour using dynamic dns updates for this; in this specific instance we used Hashicorp's Consul to keep dns updated w/services. I've also used ngx_mruby for instances where I needed more complex backend selection (allows writing Ruby code to execute within nginx)
* Set up a VPN (or more depending on your networking setup) between the locations so that the reverse proxy can reach backends in both/all locations, and so that the backends can reach databases both places.
* Replicate the database to the new location.
* Ensure your app has a mechanism for determining which database to use as the master. Just as for the reverse proxy we used Consul to select. All backends would switch on promoting a replica to master.
* Ensure you have a fast method to promote a database replica to a master. You don't want to be in a situation of having to fiddle with this. We had fully automated scripts to do the failover.
* Ensure your app gracefully handles database failure of whatever it thinks the current master is. This is the trickiest bit in some cases, as you either need to make sure updates are idempotent, or you need to make sure updates during the switchover either reliably fail or reliably succeed. In the case I mentioned we were able to safely retry requests, but in many cases it'll be safer to just punt on true zero downtime migration assuming your setup can handle promotion of the new master fast enough (in our case the promotion of the new Postgres master took literally a couple of seconds, during which any failing updates would just translate to some page loads being slow as they retried, but if we hadn't been able to retry it'd have meant a few seconds downtime).
Once you have the new environment running and capable of handling requests (but using the database in the old environment):
* Reduce DNS record TTL.
* Ensure the new backends are added to the reverse proxy. You should start seeing requests flow through the new backends and can verify error rates aren't increasing. This should be quick to undo if you see errors.
* Update DNS to add the new environment reverse proxy. You should start seeing requests hit the new reverse proxy, and some of it should flow through the new backends. Wait to see if any issues.
* Promote the replica in the new location to master and verify everything still works. Ensure whatever replication you need from the new master works. You should now see all database requests hitting the new master.
* Drain connections from the old backends (remove them from the pool, but leave them running until they're not handling any requests). You should now have all traffic past the reverse proxy going via the new environment.
* Update DNS to remove the old environment reverse proxy. Wait for all traffic to stop hitting the old reverse proxy.
* When you're confident everything is fine, you can disable the old environment and bring DNS TTL back up.
The precise sequencing is very much a question of preference - the point is you're just switching over and testing change by change, and through most of them you can go a step back without too much trouble. I tend to prefer ensuring you do changes that are low effort to reverse first. Need to keep in mind that some changes (like DNS) can take some time to propagate.
EDIT: You'll note most of this is basically to treat both sites as one large environment using a VPN to tie them together and ensure you have proper high availability. Once you do, the rest of the migration is basically just failing over.
- don't use any cloud service that isn't a packaged version of an installable/usable OSS project
- architect your services to be able to double-write and switchover the read source with A/B deployments
If you can migrate your database without downtime this way, then you are much more flexible than if not.
Can you share any details on how to achieve this?
For instance, if the first database accepts the write but the second is temporarily inaccessible or throws an error, do you roll back the transaction on the first and throw an error, or <insert_clever_thing_here> ... ?
But with migrations from one database to another at different locations, I'm lukewarm to it because it means having to handle split-brain scenarios, and often that ends up resulting in a level of complexity that costs far more than it's worth. Of course your mileage may vary - there are situations where it's easy enough and/or where the value in doing it is high enough.
Would you say bare metal cost a lot of extra monitoring/maintenance, or is this something you did on the cloud hardware as well anyway? Do you run virtualization on the Hetzner machines?
In terms of monitoring, it boils down to picking a solution and building the appropriate monitoring agent into your deployment.
I've run basically anything I run in some virtualized env. or other since ~2006 at least, be it OpenVz (ages ago), KVM, or docker. And that goes for Hetzner too. It makes it easy to ensure that the app environment is identical no matter where you move things. I managed on environment where we had stuff running on prem and in several colo's, on dedicated servers at Hetzner, and in VMs, and even on the VMs we still containerised everything - deployment of new containers was identical no matter where you deployed. All of the environment specific details were hidden in the initial host setup.
https://elest.io
https://nimbusws.com (I'm building this one so I'm biased for it).
> Would you say bare metal cost a lot of extra monitoring/maintenance, or is this something you did on the cloud hardware as well anyway? Do you run virtualization on the Hetzner machines?
It cost a lot of monitoring/maintenance up front, but once things are purring the costs amortize really well. Hetzner has the occasional hard drive failure[0], but you should be running in a RAIDed setup (that's the default for Hetzner-installed OSes), so you do have some time. They also replace drives very quickly.
If you really want to remove this headache, you run something like Ceph and make sure data copies are properly replicated to multiple hosts and you'll be fine if two drives on a single host die at the same time. Obviously nothing is every that easy but I know that I spend pretty much no time thinking about it these days.
I run a Kubernetes cluster and most of my downtime/outages have been self-inflicted, I'm wonderfully happy with my cluster now. Also another thing to note is that control plane downtime != workload downtime, which is another nice thing -- and you can hook up grafana etc to monitor it all.
[0]: https://vadosware.io/post/handling-your-first-dead-hetzner-h...
For these companies it wasn't a problem to have a few minutes of downtime, so the task was simply recreating their (usually AWS) production environment in Google Cloud.
It's nice that you ended up with a provider agnostic capability to deploy anywhere, but none of that was free in terms of ownership costs to get there.
So, no, it wasn't free, but it saved them far more money than it cost them, both the initial transition and in ongoing total cost of operation.
In fact, my first project for them was to do extensive cost-modelling of their development and operations.
I think at some point Azure announced $X in free credits for YC members, and GitLab determined this would save us something like a year's worth in bills (quite a bit of money at the time). Moving over was rather painful, and I think in the months that we used Azure literally nobody was happy with it. In addition, I recall we burned through the free credits _very_ fast.
I don't recall the exact reasoning for switching to GCP, but I do recall it being quite a challenging process that took quite some time. Our experiences with GCP were much better, but I wouldn't call it perfect. In particular, GCP had/has a tendency to just randomly terminate VMs for no clear reason whatsoever. Sometimes they would terminate cleanly, other times they would end up in a sort of purgatory/in between state, resulting in other systems still trying to connect to them but eventually timing out, instead of just erroring right away. IIRC over time we got better at handling it, but it felt very Google-like to me to just go "Something broke, but we'll never tell you why".
Looking back, if I were to start a company I'd probably stick with something like Hetzner or another affordable bare metal provider. Cloud services are great _if_ you use their services to the fullest extend possible, but I suspect for 90% of the cases it just ends up being a huge cost factor, without the benefits making it worth it.
One additional suggestion to people considering bare metal: Consider baking in a VPN setup from the start, and pick a service discovery mechanism (such as e.g Consul) that is reasonably easy to operate across data centres. Now you have what you need to do migration if you need to, but you also have the backbone to turn your setup into a hybrid setup that can extend into whichever cloud provider you want too.
A reason for wanting that is that one of the best ways I've found of cutting the cost of using bare metal even further is to have the ability to handle bursts by spinning up cloud instances in a pinch. It allows you to safely increase the utilisation levels of your bare metal setup substantially with according cost savings even if you in practice rarely end up needing the burst capability. It doesn't even need to be fully automated, as long as your infra setup is flexible enough to accommodate it reasonably rapidly. E.g. just having an AMI ready to go with whatever you need to have it connect to a VPN endpoint and hook into your service discovery/orchestration on startup can be enough.
Do you just use iptables? Or do you build out more complex solutions like software routers running on Linux/BSD?
I work in online gaming, and we're constantly seeing attacks on our infra.
You can add a magic header to traffic and drop anything that doesn’t contain the header.
Since this is done in hardware it operates at line speed. In our case 100GBit/s.
If there isn't a possibility for internal-networking (free), then I'd probably use the included iptables for a firewall on each machine. You should honestly have this running on the game servers anyway, if only to restrict communication to between the reverse-proxy and game server.
https://www.hetzner.com/unternehmen/ddos-schutz
[0]: https://nimbusws.com
That sounds like the worst reason to change cloud providers: "because that provider bribed me to"
Running our own stuff on high powered servers is very easy and less trouble than you think. Sorted out the deploy with a "git push" and build container(s) meant we could just "Set it and forget it".
We have a bit under a terabyte of Postgresql data. Any cloud is prohibitively expensive.
I think some people think that the cloud is as good a sliced bread. It does not really save any developer time.
But it's slower and more expensive than your own server by a huge margin. And I can always do my own stuff on my own iron. Really, I can't see a compelling reason to be in the cloud for the majority of mid-level workloads like ours.
I work on a very small team. We have a few developers who double as ops. None of us are or want to be sysadmins.
For our case, Amazon's ECS is a massive time and money saver. I spent a week or two a few years ago getting all of our services containerized. Since then we have not had a single full production outage (they were distressingly common before), and our sysadmin work has consisted exclusively of changing versions in Dockerfiles from time to time.
Yes, most of the problems we had before could have been solved my a competent sysadmin, but that's precisely the point—hiring a good sysadmin is way more expensive for us than paying a bit extra to Amazon and just telling them "please run these containers with this config."
It's such a huge misconception that by using a cloud provider you can avoid having "sysadmins" or don't need that kind of skills. You still need those, no matter which cloud and which service you use.
It's lower level than functions as a service, but much cheaper, more performant, matches local developer setups more closely (allowing for local development vs. trying to debug AWS Lambda or Cloudflare FaaS using an approximation of how the platform would work).
My company takes security very seriously so if these two systems were running on bare-metal I'd probably be spending one day a week patching servers rather than trying to implement new functionality across two products.
Deleted Comment
And yet… Sysadmin tasks take up maybe 2 hours a month.
Your theory is right though if no one on your team knows how to setup servers. In your case the cloud makes sense.
Deleted Comment
I've been running a PostgreSQL cluster with significant usage for a few years now, never had more than a few seconds downtime and I spend next to no time maintaining the database servers (apart from patching). If most requests are read only, clusters are so easy to do in Postgres. And even if one of the providers banned my account, I'd just promote a server at another provider to master and could still continue without downtime.
I recently calculated what a switch to a cloud provider would cost, and it was at least 10x of what I pay now, for less performance and with lock-in effects.
But I understand that there are business cases and industries where outsourcing makes sense.
for a lot big organizations its a matter of accountability. if they say AWS went down vs our dedicated servers went down, it matters a lot for insurance, clients.
what i dont get are 4 man startups paying thousands to AWS ... because everybody does it.
One company had 6 servers and used AWS snapshot for backup + managed MySQL.
Backup and recovery of that db is possible by more people in the team as if it would run as non managed service.
And while we made it work by sticking to the least common denominator which was FaaS/IaaS (Lambda, S3, API GW, K8s). It was certainly not easy. We also ignored tools that could've helped us greatly only against a single cloud in order to be multi cloud.
The conclusion after 2 years for us is kind of not that exciting.
[1] AWS is the most mature one, Azure is best suited for Microsoft products and Old Enterprise features. And IBM is best if you use only K8s.
[2] Each cloud has a lot of unique closed code features that are amazing for certain use cases ( Such as Athena for S3 in AWS or Cloud Run in GCP). But relaying on them means you are trapped in that cloud. Looking back, Athena could have simplified our specific solution if we were only on AWS.
[3] Moving between clouds, given shared features, is possible, but is definitely not a couple clicks or couple of Jenkins jobs away. Moving between clouds is a full time job. Finding how to do that little VM thing you did in AWS, now in Azure, will take time and learning. And moving between AWS IAM and Azure AD permission? time, time and time.
[4] Could we have known which cloud is best for us before? No. Only after developing our product we know exactly which cloud would offer us the most suited set of features. Not to mention different credits and discount we got as a startup.
Hope this helps.
[1] https://cloud.google.com/run
Apache Drill is in a similar space to Athena and can query unstructured or semi structured data like object/S3
Maybe I should have said "closed internal access features"
Did you run simultaneously in 3 clouds? Can you explain the setup?
If not, did you do just run on each for a while to test, or have a reason to switch?
This is probably an impossible question to answer, but: were the savings/benefits of doing this actually worth the engineering costs involved in the end? Eg even if you chose what turned out to be the most expensive, worst option, would the company ultimately have been in a better place by having engineering focused on building things to increase customer value instead?
> Did you run simultaneously in 3 clouds? Can you explain the setup?
The solution itself could be running on a single cloud. But we work in the finance sector and targeting highly regulated clients. And we got a tip very early on, that each client could ask for deployment on their cloud account that is monitored by them. Which will probably be AWS or Azure. Today we know only some require that. So it helped somewhat.
> were the savings/benefits of doing this actually worth the engineering costs involved in the end?
Like you said, very hard to know. In our case, we had a DevOps Cloud guy working a full time job, so, it was not noticeable. Reason being, Probably,
[1] Because although he had problems to solve on all clouds, clouds deployments eventually get stable enough, So pressure was spread.
[2] Although all clouds still need constant maintenance, it's a-synchronic (you can't plan ahead when AWS EKS K8s will force a new version), so pressure was spread out and it never stopped client feature building.
But who knows, maybe for other architectures or a bigger company, it would have become noticeable.
I went from AWS (cost ~£25/mo) to Microsoft Azure DE because I didn't want any user data to be subject to search by US law enforcement/national security. I thought the bill would be about the same, but it more than quadrupled almost overnight even though traffic levels, etc., were the same (i.e., practically non-existent).
What was happening was Azure was absolutely shafting me on bandwidth, even with Cloudflare in the mix (CF really doesn't help unless you're getting at a decent amount of traffic).
In the end I moved to a dedicated server with OVH that was far more powerful and doesn't have the concerns around bandwidth charges (unless I start using a truly silly amount of bandwidth, of course).
10 big dedicated servers can probably handle the same load as 100s to 1000s of cloud nodes for a fraction of the cost. Configuration and general complexity might even be simpler without the cloud.
It's not as hard as people make it out to be to set up backup and redundancy/failover.
The first group are almost always better served by dedicated VMs or hardware from a provider specializing in the, if the VM is long-lived.
Managing your own infrastructure (with dedicated servers, so no hardware management) isn't too hard, even if you're a small shop. And managing a fleet of AWS services isn't necessarily less work.
Maybe there's a reason all ads for cloud tend to compare it to running your own data center. Because once you get rid of hardware management, it's not really that much easier being in the cloud, at the risk of lock-in and huge surprise bills.
Cloud only ever is worth it when one uses the higher-tier services, like AWS Lambda and the likes. Even running Kubernetes in the cloud is only semi worth it, because it's not high enough in the stack and still touches too many low-level IaaS services.
Of course, higher tier means more vendor lock-in and more knowledge required and all that. But if you are not willing to pay that price, then OVH, Hetzner and the likes will have better offerings for you.
That still isn't much traffic (at all) in the grand scheme of things, and Cloudflare's lowest paid tier deals with around 80% of the bandwidth. Still, it's not hard to imagine that bill blowing up to several hundred pounds per month had I chosen not to act. That would translate to several thousand pounds over the course of a year. I don't know very many people for whom such an expenditure, particularly if it's unnecessary and avoidable, would be something they'd regard as insignificant.
Putting it into everyday terms: it could have grown into my second largest outgoing after my mortgage. That doesn't really seem proportionate or sensible, so why wouldn't I look for a better deal?
With own hardware, scaling is not as easy. You'll have to do a lot more around plumbing too. Networking, security, many other things that you'll have to address. Stuff that has already been solved for you.
Especially for internal use (CI/CD, Analytics), you'd rather want to queue a few things up than always having to consider your budget when you want to run something.
Honestly? It's quite fun. Despite considering myself more of a programmer than devops, I really like the devops stuff - as long as I'm doing it as part of a team and I know the domain and the code - and not being that general devops guy who gets dropped into project, does devops for them and gets pulled into another one to do the same.
Working out all those little details of switching from AWS to Azure is fun and enjoyable and I also feel like I'm doing something meaningful and impactful. Luckily there's not much vendor-locking as most of the stuff lives in k8s - otheriwse it would be much trickier.
Anybody's cloud strategy should try and stick to the most basic services/building-blocks possible (containers/vms, storage, databases, queues, load balancers, dns, etc) to facilitate multi-cloud and/or easy switching.
Not that each cloud doesn't have its quirks that you'll have to wrap your head around, but if you go all in with the managed services you're eventually going to have a bad time.
Google does have some innovative big data products like BigQuery and Dataflow. In general choosing GCP over AWS shouldn't hinder a companies growth at this point IMO.
Is there a particular reason for this?