Interested in how they handle DB updates/migrations (I don't know what Slack uses for data storage backend).
IMO those DB migrations are the most difficult/fraught with risk because you need to ensure that the different versions of the servers that are running as they are deploying can work with whatever state your DB is in at the moment.
It's always nice to see how other teams do it. Nothing too groundbreaking here but that's a good thing.
I did notice the screenshot of "Checkpoint", their deployment tracking UI. Are there solid open source or SaaS tools doing something similar? I've seen various companies build similar tools but most deployment processes are consistent enough to have a 3rd-party tool that was useful for most teams.
I've built that tool 2-3 times now. The issue is really the deploy function and what controls it. It's always a one-off, or so tightly integrated into the hosting environment, that reaching in with a SaaS product is somewhat difficult. That being said, the new lowest-common-denominator standards like K8s make it way easier. If anyone is interested in using a tool just leave a comment and I'll reach out.
Sleuth is a SaaS deployment tracker that pulls deployments from source repositories, feature flags, and other sources, in addition to pushes via curl. You can see Sleuth used to, well, track Sleuth at https://app.sleuth.io/sleuth
I can also recommend Sleuth. We use it at our company and the integration is very good. Their team is constantly working on new features, integrations and better UI.
Sure, I see your point. I'd just like to see a pattern that works for most that could gain some traction. At the end of the day we're all trying to do the same thing (deploy high quality software), just in different ways. Deployment strategy shouldn't need to be a main competency of most teams.
I've never seen anything that could even remotely give us what we wanted. We ultimately decided to roll our own devops management platform in-house which was 100% focused on our specific needs. We are now on generation 4 of this system. We just rewrote our principal devops management interface using Blazor w/ Bootstrap4. The capabilities of the management system relative to each environment are fairly absolute - Build/Deploy/Tracing/Configuration/Reporting/etc. is all baked in. We can go from merging a PR in GitHub to a client system being updated with a fresh build of master in exactly 5 button clicks with our new system.
The central service behind the UI is a pure .NET Core solution which is responsible for executing the actual builds. The entire process is self-contained within the codebase itself. Very powerful the contract enforcement you get when the application you are building and tracking is part of the same type system as the application building and tracking it.
I'm curious what a Jenkins + Octopus system is missing that your system provides. Most companies would have a hard time justifying the expense to build a bespoke system just for devops.
Fun to read, but there's a lack of detail here that I'd like to see. For example, this talks purely about code changes. However times a code change requires a database schema change (as mentioned above), different API's to be used, etc. In the percentage based rollout where multiple versions are in use at once, how are these differences handled?
For database schema changes, here is the standard practice: - You have version 1 of the software, supporting schema A. - You deploy a version 2 supporting both schema A and new schema B. Both versions coexist until the deployment iis complete and all version 1 instances are stopped. During all this time the database is still on schema A, this is fine because your instances, both version 1 and 2, support schema A.
- Now you do the schema upgrade. This is fine because your instances, now all runnning version 2, support schema B
- At last, if you wish you can now deploy a version 3, dropping the support for schema A.
We do it the other way (and I’ve always seen it done this way): database change is compatible with current code and new code. So deploy the database change, then deploy the code change. It usually allows you to rollback code changes.
My company uses HBase currently for things on premise and we're moving to a mix of psql and BigTable in GCP. This is how we do things except all of our "schemas" are defined by the client so we just have to make sure that serialization/deserialization works correctly. With psql we might have to figure out a migration strategy, but for now we'll just be using it to store raw bytes.
Always make your code compatible with the old and new schema. Migrate the database separately. Then after the migration, remove the code that supports the old schema.
It would be a good practice to first make a DB change alone, which is compatible with both and new code, so you don't need rollbacks. Then separately deploy a code change.
> Even strategies like parallel rsyncs had their limits.
They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.
I can't speak for Slack, but it's not unreasonable to believe that a single machine's available output bandwidth (~10-40Gbps) can be saturated during a deploy of ~GB to hundreds of machines. Pushing the package to S3 and fetching it back down lets the bandwidth get spread over more machines and over different network paths (e.g. in other data centers)
We do it similarly except we push an image to a docker registry (backed by multi-region S3), then you can use e.g. ansible to pull it to 5, 10, 25, 100% of your machines. It "feels" like push though, except that you're staging the artifact somewhere. But when booting a new host it'll fetch it from the same place.
Considering they are not bringing machines out of rotation or draining connections in the example given with the errors, I assume that more than 10 machines produces too many errors or takes too long to have two versions of the code deployed, and wherever they pull from is not scalable. All those problems can be easily solved though.
I'm surprised at the 12 deployments per day, if that's truly to production. There's bugfixes etc., but feature wise Slack has been... let's say slow. Not Twitter slow, but still slow, in making any user visible changes.
Far too many people on HN seem to think the public facing code that we see is all that the engineering team in a large company works on. There's so much more to running a large SaaS business. If Slack is like all the other SaaS companies I've encountered they'll have dozens of internal apps for sales, comms, analytics, engineering, etc that they work on that people outside of the business never see[1]. Those all need developing and all need deploying.
[1] They might buy in solutions for some business functions like accounting, HR and support, but they'll still have tons of homegrown stuff. Every tech company does.
Lots of places do a lot of deploys but hide significant new features behind A/B testing and feature flags. So the two things are disconnected from each other.
User visible changes are dependent on the product development process rather than the rate of deploys. Whether you deploy 12 times a day or once a month, it's not like code is getting written any faster.
I wonder why they didn't evaluate at some point using an immutable infrastructure approach leveraging tools like Spinnaker to manage the deploy? They sure have the muscle and numbers to use it and even contribute to it actively, no? I mean, I know that deploying your software is usually something pretty tied to a specific engineering team but I really like the immutable approach and I was wondering why a company the size of Slack, born and grown in the "right" time, did not consider it.
I'm kind of surprised they don't have a branch-based staging. Every place I've worked at has evolved in the direction of needing the ability to spin up an isolated staging environment that was based on specific tags or branches.
It's cool to see how big organizations have deployment setups, while it feels like there is not enough resources about how one should setup a deployment system for a new startup just in the beginning.
The setup I currently use is custom bash scripts setting up EC2 instances. Each instance installs a copy of the git repo(s), and runs a script to pull updates from production/staging branches, compiles a new build, replaces the binaries & frontend assets, then restarts the service, and sends a slack message with list of changes just deployed.
It works good enough for a startup with 2 engineers. However, I'd like to know what could be better ? What could save my time from maintaining my own deployment system in AWS world, without investing days of resources to K8s?
You don't have to do a big-bang style Google thing. You can just invest in some continuous improvement over the next few years:
Iteration 0: What you have now.
Iteration 1: A build server builds your artifact, and your EC2 instances download the artifact from the build server.
Iteration 2: The build server builds the artifact and builds a container and pushes it to ECR. Your EC2 instances now pull the image into Docker and start it.
Iteration 3: You use ECS for basic container orchestration. Your build server instructs your ECS instances to download the image and run them, with blue-green deployments linked to your load balancer.
Iteration 4: You set up K8s and your build server instructs it to deploy.
I went in a similar trajectory, and I'm at iteration 3 right now, on the verge of moving to K8s.
It's your call on how long the timespan is here, and commercial pressures will drive it. It could be 6 months, it could be 3 years.
For me, it feels a bit "wrong" to be building on each production server.
Firstly, production servers are usually "hardened", and only have installed what they need to run, reducing the attack surface as much as possible.
Secondly, for proprietary code, I don't want it on production servers.
But most importantly, I want a single, consistent set of build artifacts that can be deployed across the server/container fleet.
You can do this with CI/CD tools, such as Azure DevOps (my personal favourite), Github Actions, CircleCI, Jenkins and Appveyor.
The way it works is you set up an automated build pipeline, so when you push new code, it's built once, centrally, and the build output is made available as "build artifacts". In another pipeline stage, you can then push out the artifacts to your servers using various means (rsync, FTP, build agent, whatever), or publish them somewhere (S3, Docker Registry, whatever) where your servers can pull them from. You can have more advanced workflows, but that's the basic version.
Automate compilation on a buildserver and run tests on that, and if everything is ok, use the artifacts to push to your servers. This way you can guarantee that the code is tested and all running versions are from the same build environment.
If you make your application stateless and have it in a container then there are many managed services out there that can do this for you. For example, in AWS there is fargate and EKS.
IMO those DB migrations are the most difficult/fraught with risk because you need to ensure that the different versions of the servers that are running as they are deploying can work with whatever state your DB is in at the moment.
I did notice the screenshot of "Checkpoint", their deployment tracking UI. Are there solid open source or SaaS tools doing something similar? I've seen various companies build similar tools but most deployment processes are consistent enough to have a 3rd-party tool that was useful for most teams.
[Disclaimer: am a Sleuth co-founder]
Hi Don :)
Definitely disagree with this. I have never worked at two places with a similar enough deploy process that would benefit from a generic tool.
Here's what it looks like: https://twitter.com/shl/status/1128039742308737024/photo/2
The central service behind the UI is a pure .NET Core solution which is responsible for executing the actual builds. The entire process is self-contained within the codebase itself. Very powerful the contract enforcement you get when the application you are building and tracking is part of the same type system as the application building and tracking it.
Apart from tracking deployments, we're really focused on tracking bills of materials and communication between Business and Tech teams.
Always make your code compatible with the old and new schema. Migrate the database separately. Then after the migration, remove the code that supports the old schema.
- migrate DB and create new field
- deploy code for writing into such field (not read yet), in parallel with old field
- backfill data migration for older records
- deploy code with feature flag to read new field in workflows, but still write to both fields
- switch read feature flag on
- make sure everything works for a few weeks
- switch write feature flag to only use new field
Deleted Comment
Edit: also suggested by Martin Fowler https://www.martinfowler.com/bliki/BlueGreenDeployment.html
They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.
[1] They might buy in solutions for some business functions like accounting, HR and support, but they'll still have tons of homegrown stuff. Every tech company does.
The setup I currently use is custom bash scripts setting up EC2 instances. Each instance installs a copy of the git repo(s), and runs a script to pull updates from production/staging branches, compiles a new build, replaces the binaries & frontend assets, then restarts the service, and sends a slack message with list of changes just deployed.
It works good enough for a startup with 2 engineers. However, I'd like to know what could be better ? What could save my time from maintaining my own deployment system in AWS world, without investing days of resources to K8s?
Iteration 0: What you have now.
Iteration 1: A build server builds your artifact, and your EC2 instances download the artifact from the build server.
Iteration 2: The build server builds the artifact and builds a container and pushes it to ECR. Your EC2 instances now pull the image into Docker and start it.
Iteration 3: You use ECS for basic container orchestration. Your build server instructs your ECS instances to download the image and run them, with blue-green deployments linked to your load balancer.
Iteration 4: You set up K8s and your build server instructs it to deploy.
I went in a similar trajectory, and I'm at iteration 3 right now, on the verge of moving to K8s.
It's your call on how long the timespan is here, and commercial pressures will drive it. It could be 6 months, it could be 3 years.
Firstly, production servers are usually "hardened", and only have installed what they need to run, reducing the attack surface as much as possible.
Secondly, for proprietary code, I don't want it on production servers.
But most importantly, I want a single, consistent set of build artifacts that can be deployed across the server/container fleet.
You can do this with CI/CD tools, such as Azure DevOps (my personal favourite), Github Actions, CircleCI, Jenkins and Appveyor.
The way it works is you set up an automated build pipeline, so when you push new code, it's built once, centrally, and the build output is made available as "build artifacts". In another pipeline stage, you can then push out the artifacts to your servers using various means (rsync, FTP, build agent, whatever), or publish them somewhere (S3, Docker Registry, whatever) where your servers can pull them from. You can have more advanced workflows, but that's the basic version.