How to Do a Full Rewrite

Completed a full rewrite of many components of the Kraken.com backend in about 4 years.

The new system is around 1.5M loc of Rust. There was no serious alternative to rewriting, sometimes you find yourself in a corner and need to fix issues, and pay the price.

I wrote about it 3 years ago here https://blog.kraken.com/product/engineering/oxidizing-kraken...

Everything in that blog post still rings true and hindsight is that it were were right. But it was a massive grind and required extreme dedication to get it done, for a variety of reasons that work was very taxing.

We also didn't stop feature development and kept the two systems running concurrently (which explains why it took so long, also growing and training a new team 10x the size took time, so there are many factors).

I'm also against rewrites if I can help it, but reality is complex and sometimes we can't help it. Now however, since we removed the last pieces of legacy that were preventing larger DB schema changes (or required massive, unreasonable changes to the legacy systems), we've been shipping faster and easier than ever and caught-up on a lot of the accumulated backlog, including some of the more ambitious projects that were unthinkable in the legacy systems due to limitations.

SenorKimchi · 2 years ago

Huge fan of Kraken.

Looking back, is there anything that you would have done differently? I find that half or more of the rewrites that I have dealt with have been driven by all the wrong motivations. You get inevitable turnover and at some point people dislike code that they didn't write themselves and push for a rewrite, maybe changing the stack to something trendy, justifying it with thin arguments. Once the rewrite starts the company ends up treading water for years while incurring a ton of costs. For me, I think only 1 rewrite that I was part of was a good decision in my 15 years in tech. If I could go back in time, I think I would kill all rewrite discussions the moment that someone first whispers the idea.

How did you guys enjoy switching to Rust? I assume the safety and performance benefits for the trading system are a huge plus (didn't Kraken trading go down for an entire week a few years ago?). Did you also rewrite the webapp backend in Rust as well? How has staffing and budgeting been affected? I would assume that the supply of Rust developers is much lower unless you train them in house. Rust sounds fun, but I can't imagine trying to justify a rewrite of a legacy system, a major tech stack change, and training/building a new team all at the same time.

Sorry for the onslaught on questions. The "rewrite it in rust" fever has spread to my work and I'm fighting myself on how to respond.

simag · 2 years ago

With hindsight, considering the cards we were dealt, there's not much I would have done differently. If I had known better before, I would have ensured stronger buy-in because after a while our internal stakeholders were often pushing back on the effort, and that led to concessions where throw away code in the legacy systems was built even for weak business outcomes.

Overall I share your concerns. Having the right reasons to rewrite is key. I believe this blog[1] about software as theory building does a great job at describing the challenge with software gardening, and the times where a rewrite is the solution are few. Even then, it's critical to handle the rewrite in ways that can work - in our case, we chose to progressively eat the legacy software without making major changes when we could avoid them. The legacy software we had was mostly the results of one man heroics and traded off performance and availability for correctness and security. It also was designed to be maintained by a small group. Solid choices if you are early Kraken - but as many successful startups, we were victim of our success and we needed it all.

When it became clear that we had to rewrite the stack (the 2017 3-days shutdown happened just before that realization), those in charge at the time decided to experiment with Rust. It was a crazy bet in early 2018, it was still Rust 2015, no NLL, no async, far rougher ecosystem. The fact that it became successful enough to warrant pursuing a full rewrite is to credit on some lucky hires who made it a success.

In that regard, Rust was a very strong talent magnet. In my experience, having hired 200+ Rust engineers over the last 5 years, there are a few kind of engineers attracted to Rust: (1) some just like shiny things/hype, (2) some are perfectionists and never complete a project, (3) some just are doers who have found that Rust is a particularly effective language.

Overall, Rust has been a great to hire for. Many engineers out there want to use Rust, even if it's their 1st Rust professional experience. We were also known in the Rust community for hiring for full time Rust, probably also the place currently with the highest density of Rust talent (there are massive companies with more Rust devs, but smaller % overall). Budget wise, our Rust engineers are not paid particularly better than other engineers in the company, but the compensations at Kraken are generally in the higher tier.

At the risk of sounding boastful, in my experience Rust is reasonably easy to learn for experienced/strong developers (we have some very young outstanding Rust devs as well, most of the time they learned before joining). Average developers struggle and may never become productive. Again, we have an engineering excellence culture so it is okay for us, YMMV.

Re scope, yes we use Rust for everything in the backend, including CRUD type of work like Web APIs. We've found we're at least as productive than other languages (Go / Java+Spring / Ruby / PHP) while having far fewer incidents, and easier maintenance / cheaper KTLO. Rust's ability for reuse is excellent which means that there are very strong network effects when having more services in Rust, including the Web layer.

A nice "side effect" of a full Rust stack is that our p99.9 latency internally is usually stable around 3-4ms for most operations, even though multiple services are involved. That's coming from a much higher baseline with much more deviation across operations providing the same functionality (60-100ms).

Regarding your own rewrite discussions, you're not going to be convinced by a post on HN, I'll just say that I am very reluctant to even think working at a workplace that doesn't predominantly uses Rust. I've been in the industry for 20+ years, across many stacks and there's a before and an after Rust for me. It has been a super power and made our life easier. It makes it easier to model business problems thanks to algebraic data types and their usage for error handling (versus inheritance), traits allow to abstract behavior better than OOP-style interfaces, the absence of data races is a game changer for multi-threaded code, dependency management is trivial, the ecosystem is rich and things work well. A lot of these are properties found in other languages but no other has the same full package and is on its way to become mainstream.

[1] https://www.baldurbjarnason.com/2022/theory-building/

I have seen many organisations try this and end up with a second and even third system with less features running in parallel with the first never caching up to be feature complete.

I am of the opinion the only real way to do this is take small pieces and replace them and keep the main system running slowly replacing parts of it. It will never be complete but progress can be made in areas that need improvements. A complete rewrite isn't worth it, they fail so often, cost far more than any one thinks they do and rarely achieve the magic improvements they were sold on.

heisenbit · 2 years ago

The only way is to accept that the new system is less complete and switch over and suffer the consequences. The new system will never have all the old features - the question is whether it is good enough to use now and extend in the future. The mission of the builders of the new system should not be feature completeness nor being better but to kill the old quickly while maintaining a reasonable level of future proofness. This can not be driven from the bottom (re-write for code beauty) as it requires business level commitment to bear the pain of the changeover. Software are coded business processes and new software means new processes. A major cost driver for software are inflexible requirements and taking old code and processes as gospel is a guarantee for a cost explosion.

whstl · 2 years ago

This is true in my experience as well. Business needs and expectations should be fully aligned, and there must be commitment. Business/Product can't expect devs to do an unsupervised 1:1 rewrite that magically includes all features. You gotta treat it as if it were new software: do it incrementally and watch thing closely, test and approve or ask for change. Business must also learn to adapt and modify their workflow.

To me this is where a lot of rewrites fail. Not only rewrites, but even implementations of off-the-shelf systems. The company (both management and other workers) is so entrenched in the current processes that they will keep pushing for the time and money available to be spent on constantly tweaking the new system until it exactly what it was, and then all the old problems come back again. I've seen companies blowing millions on consultants because of this.

But this also happens for new features. The current system I work at is in its 4th permission system, for internal permissions. The problem is not so much that requirements change, but that nobody in product/business really knows them. So each permission system starts alternating between being too malleable (and then it devolves into chaos in the settings) or too rigid/simple (and then it then devolves into people asking for too many permissions).

You gotta fix the business before fixing the software.

lazyasciiart · 2 years ago

The strangler pattern. I agree.

crdrost · 2 years ago

Please be careful with calling it that.

Surveys differ a bit on this but somewhere between 3-10% of women report having been strangled by an intimate partner at some time in their lives, this may be 20% higher when non-intimate partners are also factored in. So like “one in 12” is not crazy-talk. Even if your dev-team right now is all-male, your design docs may live to see you diversify and you probably have female non-technical coworkers who will overhear you talking about it. It's not worth taking a one in 12 chance of potentially reminding them of past domestic abuse, I get that it's not your intention but.

Yes I get that it's by analogy to the “strangler fig,” but ,

(a) it was a crappy analogy in the first place[1],

(b) you can just call it the “strangler-fig pattern” and the extra syllable makes it like 20% more rhythmic, 10% more clear and 50% less alienating without sacrificing any googlability as this is what the cloud companies call it,

(c) you didn't really need the analogy, “migrate” was already the established term of art for this pattern, means the same thing and it is already verb-ified for you! “Our first priority is to migrate all our current requests,” vs “our first priority is to use our strangler-pattern to subsume the current requests under the new architecture,” whyyyyyyyyy. So you're gonna use the word “migrate” anyway and if used precisely[3] there is nothing added by evoking the humble strangler-fig, apart from, you know, accidentally sounding positive with regard to domestic abuse.

1. Taking the definition as “incremental replacement behind a proxy until the original system eventually dies,” literally the only thing that is correct about the analogy is “eventually dies”. Even if you are calling a tree’s contribution to the leaf canopy its “feature set” and the developer attention is its “nutrients” to try to save the analogy, you come to the conclusion that strangler figs do “rapid feature development” at first, rather than trying to rewrite the system. And this early development is symbiotic rather than parasitic [2]. Strangler figs don't do the strangler pattern, they do Embrace-Extend-Extinguish.

2. https://link.springer.com/article/10.1007/s13199-017-0484-5

3. So, “migrate functionality” is popular but very imprecise, you want to stay that you are migrating the requests, or the request-handling, or the users, to the new platform. The word functionality should probably also be thrown in the trash, it literally adds three clumsy syllables to a word which is already its synonym, the “function” of a thing is already its “functionality” and if you really wanted to not use the mathematical word function, just talked about its “features” or “feature-set.”

snvzz · 2 years ago

Rewrites typically work well when the people behind the rewrite are the same people who wrote the original code and maintain it.

Often, the requirements changed along the way as the problem domain incrementally became better understood. At that point, the original design is not helping, but sabotaging everything.

This is why that first version should always be considered a prototype. And the next version will probably also be.

Not rewriting will have a much larger cost down the line.

mekoka · 2 years ago

Our experience may have not been the same, but I beg to differ. If the old solution is so problematic that nothing can be salvaged from it and the best solution is a full rewrite, instead of some refactoring or modifications, then your problem is not so much that the code is solving the wrong problems. The problem is the team that built it. Every time I was ever called to rewrite a code base, there were certainly some elements that pointed to a clearer understanding of the requirements, but mostly, the necessity to do it from scratch pointed at a people problem.

Many successful projects started as reverse engineered clones of old and established ones, that then became improved versions of their predecessors. Those are rewrites, just not done by the original project's team.

My opinionated rules of full rewrites, informed by experience and observation:

1- Bring in one, just one, project lead who is a specialist of the technical domain. E.g. if you're building a web app, hire a seasoned web developer, instead of relying on your in-house electrical engineer, whom you allowed to architect the previous solution, because they managed to convince you that code is code.

2- Let the new lead vet every member of the old team, including (especially) the old team leaders.

3- Allow the new lead to drop any dead wood.

rewmie · 2 years ago

> If the old solution is so problematic that nothing can be salvaged from it and the best solution is a full rewrite, instead of some refactoring or modifications, then your problem is not so much that the code is solving the wrong problems. The problem is the team that built it.

I think you're too quick to point fingers and too desperate to throw people under the bus to pass yourself as saviour.

There are plenty of everyday scenarios where software piles up technical debt in spite of the developers. All it takes is a single requirement to change for an entire tech stack to become a problem instead of problem-solver, and all it takes is a business goal to be met at record time to pile up quick-and-dirty solutions instead of well-architected implementations. These happen far more often than replacing whole teams, and new project leads solve nothing.

I've worked on a legacy project which started as a multi-platform desktop app that in the meantime became Windows-only. You can imagine the cruft that resulted from this requirements change alone. During the same period, business requirements changed to support new major features, and Microsoft started pushing for Windows 11. Of course we discussed a major rewrite, as the legacy tech stack didn't supported native Windows features well and the legacy application was riddled with multi-platform code that made no sense anymore. Switching to vanilla WPF alone would eliminate 90% of the project's pain points.

Tell me exactly how the team created this problem, and how you would be the key to fix it.

bazoom42 · 2 years ago

Why can’t the existing code just be adapted to the new requirement? Code is supposed to be mallable. But perhaps the code was designed too rigid to support changing requirements. In that case you will have exactly the same problem after the rewrite, next time requirements change again.

BillyTheKing · 2 years ago

It depends a little, sometimes companies (especially start-ups) pivot quite substantially - With a Fintech that I've worked with from the beginning (which went through ups and downs, but ultimately ended up quite successful) we started issuing debit cards and pivoted to loans. At some point we just needed a different app that was properly written for the use-case that we ended up serving. It's like the parent said, the re-write was done by the same team (mostly) - those were 4 busy months but it ended up quite successful and imo it was worth it at the end. And I have to admit I was actually very critical of the re-write approach initially..

Well, you kinda answered yourself: code is supposed to be malleable, but sometimes it isn't.

Even when it follows trends and best practices, you might end up with non-malleable code. Perhaps especially when you over-do trends (eg: metaprogramming, 15 years ago) or best practices (design patterns, 100% unit test coverage, etc).

You will only have the same problem if you fail to solve the malleability problem.

Obviously satire, but in real life, there are reasons why a full rewrite becomes appealing. Perhaps the solution even.

One is overwhelming technical debt. Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense", and was only fired 5 years too late. Code that is difficult to understand, maintain, test, debug, change. Code that follows you home after-hours and on the week-ends. Code that nurses you to bed at night, shows up in your dreams, and wakes you up in the morning. Code that has made many a colleague look for employment elsewhere and new hires give up and quit in their first week.

Every time I see someone profess with assurance that you don't rewrite, I just know that that person has never really experienced the hell I've described above.

anonzzzies · 2 years ago

I am against rewrites of any significance because they generally just end up worse. Joel Spolsky wrote about that a long time ago as did others; most software that’s older has millions of badly documented changes applied by 1000s of people over the decades and rewriting tends to take literally forever (never finishes) or at least much long than anyone estimated times PI. And then the endresult is usually just as crappy but with more bugs.

The Dutch tax software rewrite attempts are an example I am personally familiar with. These attempt made me create a services company to help companies keep legacy software running forever. We support gnarly stuff over 25 years old and still wouldn’t recommend a rewrite for the above reasons.

There are of course cases where rewrites (of significance; rewriting a 50k LoC codebase is not going to be hard) work, but usually the rewrite is done by the same people that did the original , the original wasn’t actually that bad but just too hard to extend in modern times etc.

Joel published that post the same year that Microsoft announced their .NET framework.

I've said this in a different threads, but I think it's worth repeating here. I've seen many successful projects that started out as reverse engineered clones of older, well established ones, then go on to become better versions of their original model. Those are in effect rewrites, just not done by the same team. For a number of years now, it's been feasible to compete technically with an incumbent in matter of months. I see it as a trend indicating an increase in software production capability, be it because of better practices, better tools, or more developer availability.

So although I tend to agree with Joel's old post, and still lean toward refactoring as the likely more economical approach, I hold that advice less religiously.

BlargMcLarg · 2 years ago

> Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense"

If anything I find the largest proponents to have drunk too deep from that well and cause the rewrites to never be considered, as the time required to do it becomes far too long to be worth the pay-out.

This excludes the worst kind: the overarchitectured old mess in need of a rewrite as it was based on the wrong assumptions and is now boggled down by 10 layers of abstractions and indirection which don't do anything.

BigJono · 2 years ago

It depends on the level of the architecture. Architecture that splits the project into chunks that you can take a meat cleaver to and refactor at will is great. "Architecture" that takes one of those chunks and adds 15 abstractions to it is awful.

The former lets you recover from the latter without a full rewrite, which I'm guessing is where advice like "never rewrite" comes from.

Posts that are pro/anti 'architecture' could refer to either, so I never know whether to agree with them or not. They're kinda meaningless out of context.

tcbawo · 2 years ago

What you are describing sounds more like a problem with leadership to empower ICs and groups to make improvements, or possibly a culture of dumping code, declaring victory, and moving on. If new hires are bailing in their first week, the problems run far deeper than the codebase. Rewriting anything is not likely to change the long term end state unless the company culture has shifted. I have yet to experience an organization where a real cultural shift has happened.

l0b0 · 2 years ago

  > One is overwhelming technical debt. […]

This sounds like an organisation with bigger problems than a single manager. The devs hate the job, but don't have the clout to convince management to let them do things properly? Run. It's not worth it. An organisation doesn't suddenly "heal" once a bad apple is gone. More likely, everybody is at least five years behind current best practice, and very much used to doing things that way.

I have see horrible code, but the point is that a rewrite will not solve this. You will just lose years of opportunity and end up in exactly the same place again after the rewrite. Because the reasons which lead the first version to become a big ball of mud will also cause the rewite to end in the same state.

Aeolun · 2 years ago

Somehow I felt like you were describing my Factorio factory…

PaulKeeble · 2 years ago

iamflimflam1 · 2 years ago

What I always love about “the rewrite” is the sheer optimism of the people involved - “we’ll be done in six months, and the new system will be a thing if wonder”. Fast forward to several years later. The original proponents will have moved on leaving behind the accumulated bodges and shortcuts from the increasingly desperate efforts to try and get something live… and then the cycle repeats.

There are ways to do this properly, but no one wants to put the effort into understanding the existing code base and the reason for why it is the shape it is. Everyone is happy with the “who wrote this crap - we can’t work on it” line.

bakuninsbart · 2 years ago

I've only witnessed one full rewrite in my (admittedly short) career so far, and it went shockingly well. The goal was to rewrite a C++ application from the 90s in Java, since the company had largely moved to the web and C++ devs were getting close to retirement. The C++ application was also written in the underfunded startup phase, and then over decades new features had been "tagged on", so the architecture wasn't that robust.

The team tasked with the rewrite set a goal to finish in a year, the first 4-6 months were entirely spent on planning. After that, features were implemented in a modular and iterative approach. I think overall it took a bit longer than a year to be feature complete, but by the end of the year they had a working platform with all the core features implemented.

I think the key here really was very good planning, and the pitfalls you describe can be at least partially ascribed to agile development not being the right tool if you have a very clear and large set of requirements.

SenHeng · 2 years ago

Would you mind elaborating on the kind of planning that went on?

I’ve been involved in instances of planning that meant writing out pseudo code in the whiteboard for the entire software, including errors and exceptions and the final task was ‘just write the code!’. It started off well when coding the big picture modules but as we started getting into the details, bugs, unexpected race conditions and more started coming up.

> the first 4-6 months were entirely spent on planning

This is the most surprising part of the story. I have never seen this kind of “big design up front” actually work. Can you elsborate on how you made this work?

loveparade · 2 years ago

I see this story here all the time, but I have never seen it play out in the real world. Most rewrites I've seen have been hugely successful. This sounds more like a sticky narrative that everyone keeps repeating to tell a nice story, get attention, and make them look experienced. In reality, no experienced engineer is so naive as to think that the new system will be a perfect thing of wonder or not look at the tradeoffs made in the old codebase. Reality is more nuanced than these simplistic fictional narratives.

I wish that were true. I’ve seen ridiculous catastrophic failure twice. I still don’t know if they actually believed their own bullshit, or just wanted to do the rewrite and lied to make it happen.

TheAceOfHearts · 2 years ago

I think a rewrite is fine as long as it's incremental and well-integrated each step along the way, rather than starting off from scratch. Unfortunately this seems like the sort of lesson each engineer needs to learn through experience rather than hearing someone tell you about it. Live and learn I guess. :)

tnr23 · 2 years ago

I founded a company and reached $6M ARR with >50% net profit margin. After 4 years decided to do a full rewrite. Finished it successfully within a year including full migration of all clients. Was the best decision ever.

dieselgate · 2 years ago

Do you consider the rewrite a good decision because it helped increase profits or for other reasons?

yard2010 · 2 years ago

Not op but I would say that profits are only one part of the equation. If you have massive profits and you don't put resources in other areas (HR, engineering, etc.) You lose sustainability and might stall and die eventually. It's a tradeoff between long term sustainability and short term money making IMO

seb1204 · 2 years ago

Congratulations. It reads like you were in control, the driver seat and into the code base. What I want to say is that I consider a rewrite possible because you controlled most /all aspects. Currently we are moving from several ERP instances/configurations due to acquisitions to one completely new ERP version and architecture. This is a huge project consuming enormous resources and needs a director reporting to the CEO.

Scarblac · 2 years ago

Am in the middle of a rewrite now. I'm always against them, but this case is clearly perfect for it.

- The old system used to have many users, but only one large organisation was left.

- The way they use the system is atypical of what it was originally intended for. They use like 20% of the original functionality.

- The old system was started by the very first software our company ever produced (long before they brought in professional software developers), credits to their choice of Django that it worked for fifteen years, but it wasn't very good.

- But in the last nine years of that, hardly any maintenance has been done on it and now none of the build tools work.

We're making a much more focused, modern application now that does everything the customer used in the old one but looks completely different.

So there exists at least one situation where a rewrite is the answer.

baz00 · 2 years ago

How to do a full rewrite:

1. Create a subsidiary company and transfer the code ownership to that.

2. Sell the subsidiary company to an investor for big money and run away quickly.

3. While rolling around in VC cash, green field a better product from the ground up without all the horrible things you did last time.

4. Goto 1