> We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.
> Reducing Memory Allocations
> Implementing Efficient Caching Layers
All of those steps seem pretty standard ways of optimizing a Rails application. I wished the article made it clearer why they decided to pursue such a complex route (the whole custom Lua/nginx routing and two applications instead of a monolith).
Shopify surely has tons of Rails experts and I assume they pondered a lot before going for this unusual rewrite, so of course they have their reasons, but I really didn't understand (from the article) what they accomplished here that they couldn't have done in the Rails monolith.
You don't need to ditch Rails if you just don't want to use ActiveRecord.
The project does still use code from Rails. Some parts of ActiveSupport in particular are really not worth rewriting, it works fine and has a lot of investment already.
The MVC part of Rails is not used for this project, because the storefront of Shopify works in a very different way than a CRUD app, and doesn’t benefit nearly as much. Custom code is a lot smaller and easier to understand and optimize. Outside of storefront, Shopify still benefits a lot from Rails MVC.
I’ll also add that storefront serves a majority of requests made to Shopify but it’s a surprisingly tiny fraction of the actual code.
Someone replied but deleted right when I was posting this answer, so I'm replying to myself:
What I didn't understand was why the listed performance optimizations couldn't be implemented in the monolith itself and ensued the development of a new application, which is still Ruby.
In a production env, the request reaches the Rails controller pretty fast.
I know for a fact that the view layer (.html.erb) can be a little slow if you compare it to, say, just a `render json:`, but if you're still going to be sending fully-rendered HTML pages over the wire, the listed optimizations (caching, query optimization and memory allocation) could all be implemented in Rails itself to a huge extent, and that's what I'd love to know more about.
They talk about reducing memory allocations. My guess is the rest of the app is very large and they’re benefiting from not sharing memory and GC with that.
Of course, everything you said is true for a small-to-medium sized Rails application.
They likely could have explored a separate Rails app to meet this goal, but then they have to maintain the dependency tree and security risks twice. And if the Rails core refactors away any optimizations they make, they have to maintain and integrate with those.
There’s definitely some wiggle room and a judgement call here but their custom implementation has merit.
Don't forget that a Shopify store is 100% customizable by merchants using Liquid (Turing complete, not that you should try). There is no .html.erb layer. Think of Storefront Renderer as a Liquid interpreter using optimized presenters for the business models.
I didn't care especially for the technical details, what I like about this article is that the first thing they mention is the success criteria of the project (hopefully it was done at the very beginning, before any implementation). Then on top of that, they created an automated tool to verify such criteria automatically and objectively.
This is a great approach and unfortunately I don't think many (most?) software projects start out like that.
Not defining conditions of victory and scope creep are possibly the biggest risks in software projects.
It’s also important to remember that not everything worth doing or every “success” state you set can have KPIs defined (either actually impossible or the science may not be there yet).
Shopify has traditionally been an example people have pointed to for scaling a monolith with a large growth factor in all areas: team size, features, user base size, general "scale" of the company.
Does anyone on here, who has worked on this project or internally at Shopify, feel that this project was successful? Do you think this is the first, of a long and gradual process, where Shopify will rewrite itself into a microservice architecture? It seems like the mentality behind this project shares a lot of commonly claimed benefits of microservices.
> Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith
Different goals that need to be solved with different architectural approaches.
> storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on
Noisy neighbors.
> We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.
Smaller deployable units; you don't have to deploy all of shopify at edge, you only need to deploy the component that benefits from running at edge.
- Aggressive caching with layers of caches, DB result cache, app level object cache, and HTTP cache. Some DB queries are partitioned and each partitioned result is cached in key-value store.
I’m aware that Ruby/Rails isn’t that quick, but it seems mind boggling that an 800ms server response time is considered tolerated, and 200ms is satisfying. I’ve never used Ruby in production so maybe my reference point is off and this is more impressive than I’m giving it credit for.
I'm not sure this has anything to do with Ruby, they're talking about user experience: what's perceptible to humans and what causes frustrations.
Also - in most apps db and frontend take way more time than the Rails stack.
But you should also account for up to 100-200ms network latency (especially with mobile networks) plus some rendering time. A 200ms server response time can already lead to a perceived 500ms loading time.
This is very interesting. N+1 and lazy loading have been a very common problem that profilers can spot, but eager loading also has a cartesian product problem where if you have an an entity with 6 sub item, and 100 of another subitem, you'll end up getting 600 rows to construct a single object / view model.
I have been recently playing with RavenDB (from my all time favorite engineer turned CEO), it approaches most of these as an indexing problem in the database, where the view models are calculated offline as part of indexing pipeline. It approaches the problem from a very pragmatic angle. It's goal is to be a database that is very application centric.
Still to be seen if we will end up adopting, but it'll be interesting to play with.
Disclaimer: I am a former NHibernate contributor, and have been very intimate with AR features and other pitfalls.
Didn't NHibernate have the cartesian product problem solved in a neat way by having various fetch strategies?
You could specify to eagerly load some collections and have NHibernate issue additional select statement to load the children, producing maximum of 2-3 queries (depending on the eager-loading depth) but avoiding both N+1 problem and cartesian row explosion problem.
yes, that's the common method, but you still end up issuing multiple network calls. The problem wit issuing select statements to load the children is you have to wait on the first query (root) to finish so you can issue others which adds to the network latency (usually low, but it also depends). It's still not as good as having materialized viewmodels on server where you can issue a single query to get everything you need. The disadvantage is the storage cost, though.
Naive question: the "storefront" piece seems like it's a static page. Why does it need SSR? Even so, it could be SSR'ed to static _once_ (kind of how NextJS does this from 9.3+), then have it served by CDN/edge. I'm probably missing something here.
Throwing opinions here, but after working a bit with Shopify themes, there might be some reasons to stick with SSR rather than aggressive caching. First, the storefront can be dynamic depending on visitor region/login/logout. Second, Shopify have most of the logic on the backend, even having non-js html nodes for ordering/add to cart. Third, I don't think the visit distribution of the stores makes caching economically viable (the top 20% store probably don't account for +60% server load).
That’s also my question after reading this post. When trying to shave off milliseconds by going for a full rewrite, moving away from ruby seems like an obvious decision...at least intuitively..
Their monolith was written in Rails so Ruby alone was not the source of slow performance. In fact the solution was more to do with cloning the database in order to be able to isolate reads and writes so not even a programming language problem at all.
I'm assuming the details of exactly what the new implementation is have been deliberately withheld for some future post where they talk specifics (especially if it's something exciting like Rust/Elixir/Go). This keeps the focus of this post on the approach to migration, using the old implementation as a reference in order to burn down the list of divergences, etc.
> We carefully vet what we eager-load depending on the type of request and we optimize towards reducing instances of N+1 queries.
> Reducing Memory Allocations
> Implementing Efficient Caching Layers
All of those steps seem pretty standard ways of optimizing a Rails application. I wished the article made it clearer why they decided to pursue such a complex route (the whole custom Lua/nginx routing and two applications instead of a monolith).
Shopify surely has tons of Rails experts and I assume they pondered a lot before going for this unusual rewrite, so of course they have their reasons, but I really didn't understand (from the article) what they accomplished here that they couldn't have done in the Rails monolith.
You don't need to ditch Rails if you just don't want to use ActiveRecord.
The project does still use code from Rails. Some parts of ActiveSupport in particular are really not worth rewriting, it works fine and has a lot of investment already.
The MVC part of Rails is not used for this project, because the storefront of Shopify works in a very different way than a CRUD app, and doesn’t benefit nearly as much. Custom code is a lot smaller and easier to understand and optimize. Outside of storefront, Shopify still benefits a lot from Rails MVC.
I’ll also add that storefront serves a majority of requests made to Shopify but it’s a surprisingly tiny fraction of the actual code.
Any interesting/successful patterns you can share/resources you can share on said patterns?
What I didn't understand was why the listed performance optimizations couldn't be implemented in the monolith itself and ensued the development of a new application, which is still Ruby.
In a production env, the request reaches the Rails controller pretty fast.
I know for a fact that the view layer (.html.erb) can be a little slow if you compare it to, say, just a `render json:`, but if you're still going to be sending fully-rendered HTML pages over the wire, the listed optimizations (caching, query optimization and memory allocation) could all be implemented in Rails itself to a huge extent, and that's what I'd love to know more about.
Of course, everything you said is true for a small-to-medium sized Rails application.
They likely could have explored a separate Rails app to meet this goal, but then they have to maintain the dependency tree and security risks twice. And if the Rails core refactors away any optimizations they make, they have to maintain and integrate with those.
There’s definitely some wiggle room and a judgement call here but their custom implementation has merit.
Deleted Comment
This is a great approach and unfortunately I don't think many (most?) software projects start out like that.
Not defining conditions of victory and scope creep are possibly the biggest risks in software projects.
1) What is the goal? What defines success?
2) What are the KPI's? How are we going to measure it?
These are baseline questions to any endeavor of substance. Yet, they are rarely defined.
Does anyone on here, who has worked on this project or internally at Shopify, feel that this project was successful? Do you think this is the first, of a long and gradual process, where Shopify will rewrite itself into a microservice architecture? It seems like the mentality behind this project shares a lot of commonly claimed benefits of microservices.
> Over the years, we realized that the “storefront” part of Shopify is quite different from the other parts of the monolith
Different goals that need to be solved with different architectural approaches.
> storefront requests progressively became slower to compute as we saw more storefront traffic on the platform. This performance decline led to a direct impact on our merchant storefronts’ performance, where time-to-first-byte metrics from Shopify servers slowly crept up as time went on
Noisy neighbors.
> We learned a lot during the process of rewriting this critical piece of software. The strong foundations of this new implementation make it possible to deploy it around the world, closer to buyers everywhere, to reduce network latency involved in cross-continental networking, and we continue to explore ways to make it even faster while providing the best developer experience possible to set us up for the future.
Smaller deployable units; you don't have to deploy all of shopify at edge, you only need to deploy the component that benefits from running at edge.
It makes more sense for us to extract things than to make everything microservice.
Storefront makes sense to be on its own service, so we are making it so.
- Handcrafted SQL.
- Reduce memory usage, e.g. use mutable map.
- Aggressive caching with layers of caches, DB result cache, app level object cache, and HTTP cache. Some DB queries are partitioned and each partitioned result is cached in key-value store.
Page Rendered in 12.2ms - 18.3ms
Giving plenty of room for Network Latency.
I have been recently playing with RavenDB (from my all time favorite engineer turned CEO), it approaches most of these as an indexing problem in the database, where the view models are calculated offline as part of indexing pipeline. It approaches the problem from a very pragmatic angle. It's goal is to be a database that is very application centric.
Still to be seen if we will end up adopting, but it'll be interesting to play with.
Disclaimer: I am a former NHibernate contributor, and have been very intimate with AR features and other pitfalls.
You could specify to eagerly load some collections and have NHibernate issue additional select statement to load the children, producing maximum of 2-3 queries (depending on the eager-loading depth) but avoiding both N+1 problem and cartesian row explosion problem.
Are you going to restructure literally thousands of employees and their teams, staffed with Rubyists and organized around your current setup?
Will you re-hire and/or re-train everyone?
That doesn't seem so obvious... At the scale of a team like Shopify, refactoring to a different language is probably a non-starter.