Elixir saves Pinterest $2M a year in server costs

coldtea · 2 years ago

By the title alone, this means nothing. How much would it cost otherwise? What is the percentage savings?

In TFA, it gets better though: "Steve: That’s pretty easy. When I started on the spam team, we had close to 1,400 servers running. When we converted several parts to Elixir, we reduced that by around 95%. One of the systems that ran on 200 Python servers now runs on four Elixir servers (it can actually run on two servers, but we felt that four provided more fault tolerance). The combined effect of better architecture and Elixir saved Pinterest over $2 million per year in server costs. In addition, the performance and reliability of the systems went up despite running on drastically less hardware. When our notifications system was running on Java, it was on 30 c32.xl instances. When we switched over to Elixir, we could run on 15. Despite running on less hardware, the response times dropped significantly, as did errors."

threeseed · 2 years ago

> When our notifications system was running on Java, it was on 30 c32.xl instances. When we switched over to Elixir, we could run on 15.

Would be curious to know how they tried to optimise the Java stack.

Because on every benchmark I've seen the JVM is faster in every which way than Elixir. Except for memory where often people will over-provision the JVM rather than look at where their code might be over-allocating or leaking.

sph · 2 years ago

The advantages of Elixir are not performance-related.

There is a lot of focus on raw performance on web-related services, when in reality most of their running time is spent waiting for IO. If there are two things the BEAM excels at, is IO and turning almost any problem into half a dozen processes that are scheduled and run in parallel, if not geographically-distributed, with 1/50th the effort of any other language.

We live in a world with 32+ core CPUs. If your load is not spread uniformly all over those cores, you're losing a ton of performance. Handling requests over separate threads, like 99% of languages do, still isn't enough if all the business logic runs on the same thread.

I'm currently writing a web crawler in Elixir, and it is easier to design it so every request is done and processed in parallel, than to write a naive sequential one you'd do in any other language in half a day.

TacticalCoder · 2 years ago

They rewrote, which is known to help too. Going from 30 to 15 instances is not bad but it's very likely that a Java-to-Java rewrite would have helped go down too.

The big one however is going from 200 Python servers to 4 Erlang ones: a 50x reduction is quite something and a Python-to-Python rewrite would not have allowed to achieved a 50x gain:

> All this is possible because Elixir, and the Erlang platform underneath, are fundamentally designed for always-online software with many users. When you use the right tool for the job, the benefits are clear.

Cthulhu_ · 2 years ago

I think it's more important to look at the re-architecting than the different language. Second, I think certain architectures - like the actor model described in the article - work better and more intuitive if you use a different language.

That said, I'm sure a 2x performance improvement could've been done in Java as well if they did a re-architecture. They could also have made a lateral movement and go to a different JVM language, like Scala that also has an actor concurrency model + accompanying syntax.

klohto · 2 years ago

That’s exactly the thing? Why would you bother optimizing the code, looking for overallocations, leakage, tweak parameters, when you can just take a friendlier language for the same benefits.

neeleshs · 2 years ago

Yep. We got a 10x improvement on throughput for our backend runtime, which was in Java, by moving to a better architecture for performance hotspots.. using Java again.

In a rewrite with a different design/architecture, that new design typically accounts for most gains, rather than language.

A language may make some parts of that rewrite simpler.

coldtea · 2 years ago

>Would be curious to know how they tried to optimise the Java stack

They say "The combined effect of better architecture and Elixir".

So the post is mostly useless.

boxed · 2 years ago

> except memory

That's an enormous "except". Absolutely gigantic.

jjav · 2 years ago

>Would be curious to know how they tried to optimise the Java stack.

Fairly safe to day not at all.

The JVM is extremely fast, very efficient and very scalable (you can write java code that scales linearly with available cores). If performance or scalability is a metric to care about, it is nearly impossible to outscore java. You can, with very skilled C/C++ developers, but it's going to be difficult to find those people and it'll be a lot of work. If you need extreme performance on a reasonable budget, you can't do better than java. I know java isn't hip anymore, so this is not a popular truth, but it is.

panick21_ · 2 years ago

And of course you can just go and buy a much faster JVM from Azul rather then rewrite everything.

asabil · 2 years ago

Faster doesn’t mean more responsive, in fact very often an increase in throughput leads to an increase in latency.

DoesntMatter22 · 2 years ago

Elixir is slower than plain PHP according to the techempower benchmarks. I'm not even sure how that's possible but it is. By like a factor of 2 iirc. I'm not sure how elixir is that slow since it's compiled.

miroljub · 2 years ago

Second system syndrome.

They did savings by re-implementing their services and attribute those savings to the new tool / programming language.

I wonder what the saving would look like if they chose another tool for the second / optimized system. I doubt it would differ much if they went with Go, Java or stayed with Python.

winter_blue · 2 years ago

This is a pretty realistic balanced real-world benchmark of web frameworks (and languages): https://www.techempower.com/benchmarks/

It shows that Python with Django is literally 40 times slower than the fastest framework. Python with uvicorn is 10 times slower.

The use of languages like Python and Ruby literally results in >10x the servers being used; which not only results in higher cost, but also greater electricity use, and pollution and carbon emissions if the grid where the data center is located uses fossil fuels.

Not to mention, dynamically-typed languages are truly horrible from a code readability point of view. Large code bases are IMO difficult to read and make sense of, hard to debug, and more prone to bugs, without static types. I'm aware that Elixir is dynamically-typed, but it (along with JS) is an exception in terms of speed. Most dynamically-typed languages are quite slow. Not only do dynamically-typed languages damage the environment as they're typically an order of magnitude slower, they also lower developer productivity, and damage the robustness and reliability of the software written in it. To be clear, I'm in favor of anything that increases productivity. If Kotlin were 10 times slower, I'd be happy to pay that price, since it is genuinely a great language to work with, is statically typed, and developers are more productive in it. I'm not sure how Elixir mitigates the downsides of dynamic typing (maybe lots of 'type checks' with pattern matching?), but it would definitely be super-nice if a well-designed (Kotlin or Haskell like?) statically-typed language targeting the BEAM existed...

riffraff · 2 years ago

forgive the nitpicking but "second system syndrome" usually means that re-architecting goes wrong because of bloat[0].

This is the opposite case, when lessons learned in the first write are actually useful for a second rewrite.

I think the commonly associated soundbite is "build one to throw away."

[0] https://wiki.c2.com/?SecondSystemEffect

throwaway894345 · 2 years ago

I think second syndrome is probably a significant factor, but I can also believe that ditching Python was also a significant factor.

EDIT: I’m being rate limited because I guess my comments are too spicy for the HN mods, but anyway I agree that there’s no reason other non-Python languages would fare much worse than Elixir.

liveoneggs · 2 years ago

While you are probably right it's also true that python is just straight up slow - especially in normal configurations (django, flask, uwsgi, lambdas, etc) and elixir is pretty fast while offering a great/fun/friendly dev experience + BEAM-HA/scaling-benefits.

They could have also blown everything out of the water with C++ - or probably even golang - but if elixir can do it on 2-4 boxes it's fast enough.

loloquwowndueo · 2 years ago

Um that’s not what second system syndrome is. It’s a simple rewrite.

paulsutter · 2 years ago

Python is 100x slower than a real language (I like python and use it often, not meant as a dig at the language just stating facts)

My favorite thing about Python is writing prototypes. The big risk with writing a prototype is that it survives into production. Using Python ensures that the code will be replaced by real code

Python is great for other non-production code like Jupyter notebooks, numpy experiments, etc

andersa · 2 years ago

It's really strange to me how people continue to build services using python, knowing it is 100 times slower than appropriate languages, and then get surprised by it being 100 times slower, so they eventually rewrite it.

acdha · 2 years ago

It’s extremely rare for a system to be 10x slower, much less 100x, and developer productivity is huge. When you see huge numbers being tossed around for an entire system, they almost always mean “our first architecture wasn’t right for the problem” and the question to ask is how much time it would have taken the same team to discover the correct shape of the problem with the other candidates.

zelphirkalt · 2 years ago

I think it is a lack of know-how. Most businesses/managements are not capable of making the decision to go for an ecosystem like Elixir, because they either don't even know it exists (that is also true for many devs) or they do not dare to do anything non-conventional or non-mainstream, or they have the wrong impression, that the "programming language does not matter". (Well, it does! Since it connects you to an ecosystem that comes with it and its language design choices influence how easily you can do things ...)

So then Python comes along and you find loads of devs for that. Once Python is entrenched, businesses have a hard time telling their devs to actually learn something new. And few devs will already explore things like Elixir on their own in their free time. And so they continue to hire Python devs.

(One could also replace "Python" with "Java" or "NodeJS" or similar, the principles remain the same.)

IshKebab · 2 years ago

So really "Python cost Pinterest $2M/year"...

Mawr · 2 years ago

More like "Python got Pinterest to the point of being able to afford to spend extra $2M/year".

naillo · 2 years ago

I think it's a generally practiced strategy at this point to spew out blog dev posts on company blogs to build SEO or act as an ad, regardless of quality.

princevegeta89 · 2 years ago

The problem with Elixir is that it is such a foreign language to most of the junior developers and a radical shift to dig into coming from object-oriented and other higher-level languages. This is worsened by the fact that there are not many jobs for Elixir in addition to Development Tooling, IDE Support.

To someone who starts their job on an Elixir codebase, it is just not a smooth onboarding at all. While the performance aspect is unparalleled compared to most of the popular scripting languages in the last decade, the price to pay to settle into Elixir seems huge to me.

no_wizard · 2 years ago

sounds like a feature not a bug. Poor onboarding is a cultural problem in my experience, not really a technology one.

When you get junior (or even non junior) developers onboarded in a new language, you have a unique opportunity to break them of bad habits and expand horizons.

Yes, there is a cost to it as it extends in the short term the time it takes to get developers ramped up, however the long tail payoff is huge

hinkley · 2 years ago

How many man hours did it take to achieve that?

We just did a Prometheus migration that I suspect will take us 5 years to break even on the development effort investment. And I'm not even counting opportunity costs, which were immeasurable.

I like Elixir and I want it to do well, but bad articles make that harder, not easier.

sanitycheck · 2 years ago

I wonder how much of their traffic is angry people who ended up there by mistake. They could save a lot more than $2M if they just set robots.txt to disallow everything.

kramerger · 2 years ago

That perfectly describes my usage of Pinterest.

People have written browser extensions to remove Pinterest from search results, as it is almost always a dead-end

giancarlostoro · 2 years ago

Maybe to you, but my wife uses Pinterest a ton. It is where a lot of the women I know go to for ideas for just about everything, from house decor to even the ideas for our wedding.

Pinterest is useful just not for us nerdy guys. I am not sure why Google keeps it though or how they benefit, unless Pinterest uses AdSense exclusively, then one can determine that its some sort of partnership. You would think Google would be smarter about who to send over to Pinterest if thats the case.

qwerty456127 · 2 years ago

Or allow convenient usage without registration. It seems whatever I look for mostly is there, I just go away because I don't want to sign-up&in nor be tracked. I wouldn't realy mind if there were some reasonable non-intrusive content-relevant ads though.

I probably might even sign-up some day if I weren't repelled by this being required every time I come. I even stopped reading Quora, and, most recently, Twitter because of this - they started requiring signing-in while I don't want to stay signed in and be tracked even though I actually have respective accounts.

wahnfrieden · 2 years ago

Why on earth would they do that? It’s intentional and drives growth for them

pictur · 2 years ago

it hurt

andersrs · 2 years ago

And then they take that $2M and bribe someone at Google to ensure 59% of the image search results are from Pinterest.

scq · 2 years ago

I'm really surprised Google Images doesn't block Pinterest.

Cthulhu_ · 2 years ago

It earns them money, and people aren't using competitors; Google has seemed to give up on the quality of results a long time ago in favor of plain volume and whatever numbers they use to measure success.

dazc · 2 years ago

Google are returning results for image search without the overhead cost. This is the only logical reason I can think of why Pinterest still exists.

cbg0 · 2 years ago

Is this based on something or just a random conspiracy?

andersrs · 2 years ago

79% of statistics are made up.

Yes it's a joke... just like Google's results in 2023.

sergioisidoro · 2 years ago

I am very skeptical about this. The title makes it look like it's all about Elixir, but there seems to have been a fair amount of re-architecturing.

"One of the systems that ran on 200 Python servers now runs on four Elixir servers"

This alone is a major telltale.

throwaway894345 · 2 years ago

I don’t know about Elixir specifically, but python is slllllooooowwww. If the operation is CPU bound, you can easily get a 100X performance improvement by rewriting carefully optimized Python in naive Go, Java, Rust, C#, etc. And if you make an optimization pass on that you can usually eke out another 10X.

Even on I/O bound operations, in Python you have to choose between the error-prone async framework if you want to improve resource utilization or you stick to the synchronous world and accept extremely low resource saturation.

Either way, I can entirely believe that another language would beat Python on both counts. I’ve seen similar results rewriting a Python system in Go with extremely minimal rearchitecting.

The silliest thing is that the title credits the improvement to moving toward Elixir rather than moving away from Python (or maybe their case really is ideal for the BEAM VM and wouldn’t translate easily to, say, Go’s runtime model although I doubt it).

Cthulhu_ · 2 years ago

For sure; Elixir comes with a whole new architecture as well, and they COULD have gained a significant performance improvement if they rewrote / re-architected it in Python.

However, could they have done a 50x performance improvement in Python? And what about the other numbers, like speed and concurrency?

That said, I'm confident they crunched the numbers and did the tradeoffs; after all, adding another language and/or architecture will make your company more complex, makes hiring more complex.

sergioisidoro · 2 years ago

Depends. Were they already using gunicorn? What about cPython? Was the code written by a junior developer without much consideration for memorization and dictionary access?

And sure, you can get some performance improvements by rewriting things in those languages, at the expense of losing the entire python open source environment.

So I would need way more information about the previous system to take this even remotely seriously.

Python scales quite reasonably for most small to medium companies.

hinkley · 2 years ago

They would have had to create a bad implementation of half of Erlang to accomplish it. Or you could just fuckin' use Elixir.

__alexs · 2 years ago

> I'm confident they crunched the numbers and did the tradeoffs;

Here is the math they did.

Pros: Looks good in my promotion packet.

Cons: Need a bigger garage to fit all of these new sports cars.

fergie · 2 years ago

Anecdotally, I am beginning to hear more and more about organisations moving away from high level cloud infrastructure (such as lambda and cloud gateway) and going back to plain old virtual servers (like EC2), or even on-prem. Often the cost of supposedly "cheap" cloud environments is WAY more than you might expect and all booked as operational rather than capital expenditure (the latter being often preferable to shareholders).

whstl · 2 years ago

My previous company went from very expensive cloud CI/CD servers to on-prem off-the-shelf servers.

The cost and the incredible performance gains we got by moving to a bunch of local computers was enough to make the whole thing pay for itself in about two months. Yep, physical computers costed less than two months of cloud. Plus the gains in productivity from having to wait minutes instead of hours.

Maintenance was never a problem, and we didn't need to hire new people to take extra care of the servers.

My current company is thinking of doing the same for AI servers. It's just too expensive in AWS.

devjab · 2 years ago

I'm not sure why that would be too surprising to you. In the enterprise organization I've worked with over the past few decades IT strategy is always long term and always cost based. 10-15 years ago organizations moved from the basement to placing their owned hardware in rented racks in data centers that were run by 3'rd party organization because it was cheaper. Then Azure came along and made it sort of a "no-brainer" to move into Azure because you already had a lot of Microsoft products and Azure was cheap. Now with so many Azure price hikes and those 3'rd party data centers improving their business models, the pendulum is swinging away from Azure.

That doesn't mean that the move into Azure wasn't the right one at the time, or that it was more expensive than not going into Azure. It's simply that the market evolves.

PestoDiRucola · 2 years ago

I wonder if this will be another effect of higher-than-0 interest rates. With tech companies choosing to leave cloud service providers in order to reduce their costs.

_rwo · 2 years ago

> (...) and going back to plain old virtual servers

Some of us never left ;)

redocneknurd · 2 years ago

Pinterest - ah that app that has hijacked all images in google search result. Created an account once but has not really used it.

interactivecode · 2 years ago

While the google images thing is kinda shitty. Pinterest is really really good at image search and image recommendations.

ReleaseCandidat · 2 years ago

Pinterest without an account is useless. Or did that change since the last time I cared to look?

xk_id · 2 years ago

It's actually one of my favourite apps, after using it for years. I'm not even joking. Their feed is incredibly good.

cmrdporcupine · 2 years ago

Yeah, one thing it really shines at is recommending eating disorder content to teenage girls.

robbintt · 2 years ago

-site:pinterest.com will fix it as needed

Deleted Comment

hermannj314 · 2 years ago

If anyone is struggling with analysis paralysis, remember it is ok to do things wrong the first time because then you can farm that sweet internet karma bragging about how you fixed your crappy first iteration.

If you always do things right the first time, you don't get to brag about putting out the fire your started.

mabbo · 2 years ago

Exactly. Goal number one is a working product. All you know is python? Write it in Python.

hcks · 2 years ago

It’s interesting because while this is an extreme case of performance improvement, the ROI doesn’t seem amazing.

"rewriting in another language reduced the number of servers by 95%" is hard to beat, but at the same time, this saves "only" 2m a year, or about 0.3% of FY22 cost of revenue (per another post)

Pinterest per employee revenue seems to be around 1m, which basically suggest that this could even be a worse than average allocation of resources.

My takeaway would be "don’t bother with this kind of optimisation before you reach a scale where you can afford to do marginal improvements"

impulser_ · 2 years ago

It more than just server costs. They reduced the complexity of their services by a lot.

Having to maintain 95% less servers is worth it even if they didn't save any money IMO.

This also could lead to them reducing their engineering team that maintains these services which would reduce costs even more.

robbintt · 2 years ago

This isn't a huge difference most of the time. Once you have cluster management at scale, 3 vs 3000 are pretty similar.

paxys · 2 years ago

Can switching to an esoteric functional language from Python and Java really be considered reducing complexity? No matter how well it is written, I'm willing to bet that way fewer people in the company/industry understand the new codebase and can make changes to it.

manmal · 2 years ago

Wouldn't you also need a smaller ops team that way, further reducing your costs?

diarrhea · 2 years ago

Managing 200 Python servers/nodes certainly sounds operationally challenging.

boxed · 2 years ago

2 million dollars a year is "marginal improvement" just because the sales and customer support team is big? That's your logic?