Meta reveals serverless platform processing trillions of function calls a day

As an aside, I’m always surprised at just how bad Meta’s software quality is, especially for a FAANG company with lots of expensive engineers.

Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.

One of the problems is it’s basically impossible to get in touch with Meta to report them.

DoingIsLearning · 2 years ago

No positive incentives, nobody gets promoted doing bug fixes, everybody tries to focus on hot features.

No negative incentives, it's not safety-critical so nobody will recall or sue.

KingOfCoders · 2 years ago

If you think they are buggy, become a seller on Amazon. The buggiest software I have ever had. And they are influencers with the "two pizza rule". The second worst is Spotify. It still does not remember where you've stopped listening on radio plays after a decade. From the influencer people who brought you "tribes".

saiya-jin · 2 years ago

Strong agree here, at least on FB (don't use Instagram because why). I fully tolerated their growing pains back in the day when site went down for hours, or frequent blank pages after some standard click on something. The thing is, even when things improved over time its still buggy as hell.

I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.

Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.

I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.

Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.

benterix · 2 years ago

> I had so many problems with uploading photo albums for example

I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.

kristiandupont · 2 years ago

I can't speak to what is going on but one thing I've heard that makes things difficult for at least the frontend teams is that they are battling ad blockers. In order to work around them, they create incredibly obfuscated HTML that no one would normally ever create.

c7DJTLrn · 2 years ago

And this is where some of the greatest, most educated minds in computing are going. What a waste.

benterix · 2 years ago

So how is this working for them? Because thanks for uBlock Origin I haven't seen an ad on FB for years.

somehow every comment in here is downplaying this achievement as “low” or unimpressive

you truly underestimate the scale of an operation like this

the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large

fxtentacle · 2 years ago

The article is too light on details to estimate if trillions is impressive or not. For example, if my single-server system easily handles 100 mio per day and the load is almost exclusively CPU bound (like with most AI tasks) then scaling to 1 trillion per day might be as easy as buying 10k servers, which is totally a thing that mid to large sized companies do to scale up.

The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.

kristiandupont · 2 years ago

>then scaling to 1 trillion per day might be as easy as buying 10k servers [...]

..I doubt that. How would you distribute the requests between those? An instance of mod_proxy_balancer?

DetroitThrow · 2 years ago

It's an interesting paper, but they've made some weird trade offs regarding latency for resource efficiency that make it seem niche especially for FaaS tech, and the TPS they're hitting is surprisingly low for something that is supposedly in widespread use at their company. Some of their suggestions at the end are also already features for FaaS products in some form or another too.

I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.

>the vast majority of software companies will never count a trillion of anything

As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.

almost_usual · 2 years ago

This is HN, there are definitely users on this site who have experienced or have worked at places with these workloads.

darkwater · 2 years ago

And this is also the HN where people boosts they can rebuild on their own $SUCCESSFUL_SOFTWARE over a 3-days weekend.

There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.

hiddencost · 2 years ago

12 million QPS isn't nothing but it's pretty common at big companies.

Trow83949 · 2 years ago

So what? Meta is a trillion dollar company. It should be able to create website, that works.

Compare its budget to WhatsApp before it was acquired!

type_Ben_struct · 2 years ago

foolfoolz · 2 years ago

fnordsensei · 2 years ago

“Trillions of functions” is a metric that’s hard to know whether to be impressed by or not. I don’t think it’s impossible that my laptop runs “trillions of functions” every day.

But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.

I’ll assume it’s impressive.

TBH anyone using "per day" I assume is dishonest.

QPS is the standard measure, and per day is just a sly way to multiply by 86400.

If you want an average measure then report the average and peak QPS ...

zmgsabst · 2 years ago

I divided through, assuming sustained rate:

1T/day = 11.575M/s

I personally find 11.5M/s a lot more impressive sounding. Though another comment suggests 100k servers — for about 100/s per server.

10ms per request isn’t particularly good or bad; volume is still impressive.

xxs · 2 years ago

servers have lots and lots of cores, you might as well divide by 64 or 128, which would set it to 1s. Overall 4req/s per core is nothing.

consp · 2 years ago

Servers have no parallelism?

eddtries · 2 years ago

Works out to around 115 function calls a second per server

kyeb · 2 years ago

The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:

- Load balancing across regions [0] without significant latency overhead

- Service-to-service mesh/discovery that scales with less than O(# of servers)

- Reliable processing (retries without causing retry storms, optimistic hedging)

- Coordinating error handling and observability

All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).

I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.

[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it

booi · 2 years ago

That doesn’t sound right… maybe per user?

1 Trillion function calls over 100,000 servers. Technically they say trillions over hundreds of thousands, but I went for the lower case.

1,000,000,000,000 / 100,000 = 10,000,000

10,000,000 / 24 / 3600 = 115.7

nick0garvey · 2 years ago

This isn't unreasonable. ML workloads benefit from more computational time per request. Lower QPS = better results.

> One example of load they demonstrate has 20 million function calls submitted to XFaaS within 15 minutes.

> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”

This seems very low per server?

fhd2 · 2 years ago

Just for one trillion (not "trillions"), that'd be 10 million function calls per server per day, ~7k per minute. Sounds about right to me. They'll surely want to leave some room for traffic spikes, server failures and unforeseen issues. Server uplink could also be a factor - at least that was the major bottleneck the last time I ran infrastructure serving lots of clients (about 100 million), but they are probably smarter about this than I was back then.

hexo · 2 years ago

The only impressive about this thing is technology abuse to do such a slow job on such a large scale. I recall ruby on rails doing at most 150 requests per second on a laptop 15 years ago, which was laughably low even for those days. I could do hundreds times more using c++, or at least tens times more using unoptimized python.

Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like. And no, it is not serverless, since obviously something serves that stuff, stop lying already.

klodolph · 2 years ago

Trillion per day is 10M per second. That’s hard.

“Serverless” just means you’re not running a long-running process that handles these specific requests.

On multiple 100k servers? Not as much as you think.

Serverless means you have no server. Stop repeating amazons lies. It is nonsense.

CommanderData · 2 years ago

What sort of isolation did you achieve with your Ruby experiments?

choeger · 2 years ago

I assume these are mostly procedures, not functions, (that is, they have side effects) as applying a function 1.3M times per minute immediately raises the question of caching which doesn't seem to be important for them.

But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?

sspiff · 2 years ago

Off topic: I ended up not finishing the article, because first it presented me with a full screen overlay asking me to sign up to their news letter, forcing me to click on "I want to read it first" to see the article.

Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.

Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?

Nextgrid · 2 years ago

A solution to these would be to build a browser extension that automatically submits any such forms with bogus email addresses.

It can't easily be blocked by a captcha as that would hurt conversion rates of legitimate users.

I wonder how much actual conversion happens in these kind of forms, anyway. Who are the people that sign up to newsletters of a random website before even reading anything? And who signs up after reading just an intro?

I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.

But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?