As an aside, I’m always surprised at just how bad Meta’s software quality is, especially for a FAANG company with lots of expensive engineers.
Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.
One of the problems is it’s basically impossible to get in touch with Meta to report them.
If you think they are buggy, become a seller on Amazon. The buggiest software I have ever had. And they are influencers with the "two pizza rule". The second worst is Spotify. It still does not remember where you've stopped listening on radio plays after a decade. From the influencer people who brought you "tribes".
Strong agree here, at least on FB (don't use Instagram because why). I fully tolerated their growing pains back in the day when site went down for hours, or frequent blank pages after some standard click on something. The thing is, even when things improved over time its still buggy as hell.
I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.
Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.
I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.
Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.
> I had so many problems with uploading photo albums for example
I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.
I can't speak to what is going on but one thing I've heard that makes things difficult for at least the frontend teams is that they are battling ad blockers. In order to work around them, they create incredibly obfuscated HTML that no one would normally ever create.
somehow every comment in here is downplaying this achievement as “low” or unimpressive
you truly underestimate the scale of an operation like this
the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large
The article is too light on details to estimate if trillions is impressive or not. For example, if my single-server system easily handles 100 mio per day and the load is almost exclusively CPU bound (like with most AI tasks) then scaling to 1 trillion per day might be as easy as buying 10k servers, which is totally a thing that mid to large sized companies do to scale up.
The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.
It's an interesting paper, but they've made some weird trade offs regarding latency for resource efficiency that make it seem niche especially for FaaS tech, and the TPS they're hitting is surprisingly low for something that is supposedly in widespread use at their company. Some of their suggestions at the end are also already features for FaaS products in some form or another too.
I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.
>the vast majority of software companies will never count a trillion of anything
As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.
And this is also the HN where people boosts they can rebuild on their own $SUCCESSFUL_SOFTWARE over a 3-days weekend.
There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.
“Trillions of functions” is a metric that’s hard to know whether to be impressed by or not. I don’t think it’s impossible that my laptop runs “trillions of functions” every day.
But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.
The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:
- Load balancing across regions [0] without significant latency overhead
- Service-to-service mesh/discovery that scales with less than O(# of servers)
- Reliable processing (retries without causing retry storms, optimistic hedging)
- Coordinating error handling and observability
All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).
I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.
[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it
> One example of load they demonstrate has 20 million function calls submitted to XFaaS within 15 minutes.
> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”
Just for one trillion (not "trillions"), that'd be 10 million function calls per server per day, ~7k per minute. Sounds about right to me. They'll surely want to leave some room for traffic spikes, server failures and unforeseen issues. Server uplink could also be a factor - at least that was the major bottleneck the last time I ran infrastructure serving lots of clients (about 100 million), but they are probably smarter about this than I was back then.
The only impressive about this thing is technology abuse to do such a slow job on such a large scale.
I recall ruby on rails doing at most 150 requests per second on a laptop 15 years ago, which was laughably low even for those days. I could do hundreds times more using c++, or at least tens times more using unoptimized python.
Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like.
And no, it is not serverless, since obviously something serves that stuff, stop lying already.
I assume these are mostly procedures, not functions, (that is, they have side effects) as applying a function 1.3M times per minute immediately raises the question of caching which doesn't seem to be important for them.
But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?
Off topic: I ended up not finishing the article, because first it presented me with a full screen overlay asking me to sign up to their news letter, forcing me to click on "I want to read it first" to see the article.
Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.
Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?
I wonder how much actual conversion happens in these kind of forms, anyway. Who are the people that sign up to newsletters of a random website before even reading anything? And who signs up after reading just an intro?
I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.
But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?
Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.
One of the problems is it’s basically impossible to get in touch with Meta to report them.
No negative incentives, it's not safety-critical so nobody will recall or sue.
I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.
Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.
I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.
Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.
I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.
you truly underestimate the scale of an operation like this
the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large
The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.
..I doubt that. How would you distribute the requests between those? An instance of mod_proxy_balancer?
I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.
>the vast majority of software companies will never count a trillion of anything
As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.
There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.
Compare its budget to WhatsApp before it was acquired!
But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.
I’ll assume it’s impressive.
QPS is the standard measure, and per day is just a sly way to multiply by 86400.
If you want an average measure then report the average and peak QPS ...
1T/day = 11.575M/s
I personally find 11.5M/s a lot more impressive sounding. Though another comment suggests 100k servers — for about 100/s per server.
10ms per request isn’t particularly good or bad; volume is still impressive.
- Load balancing across regions [0] without significant latency overhead
- Service-to-service mesh/discovery that scales with less than O(# of servers)
- Reliable processing (retries without causing retry storms, optimistic hedging)
- Coordinating error handling and observability
All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).
I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.
[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it
1,000,000,000,000 / 100,000 = 10,000,000
10,000,000 / 24 / 3600 = 115.7
> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”
This seems very low per server?
Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like. And no, it is not serverless, since obviously something serves that stuff, stop lying already.
“Serverless” just means you’re not running a long-running process that handles these specific requests.
Serverless means you have no server. Stop repeating amazons lies. It is nonsense.
But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?
Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.
Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?
It can't easily be blocked by a captcha as that would hurt conversion rates of legitimate users.
I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.
But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?