XFaaS: Hyperscale and Low Cost Serverless Functions at Meta

It's interesting to see the heavily growing demand graph. Is that because people want to adopt it, or is it being mandated or encouraged as "best practice" etc? I'm not implying that true organic demand wouldn't exist because it definitely might, but I have seen in practice where leadership encourages or even mandates usage of FaaS, so the numbers go up even though on a neutral field people wouldn't necessarily choose it. There's also the "I'd like to try it" group who hasn't yet had experience with it so choose it as a target for learning/curiosity reasons.

I wonder because my own experience with FaaS has been mostly bad. There are some nice things for sure, and a handful of use cases where it's wonderfully superior, such as executing jobs where number of concurrent processes fluctuates wildly making rapid scalability highly desirable. The canonical use case being "make a thumbnail for this image."

For web servers though, I find the opacity and observability, and the difficulty running locally to be significant hindrances. There's also lock-in which I hate with a passion. At this point I'd rather manage static EC2 instances than Lambdas, for example. (To be clear I'm not advocating static EC2 instances. My preference is Kubernetes all the things. Not perfect of course, but K8s makes horizontal scalability very easy while improving on visibility, but that's a different conversation)

madsbuch · 2 years ago

I think one of the core arguments for larger organisations is that incompetence can not ruin everything.

Firebase is a good case study for this: They heavily argue using Firestore and server less functions. If you succeed solving your problems using their offerings, then they will also guarantee that things run well and scale well.

Firestore, as when I used it last, did not support all the operations that can make traditional DBMS go in their knees. You have document level isolation, ie. no joins or aggregating functions (like count or sum across documents). These functions need to be implemented in another way using aggregators or indices.

So I agree that development is more fun when developing on proper runtimes using fully fledged databases. But when you manage several thousands of developers on all levels, then I think it makes sense to impose another architecture.

KaiserPro · 2 years ago

> Is that because people want to adopt it, or is it being mandated or encouraged as "best practice"

Its because its the simplest, fastest way to get compute for non realtime bits of code. Its much less hard to deploy stuff to, and its really simple to trigger it from other services.

A lot of things in FB are communicated by RPC, so its not really "web" fucntions that run on there, its more generic ETL type stuff. (as in system x has updated y, this triggers a function to update paths to use the latest version)

asdfman123 · 2 years ago

> Its because its the simplest, fastest way to get compute for non realtime bits of code. Its much less hard to deploy stuff to, and its really simple to trigger it from other services.

That's all true, but IMO the problem with functions is that they are initially so simple. But deploying code isn't actually that hard of a problem. The hard part is growing and maintaining the codebase over time.

I'm not saying there ISN'T a use case for them, but there should be a very good reason why you want to split them off of other services.

jacobr1 · 2 years ago

> as in system x has updated y, this triggers a function to update paths to use the latest version

This is the sweet spot. Eventing. Any kind of queue based workload, especially one with variable load is a potential candidate for FaaS kind of architecture. The alternative is a worker-pool specific the workload. FaaS just moves up the abstraction on how processes are managed to a global worker-pool.

willvarfar · 2 years ago

Yeah it sucks to be a FaaS customer on a cloud run by someone else... you're just overpaying for easy instead of simple etc.

But if you're in Meta, and you're running on an abstraction built and maintained by Meta, then the it's running for cost instead of profit and the the incentives all align between user and infrastructure provider?

I imagine the development and deployment ease for a system at Meta for Meta could be dreamy. At least, it has the potential to be... :)

foofie · 2 years ago

> It's interesting to see the heavily growing demand graph. Is that because people want to adopt it, or is it being mandated or encouraged as "best practice" etc?

FaaS is well justified from the point of view of an infrastructure provider. You get far better utilization from your hardware with a tradeoff of a convoluted software architecture and development model.

In theory you also get systems that are easier to manage as you don't have teams owning deployments from the OS and up, nor do they have to bother with managing their scaling needs.

It also makes sense in the technical side because when a team launches a service, 90% of the thing is just infrastructure code that needs to be in place to ultimately implement request handlers.

If that's all your team needs, why not get that redundancy out of the way?

Nevertheless we need to keep things in perspective, and avoid this FANG-focused cargo cult idiocy of mindlessly imitating any arbitrary decision regardless of making sense. FaaS makes sense if you are the infrastructure provider, and only if you have a pressing need to squeeze every single drop of utilization from your hardware. If your company does not fit this pattern, odds are you will be making a huge mistake by mimicking this decision.

DeathArrow · 2 years ago

>FaaS is well justified from the point of view of an infrastructure provider.

What if you are both provider and user? Are the tradeoffs justified?

charcircuit · 2 years ago

From the paper:

>The rapid growth at the end of 2022 is due to the launch of a new feature that allows for the use of Kafka-like data streams [12] to trigger function calls

In regards to opacity and observability I don't see why it would be any worse to run PHP code on XFaaS than running the PHP code on another host.

freedomben · 2 years ago

Ah thanks, I missed that. Makes sense!

bagels · 2 years ago

Most teams at meta are free to choose which internal tools to use. We evaluated the maturity, performance, staffing levels, roadmap when choosing tools.

fragmede · 2 years ago

> There's also lock-in which I hate with a passion.

I wonder if this isn't clouding your judgement here. FaaS is subject to lock-in, absolutely, but for teams that need a bit of code run and need to not manage an instance, functions are the way to go. Need to be at an organization that's able to support that properly, but that's table stakes at this point.

tyingq · 2 years ago

It could also be that it's the "easy path" to get something in production with either lighter governance or faster turnaround, etc.

> To reduce latency, a common approach is to keep a VM idle for 10 minutes or longer after a function invocation to allow for potential reuse [45]. In contrast, if a FaaS platform is optimized for hardware utilization and throughput, this waiting time should be reduced by a factor of 10 or more, because starting a VM consumes significantly fewer resources than having a VM idle for 10 minutes

This seems somewhat surprising in such controlled environment. Why does idle function consume so many resources? If I think a normal Linux system, an idle process doesn't really consume much resources at all

Shrezzing · 2 years ago

A server idling on Function A can't be called for Function B. 144 servers idling for 10 minutes each before expiring makes up a day of compute-time for 1 server. For every 144 expected idle servers, Meta needs to purchase and maintain an additional physical machine to keep up with healthy throughput on xfaas.

Every 15 minutes, the 100k server network has 1,500,000 compute-minutes available.

For an extreme example, every 15 minutes, 50k DB Cleanup processes run for 5 minutes, and then sit idle for 10 minutes, totalling 15 mins each. In this scenario, to satisfy 250,000 demanded compute-minutes (16.6%), 750,000 compute-minutes were supplied (50%). All else equal, to keep up healthy throughput on the rest of the network, Meta needs to purchase and maintain an excess 33,400 (50%-16.6%) physical servers to satisfy this DB Cleanup process.

Reducing the idle-time to 1 minute. The 50k process runs for 5 minutes, and then sits idle for 1 minute, totalling 6 min each. In this scenario, to satisfy 250,000 demanded compute-minutes (16.6%), 300,000 compute-minutes were supplied (20%). All else equal, to keep up healthy throughput on the rest of the network, Meta only needs to purchase and maintain an excess 4,000 (20%-16%) physical servers.

That's an extreme example, but hopefully it demonstrates why at Meta's scale, the energy requirements of the servers can be secondary when optimising for hardware. Since optomising for hardware can make a difference large enough to mothball a moderately sized datacenter.

zokier · 2 years ago

> A server idling on Function A can't be called for Function B

why not?

FooBarWidget · 2 years ago

I think it means reserving resources. When resources are reserved, they cannot be used for anything else (even if you're not actually using that reservation) so one can also say that those resources are "consumed".

mike_hearn · 2 years ago

RAM, disk. Functions are often written in high level JITd languages, they may need to load large reference datasets from disk, they may require a lot of code to be transferred over the network before they can begin running, and because nobody trusts the security of the Linux kernel you also have to pay the VM startup time.

An idle process on Linux isn't much different: it consumes RAM, disk and a kernel. If you trust the software you're running enough you can amortize the kernel cost but the others remain.

dig1 · 2 years ago

> and because nobody trusts the security of the Linux kernel

Seriously? I imagine all the infra worldwide is run on FaaS and short-lived VMs, right?

willvarfar · 2 years ago

Are we talking about VMs (e.g. K8s or something) or JVMs (or other equiv garbage-collected runtimes for other languages with hot-code-loading)?

I'm imagining that sharing actual JVMs is a massive saving over each function getting its own VM to launch a JVM in?

bagels · 2 years ago

Memory is used.

realprimoh · 2 years ago

Related discussions:

https://news.ycombinator.com/item?id=37981813

https://news.ycombinator.com/item?id=39029439

Both link to https://read.engineerscodex.com/p/meta-xfaas-serverless-func...

The actual paper is https://dl.acm.org/doi/abs/10.1145/3600006.3613155

As a developer who worked at small to medium sized companies, mainly developing microservices running on Kubernetes, I don't see a big advantage using FaaS as a customer. Maybe I am missing something? I am sure the provider gets more utilization for the hardware, but for the customer what's the point?

WJW · 2 years ago

Most of the companies I've worked for or consulted with had some sort of background worker system for , usually with a queue in postgres/redis/rabbitMQ/kafka/whatever in between. It usually gets used for non-latency sensitive work like sending emails instead of doing that from the main request handler, so that the user gets their response without having to wait for the other stuff to be completed. Bigger companies sometimes have multiple groups of background workers, one for each service and each with their own queues, servers and autoscaling policies.

As I understand the XFaaS system from the article it functions a bit like a consolidated background worker system that any service can submit their work to and it will (try to) make sure it gets done in whatever SLO is specified. This is usually cheaper because now you can do low-urgency work from service B when there are urgent jobs from service A and vice versa, leading to higher average server usage. For small/medium businesses this is probably not something you'd want, since developing and operating such a system costs more than you save. But if you have a multi-million dollar bill for your background workers alone then it might be worth it.

sanderjd · 2 years ago

Isn't it that you don't have to set up and maintain a kubernetes cluster?

btreecat · 2 years ago

Right, you just need a collection of XFaaS services, which need to be hosted somewhere... So you'll want some container orchestration tool, might as well use k8s.

If setting up a Kubernetes cluster isn't a terrible overhead for us, switching to FaaS would mean we would have more control or less control? I am thinking mainly of migration to another provider.

Maybe we would benefit more if we would have large spikes in resource usage, but we don't.

danpalmer · 2 years ago

> Furthermore, XFaaS explicitly does not handle functions and the path of a user-interaction

The wording of this is a little strange, but does this really mean that XFaaS is not used to serve end-user traffic at all?

The general approach here is interesting. I've always thought that there are two potential benefits for Function-based hosting ("Serverless") – low cost of components via scale-to-zero infrastructure (good for rare events, or highly variable traffic), and developer experience (coding against an infra framework rather than build-your-own infra). I'd have expected the former to be much less of an issue at scale, to the point where it's not the leading benefit, so I expected to hear more about the dev productivity side, but instead this article focuses significantly on the performance and cost side of things.

withinboredom · 2 years ago

> but does this really mean that XFaaS is not used to serve end-user traffic at all?

When I worked at a company with a similar-ish setup, we were told that if you could do it in less than a minute and the user has to wait on it, do it in the request, don't send it to a job. This was because it was "cheaper" to keep a thread running vs. spinning up the resources to do a job. When I say cheaper, I mean in dev-time, user experience, and actual resources.

1) on the front-end, the dev doesn't have to "refresh" or "poll" an endpoint to get the status.

2) user experience is better. many people naively check "every 30s" or something ... well, what happens if services are degraded and it takes more than 30s to respond to your status check? Now, after a few minutes, you have dozens of pending requests for the exact same resource.

3) we didn't have a fancy SLO priority queue, just a regular one, so sometimes, you could have your job stuck behind a ton of jobs that are a higher priority than you, causing what was only 30s in your tests to now take 2 hours with the user polling your status endpoint and randomly refreshing, then eventually contacting support asking why it isn't working.

I think they are trying to hammer home that its not a realtime system. Anything you pass into will be executed at some point in the future. so if you want something synchronous, its not the tool to use.

mlerner · 2 years ago

Author of the paper summary here - my understanding is that XFaaS doesn't run functions that are run in response to user input (e.g. XFaaS does not execute code that fetches and returns data because a user clicked on a button).

chrishare · 2 years ago

Intuitively, the popularisation of FaaS feels inevitable as lower level technical challenges get solved and give way to abstractions like this. Who knows though what our stack will look like in 20 years.

pjmlp · 2 years ago

Looking behind what I have seen since I started paying attention in the 1980's, as it looks today, resold under new marketing terms, resold by newly founded startups that are disrupting the ecosystem.

dijit · 2 years ago

I agree, FaaS is treading deep "CGI" territory, except there are a lot more resources available for nice to have operational overheads so people think it's largely different.

mhd · 2 years ago

As Gibson said, "The future is already here, it's just not very evenly distributed". And in IT, sometimes the future takes a long nap. I wouldn't be too surprised if e.g. descendants of Linda spaces get re-discovered because external circumstances favor that model.

Although with FaaS, I don't see anything particularly new from a development perspective. It's, well, functions. Most of the time they aren't communicating in any kind of novel way, and often they're doing the decade-old stuff of reading files and slurping databases. The abstraction being more on the operational side of things this time.

Not that I'm complaining or doing the "it's just CGI" dance. Compared to other "cloudy" tech, there's actually potential for simplifying things and not just simulated VAX computers with more effort…

But what is left to "solve" here? Functions already work pretty well for what they do.

But any abstraction runs the risk of revealing itself as a leaky abstraction, not worth the overhead. That's been my experience with functions.

Rodeoclash · 2 years ago

Any insight to what kinds of work is being run in the functions? I run a lot of jobs in background queues but I'm wondering if I'm missing out on some new paradigm.

xylophile · 2 years ago

It's fundamentally the same as a job queue, but the difference is that the people writing the job are not creating a running OS process. You literally just write a function, and it gets compiled into a process owned and executed by the job system.

Why would you want that? Well, who really wants to think about the OS, or how to get their data into main()? You just want to write business logic, and FaaS lets your developers focus on that. It's a small development process optimization, but a significant one at scale if you have enough developers / unique jobs being created. And it lets platform engineers focus on the best way to shovel data into main() in your particular environment.

It's a "job queue" in old-people-words, just managed by someone else. Sometimes they even offer integrations like putting a job on the queue on database triggers, or file changes.

Nothing really new, except that you don't have to build it yourself ... but if you do and you sell it to your coworkers/clients/customers, you just call it FaaS.

maigret · 2 years ago

They recently mentioned functions as well in an article about the Threads infrastructure. https://engineering.fb.com/2023/12/19/core-infra/how-meta-bu...

No it's basically just a big background job cluster, but for all the microservices at the same time so they can get a higher average utilization out of their servers by averaging out the peaks.

rafaelturk · 2 years ago

Inovation regarding FaaS comes from BigTechs. I wish we had more OpenSource FaaS solutions, albeit OpenFaas amazing achievements still lacks more robust ecosystem.

smt88 · 2 years ago

What would the point of self-hosted FaaS be?

They lose their whole value proposition when you're paying for dedicated servers or when you don't have large numbers of servers.

oxfordmale · 2 years ago

Self-hosted FaaS is only worth it if you are at the same size as Meta. If smaller companies do this, it would be a big red flag for me.

The pendulum has swung (is swinging?) for many companies back to on-prem for cost savings. Self-hosted FAAS allows your developers to retain the abstraction over the compute platform (a step beyond what containers provide), and grants those running the physical infra significant flexibility in managing it. It's also arguably less complex than k8s for basically everyone, if your use case supports short-lived functions.

FaaS could be sharing your hardware at a function level inside the JVMs or whatever you are running, rather than at the VM or hypervisor level. You can (potentially) get much better elasticity and utilisation?

nmhancoc · 2 years ago

Cloudflare open sourced workerd, there's also Apache Airflow which covers some similar use cases albeit not strictly with functions.

There is the fn project:

https://fnproject.io/

There's also Dapr. https://dapr.io/