The Magical Mystery Merge Or Why we run FreeBSD-current at Netflix (2023) [pdf]

I posted the Serving Netflix Video Traffic at 800Gb/s and Beyond [1] in 2022. For those who are unaware of the context you may want to read the previous PDF and thread. Now we have some update; quote

> Important Performance Milestones:

2022: First 800Gb/s CDN server 2x AMD 7713, NIC kTLS offload

2023: First 100Gb/s CDN server consuming only 100W of power, Nvidia Bluefield-3, NIC kTLS offload

My immediate question is if the 2x AMD 7713 actually consumes more than 800W of power. i.e More Watts / Gbps. Even if it does, it is based on 7nm Zen 3 and DDR4 came out in 2021. Would a Zen 5 DDR5 outperforms Bluefield in Watts / Gbps?

[1] https://news.ycombinator.com/item?id=32519881

papageek · a year ago

I haven’t read post yet but I used to work in this space @cable. The nice bit with bluefield3 is you can run nginx on the on-board arm cores which have sufficient ram for live hls usecase. You can use liqid pcie fabric and support 20 bluefield nics off a single cpu 1u box. You essentially turn the 1u box into a mid-tier cache for the nics. Doing this I was able to generate 120 gig off each nic off a 1u hp + the pcie fabric/cards. I worked with liqid and hp lab here in Colorado prototyping it. Edit: I ran the cdn edge directly on nic using yocto linix.

vitus · a year ago

Note that your power consumption is more than just the CPU (combined TDP of 2x225W [0]). You also have to consider SSD (16x20W when fully loaded [1]), NIC (4x24W [2]), and the rest of the actual system itself (e.g. cooling, backplane).

[0] https://www.amd.com/en/products/processors/server/epyc/7003-...

[1] I couldn't find 14TB enterprise SSDs on Intel's website, so I'm using the numbers from 6.4TB drives: https://ark.intel.com/content/www/us/en/ark/products/202708/...

[2] I'm not sure offhand which model number to use, but both models that support 200GbE on page 93-96 have this maximum wattage: https://docs.nvidia.com/nvidia-connectx-6-dx-ethernet-adapte...

Or, you can skip all the hand calculations and just fiddle with Dell's website to put together an order for a rack while trying to mirror the specs as closely as possible (I only included two NICs, since it complained that the configuration didn't have enough low-profile PCIe slots for four):

https://www.dell.com/en-us/shop/dell-poweredge-servers/power...

In this case, I selected a 1100W power supply and it's still yelling at me that it's not enough; 1400W is enough to make that nag go away.

ksec · a year ago

Well I am assuming Memory and SSD being the same. The only different should be CPU + NIC since Bluefield itself is the NIC. May be Drewg123 could expand on that ( if he is allowed to )

throwbigdata · a year ago

That’s not how TDP works.

papageek · a year ago

Another observation, using the host cpu to manage the nvme storage for vod content is also a bottleneck. You can use the liqid 8x nvme pcie cards and address them directly from the bluefield3 arm processes using nvme-of. You are then just limited by pcie 5 switch capacity among the 10-20 shared nvme/bluefield3 cards.

Competition is good (not everyone using Linux I mean), and I've ran FreeBSD on my desktop and server for a few years.

But whenever Netflix's use of FreeBSD comes up I never come away with a concrete understanding for: is the cost/performance Netflix gets with FreeBSD really not doable on Linux?

I'm trying to understand if it's inertia, or if not, why more cloud companies with similar traffic (all the CDN companies for example) aren't also using FreeBSD somewhere in their stack.

If it were just the case that they like FreeBSD that would be fine and I wouldn't ask more. But they mention in the slides FreeBSD is a performance-focused OS which seems to beg the question.

TheCondor · a year ago

The community is such that if one of either FreeBSD or Linux really outperformed the other in some capacity, they'd rally and get the gap closed pretty quickly. There are benchmarks that flatter both but they tend to converge and be pretty close.

Netflix has a team of, I think it was 10 from the PDF, engineers working on this platform. A substantial number of them are FreeBSD contributors with relationships to it. It's a very special team. That's the difference maker here. If it was a team of former Red Hat guys, I'm sure they'd be on Linux. If it was a team of old Solaris guys, I wouldn't be surprised if they were on one of the Solaris offsprings. Then, Netflix knew that at their scale and to make it work, they had to build this stuff out. That was something they figured out pretty quickly, they found the right team and then built around them. It's far more sophisticated than "we replaced RedHat Enterprise with FreeBSD and reaped immediate performance advantages." At their scale, I don't think there is an off the shelf solution.

kelsey98765431 · a year ago

I think you're close but missing an element. FreeBSD is a centrally designed and implemented OS from kernel to libc to userland. The entire OS is in the source tree, with updated and correct documentation, with coherent design.

Any systems level team will prefer this form of tightly integrated solution to the systems layer problem if they are responsible for the highly built out specialized distributed application we call netflix. The reasons for design choices going all the way back to freebsd 1 are available on the mailing list in a central place. Everything is there.

Trying to maintain your own linux distro is insanely difficult, infact google still uses a 2.2 kernel with all custom back ported updates of the last 30 years.

The resources to match the relatively small freebsd design and implementation team are minuscule compared to the infinite crawl of linuxes, a team of 10 freebsd sysprogs is basically the same amount of people responsible for designing and building the entire thing.

It comes down to the sources of truth. In the world of fbsd that's the single fbsd repo and the single mailing list. For a linux, that could be thousands of sources of truth across hundreds of platforms.

toast0 · a year ago

It's hard to do apples to apples comparison, because you'd need two excellent, committed teams working on this.

I'm a FreeBSD supporter, but I'm sure you could get things to work in Linux too. I haven't seen any posts like 'yeah, we got 800 gbps out of our Linux CDN boxes too', but I don't see a lot of posts about CDN boxes at all.

As someone else downthread wrote, using FreeBSD gives a lot of control, and IMHO provides a much more stable base to work with.

Where I've worked, we didn't follow -CURRENT, and tended to avoid .0 releases, but it was rare to see breakage across upgrades and it was typically easy to track down what changed because there usually wasn't a lot of changes in the suspect system. That's not really been my experience with Linux.

A small community probably helps get their changes upstreamed regularly too.

Thaxll · a year ago

The truth is that some senior engineer at Netflix chose to use FreeBSD and they stick to that idea since then, FreeBSD is not better it's happen to be the solution they chose.

All the benefits they added to FreeBSD could be added the same way in Linux if it was missing.

YouTube / Google CDN is much bigger than Netflix and runs 100% on Linux, you can make pretty much everything work on modern solution / languages / framework.

inopinatus · a year ago

Sorry, this is seriously misinformed:

> YouTube / Google CDN is much bigger than Netflix

Youtube and Netflix are on par. According to Sandvine, Netflix sneaked past Youtube in volume in 2023[1]. I believe their 2024 report shows them still neck-and-neck.

> you can make pretty much everything work on modern solution

Presenting a false equivalence without evidence is not convincing. "You could write it in SNOBOL and deploy it on TempleOS". Netflix didn't choose something arbitrary or by mistake. They chose one of the world's highest performing and most robustly tested infrastructure operating systems. It's the reason a FreeBSD derivative lies at the core of Juniper routers, Isilon and Netapp storage heads, every Playstation 3/4/5, and got mashed into NeXTSTEP to spawn Darwin and thence macOS, iOS etc.

It continues to surprise me how folks in the tech sector routinely fail to notice how much BSD is deployed in the infrastructure around them.

> All the benefits they added to FreeBSD could be added the same way in Linux

They cannot. Case in point, the bisect analysis described in the presentation above doesn't exist for Linux, where userland distributions develop independently from the kernel. Netflix is touting here the value of FreeBSD's unified release, since the bisect process fundamentally relies on a single dimension of change (please ignore the mathematicians muttering about Schur's transform).

[1] https://www.sandvine.com/press-releases/sandvines-2023-globa...

secondcoming · a year ago

'runs 100% on Linux' is a bit vague. What customisations do they do?

bbatha · a year ago

My understanding was that Netflix liked FreeBSD for a few reasons, some of them more historical than others.

* The networking stack was faster at the time

* dtrace

* async sendfile(2) https://lists.freebsd.org/pipermail/svn-src-head/2016-Januar...

Could they have contributed async sendfile(2) to linux as well? Probably. In 2024 these advantages seem to be moot: ebpf, io_uring, more maturity in the linux network stack plus FreeBSD losing more and more vendor support by the day .

hiAndrewQuinn · a year ago

Might be a manpower thing. By hiring a bunch of FreeBSD Core devs, Netflix might be able to get a really talented team for cheaper than they might get a less ideologically flavored team. (I say this as I set up i3 on FreeBSD on my Thinkpad X280, I'm a big fan!)

IshKebab · a year ago

They also get much more control. If they employ most of the core FreeBSD devs they basically have their own OS that they can do what they like with. If they want some change that benefits their use case at the detriment of other people they can pretty much just do it.

That's not really possible with Linux.

bluedino · a year ago

Are there papers out there from other companies that detail what performance levels have been achieved using Linux in a similar device to the Netflix OCA? Maybe they just use two devices that have 90% of the performance?

ay · a year ago

Sometimes it’s still hard to tackle people psychology who are used to “comfort” of the “sta[b]le” branches.

So at work I came up with the following observation: if you are a consumer and are afraid of unpredictable events at head of master/main branch - if you use the head of master/main from 21 days ago, you have 3 weeks of completely predictable and modifiable future.

Any cherry picks made are made during the build process, so there is no branch divergence - if the fix gets up streamed it is not cherry picked anymore.

Thus, unlike with stable branches, by default it converges back to master.

“But what if the fix is not upstreamed ?” - then it gets to be there, and depending on the nature of the code it bears a bigger or smaller ongoing cost - which reflects well the technical debt as it is.

This has worked pretty well for the past 4-5 years and is now used for quite a few projects.

bongodongobob · a year ago

This is how OS updates have worked at every company I've been at. Either have a handful of devices that get them immediately and you scream test, or you simply just wait 3 weeks, then roll them out. (Minus security CVE's of course)

Lucky you! :-)

The one additional bonus of this scheme that popped up somewhat as a side effect is one can deploy the new feature into the “prod” build even before it hits the master: code it up, add the instruction to cherry-pick the change into the branch of a builder, test it and if it works then merge that instruction into the “master” builder.

Then, the feature will remain in prod builds as a custom cherry-pick until it’s merged upstream.

And if it isn’t merged - then one has a recurring reminder of their technical debt that they have incurred by diverging. Once adopted, it became a pretty cool way to adopt “right” long term incentives without door-stopping the short term “urgent” deliverables.

eatonphil · a year ago

jsnell · a year ago

> Had we moved between -stable branches, bisecting 3+ years of changes could have taken weeks

Would it really? Going by their number of 4 hours per bisect step, you get 6 bisections per day, which cuts the range to 1/64th of the original. The difference between "three years of changes" and "three weeks of changes" is a factor of 50x. I.e. within one day, they'd already have identified the problematic three week range. After that, the remaining bisection takes exactly as long as this bisection did.

Even if they're limited to doing the bisections just doing the working hours in one timezone for some reason, you'd still get those six bisection steps done in just three days. It still would not add weeks.

drewg123 · a year ago

Author here: Note that the 4 hours per bisection is the time to ramp the server up and down and re-image it. It does not count the time to actually do the bisection step. That's because in our case of following head, the merging & conflicts were trivial for each step of the 3 week bisection. Eg, the bisections we're doing are far simpler than the bisections we'd be doing if we didn't track the main branch, and had a far larger patchset that had to be re-merged at each bisection point. I'm cringing just imagining the conflicts.

crote · a year ago

Does "the server" imply you're only using a single server for this? I would have expected that at Netflix's scale it wouldn't be too difficult to do the whole bisect-and-test on a couple dozen servers in parallel.

iainmerrick · a year ago

That's a very good point, but merging in three years of changes would also have pulled in a lot of minor performance changes, and possibly some incompatible changes that would require some code updates. That would slow down each bisection step, and also make it harder to pinpoint the problem.

If you know that some small set of recent changes caused a 7% regression, you can be fairly confident there's a single cause. If 3 years of updates cause a 6% or 8% regression (say), it's not obvious that there's going to be a single cause. Even if you find a commit that looks bad, it might already have been addressed in a later commit.

Edit to clarify: you're technically correct (the best kind of correct!) but I'd still much prefer to merge 3 weeks rather than 3 years, even though their justification isn't quite right.

wccrawford · a year ago

I take your point, but that assumes someone working 24 hour days, or constantly handing off the project to at least 2 other people every day.

I don't think those are reasonable work scenarios, so it's more like 2 bisects (maybe 3!) per day, rather than 6.

Not really, it just assumes that the bisection process is automated.

But also, I addressed exactly this objection in the second paragraph :)

zellyn · a year ago

lol, came here to say this, armed with identical log2 53 == 5.7 :-) The replies to your comment are of course, spot on, though. Finding 8% of performance regression in three years of commits could have taken a looooong time.

phoronixrly · a year ago

Video from the talk at OpenFest 2023: https://www.youtube.com/watch?v=q4TZxj-Dq7s

kazinator · a year ago

Why would you alphabetically order initializations?

Every complex system I've ever worked on that had a large number of initializations was sensitive to orders.

Languages with module support like Wirth's Modula-2 ensure that if module A uses B, B's initialization will execute before A's. If there is no circular dependency, that order will never be wrong.

The reverse order could work too, but it's a crapshoot then. Module dependency doesn't logically entail initialization order dependency: A's initializations might not require B's initializations to have completed.

If you're initializing by explicit calls in a language that doesn't automate the dependencies, the baseline safest thing to do is to call things in a fixed order that is recorded somewhere in the code: array of function addresses, or just an init procedure that calls others directly.

If you sort the init calls, it has to be on some property linked to dependency order, otherwise don't do it. Unless you've encoded something related to dependencies into the module names, lexicographic order is not right.

In the firmware application I work on now, all modules have a statically allocated signature word that is initially zero and set to a pattern when the module is initialized. The external API functions all assert that the pattern has the correct value, which is strong evidence that the module had been initialized before use.

One one occasion I debugged a static array overrun which trashed these signatures, causing the affected modules to assert.

Having a consistent ordering avoids differences in results from inconsistent ordering by construction. IIUC, Alpha sort was/is used as a tie breaker after declared dependencies or other ordering information.

In this case, two (or more) modules indicate they can handle the same hardware and didn't have information on priority if both were present. Probably this should be detected / raise a fault, but under the previous regime of alpha sort, it was handled nicely because the preferred drivers happened to sort first.

10000truths · a year ago

A topological sort of the dependency graph is just as consistent as any other sort, so long as you have a deterministic tie breaking mechanism for the case of multiple valid toposorts (which can just be another sort based on some unique property).

gigatexal · a year ago

Have they ever talked about why the content isn't stored/read from ZFS just the root pool?

shrubble · a year ago

Yes, it is because they can do zero-copy sending of content that they can't do under ZFS (yet). Some links to Netflix papers and video talks on this older Reddit thread: https://www.reddit.com/r/freebsd/comments/ltjv8m/zfs_is_over...

bravo thank you for that

8fingerlouie · a year ago

Besides not being able to do zero copy on ZFS (yet), it probably also have to do with them not using RAID for content, and single drive ZFS doesn't make much sense in that scenario.

Single drive ZFS brings snapshots and COW, as well as bitrot detection, but for Netflix OCAs, snapshots are not used, and it's mostly read-only content, so not much use for COW, and bitrot is less of a problem with media. Yes, you may get a corrupt pixel every now and then (assuming it's not caught by the codec), but the media will be reseeded again every n days, so the problem will solve itself.

I assume they have ample redundancy in the rest of the CDN, so if a drive fails, they can simply redirect traffic to the next nearest mirror, and when a drive is eventually replaced, it will be seeded again by normal content seeding.

_zoltan_ · a year ago

yet? you write "yet" as it's something that would be almost readily available, yet it's been at least 2 years now and it's still not there.

am I missing something or that "yet" is more like "maybe sometime if ever"?

In addition to the lack of zero-copy sendfile from ZFS, we also have the problem the ZFS is also lacking async sendfile support (eg, the ability to continue the sendfile operation from the disk interrupt handler when the blocks arrive from disk).

iv42 · a year ago

Yes. Because sendfile(2) is not zero-copy on ZFS.

ogurechny · a year ago

Great, but I think some slide is missing in the end. Something similar to

> Making a diff of some kind of text dumps describing current system configuration before and after would be way faster and easier.

> Guys, how do we know our systems are built in reproducible manner at all?

Last minute edit: My bad, “configured at startup”, not “built”. Building can already be controlled, obviously, starting up is more dynamic and opaque.