How Netflix uses Java

>Netflix observed a 20% increase of CPU usage on JDK 17 compared to JDK 8. This was mostly due to the improvements in the G1 garbage collector.

Help me here, why do GC improvements cause CPU increase?

blackoil · 2 years ago

I think this is a 20% improved utilization of CPU, earlier app was memory-bound or/and GC was consuming CPU. Now app has 20% more CPU available. It should be doing correspondingly more work. This could definitely be written clearly.

moffkalast · 2 years ago

> Bakker provided a retrospective of their JDK 17 upgrade that provided performance benefits, especially since they were running JDK 8 as recently as this year. Netflix observed a 20% increase of CPU usage

Seems like it's exactly that, OP cropped out the relevant bit where they list it having an overall performance benefit for that extra CPU time. Otherwise it could be assumed that it just hogs more CPU to get the same result.

bunderbunder · 2 years ago

I haven't dealt with this side of Java in a while, but it reflects my experience poking at Java 8 performance. At some (surprisingly early) point you'd hit a performance wall due to saturating the memory bus.

A new GC could alleviate this by either going easier on the memory itself, or by doing allocations in a way that achieves better locality of reference.

Deleted Comment

jillesvangurp · 2 years ago

Most modern GCs trade off CPU usage and latency. Less latency means the CPU has to do more work on e.g. a separate thread to figure out what can be garbage collected. JDK 8 wouldn't have had the G1 collector (I think, or at least a really old version of that) and they would have probably been using one of the now deprecated garbage collectors that would be collecting less often but have a more open ended stop the world phase. It used to be that this would require careful tuning and could get out of hand and start taking seconds.

The new ZGC uses more CPU but it provides some hard guarantees that it won't block for more than a certain amount of milliseconds. And it supports much larger heap sizes. More CPU sounds worse than it is because you wouldn't want to run your application servers anywhere near 100% CPU typically anyway. So, there is a bit of wiggle room. Also, if your garbage collector is struggling, it's probably because you are nearly running out of memory. So, more memory is the solution in that case.

BinaryRage · 2 years ago

The figure is about the overall improvement, not sure why that reads increase.

On JDK 8 we are using G1 for our modern application stack, and we saw a reduction in CPU utilisation with the upgrade with few exceptions (saw what I believe is our first regression today: a busy wait in ForkJoinPool with parallel streams; fixed in 19 and later it seems).

G1 has seen the greatest improvement from 8 to 17 compared to its counterparts, and you also see reduced allocation rates due to compact strings (20-30%), so that reduces GC total time.

It's a virtuous cycle for the GRPC services doing the heavy lifting: reduced pauses means reduced tail latencies, fewer server cancellations and client hedging and retries. So improvements to application throughput reduce RPS, and further reduce required capacity over and above the CPU utilisation reduction due to efficiency improvements.

JDK 21 is a much more modest improvement upgrading from 17, perhaps 3%. Virtual threads are incredibly impressive work, and despite having an already highly asynchronous/non-blocking stack, expect to see many benefits. Generational ZGC is fantastic, but losing compressed oops (it requires 64-bit pointers) is about a 20% memory penalty. Haven't yet done a head to head with Genshen. We already have some JDK 21 in production, including a very large DGS service.

ahoka · 2 years ago

I don't think he meant that.

Macha · 2 years ago

A somewhat common problem is to be limited by the throughput of CPU heavy tasks while the OS reports lower than expected CPU usage. A lot of companies/teams just kind of handwave it away as "hyperthreading is weird", and allocate more machines. Actual causes might be poor cache usage causing programs to wait on data to be loaded from memory, which depending on the CPU metrics you use, may not show as CPU busy time.

For companies at much smaller scale than netflix where employee time is relatively more costly than computer time, this might even be the right decision. So you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

If the bottlenecks and overhead are reduced such that it's able to make more full use of the CPU, you might be able to reduce to e.g. 15 machines at 75% CPU usage. Consequently the increased CPU usage represents more efficient use of resources.

CraigJPerry · 2 years ago

>> while the OS reports lower than expected CPU usage

>> which depending on the CPU metrics you use, may not show as CPU busy time

If your userspace process is waiting on memory (be that cache, or RAM) then you’ll show as CPU busy when you look in top or whatever - even though if you look under the covers such as via perf counters, you’ll see a lack of instructions executed.

The CPU is busy in this case and the OS won’t context switch to another task, your stalled process will be treated as running by the OS. At the hardware thread level then it will hopefully use the opportunity to run another thread thanks to hyper threading but at the OS level your process will show user space cpu bound. You’ll have to look at perf counters to see what’s actually happening.

>> you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

Queue theory is fascinating, the latency change when dropping to half the servers may not be just a doubling. It depends on queue arrival rate and processing time but the results can be wild, like 10x worse.

xorcist · 2 years ago

When you put it like that, yes. Hardware is cheap and all that. In practice I think that an organization that doesn't understand the software it is developing has a people problem. And people problems generally can't be solved with hardware.

If somebody knows how to make that insight actionable, let me know. No, hiring new people is not the answer. In all likelihood that swaps one hard problem for an even harder.

edpichler · 2 years ago

To free memory. Also, 20% increase is not 20% in total. It's 20% when you go from 10 to 12 cpu usage, or from 50 to 60, for instance.

_the_inflator · 2 years ago

Well done.

I always appreciate numbers and the differentiation between relative and absolute numbers in this case.

"We doubled our workforce in one week!" - CEO's first hire... ;)

matsemann · 2 years ago

The CPU can do more tasks without being limited by memory pressure, perhaps?

I guess it depends on if they mean "we used 20% more CPU for the same output", or "we could utilize the CPUs 20% more".

paulbakker · 2 years ago

It’s a 20% improvement. So less time spent on GC.

Deleted Comment

znpy · 2 years ago

> Help me here, why do GC improvements cause CPU increase?

In Java 8 (afaik) there were pretty much no generational or concurrent garbage collectors, so garbage collector would happen in a stop-the-world manner: all work gets put on a halt, garbage collection happens, then the work can resume.

If you have a better GC, you have shorter and less frequent needs to do a stop the world pause.

Hence the code can run on cpu for more time, getting you higher cpu usage.

Higher cpu usage is often actually good in situations like this: it means you're getting more work done with the same cpu/memory configuration.

dboreham · 2 years ago

Java8 was at least a decade into generational and concurrent GC. It does STW once in a while though which may be what you meant.

tpm · 2 years ago

I read it as a good thing: GC improvements -> more available memory -> more work done by the CPU. But still would be interested in more detail.

groestl · 2 years ago

Because the memory / I/O is not the bottleneck anymore, and the CPU can now run optimally.

jjtheblunt · 2 years ago

I haven’t seen the specific profiling data, but it’s possible that the garbage collector is running a collection thread, concurrently with regular processing threads, and thereby preventing entire world synchronization points which would idle processor cores.

ahoka · 2 years ago

Higher CPU usage paradoxically means better performance. When I last did OPS we used to watch total CPU usage of all services and if it was not 100%, then we started to look for a bottleneck to fix.

radomir_cernoch · 2 years ago

Also interested! We saw basically the exact opposite. :-)

pyeri · 2 years ago

It's like hiring more workers to accomplish the exact same output as before. "See, I achieved 20% growth in my targets!", some recruiter will say!

groestl · 2 years ago

No, it's like improving a form to minimize the need for follow-up questions to the customer, and now seeing your workers (the same you had before) processing 20% more forms instead of waiting for responses.

Spring Boot and Spring cloud for backend & graphql for the win. ;-)

RamblingCTO · 2 years ago

No, just no. Performance and debugging are just plain horrible. The spring team loves to force you into their automagic shit and this bean stuff is so annoying. You almost got no compile time safety in this stack. It's the bane of my existence. I'd like to know that a compiled program will run. That seems virtually impossible with java/spring boot.

StevePerkins · 2 years ago

I'm not sure what "no compile time safety in this stack" even means in the context of a strongly-typed compiled language.

If you are referring to the dependency injection container making use of reflection, then Spring Native graduated from experimental add-on to part of the core framework some years ago. You can now opt for Quarkus/Micronaut-style static build-time dependency injection, and even AOT compilation to Go-style native executables, if you're willing to trade off the flexibility that comes with avoiding reflection. For example, not being able to use any of the "@ConditionalOnXXX" annotations to make your DI more dynamic.

(Personally, I don't believe that those trade-offs are worth it in most cases. And I believe that all the Spring magic in the universe doesn't amount to 10% of what Python brings to the table in a minimal Django/Flask/FastAPI microservice. But the option is there if your use case truly calls for it.)

Honestly, I've never run into anyone who considers Spring to be "the bane of their existence", where the real issue wasn't simply that the bulk of their experience was in something else. Where they weren't thrown into someone else's project, and resent working with decisions made by other people, but don't want to either dig in and learn the tech or else search for a new job where they get to make the choices on a greenfield project.

pylua · 2 years ago

Spring is basically a standard in itself and it is easier to hire people in it. It also normalizes large pieces of the backend application so even though they are written by different people they are similar.

Once you learn the annotation based configuration it also saves a lot of time.

The performance is valid but it will only keep improving.

Fabricio20 · 2 years ago

It's funny to see this perspective! I used to work in a few companies locally who had adopted the early java-ee style for their applications and my experience is exactly the opposite. When going to spring I'm usually diagnosing issues on the application layer (ie: business issues, not framework issues), while on the java-ee applications I was often having to fix issues down at the custom persistence layer each company had, etc.. I see where you come from having looked at the "old" spring stack (non -boot), and I can see people getting mad over the configuration hell and how stuff is hidden behind xml.. Much like how java-ee is!

dimgl · 2 years ago

I completely agree with this. Spring was an absolute nightmare during the short period of time where I had the misfortune of using it. It also didn't help that the codebase was a monstrosity... classes following no design patterns and having 40k lines. But still...

nameless912 · 2 years ago

That has not been my experience on the inside - I spend most of my days working on a Spring Boot based service at Netflix and frankly it's one of the most effortless environments I've ever worked in. Granted, there's a lot of ecosystem support from the rest of the company, but things are very low effort, and generally very predictable. I can usually drop a breakpoint in a debugger in exactly the right spot and find a problem immediately.

vmaurin · 2 years ago

The issue with Spring ecosystem is that people use it without knowing why or which problem it solves but because almost everyone is using it. And most of the time, they don't need Spring (maybe a company like Netflix did, but it didn't prove to be the right choice at the end)

didntcheck · 2 years ago

It's not quite as good as compile-time or type-based guarantees, but IME configuration errors with Spring are almost always flagged up immediately on application startup. As long as you have at least one test that initializes the application context (I.e. @SpringBootTest) then this should be caught easily

krooj · 2 years ago

This is just... ignorance; your argument is basically, "I don't understand/want to learn how X works; therefore, X must be garbage"

smrtinsert · 2 years ago

Performance and debugging simple, and compile time safety is Javas core domain. I think you're over focusing on proxying or enhancement of beans, but if you look at a documentation for a reasonable amount of time there's really nothing to it.

misja111 · 2 years ago

FYI, you can still use XML based configuration in Spring. The choice is yours. See https://docs.spring.io/spring-framework/docs/4.2.x/spring-fr...

I agree it is not common to do it, most teams follow the autoconfiguration madness.

bedobi · 2 years ago

100% agree, Java and Spring are a mess and there's no justifiable reason to use them in 2023 (and no, "that's what we've always used" isn't a good justification)

Like srsly even DropWizard is better than Spring lol, let alone other even simpler frameworks like Ktor which is built on a much improved language over Java

wing-_-nuts · 2 years ago

What do you propose as an alternative? Something like Micronaut trades more compile time for stricter checks and faster runtime. Do you use something like that?

twh270 · 2 years ago

We've adopted Quarkus and it's been a breath of fresh air. Excellent all around, DX, performance, features, it's all been good.

Dead Comment

Cthulhu_ · 2 years ago

Spring is a safe and reliable choice I'd say; not the most exciting, but neither code nor frameworks should be exciting, they're used to solve a problem, they shouldn't become the problem itself.

GraphQL is interesting to me, I thought the clients were pretty similar across all platforms, meaning their API usage should also be similar enough to not need the flexible nature of GraphQL. But then, it allows for a lot more flexibility and decoupling - if a client needs an extra field, the API contract does not need to be updated, and not all clients need to be updated at once. Not all clients will be updated either, they will need to support 5-10+ year old clients that haven't updated yet for whichever reason.

m_0x · 2 years ago

> not the most exciting

It was exciting when J2EE was dominating.

robertlagrant · 2 years ago

Well, if the field is not available then new backend code will need to be written, resolvers, integrations, etc. But it does allow UIs to take less info over the wire, and eitherfewer joins need to be done or fewer performance-oriented APIs need building, as you say.

krooj · 2 years ago

The stack is tremendously productive, but history has taught me a few things when dealing with Spring:

1. It's always best to start people off with plain old spring, even with an XML context, such that they understand the concepts at play with higher level abstractions like Boot. Hell, I even start with a servlet and singletons to elucidate the shortcomings of rolling your own. 2. Don't fall prey to hype around new projects in the Spring ecosystem, such as their OAuth2 implementation, since they often become abandonware. It's always best to take a wait and see approach 3. Spring Security is/was terrible to read, understand, and extend ;)

inparen · 2 years ago

Ha ha, spring security is tricky and high chance may surprise some one while "boot"strapping a new project. But once done, it is out of way.

I did not like much of the XML, because it always seemed lot of duplication. All you doing is copying bean definitions and changing bean id and class/interface most of the time. But it became non issue over time. Now spring boot made it really easy with all those annotations.

olavgg · 2 years ago

I am a big fan of Spring Boot, its one of the few frameworks that just works and let me focus 100% on solving business problems. I've tried Micronaut, Quarkus, Dropwizard, but they slow me down too much compared to just using Spring Boot.

For me delivering business value is the most important metric when I am comparing frameworks. Spring Boot wins every time.

ramon156 · 2 years ago

May I recommend Symfony? You get the advantages of Spring but also the nicer things of PHP :-)

baby · 2 years ago

I had to review a Spring application once and that convinced me never to work with Java ever again