Early on (15+ years ago) I spent a few weeks there on contract and I noticed they used Java EVERYWHERE, and not always well. They had a CS app named after a key Star Wars character that was in all likelihood a breach of the Geneva Convention. A code atrocity with the performance of a sloth on its 8th bong rip with a UX from hell.
If it helps that CS app was rewritten about 10 years ago (when I worked there, but not on that app) in part due to the complaints you mention. It's totally true that most resources were spent on customer facing apps. Internal apps were definitely not of the same quality, because they didn't need to be.
Good to hear. The fact that in 2005 you had an app that required seemingly petabytes of memory to operate, and put on machines barely powerful enough to play minesweeper, was in and of itself a series of bad decisions...but the app itself, and it's layout were just maddening. It's like MC Escher was the UX lead.
For some reason whenever I'm on my work's VPN, Apple Music lets me play 1 album and then the next time I try to start a song it will tell me I'm not logged in and I'll have to force quit and relaunch (frequently a few times) before it will let me play another album.
Apple Music is the only app that has this problem.
When I was there a decade ago, it started becoming more polyglot friendly (node apps had to use a jvm sidecar to do internal communications originally!)
My team wrote some of the Python libraries for internal services just so we could avoid that Java sidecar! It took 10 times longer to boot the sidecar than the Python app.
Another reminder that acronyms are pretty terrible for communication. Every time I onboard with a new org there’s a whole new set of acronyms to learn that’s barely faster than typing out the unabbreviated version. Nice to save a couple seconds when the cost is only a bunch of people not able to follow along when people are communicating.
To be clear: not ragging on OP in particular at all but more at the widespread practice at a company level.
Interesting the article jumps straight from REST to GraphQL and forgets Falcor[0] - Netflix's alternative vision for federated services. For a while it looked like it might be a contender to GraphQL but it never really seemed to take off despite being simpler to adopt.
Falcor is actually part of the "old" architecture described in the talk. Because it's mostly unknown and no longer used I didn't go into the details of it.
Falcor was developed at the time Facebook was developing GraphQL in-house. It has similar concepts, but never took off the way GraphQL did.
I was at the React Rally conference where Falcon was publcly announced in August of 2015. I recall that Facebook gave a GraphQL presentation right before.
It seems GraphQL was first announced publicly in February 2015.
I think this is a 20% improved utilization of CPU, earlier app was memory-bound or/and GC was consuming CPU. Now app has 20% more CPU available. It should be doing correspondingly more work. This could definitely be written clearly.
> Bakker provided a retrospective of their JDK 17 upgrade that provided performance benefits, especially since they were running JDK 8 as recently as this year. Netflix observed a 20% increase of CPU usage
Seems like it's exactly that, OP cropped out the relevant bit where they list it having an overall performance benefit for that extra CPU time. Otherwise it could be assumed that it just hogs more CPU to get the same result.
I haven't dealt with this side of Java in a while, but it reflects my experience poking at Java 8 performance. At some (surprisingly early) point you'd hit a performance wall due to saturating the memory bus.
A new GC could alleviate this by either going easier on the memory itself, or by doing allocations in a way that achieves better locality of reference.
Most modern GCs trade off CPU usage and latency. Less latency means the CPU has to do more work on e.g. a separate thread to figure out what can be garbage collected. JDK 8 wouldn't have had the G1 collector (I think, or at least a really old version of that) and they would have probably been using one of the now deprecated garbage collectors that would be collecting less often but have a more open ended stop the world phase. It used to be that this would require careful tuning and could get out of hand and start taking seconds.
The new ZGC uses more CPU but it provides some hard guarantees that it won't block for more than a certain amount of milliseconds. And it supports much larger heap sizes. More CPU sounds worse than it is because you wouldn't want to run your application servers anywhere near 100% CPU typically anyway. So, there is a bit of wiggle room. Also, if your garbage collector is struggling, it's probably because you are nearly running out of memory. So, more memory is the solution in that case.
The figure is about the overall improvement, not sure why that reads increase.
On JDK 8 we are using G1 for our modern application stack, and we saw a reduction in CPU utilisation with the upgrade with few exceptions (saw what I believe is our first regression today: a busy wait in ForkJoinPool with parallel streams; fixed in 19 and later it seems).
G1 has seen the greatest improvement from 8 to 17 compared to its counterparts, and you also see reduced allocation rates due to compact strings (20-30%), so that reduces GC total time.
It's a virtuous cycle for the GRPC services doing the heavy lifting: reduced pauses means reduced tail latencies, fewer server cancellations and client hedging and retries. So improvements to application throughput reduce RPS, and further reduce required capacity over and above the CPU utilisation reduction due to efficiency improvements.
JDK 21 is a much more modest improvement upgrading from 17, perhaps 3%. Virtual threads are incredibly impressive work, and despite having an already highly asynchronous/non-blocking stack, expect to see many benefits. Generational ZGC is fantastic, but losing compressed oops (it requires 64-bit pointers) is about a 20% memory penalty. Haven't yet done a head to head with Genshen. We already have some JDK 21 in production, including a very large DGS service.
A somewhat common problem is to be limited by the throughput of CPU heavy tasks while the OS reports lower than expected CPU usage. A lot of companies/teams just kind of handwave it away as "hyperthreading is weird", and allocate more machines. Actual causes might be poor cache usage causing programs to wait on data to be loaded from memory, which depending on the CPU metrics you use, may not show as CPU busy time.
For companies at much smaller scale than netflix where employee time is relatively more costly than computer time, this might even be the right decision. So you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.
If the bottlenecks and overhead are reduced such that it's able to make more full use of the CPU, you might be able to reduce to e.g. 15 machines at 75% CPU usage. Consequently the increased CPU usage represents more efficient use of resources.
>> while the OS reports lower than expected CPU usage
>> which depending on the CPU metrics you use, may not show as CPU busy time
If your userspace process is waiting on memory (be that cache, or RAM) then you’ll show as CPU busy when you look in top or whatever - even though if you look under the covers such as via perf counters, you’ll see a lack of instructions executed.
The CPU is busy in this case and the OS won’t context switch to another task, your stalled process will be treated as running by the OS. At the hardware thread level then it will hopefully use the opportunity to run another thread thanks to hyper threading but at the OS level your process will show user space cpu bound. You’ll have to look at perf counters to see what’s actually happening.
>> you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.
Queue theory is fascinating, the latency change when dropping to half the servers may not be just a doubling. It depends on queue arrival rate and processing time but the results can be wild, like 10x worse.
When you put it like that, yes. Hardware is cheap and all that. In practice I think that an organization that doesn't understand the software it is developing has a people problem. And people problems generally can't be solved with hardware.
If somebody knows how to make that insight actionable, let me know. No, hiring new people is not the answer. In all likelihood that swaps one hard problem for an even harder.
> Help me here, why do GC improvements cause CPU increase?
In Java 8 (afaik) there were pretty much no generational or concurrent garbage collectors, so garbage collector would happen in a stop-the-world manner: all work gets put on a halt, garbage collection happens, then the work can resume.
If you have a better GC, you have shorter and less frequent needs to do a stop the world pause.
Hence the code can run on cpu for more time, getting you higher cpu usage.
Higher cpu usage is often actually good in situations like this: it means you're getting more work done with the same cpu/memory configuration.
I haven’t seen the specific profiling data, but it’s possible that the garbage collector is running a collection thread, concurrently with regular processing threads, and thereby preventing entire world synchronization points which would idle processor cores.
Higher CPU usage paradoxically means better performance. When I last did OPS we used to watch total CPU usage of all services and if it was not 100%, then we started to look for a bottleneck to fix.
No, it's like improving a form to minimize the need for follow-up questions to the customer, and now seeing your workers (the same you had before) processing 20% more forms instead of waiting for responses.
Most of the postings for backend positions at Netflix I've seen call out nodejs. Can I assume they do both? Is one legacy and the other newer stuff, or are they more complimentary?
Things are certainly more of a blend now than what's presented in this presentation, but the presenter is a big Java platform guy here. I would say ~70% of the services I interact with on a day to day basis are Java, another 20% in Node, and then the last 10% is a hodgepodge of Python, Go, and more esoteric stuff.
It varies from team to team; the "Studio" organization that supports creating Netflix content does lots of nodeJS due to the perception that it's faster to iterate on a UI and API together if they're both in the same language. On my team, we're very close to 50/50 due to managing a bunch of backend, business process type systems (Java), and a very complex UI (with a NodeJS backing service to provide a graphql query layer). Regardless, the tooling is really quite good, so interacting with a Node service is roughly identical to interacting with a Java service is roughly identical to interacting with anything else. We lean into code generation for clients pretty heavily, so graphQL is a good fit, but gRPC and Swagger are still used pretty frequently.
No, just no. Performance and debugging are just plain horrible. The spring team loves to force you into their automagic shit and this bean stuff is so annoying. You almost got no compile time safety in this stack. It's the bane of my existence. I'd like to know that a compiled program will run. That seems virtually impossible with java/spring boot.
I'm not sure what "no compile time safety in this stack" even means in the context of a strongly-typed compiled language.
If you are referring to the dependency injection container making use of reflection, then Spring Native graduated from experimental add-on to part of the core framework some years ago. You can now opt for Quarkus/Micronaut-style static build-time dependency injection, and even AOT compilation to Go-style native executables, if you're willing to trade off the flexibility that comes with avoiding reflection. For example, not being able to use any of the "@ConditionalOnXXX" annotations to make your DI more dynamic.
(Personally, I don't believe that those trade-offs are worth it in most cases. And I believe that all the Spring magic in the universe doesn't amount to 10% of what Python brings to the table in a minimal Django/Flask/FastAPI microservice. But the option is there if your use case truly calls for it.)
Honestly, I've never run into anyone who considers Spring to be "the bane of their existence", where the real issue wasn't simply that the bulk of their experience was in something else. Where they weren't thrown into someone else's project, and resent working with decisions made by other people, but don't want to either dig in and learn the tech or else search for a new job where they get to make the choices on a greenfield project.
Spring is basically a standard in itself and it is easier to hire people in it. It also normalizes large pieces of the backend application so even though they are written by different people they are similar.
Once you learn the annotation based configuration it also saves a lot of time.
The performance is valid but it will only keep improving.
It's funny to see this perspective! I used to work in a few companies locally who had adopted the early java-ee style for their applications and my experience is exactly the opposite. When going to spring I'm usually diagnosing issues on the application layer (ie: business issues, not framework issues), while on the java-ee applications I was often having to fix issues down at the custom persistence layer each company had, etc.. I see where you come from having looked at the "old" spring stack (non -boot), and I can see people getting mad over the configuration hell and how stuff is hidden behind xml.. Much like how java-ee is!
I completely agree with this. Spring was an absolute nightmare during the short period of time where I had the misfortune of using it. It also didn't help that the codebase was a monstrosity... classes following no design patterns and having 40k lines. But still...
That has not been my experience on the inside - I spend most of my days working on a Spring Boot based service at Netflix and frankly it's one of the most effortless environments I've ever worked in. Granted, there's a lot of ecosystem support from the rest of the company, but things are very low effort, and generally very predictable. I can usually drop a breakpoint in a debugger in exactly the right spot and find a problem immediately.
The issue with Spring ecosystem is that people use it without knowing why or which problem it solves but because almost everyone is using it. And most of the time, they don't need Spring (maybe a company like Netflix did, but it didn't prove to be the right choice at the end)
It's not quite as good as compile-time or type-based guarantees, but IME configuration errors with Spring are almost always flagged up immediately on application startup. As long as you have at least one test that initializes the application context (I.e. @SpringBootTest) then this should be caught easily
Performance and debugging simple, and compile time safety is Javas core domain. I think you're over focusing on proxying or enhancement of beans, but if you look at a documentation for a reasonable amount of time there's really nothing to it.
100% agree, Java and Spring are a mess and there's no justifiable reason to use them in 2023 (and no, "that's what we've always used" isn't a good justification)
Like srsly even DropWizard is better than Spring lol, let alone other even simpler frameworks like Ktor which is built on a much improved language over Java
What do you propose as an alternative? Something like Micronaut trades more compile time for stricter checks and faster runtime. Do you use something like that?
Spring is a safe and reliable choice I'd say; not the most exciting, but neither code nor frameworks should be exciting, they're used to solve a problem, they shouldn't become the problem itself.
GraphQL is interesting to me, I thought the clients were pretty similar across all platforms, meaning their API usage should also be similar enough to not need the flexible nature of GraphQL. But then, it allows for a lot more flexibility and decoupling - if a client needs an extra field, the API contract does not need to be updated, and not all clients need to be updated at once. Not all clients will be updated either, they will need to support 5-10+ year old clients that haven't updated yet for whichever reason.
Well, if the field is not available then new backend code will need to be written, resolvers, integrations, etc. But it does allow UIs to take less info over the wire, and eitherfewer joins need to be done or fewer performance-oriented APIs need building, as you say.
The stack is tremendously productive, but history has taught me a few things when dealing with Spring:
1. It's always best to start people off with plain old spring, even with an XML context, such that they understand the concepts at play with higher level abstractions like Boot. Hell, I even start with a servlet and singletons to elucidate the shortcomings of rolling your own.
2. Don't fall prey to hype around new projects in the Spring ecosystem, such as their OAuth2 implementation, since they often become abandonware. It's always best to take a wait and see approach
3. Spring Security is/was terrible to read, understand, and extend ;)
Ha ha, spring security is tricky and high chance may surprise some one while "boot"strapping a new project. But once done, it is out of way.
I did not like much of the XML, because it always seemed lot of duplication. All you doing is copying bean definitions and changing bean id and class/interface most of the time. But it became non issue over time. Now spring boot made it really easy with all those annotations.
I am a big fan of Spring Boot, its one of the few frameworks that just works and let me focus 100% on solving business problems. I've tried Micronaut, Quarkus, Dropwizard, but they slow me down too much compared to just using Spring Boot.
For me delivering business value is the most important metric when I am comparing frameworks. Spring Boot wins every time.
Sounds like Apple Music.
Apple Music is the only app that has this problem.
I just want to know where I can buy a bong ripping sloth [1] and whether they're legal in California.
[1] https://imgur.com/a/S3NVS16
Also, that description made me lmao, thanks
To be clear: not ragging on OP in particular at all but more at the widespread practice at a company level.
[0] https://netflix.github.io/falcor/
Falcor was developed at the time Facebook was developing GraphQL in-house. It has similar concepts, but never took off the way GraphQL did.
https://netflixtechblog.com/migrating-netflix-to-graphql-saf...
Deleted Comment
It seems GraphQL was first announced publicly in February 2015.
Help me here, why do GC improvements cause CPU increase?
Seems like it's exactly that, OP cropped out the relevant bit where they list it having an overall performance benefit for that extra CPU time. Otherwise it could be assumed that it just hogs more CPU to get the same result.
A new GC could alleviate this by either going easier on the memory itself, or by doing allocations in a way that achieves better locality of reference.
Deleted Comment
The new ZGC uses more CPU but it provides some hard guarantees that it won't block for more than a certain amount of milliseconds. And it supports much larger heap sizes. More CPU sounds worse than it is because you wouldn't want to run your application servers anywhere near 100% CPU typically anyway. So, there is a bit of wiggle room. Also, if your garbage collector is struggling, it's probably because you are nearly running out of memory. So, more memory is the solution in that case.
On JDK 8 we are using G1 for our modern application stack, and we saw a reduction in CPU utilisation with the upgrade with few exceptions (saw what I believe is our first regression today: a busy wait in ForkJoinPool with parallel streams; fixed in 19 and later it seems).
G1 has seen the greatest improvement from 8 to 17 compared to its counterparts, and you also see reduced allocation rates due to compact strings (20-30%), so that reduces GC total time.
It's a virtuous cycle for the GRPC services doing the heavy lifting: reduced pauses means reduced tail latencies, fewer server cancellations and client hedging and retries. So improvements to application throughput reduce RPS, and further reduce required capacity over and above the CPU utilisation reduction due to efficiency improvements.
JDK 21 is a much more modest improvement upgrading from 17, perhaps 3%. Virtual threads are incredibly impressive work, and despite having an already highly asynchronous/non-blocking stack, expect to see many benefits. Generational ZGC is fantastic, but losing compressed oops (it requires 64-bit pointers) is about a 20% memory penalty. Haven't yet done a head to head with Genshen. We already have some JDK 21 in production, including a very large DGS service.
For companies at much smaller scale than netflix where employee time is relatively more costly than computer time, this might even be the right decision. So you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.
If the bottlenecks and overhead are reduced such that it's able to make more full use of the CPU, you might be able to reduce to e.g. 15 machines at 75% CPU usage. Consequently the increased CPU usage represents more efficient use of resources.
>> which depending on the CPU metrics you use, may not show as CPU busy time
If your userspace process is waiting on memory (be that cache, or RAM) then you’ll show as CPU busy when you look in top or whatever - even though if you look under the covers such as via perf counters, you’ll see a lack of instructions executed.
The CPU is busy in this case and the OS won’t context switch to another task, your stalled process will be treated as running by the OS. At the hardware thread level then it will hopefully use the opportunity to run another thread thanks to hyper threading but at the OS level your process will show user space cpu bound. You’ll have to look at perf counters to see what’s actually happening.
>> you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.
Queue theory is fascinating, the latency change when dropping to half the servers may not be just a doubling. It depends on queue arrival rate and processing time but the results can be wild, like 10x worse.
If somebody knows how to make that insight actionable, let me know. No, hiring new people is not the answer. In all likelihood that swaps one hard problem for an even harder.
I always appreciate numbers and the differentiation between relative and absolute numbers in this case.
"We doubled our workforce in one week!" - CEO's first hire... ;)
I guess it depends on if they mean "we used 20% more CPU for the same output", or "we could utilize the CPUs 20% more".
Deleted Comment
In Java 8 (afaik) there were pretty much no generational or concurrent garbage collectors, so garbage collector would happen in a stop-the-world manner: all work gets put on a halt, garbage collection happens, then the work can resume.
If you have a better GC, you have shorter and less frequent needs to do a stop the world pause.
Hence the code can run on cpu for more time, getting you higher cpu usage.
Higher cpu usage is often actually good in situations like this: it means you're getting more work done with the same cpu/memory configuration.
Anyone on in the inside know?
It varies from team to team; the "Studio" organization that supports creating Netflix content does lots of nodeJS due to the perception that it's faster to iterate on a UI and API together if they're both in the same language. On my team, we're very close to 50/50 due to managing a bunch of backend, business process type systems (Java), and a very complex UI (with a NodeJS backing service to provide a graphql query layer). Regardless, the tooling is really quite good, so interacting with a Node service is roughly identical to interacting with a Java service is roughly identical to interacting with anything else. We lean into code generation for clients pretty heavily, so graphQL is a good fit, but gRPC and Swagger are still used pretty frequently.
Things may have changed in the last 5 years, though.
If you are referring to the dependency injection container making use of reflection, then Spring Native graduated from experimental add-on to part of the core framework some years ago. You can now opt for Quarkus/Micronaut-style static build-time dependency injection, and even AOT compilation to Go-style native executables, if you're willing to trade off the flexibility that comes with avoiding reflection. For example, not being able to use any of the "@ConditionalOnXXX" annotations to make your DI more dynamic.
(Personally, I don't believe that those trade-offs are worth it in most cases. And I believe that all the Spring magic in the universe doesn't amount to 10% of what Python brings to the table in a minimal Django/Flask/FastAPI microservice. But the option is there if your use case truly calls for it.)
Honestly, I've never run into anyone who considers Spring to be "the bane of their existence", where the real issue wasn't simply that the bulk of their experience was in something else. Where they weren't thrown into someone else's project, and resent working with decisions made by other people, but don't want to either dig in and learn the tech or else search for a new job where they get to make the choices on a greenfield project.
Once you learn the annotation based configuration it also saves a lot of time.
The performance is valid but it will only keep improving.
I agree it is not common to do it, most teams follow the autoconfiguration madness.
Like srsly even DropWizard is better than Spring lol, let alone other even simpler frameworks like Ktor which is built on a much improved language over Java
Dead Comment
GraphQL is interesting to me, I thought the clients were pretty similar across all platforms, meaning their API usage should also be similar enough to not need the flexible nature of GraphQL. But then, it allows for a lot more flexibility and decoupling - if a client needs an extra field, the API contract does not need to be updated, and not all clients need to be updated at once. Not all clients will be updated either, they will need to support 5-10+ year old clients that haven't updated yet for whichever reason.
It was exciting when J2EE was dominating.
1. It's always best to start people off with plain old spring, even with an XML context, such that they understand the concepts at play with higher level abstractions like Boot. Hell, I even start with a servlet and singletons to elucidate the shortcomings of rolling your own. 2. Don't fall prey to hype around new projects in the Spring ecosystem, such as their OAuth2 implementation, since they often become abandonware. It's always best to take a wait and see approach 3. Spring Security is/was terrible to read, understand, and extend ;)
I did not like much of the XML, because it always seemed lot of duplication. All you doing is copying bean definitions and changing bean id and class/interface most of the time. But it became non issue over time. Now spring boot made it really easy with all those annotations.
For me delivering business value is the most important metric when I am comparing frameworks. Spring Boot wins every time.