Why most programming language performance comparisons are most likely wrong

ezekiel68 · 4 years ago

The author raises interesting points, but none of them are novel. The general tone here seems to be along the lines of: there are just so many moving variables, no fair comparisons may be fairly made. But I judge such positions agains this heuristic: "But, is it actionable?" Pragmatism demands we at least take a glance at languages, frameworks, and platforms from the perfromance perspective.

Here is a practical example: There may be some issues with the Tech Empower web framework benchmarks, but -- across a battery of tests -- one may clearly discern that there are NO python or ruby frameworks in the top 75% performing web frameworks yet plenty of go, Java, and PHP frameworks there. c++, js, amd rust dominate the top 20% and the differences in requests-per-second among them, statistically less significant, literally engulf the total rps numbers for the python or ruby rps totals. If you (or your company) are paying cloud fees for hosting your services (including serverless), you are absolutely throwing money away, perhaps for a "good" reason ('we have a huge rails codebase'), if you ignore this information.

Retric · 4 years ago

That’s assuming CPU bound workloads. C++ spends just as long as Ruby waiting on a database query.

PragmaticPulp · 4 years ago

> That’s assuming CPU bound workloads. C++ spends just as long as Ruby waiting on a database query.

Some of the TechEmpower benchmarks have a significant database component as well. Obviously if you have database queries taking 0.5s or more then it's a different story, but in my experience the server language/framework still has a large impact on total performance even with the database in the equation.

In my experience, memory footprint is also a huge variable. For example, I rewrote a certain gRPC service in Rust and cut the memory usage by almost 80% relative to the initial implementation in a higher level language. We weren't bottlenecked on requests/second performance but being able to scale down our memory usage had a huge impact on cloud costs.

yen223 · 4 years ago

There's the Fortunes benchmark, that involves reading off a database: https://www.techempower.com/benchmarks/#section=data-r20&hw=...

Something to observe from those benchmarks is that "fast" languages are still fast in those benchmarks. The server isn't waiting on I/O 100% of the time - it still has to do serialization and parsing and all other CPU-bound tasks, and those still benefit from a "faster" language.

omegalulw · 4 years ago

Not always. C++/C are also typically better in memory footprint. You can run more threads in the same server, so better qps.

ezekiel68 · 4 years ago

This is a good point, and there are ways to take advantage of this. The c++ frameworks that beat rust advertise in their repos that they are using a forked PostgreSQL DB driver which has been re-written to be asynchronous. There's a js framework named just-js which tops the current (nightly) DB Update benchmark on TechEmpower and I suspect they are doing the same.

Additionally, Java, python, and ruby frameworks tend to gobble RAM (I work in a large corp with many deployed Java web services). This requires instances to autoscale based on mem usage or to use instances with more mem. Rust and c++ frameworks often utilize a fraction of the memory for the same service (yes, we've tested this). So we can use smaller and/or fewer tasks/instances in our autoscaling pools.

tylerhou · 4 years ago

You can hide latency with sufficient parallelism. Since read requests are trivially parallelizable, as long as your database queries are mostly reads the database latency is probably not the limiting factor for throughput.

(Even if your C++ program processes too many requests for one replica to handle, it's not hard to round-robin to multiple database replicas.)

ksec · 4 years ago

Filtered in Fortunes Cookies, Full Stack, ORM Only, MySQL / Postgres DB, and Realistic Approach, the difference is still 50x. That is already lower than the RAW SQL approach from 100 to 150x.

And if we just look at .Net and ASP, these were done before all the optimisation came in latest .Net and ASP.net version.

jzoch · 4 years ago

Until the data store you grabbed is written in Ruby too :)

ashtonkem · 4 years ago

> If you (or your company) are paying cloud fees for hosting your services (including serverless), you are absolutely throwing money away, perhaps for a "good" reason ('we have a huge rails codebase'), if you ignore this information.

It’s worth pointing out that in some cases developer productivity can swamp server cost. It might be reasonable to pay the server bill necessary to develop in Ruby rather than C++, for example.

Obviously the relationship between performance and developer productivity isn’t strictly linear.

pclmulqdq · 4 years ago

I recently wanted to build a serverless application, but as far as I can tell, no serverless platform takes code in C or C++, so I was stuck working in a language I didn't know well.

There may be an opportunity for someone to build a serverless platform that uses verified functions in C/C++/Rust, so you can get both performance and convenience.

slezyr · 4 years ago

> Tech Empower web framework benchmarks, but -- across a battery of tests -- one may clearly discern that there are NO python or ruby frameworks in the top 75% performing web frameworks yet plenty of go, Java, and PHP frameworks there.

Have you seen the source code for the Django app there and one of the top performers? The Django app is just a simple example any developer would write. While Rust's top performers are full of hacks like checking the N-th letter of the HTTP method, no other person would use it in daily life.

maverwa · 4 years ago

While what you say is true, I don’t see how this invalidates the thesis of the parent comment. Two examples do not invalidate the general trend there. If, however, all implementations are flawed like that, we are having a different problem, right? But yes, super-optimized cases should be removed or marked as such.

ezekiel68 · 4 years ago

This is a common claim -- but no one has yet provided any evidence to back it up. So I got curious and took a look in the TechEmpower/FrameworkBenchmarks repo up on github. For the top rust frameworks (actix and ntex), the submissions are surprisingly short and easily comprehendable. Those two, at least, don't pull anything particularly tricky other than perhaps caching the initial portion of a successful HTTP response string (they all do that).

By the same token, there has not been any evidence I have seen to support the claim that the scripting language frameworks have not also done their best with their submissions.

lmm · 4 years ago

> If you (or your company) are paying cloud fees for hosting your services (including serverless), you are absolutely throwing money away

You're almost certainly not. You almost certainly are nowhere near the RPS limit. You almost certainly are bottlenecked on your own code far more than your framework/language. You're almost certainly running at a performance level of "just barely not bad enough to spend time improving" and would almost certainly be doing the same in any other language. You're probably using a slower framework than the fastest python or ruby framework, so the fact that faster frameworks exist in your language is irrelevant. And if you are in the (rare) case where you're handling enough requests for this choice to matter, then you're big enough to benchmark some proper prototypes that measure the kind of thing your application actually needs to do.

Almost anything is a better use of your time than looking at the benchmarks. And if you are going to look at the benchmarks, you shouldn't look at the RPS total; what you should do is first estimate how many requests your app needs to handle and then apply that as a yes/no filter to the frameworks you're considering. "It can handle 1000x as many requests as we need instead of 100x" is not a good reason to pick a framework, much less a language.

ezekiel68 · 4 years ago

I wish this were true -- but it isn't. Partly because Java and scripting frameworks can easily get memory bound instead of CPU bound. I've worked for several large enterprises running externally accessible services in the cloud (and this is true of my current employer). Of course, all of our services have at least 2 "tasks" or instances to help with fault tolerance, yet most of them do, in fact, auto-scale up to more instances during a day based on either RAM or CPU load. (for the curious, apps which read from message queues tend to get CPU bound while web services in Java or a scripting language tend to get memory bound) If you must run 3 instances instead of 2, that is 50% more instance costs for that time period, etc.

Moving a service from Java 11 to rust brought the memory load down sufficiently that we didn't need to autoscale at all. Some other service might autoscale, but to fewer max instances. And this adds up multiplied by days by services.

nonameiguess · 4 years ago

This raises an interesting question to me. When you look at high-performance Python libraries, they achieve the performance by just offloading the critical paths to compiled C modules. I remember reading some comparison of aiohttp to other Python frameworks a while back where the author expected aiohttp to be faster since it doesn't need to block on the network, but it wasn't, largely because of how slow it is to parse http requests in Python. If the bottleneck in serving a ton of requests per second is that Python is bad at serializing and deserializing http headers, why has no one just written a package that offloads that part of the library to a compiled C module? There is seemingly nothing preventing maintaining the convenience of Python for implementing your high-level business logic, but moving the bottlenecks to C. That has been exactly what programmers in scientific computing fields have been doing for 30 years and why projects like NumPy exist in the first place. So why do Python web programmers not just do the same thing? Heck, aiohttp already does this for the event loop, which just uses libuv the same as Node does. Both teams recognize that is a critical path component and it would be really slow to implement it in either Python or Javascript.

derbOac · 4 years ago

I tend to think of these things in terms of some limit approaching reality, whatever that is, as the benchmarks continue. If you have an open competition, and there's enough editing and revision by enough actors over time, it gives you some idea of how fast something is likely to be in practice. That open scrutiny and process of suggested revision by experts in a language is a rough sampling of what is likely to be encountered in practice, with a bunch of experts chipping away at the problem. If a difference in times exists, that's about as good as you're likely to get, with further time differences decreasing in size and probability as the whole process continues.

I'd even probably argue that even if you have a bunch of naive implementations, it still tells you about what's likely to be encountered in practice, because those naive implementations are often not so naive, and many programmers are more naive than you would like to be the case.

Finally, I think in a lot of cases, it's clearer how to do a certain benchmark in a certain language than the essay assumes. Sure, a complex one probably might require some unusual level of expertise. But many benchmarks are fairly small and circumscribed, for better or worse, and consequently pretty tractable.

Rochus · 4 years ago

> Or, in other words, the performance difference comes somewhere else than the choice of programming language.

We're comparing the code generated by a compiler or VM/JIT of a specific programming language, not the programming language itself. I'm amazed that this is even up for discussion in the article. It is clear that even with the same language there can be differences in implementation. Here is e.g. a measurement series I did to compare different Smalltalk implementations (LjSOM is mine): http://software.rochus-keller.ch/are-we-fast-yet_crystal_lua.... Smalltalk is a tough nut to crack because nearly all statements are closures; the Open Smalltalk VM team invested two decades to achieve a decent performance, but it's still quite behind e.g. V8 or a statically typed AOT compiled language. The Are-we-fast-yet suite (https://stefan-marr.de/papers/dls-marr-et-al-cross-language-...) gives more credible results than CLBG (referenced in the article) from my point of view because of the stricter rules the implementations have to adhere to (in favour of fair comparisons) and the larger benchmark apps. In return, the implementation gives significantly more work, but it is doable and worth it; here is one of my implementations that helped me significantly improve the compiler (still in progress): https://github.com/rochus-keller/Oberon/tree/master/testcase... this also demonstrates that - given the same platform (LuaJIT bytecode) - Oberon+ was implemented with much less effort and a much more performant result than Smalltalk. And the comparison is "most likely not wrong".

choeger · 4 years ago

What the author talks about is the theoretical performance of a programming language and I honestly believe that sample programs are the wrong experiment for that.

My hypothesis is that every programming language can approach optimal performance with enough effort. So the interesting question would be to measure the effort. As a proxy, one could maybe measure the time and experience-level of the programmers.

If I am really interested in the theoretical performance of a language, I'd use microbenchmarks for standard library functions. How much time does it take to sort an array? Reverse a list? Split a string? Etc.

Then there is of course the performance of the compiler or garbage collector. To compare this between different languages is the most difficult, I think. The best approach here is probably to take some real-world program (say a web server or a database), translate it nearly 1:1 and then compare the results. That's a tremendous effort for a single data point, though (albeit it would yield a nice collection of re-implementations, that could only help software quality).

Thiez · 4 years ago

> If I am really interested in the theoretical performance of a language, I'd use microbenchmarks for standard library functions. How much time does it take to sort an array? Reverse a list? Split a string? Etc.

The standard library may not be implemented in the language you're evaluating. Especially for scripting languages, and things like regular expressions and bignums.

atomicnumber3 · 4 years ago

So, I know Ruby is not faster than python, and Java is not faster than C++, and so on.

But I've been programming for... 15ish years now, professionally for almost 7, and I've yet to be unable to make a program go "fast enough" regardless of the programming language or runtime it's in. And I've had to do some honestly kinda stupid stuff like doing a bunch of CPU-intensive work in python.

And I don't think I'm like, any kind of whiz. I don't know all the deep C/C++/jvm compiler magicks (I honestly find that sort of thing kind of tiresome). I've just found that between multiprocessing (like via xargs -P) and multiprocessing (the python module, or equivalent in a different language or threading in a language with real threads and no interpreter lock) and sometimes shelling out to an OS utility (written in C usually) for a bit of hard work tends to topple any problem, and usually with less damage to the code than a ton of deep magicks would inflict and with less damage to the business function than if it were written in the more-performant less-dev-productive language in the first place.

I feel like I need to make a disclaimer since I know several friends who would feel MUCH chagrin at my disparagement of the deep C++ compiler magicks. It's cool, and I have immense respect for their practitioners, but I think my personal interest in programming stems from interest in logic (like the formal stuff you'd learn in a philosophy class) and how comp sci intersects that with being able to build really cool shit (in a product sense). So, you'll have to understand me when I say that it's only in extremely specific scenarios (like HFT, where I used to work, which is how I know all these chagrin'd people in the first place...) that these things tend to make a meaningful different in the product outcome.

shados · 4 years ago

> and I've yet to be unable to make a program go "fast enough" regardless of the programming language or runtime it's in.

That's because you won't even attempt to do things that that can't run that fast, because you know it won't work, from experience.

Let's take an example unrelated to programming languages (but easier to relate maybe?). Git branches. Git branches are fast to create and delete, so how do they get used? People create them all the time, for a million reasons. Remember SVN, or god forbid CVS, or even SourceSafe? You wouldn't create a branch for shits and giggles for those, because it was slow. You only did it when you had a good reason. Fast branches changed the way we even thought about how to use source control systems.

So let's go back to programming languages. Let's take JavaScript for an easy target. I've built real time image processing that happens on mouse over at 60 frames per second even on old android devices. But when all we had was older versions of internet explorer? I wouldn't even attempt it.

The same holds true for any environment, any kind of databases (regular transactional vs column stores) and yeah, C++ vs Python. People are rebuilding much of the JS toolchain in rust, purely for performance reasons. Because it's never fast enough. On the server, you could save money by using cheaper AWS instances if it was faster. And so on and so forth.

AlotOfReading · 4 years ago

I like to compare program efficiency to fuel-efficient engines. Most people could get along just fine with a 1950s clunker that gets 10mpg. They might even tell everyone they don't want to upgrade and lose the ability to fix it with nothing more than duct tape and a wrench. There are still clear advantages to all the extra complexity and cost in an efficient engine though.

kaba0 · 4 years ago

I don’t know, both of your examples seem to be about data structures/algorithms used. I don’t think any of them would be significantly more performant given a reimplementation of the same thing in a different language.

turtlebits · 4 years ago

It's not necessarily that a developer won't attempt it, not all specialities of computing require it. I am in my field because it interests me, not because I can't or wont try other things.

leephillips · 4 years ago

Sounds like you’re saying that you’ve been able to make any language fast enough where performance is not critical, but that you realize that when it is (you mention HFT, but there is also HPC for science and engineering, a vast area), the slow languages won’t cut it. I can’t disagree, but your comment also seems like a bit of a tautology.

naniwaduni · 4 years ago

I think GP is implying a stronger point than you're making it out to, that the situations where performance is critical occur "only in extremely specific scenarios", backed by the claim that they've managed to avoid such scenarios for 7~15 years.

shoo · 4 years ago

tangent: one of the more satisfying patches of my programming career was improving system stability and roughly preserving performance while removing the use of parallel processing

the code grew like this:

1. some python tool is written to grab a bunch of data from a SQL DB through an ORM and do some kind of graph analysis on it. the code is annoyingly slow.

2. the problem is embarassingly parallel (tool performs analysis in terms of domain-specific paths, which can be processed independently for each region of the graph), so one of the devs chunks each problem into sub problems, distributes the work to a cluster, then aggregate the results. it's not lighting fast but a good speedup, good enough for now

3. the tool executes work in parallel but is still regarded by the business and the dev team as kind of slow. later on, another dev has time to do some profiling, sees the SQL queries generated are fairly daft and are failing to exploit ways to efficiently search the data, figures out some different queries. the tool is now running pretty fast.

4. there's an unrelated priority to uplift security. TLS is applied to the job queuing system used by the tool to execute work in parallel. the formerly-slow embarrassingly-parallel code is sending rather large volumes of intermediate data through the cluster, and alas, for technical reasons, when TLS is enabled the additional encryption overhead causes the queuing system to fail sometimes (it is so busy doing crypto on large messages that it delays sending heartbeats, which causes its clustered friends to infer a network partition, which causes the cluster to kick into panic-network-partition-recovery mode, which periodically breaks production and requires manual intervention to recover)

5. what's the fix? remember that we fixed the queries in step 3, meaning the original partial speedup from processing all the work in parallel is no longer required, so we delete all the parallel-chunking code to invert step 2, and the resulting system is fast enough when run sequentially, without using the queue to aggregate large intermediate results

yjftsjthsd-h · 4 years ago

> and Java is not faster than C++

I feel obligated to be That Guy who points out that for some programs, JIT performance makes this surprisingly untrue, although it sounds like you might be the kind of dev who can beat the JIT:)

https://stackoverflow.com/questions/4516778/when-is-java-fas...

Shorel · 4 years ago

Java new operator leaves C++ new operator biting the dust.

But then you allocate C++ objects in the stack, and that's a couple of orders of magnitude faster.

It really comes down to if you can use the stack, or you have to use the heap.

jjav · 4 years ago

> I've yet to be unable to make a program go "fast enough" regardless of the programming language or runtime it's in.

On "fast enough": Performance matters to latency and throughput but also resource consumption and thus cost.

Ok maybe you haven't worked on anything where measurements like latency and throughput matter.

But surely you've worked on something where cost matters?

Whether it's paying for additional hardware or VMs or increased memory requirements or more runtime minutes for "serverless" hosting, slower code always costs more to run.

If traffic is super low maybe it's the difference between $0.10 and $1.00 so nobody cares. But grow that company and suddenly it's $10K vs. $100K, the CFO is now paying attention.

A couple startups ago I redesigned a system so it was able to handle ~50x the user load per server (no special magic, just JVM tuning, DB tuning, removing some of the worst-offender slow code). The ops team had been in the process of ramping up server count and instead they ended up removing 30+% of the servers assigned to it.

From a customer perspective it wasn't faster (it actually was faster, but not in a way any customer would've noticed) but higher performance code sure saved us a lot of money.

bryanrasmussen · 4 years ago

> and I've yet to be unable to make a program go "fast enough"

ok but in certain problem domains it's sort of difficult to figure out what fast enough means, for example if you have a site and when there are problems you spin up another server - the code is running fast enough because you just need to spin up another server on the other hand if the code could be improved so that you didn't need to handle the problem with using extra servers...

prestonbriggs · 4 years ago

I argued for years in favor of keeping the minimum runtime of a number of samples; never found much support.

smoldesu · 4 years ago

Please, correct me if I'm wrong: when I invoke "time" on a Unix system (eg. 'time ./a.out'), is the reported execution time calculated from start to finish, or is it calculated from the sum of CPU cycles? Perhaps a silly question, but I'm curious if *nix developers had the foresight to play with this at any point.

jjav · 4 years ago

Here's an easy test which answers the question:

% time sleep 1

sleep 1 0.00s user 0.01s system 0% cpu 1.011 total

strken · 4 years ago

Real is the wallclock execution time, user and sys are the CPU time for userland and kernel respectively.

adrian_b · 4 years ago

Yes, "real" is the time from start to finish and for a multi-threaded program "user" and "sys" will be the sums of the corresponding times for all threads, so normally "user" will be many times greater than "real" in that case.