Readit News logoReadit News
MathMonkeyMan · 5 months ago
Chandler Carruth told a similar story in one of his talks.

He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.

Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.

Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

snvzz · 5 months ago
My take is that code is disposable, and often needs to be written to figure out what is really needed, then disposed off.

Learning this is worth writing the otherwise useless code.

BigJ1211 · 5 months ago
This is why I strongly adhere to WET¹, you discover so much about the domain the first time you write something. By the time you have it 'working' you realize you should've written vastly different code to solve or cut-off edge cases.

That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.

¹: Write Everything Twice

Deleted Comment

5pl1n73r · 5 months ago
My reaction to the article title was recalling how old assembly teachers would drill into our heads to make sure we have proof of 200% gains before implementing an optimization. However the article shows how sometimes such planning doesn't work.
Cthulhu_ · 5 months ago
Or one should spend a little more time at first to fully understand the problem, in both the mentioned case and the OP, use realistic data.
ddalex · 5 months ago
I came to the same conclusion months ago with the advent of coding LLMs.... I used to treat code as precious, carefully saving, cataloguing and referencing git commits and SQL scripts and command line histories, because they were painful to write, and thus they are valuable. I mean, they were, because when I had the same need again, I could find previous instances and reuse stuff.

Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem

tasn · 5 months ago
This is such a great anecdote, thanks for sharing!

Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

scottlamb · 5 months ago
> Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.

Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.

[1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.

[2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.

RedNifre · 5 months ago
Okay, but how about checking the 99.9% policy first and if it doesn't match, run Ken's solution?
fifilura · 5 months ago
You'd be better off with some stupid code that junior devs (or Grug brained senior devs) could easily understand for the 0.1% cases.
TrapLord_Rhodo · 5 months ago
That's what he said they did. that's the joke.

>so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.

lispitillo · 5 months ago
If a case doesn't match, then speeding up the remaining 0.1% is not going to change much the overall speed. Hence, a non optimized algorithm maybe enough. But when speed and speed variance is critical, then optimization is a must.
hinkley · 5 months ago
The devil is in the details. A pair of my coworkers spent almost two years working on our page generation to split it onto a static part and a dynamic part. Similar to Amazon’s landing pages. Static skeleton, but lots of lazy loaded blocks. When they finally finished it, the work was mediocre for two reasons. One, they ended up splitting one request into several (but usually two), and they showed average response times dropping 20% but but neglected to show the overall request rate also went up by more than 10%. They changed the denominator. Rookie mistake.

Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.

Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.

So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.

Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.

ralegh · 5 months ago
This is fine assuming the popular request types don’t change, but arguably if both new versions of matching are sufficiently fast then I would prefer Ken’s long term as the other could become slow again if the distribution of request types changes.
sfilmeyer · 5 months ago
As a counterpoint, what fraction of the future engineers who will touch the project are likely to be able to competently edit the finite automata based version without introducing bugs and what fraction will be able to competently edit the if statement that checks the particular policy?
andrepd · 5 months ago
Nonsense. The pre-check can literally be one line (if common_case {fast_path()} else {slow_path()}), and thus enabling or disabling it is dead simple and obvious if the problem changes in the future.

Lines of thinking like that are part of the reason most modern software is so sloooow :)

scott_w · 5 months ago
I think you missed the point. Ken's version wasn't removed, it was simply prepended with something like:

  if value in most_common_range:
    take_fast_path()
  else:
    use_kens_solution()

sylware · 5 months ago
I am writing assembly and often you have many code paths/data structures which "fit". Each combinaison of code paths/data structures will favor some specific usage/data in spite of the others. And this is not real "optimization" since usually those code paths have roughly "the same" cost. The bottom of this: it is what you think of semantics and real-life usage of this code which will drive those "choices".

That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).

underdeserver · 5 months ago
This is a good place to mention profile-guided optimization.

In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.

I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.

menaerus · 5 months ago
What if your "application" is a server-side code hosting multitude of different types of workloads, e.g. databases? I was never really convinced by the PGO benefits for such open-ended and complex workloads.
n3t · 5 months ago
Anyone has a link to the talk?
n3t · 5 months ago
Found it! https://youtu.be/nXaxk27zwlk?t=393 (there is more context before the timestamp)
MathMonkeyMan · 5 months ago
Neither I nor ChatGPT could find it again.
jasonthorsness · 5 months ago
Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).

I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).

bgirard · 5 months ago
> Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.

jasonthorsness · 5 months ago
What have you found are the most useful approaches to collecting the representative issues in a way you can reuse and test? I haven’t worked as much in the web space.
viraptor · 5 months ago
Tracing and continuous profiling made this task significantly easier than it was in the past, fortunately.
sfink · 5 months ago
It's a subset of profile-guided optimization, but I fairly frequently will gather summary histograms from more or less representative cases and then use those to validate my assumptions. (And once in a while, write a benchmark that produces a similar distribution, but honestly it usually comes down to "I think this case is really rare" and checking whether it is indeed rare.)
anonymoushn · 5 months ago
I have asked for like the ability to apply a filter that maps real strings from real users to a very small number of character classes, so that I may run some codec on data that is equivalent to user data for the purposes of the codec. I have heard, this is not a good use of my time, and if I really care so much I should instead make various guesses about the production data distribution of user-created strings (that need to be json-escaped) and then we can deploy each of them and keep the best, if we can even tell the difference.
jasonthorsness · 5 months ago
If the real data is sensitive, it's hard to distill test data from it that succeeds in fully removing the sensitivity but that is still useful. Depending on the domain even median string length could be sensitive.
kazinator · 5 months ago
If they were forking that JVM in order to offer it as a development tool to a broad developer base, then such an optimization might be worth keeping.

Someone out there might have an application in which they are serializing large numbers of large integers.

The case for abandoning it is not as strong, since you don't know who is doing what.

It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)

grogers · 5 months ago
If the distribution of numbers you are serializing is uniformly random, you generally wouldn't choose a variable length encoding. Sometimes the choice is made for you of course, but not usually.
necovek · 5 months ago
The problem is that this is a complex assembly implementation that needs to be maintained over the course of decades in the future. Unless you have a good reason to keep it, you should drop it.
spott · 5 months ago
Why is just capturing real values not the answer here?

Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).

Is the cost of grabbing this that high?

adrianN · 5 months ago
That works for some cases, but if you have a service that processes customer data you probably don’t want to make copies without consulting with legal.
heipei · 5 months ago
Counterpoint: I once wrote a paper on accelerating blockciphers (AES et al) using CUDA and while doing so I realised that most (if not all) previous academic work which had claimed incredible speedups had done so by benchmarking exclusively on zero-bytes. Since these blockciphers are implemenented using lookup tables this meant perfect cache hits on every block to be encrypted. Benchmarking on random data painted a very different, and in my opinion more realistic picture.
atiedebee · 5 months ago
Why not use real world data instead? Grab a large pdf or archive and use that as the benchmark.
bluGill · 5 months ago
A common case is to compress data before encrypting it so random is realistic. Pdf might allow some optimization (how I don't know) that is not representative
jbaber · 5 months ago
Or at least the canonical test vectors. All zeros is a choice.
almostgotcaught · 5 months ago
There are an enormous number of papers like this. I wrote a paper accelerating a small CNN classifier on FPGA and when I compared against the previous SOTA GPU implementation and the numbers were way off from the paper. I did a git blame on their repo and found that after the paper was published they deleted the lines that short-circuited eval if the sample was all 0 (which much of their synthetic data was ¯\_(ツ)_/¯).
michaelcampbell · 5 months ago
I'm sure I'm getting this wrong, but I think I remember someone pointing out that Borland's Turbo Pascal "compiled lines per second" figure no one could even come close to replicating, until someone wrote a legitimate Pascal program that had essentially that number of lines containing only ";", or something along those lines.

It WAS still a great compiler, and way faster than the competition at the time.

Marazan · 5 months ago
Back when Genetic Algorithms were a hot topic I discovered that a large number of papers discussion optimal parameterisation of the the approach (mutation rate, cross-over, populations etc) were written using '1-Max' as the "problem" to be solved by the GA. (1-Max being attempting to make every bit of the candidate string a 1 rather than 0)

This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!

imtringued · 5 months ago
This reminds me of this paper:

https://link.springer.com/article/10.1007/s10710-021-09425-5

They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.

rdc12 · 5 months ago
Do you have a link to that paper?
scottlamb · 5 months ago
> For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.

An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)

deepsun · 5 months ago
> it fills RAM/cache with extra Object stuff

But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.

PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.

scottlamb · 5 months ago
> But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel.

What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.

Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.

[1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.

IshKebab · 5 months ago
Aside from Rust not working like that (as scottlamb said), Rust is faster than Java even if Java has a faster allocator because Rust code usually does much less allocation in the first place.
Mawr · 5 months ago
> So in JVM boxing is really cheap, often just a pointer increment.

Nonsense. Sure, the act of creating a new allocation is cheap, but that's not where the expense lies at all. And of course, to make allocations cheap, you needed to give up something else.

Each allocation needs to be at some point handled by the GC, so the more allocations, the more GC pressure. That then forces you to make your GC generational, which constrains your GC design. You end up with a slower GC no matter what you do.

Moreover, every access to that allocation goes through a pointer, which is basically a guaranteed cache miss.

The classic case of this is iterating over an Array<Integer>. Not only did you have to make n allocations that the GC now has to keep track of, it's not possible to efficiently fetch items of this array from memory at all. You need to first fetch the pointers, then request the pointed to memory. Even in the best possible case of the pointer living right next to the data it points to, you're still paying the overhead of extra memory taking up space in your L1 cache.

Compare with an array of simple integers. Zero GC overhead, zero pointers, zero memory overhead, trivial to prefetch.

---

This is a case of horrid approach to design. Making allocating cheap must come at a cost of slowing down some other part of the GC and just encourages programmers to allocate more, which puts more and more pressure on the GC, making it even slower. It's a self-defeating approach that is therefore completely bone headed.

Allocations should be expensive!

With expensive allocations you get breathing room in your GC to make it otherwise faster. Moreover, since programmers are now encouraged not to allocate, the GC pressure is lower, making it even faster. But it gets even better. Optimizing becomes very clear and straightforward - just reduce allocations. This is great because it allows for targeted optimizations - if a particular piece of code is slow, just reduce its allocation rate and you'll get a nice speedup. Very easy to reason about.

That's why code written C/C++/Go/Rust tends to be so conscious of allocations and any indirections, but Java is full of linked lists, arrays of pointers, allocations everywhere and thousands of layers of indirections.

Cleaning up heaps of garbage can never be faster than not creating the garbage in the first place.

bsder · 5 months ago
It's not like C++ users don't say the exact same thing about String, though.

The problem is that String really isn't a language primitive.

Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

scottlamb · 5 months ago
> It's not like C++ users don't say the exact same thing about String, though.

If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.

Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).

> Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.

pjc50 · 5 months ago
I like how dotnet has acknowledged this and provided a whole lot of machinery recently for things like value types on the stack and Span<T> to reduce array copying.
pjmlp · 5 months ago
I would have liked that both had acknowledged this to the point of languages like Oberon (all variants predate them), Eiffel, Modula-3, CLU, Cedar,...

While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.

I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.

mcosta · 5 months ago
An example of why I'd prefer to avoid Java for somethi [Core dumped]
snickerdoodle12 · 5 months ago
> but I also think a lot of it comes down to all non-primitive types being boxed Objects

They're working on it: https://openjdk.org/projects/valhalla/

eulgro · 5 months ago
Project Valhalla started over 10 years ago. It keeps being mentioned, but with seemingly no end in sight. Is there an ETA?
charleslmunger · 5 months ago
I tried implementing an optimized varint encoder that did something similar, by populating an 8 byte value and then storing it to ram, but unaligned overlapping stores caused big regressions. The approach that worked required a different technique. This one is for encoding backwards:

1. One branch for the one-byte case, always inlined, just store the byte

2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6

3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.

https://github.com/protocolbuffers/protobuf/commit/b039dfe26...

This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.

I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.

chaosmachine · 5 months ago
This reminds me of Benford's law[1]

"Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."

[1] https://en.m.wikipedia.org/wiki/Benford%27s_law

OhMeadhbh · 5 months ago
<snark>Time spent optimizing assembly language is never wasted.</snark>

That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.

bluGill · 5 months ago
Also there is no way for author to know that his optimizations wouldn't work without trying them. We are not even sure that his analysis of why they work is correct (though it sounds reasonable). There is also the possibility that author could use the new found information to fix his optimizations so they are better.
xandrius · 5 months ago
It could also burn someone out and take out the real reason technology exists: to help someone.
labadal · 5 months ago
Kind of like math, huh? Some people burn out on "meaningless" impossible problems. Others get obsessed with them and make significant contributions along the way to failure.
corysama · 5 months ago
Reminds me of https://ridiculousfish.com/blog/posts/old-age-and-treachery....

Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”