Cache invalidation really is one of the hardest problems in computer science

> Cache coherency ensures that the behavior is correct, but every time a cache is invalidated and the same memory has to be retrieved from main memory again, it pays the performance penalty of reading from main memory.

1. First no. If any such rereads occur, they will be from the LLC (last-level cache, or L3 cache for Intel/AMD CPUs).

2. Second no. IIRC, modern caches are snooping and/or directory caches. This means that when Core#0 says "I'm changing cacheline X", Core#0 knows that Core#1 has it in its L2 (or deeper) caches. So Core#0 will publish the results of that change to Core#1

3. The examples they gave are missing read/write barriers, which are rather important. #2 will only happen during read/write barriers. That is to say: your code is memory-unsafe by default. If order and cache-invalidation matters, you actually need a barrier instruction (!!!) to make sure all these messages are passed to the right place at the right time.

---------

For better or for worse, modern CPU design is about doing things thread-unsafe by default. If the programmer recognizes a problem may occur, it is the responsibility of the programmer to put the memory barriers in the right place.

This is because the CPU is not the only thing that can reorder things... but also the cache system... as well as the compiler. So memory barriers inform the compiler + CPU + cache system when to synchronize.

The original Netflix article is more precisely worded.

> This consistency is ensured with so-called “cache coherency protocol.”

With a link to MESIF (F for Forwarding state, which is one such protocol / way to share L1 cache info without going all the way to LLC or DRAM)

I'm pretty sure that GPUs actually invalidate the cache dumbly, without any MESI / MESIF (or whatever). GPUs are actually really bad at these kinds of ping-pong operations and synchronization... preferring thread-fences and other synchronization mechanisms instead.

------

That being said, I think the blogpost is a good introduction to the subject. But note that its a bit imprecise with some of the low level details. I guess its correct for the GPU-world though, so its not completely inaccurate...

-----

The original Netflix article has an even more difficult problem revolving around Java's Superclass cache and how it is affecting "true sharing" of caches. I'm not very familiar with JVM internals, so I lost track of the discussion at that point.

nwallin · 3 years ago

> 3. The examples they gave are missing read/write barriers, which are rather important. #2 will only happen during read/write barriers. That is to say: your code is memory-unsafe by default. If order and cache-invalidation matters, you actually need a barrier instruction (!!!) to make sure all these messages are passed to the right place at the right time.

Wait... wait what? Am to understand that if I have code like this:

    struct {
      int foo;
      int bar;
    } foobar;

    mutex io_mutex;

    void baz(int* x, int n) {
      *x = 0;
      for (int i = 0; i <= n; i++)
        *x += i;
      lock io_mutex;
      print *x;
      unlock io_mutex;
    }

    thread t1 {
      baz(&foobar.foo, 10000);
    };

    thread t2 {
      baz(&foobar.bar, 10000);
    };

You're saying this code is memory-unsafe without adding barrier instructions? (as opposed to just having terrible performance) (let's also assume the compiler isn't smart enough to just compile that down to `x = n (n-1) / 2;` and actually reads/writes the value in the pointer.)

dragontamer · 3 years ago

    lock io_mutex;
    print \*x;
    unlock io_mutex;

Have you ever looked at lock/unlock code? There's a memory + compiler barrier inside of any mutex.

The easiest way to use memory barriers is through locking/unlocking mutexes. Alas, that's too slow for a number of applications, so advanced programmers have begun to use more complicated patterns.

vouwfietsman · 3 years ago

But you aren't actually mixing data here, right? Even though foo and bar could be on the same cache-line, the threads need to read/write either foo OR bar, so the respective core caches being out of sync for the var they don't use does not really cause any safety issues. Maybe I'm missing the point of your example.

No idea how these two separate cache lines get merged to main memory in the end though, is there some kind of write mask for that?

Also I imagine them being in the same struct is really just an implementation detail and shouldn't actually influence the memory safety?

ChuckMcM · 3 years ago

I was going to say something like this but I like this version better. Key things to keep in mind, this is Intel Core micro-architecture, other micro-architectures may make other choices.

In particular using snooping or shoot downs it one of those choices. Why use one or the other? Consider whether or not your memory controller is "single threaded" or not[1]. If you have multiple memory controllers that have access to the same "slow" memory, then a shoot down (cache invalidation message) over the cache coherency interconnect, can start the refresh transaction for unoccupied memory controller context. In these cases you can overlap memory transactions and "hide" some of the latency. If you're snooping, and see a write through (where write is headed for main memory), you can mirror a copy of that write through to update your cache, but you might simply invalidate your cache if there is a shared cache with the CPU (thread) in question, as it will get changed in the shared cache and the invalidation will cause the shared cache to move it from L2 to L1 as soon as you look at it again.

The availability of multi-core RISC-V designs for FPGAs give someone who is interested in this an opportunity to actually try the different strategies and to compare instruction throughput (instructions per clock) and maximum latency for correct operation.

There was a good paper at Microprocessor Forum in the early 2000's from Toshiba on a multi-core cache design where the L2 cache was visible but segmented between cores (it appeared as a single cache where you didn't have access to invalidate all of the look aside buffers. I thought it was very creative and it of course rewarded algorithms that understood that. (small working set per thread, larger working set for application)

[1] Basically how many memory transactions can it have outstanding at any given time.

throwawaylinux · 3 years ago

Directory vs snooping just means you either know exactly which other agents have an interest in a line so you can target them directly, or that you don't so you broadcast and other agents have to "snoop" requests and respond to relevant ones.

The logical coherency protocol that is (somewhat) above that level is what would dictate whether you can forward modified data between CPUs or it would have to write back and be fetched from memory.

AFAIK protocols don't tend to have the writer publish its modification to other owners for two major reasons. First because you then would no longer own it exclusive and would have to go off-chip if it wanted to modify it again immediately afterwards. Second because problems with supporting atomic operations and preventing deadlocks in the protocol can become unmanageable if you could have other owners while you are writing lines. I believe generally the transfer of data is demand-driven, so core 1 wanted to read the line, it would ask core 0 for it. And that modifications will always remove access permission from other cores before the modification completes, so core 0 would have send an intent to write message to core 1 in this case before its store could reach coherency.

Barriers don't necessarily drive any coherency on a cache line level. Caches are always coherent (ignoring access by incoherent agents, where the coherency has to be managed at a software level). Barriers provide ordering between more than one memory access.

The MESIF page is a good basic intro to concepts, naturally state of the art has moved beyond that. Here for example https://course.ece.cmu.edu/~ece742/f12/lib/exe/fetch.php?med... a 15 year old CPU has 13 states in its coherency protocol. It's difficult to find a lot of public information about this stuff. Suffice to say these things are so complex that mathematical proofs are used to ensure correctness, no deadlocks, etc. You are right right that data including updates can go directly between agents rather than via memory.

But no matter how smart the hardware gets, the advice to programmers is still the same: stores to a line will slow down access to that line by any other agent, so sharing read-only lines between CPUs is fine, but don't add a lot of writes to the mix.

kfrzcode · 3 years ago

https://netflixtechblog.com/seeing-through-hardware-counters...

Link from the OP, for the lazy

namibj · 3 years ago

> For better or for worse, modern CPU design is about doing things thread-unsafe by default. If the programmer recognizes a problem may occur, it is the responsibility of the programmer to put the memory barriers in the right place.

This so much. I truly wish RISC-V had decided on the weak memory model (WMM)[0] that allowed load-value speculation in the absence of explicit `Reconcile` fences between dependent loads (i.e., when first loading an address from memory and then loading a value from that address, it can load an old value that was stored after the value's address was stored to memory, provided it speculated the value (of pointer-type) of the first load correctly and loaded from this speculated address before it properly executed the first load).

Not needing to repeat the load from a speculated address in response to that address getting a cache invalidation before the load that fetched the (correctly speculated) pointer "goes though" severs the cache snooping/invalidating feed to the reorder buffer(ROB). Essentially, the difference to banning the dependent-load-load reordering is that WMM frees up the second load's ROB entry once the value speculation of the first load confirms, so when the first load instruction returns and affirms the value speculated for it, the second load can retire immediately with the first load to free up both ROB entries. The decoupling between an L2 invalidating other L2's and responding to another L2's read-request allows the "Invalidation Buffer" of an L2 to function similar to an L1's store buffer. So if L3 is implemented by just giving each core a large multi-bank L2 so the L3-slices are just the different core's L2s, write-congestion into an L2 through other cores all writing to the same bank is just buffered away (and redundant stores can be squashed/adjacent stores write-combined, useful if the writing cores specify they themselves don't want read access to the cachelines those writes target).

See this excerpt from the paper: > Speculations: OOO can issue a load speculatively byaggressive predictions, such as branch prediction (Figure 3d),memory dependency prediction (Figure 3e) and even load-value prediction (Figure 3f). As long as all predictions relatedto the load eventually turn out to be correct, the load resultgot from the speculative execution can be preserved. Nofurther check is needed. Speculations effectively reorderdependent instructions, e.g., load-value speculation reordersdata-dependent loads. Since WMM does not require preserving any dependency ordering, speculations will neither breakWMM nor affect the above correspondence between OOOand WMM.

Yes, WMM is similar to Alpha, just that Alpha allowed a load from some address to be reordered after a store to a different address (which is forbidden in WMM).

[0]: https://arxiv.org/abs/1707.05923

bluGill · 3 years ago

Not all the world is x86, arm doesn't do as much.

dragontamer · 3 years ago

ARM is cache coherent, meaning it has some kind of MESI (or better, like MESIF) protocol going on.

ARM is _RELAXED_ however, or perhaps... "more relaxed" than x86. More orderings are possible on ARM, but that only makes memory-fences (load-acquire, store-release) even more important.

Off topic: What level of sophistication about modern CPUs is _good_ to have? And where does one learn it? Resources?

Say that I want to work with HPC-type applications and otherwise just squeeze the most out of the processor, e.g. quant finance / trading systems.

pastaguy1 · 3 years ago

First question is a bit subjective, but Hennessey and Patterson is a good place to start.

It's thick but reads fairly easy in my recollection

anonymoushn · 3 years ago

There are a lot of issues here, so I can share some stuff about some of them and hope that some helpful internet commenters come along and point out where I have neglected important things.

A single modern CPU core is superscalar and has a deep instruction pipeline. With your help, it will decode and reorder many instructions and execute many instructions concurrently. Each of those instructions can operate on a lot of data.

As famous online controversial opinion haver Casey Muratori tells us, most software just sucks, like really really bad (e.g. commonly people will post hash table benchmarks of high-performance hash tables that do bulk inserts in ~100ns/op, but you can do <10ns/op easily if you try), and using SIMD instructions is table stakes for making good use of the machinery inside of a single CPU core. SIMD instructions are not just for math! They are tools for general purpose programming. When your program needs to make decisions based on data that does not contain obvious patterns, it is often a lot cheaper to compute both possible outcomes and have a data dependency than to have a branch. Instructions like pshufb or blendv or just movemask and then using a dang lookup table can replace branches. Often these instructions can replace 32 or 64 branches at a time[0]. Wojciech Muła's web site[1] is the best collection of notes about using SIMD instructions for general-purpose programming, but I have found some of the articles to be a bit terse or sometimes incorrect, and I have not yet done anything to fix the issue. "Using SIMD" ends up meaning that you choose the low-level layout of your data to be more suitable to processing using the instructions available.

Inside your single CPU core there is hardware for handling virtual -> physical address translation. This is a special cache called the translation lookaside buffer (TLB). Normally, chips other than recent Apple chips have a couple hundred entries of 1 4KiB page each in the TLB, and recent Apple chips have a couple hundred entries of 1 16KiB page each. Normal programs deal with a bit more than 1 meg of RAM today, and as a result they spend a huge portion of their execution time on TLB misses. You can fix this by using explicit huge pages on Linux. This feature nominally exists on Windows but is basically unusable for most programs because it requires the application to run as administrator and because the OS will never compact memory once it is fragmented (so the huge pages must be obtained at startup and never released, or they will disappear until you reboot). I have not tried it on Mac. As an example of a normal non-crazy program that is helped by larger pages, one person noted[2] that Linux builds 16% faster on 16K vs on 4K pages.

Inside your single CPU core is a small hierarchy of set-associative caches. With your help, it will have the data it needs in cache almost all the time! An obvious aspect of this is that when you need to work on some data repeatedly, if you have a choice, you should do it before you have worked on a bunch of other data and caused that earlier data to be evicted (that is, you can rearrange your work to avoid "capacity misses"). A less obvious aspect of this is that if you operate on data that is too-aligned, you will greatly reduce the effective size of your cache, because all the data you are using will go into the same tiny subset of your cache! An easy way to run into this issue is to repeatedly request slabs of memory from an allocator that returns pretty-aligned slabs of memory, and then use them all starting at the beginning. That this could cause problems at all seems relatively unknown, so I would guess lots and lots of software is losing 5-10% of its performance because of this sort of thing. Famous online good opinion haver Dan Luu wrote about this here[3]. The links included near the bottom of that post are also excellent resources for the topics you've asked about.

When coordinating between multiple CPU cores, as noted in TFA, it is helpful to avoid false sharing[4]. People who write trading systems have mostly found that it is helpful to avoid sharing *at all*, which is why they have work explicitly divided among cores and communicate over queues rather than dumping things into a concurrent hash map and hoping things work out. In general this is not a popular practice, and if you go online and post stuff like "Well, just don't allocate any memory after startup and don't pass any data between threads other than by using queues" you will generally lose imaginary internet points.

There are some incantations you may want to apply if you would like Linux to prioritize running your program, which are documented in the Red Hat Low Latency Performance Tuning guide[5] and Erik Rigtorp's web site[6].

Some other various resources are highload.fun[7], a web site where you can practice this sort of thing, a list of links associated with highload.fun[8], Sergey Slotin's excellent online book Algorithms for Modern Hardware[9], and Dendi Bakh's online course Perf Ninja[10] and blog easyperf[11].

> Off topic: What level of sophistication about modern CPUs is _good_ to have?

Probably none? These skills are basically unemployable as far as I can tell.

[0]: https://github.com/lemire/despacer

[1]: http://0x80.pl/articles/index.html

[2]: https://twitter.com/AtTheHackOfDawn/status/13338951151741870...

[3]: https://danluu.com/3c-conflict/

[4]: https://rigtorp.se/ringbuffer/

[5]: https://access.redhat.com/sites/default/files/attachments/20...

[6]: https://rigtorp.se/low-latency-guide/

[7]: https://highload.fun/

[8]: https://github.com/Highload-fun/platform/wiki

[9]: https://en.algorithmica.org/hpc/

[10]: https://github.com/dendibakh/perf-ninja

[11]: https://easyperf.net/notes/

di4na · 3 years ago

All of this especially the end.

Funnily enough the part about not sharing cache feels a lot like erlang with less scheduling...

eloff · 3 years ago

Hey, thanks for sharing! That was a lot of effort to type all that up, with references to boot.

pramodbiligiri · 3 years ago

If you’re just starting out, I suggest the introductory book by Patterson and Hennessey (not the Quantitative Approach which is a tome) - https://www.amazon.in/Computer-Organization-Design-MIPS-Inte...

Another one is Computer Systems: A Programmer’s Perspective: http://csapp.cs.cmu.edu/

eric4smith · 3 years ago

Slightly off topic...

Back before my Phoenix/Elixir days, we used to have to cache almost everything. Now we don't cache anything -- even with more traffic and less servers. Now, the projects I work on are not getting millions of views every day. But they are still important for many customers.

We don't really design anything that needs to be cached anymore. Just better access patterns and table design as we get wiser.

Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache. To be fair, I suppose, their legacy non-optimal database design access pattern and code is responsible for this.

Just getting something up and aiming for product-market fit is priority #1 and then struggling with caching if it gets popular, at which point, more resources can be thrown at it.

sethammons · 3 years ago

Elixir is not some golden ticket and in my experience still needed caching (ets and/or redis). I worked on an elixir service that topped out at around 2k requests per second with 5 servers even with some caching of the high-read/low-write flows. This was mostly limited by db cpu pressure against postgres with lots of materialized views and monstrous queries being inefficient due to over normalization and, maybe, partly affected by uuid primary keys.

I came to this system with prior experience using Go and mysql where I expected base performance of 1-5k queries per second per node before any optimizations like (non cpu) caching but also didn't use triggers or anything fancy to prevent db cpu pressure.

I got a bit rambling and perhaps a bit off topic, but elixir is not something that side steps the need for a cache when your db is maxed out.

andy_ppp · 3 years ago

2000 requests per second implies your system was badly designed. If you’re saying it was mostly database bottlenecks why would that not affect the Go version in the same way?

locutous · 3 years ago

> Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache.

Caching gives sites about two orders of magnitude speed uplift in my experience. Sometimes more. Especially if you are building the pages live and compressing them.

Brotli -11 is really really expensive and egress data transfer is one of the more expensive commodities to purchase from cloud venders.

Caching also reduces latency significantly - which is a much better user experience.

AussieWog93 · 3 years ago

>Take Wordpress as an example. There really should not be a need to cache a Wordpress site. But some of the most popular plugins are those that cache. To be fair, I suppose, their legacy non-optimal database design access pattern and code is responsible for this.

To anyone reading this, do not assume that you don't need to cache your Wordpress site! I remember running a test on my site once (basic WooCommerce site running on a DigitalOcean VPS), and simply turning on the free version of WP Fastest Cache was the difference between serving 30 users/min and crashing vs. serving 1000. Drastically lowered page load times too.

basilgohar · 3 years ago

In my experience, the caching of WordPress became essential on shared hosting which one of my clients was somewhat bound to for....reasons. Without, each page load was atrociously slow, but once enabled and the cached was warmed, the site moved as quickly as anywhere else.

It's interesting because a custom legacy PHP application on the same shared host didn't demonstrate such slowness, but I did write some aggressive in-request caching (static variables for repeatedly called functions, mainly) to great performance effect.

taskforcegemini · 3 years ago

Just throwing more ressources at it will make your hosting company happy. I am working for one, so I will not disagree with you. If I were to mention its name, I would though, since a lot of what we do is help our clients to make their webapplications run more efficently (and thus saving them a lot of money in the long run).

withinboredom · 3 years ago

WordPress by itself isn’t so bad. It’s the plugin/theme developers who likely have no idea what they’re doing or why what they’re doing is bad.

andai · 3 years ago

My client is running 21 plugins :(

ignoramous · 3 years ago

Keeping operating modes simple is something that pays off especially in large scale systems (let alone smaller ones): https://archive.is/2S2ME (Doing Constant Work, Colm MacCárthaigh, AWS Builders' Library).

Generally, I've found that, in the guise of doing the simplest thing, I've done the most naive-est thing possible, which unfortunately comes with the burden of tech debt. So, as a rule of thumb, given the small tech shop we are, I tend to favour systems that are fully-managed or serverless and prefer to take on whatever tech debt using it throws at me.

pas · 3 years ago

WP cache is "necessary" if you run on a low-end shared host, where contention for resources, context switching, and basically any dynamic feature of the site gets too slow in a big traffic surge.

osigurdson · 3 years ago

1M views per day is only ~12 views per second. It should be possible to easily run at least 1K DB queries per second. Based on this, it seems that you would have quiet a bit of room with a non-cached model. Of course, peak traffic estimates are really the key.

I don't understand how Phoenix/Elixir impacts this however.

tl · 3 years ago

Phoenix / Elixir (really, BEAM) enforce a worldview that results in non-bottlenecking code. You can do this yourself with a certain amount of disclipine in other languages. You should do this yourself if you are getting millions (or billions, etc...) of hits per second to gain access to system with a lower constant than BEAM. This is why Discord (for example) shifted to Rust.

Two interesting things are true:

1. Engineers have through lack of experience (or care) bottlenecked systems that had 12 views per day. Guardrails are useful particularly if there are shared resources or bottlenecks can impact busier neighbors.

2. The threshold where a shift out of Elixir is a good idea climbs at the same rate as Moore's law. Discord made their Elixir->Rust moves a lot later than Twitter made Ruby->Java ones.

idiocrat · 3 years ago

Aren't you just pushing down the caching from HTML and App layers to DB layers, OS layers, Disk layers and CPU caches layers?

Something needs to be cached always, so the DB and Disk need at least cache blocks.

tinus_hn · 3 years ago

If your main page is put together by running a bunch of queries it’s just a waste of resources to run these for every bot hit.

jasfi · 3 years ago

I believe in caching where it makes sense. This is usually app and implementation specific. But I do agree that a cache shouldn't make up for bad DB design and tuning.

beambot · 3 years ago

It has earned its title as "one of the two hardest problems in computer science": cache invalidation, naming things, and off-by-one errors.

moring · 3 years ago

IMHO off-by-one errors only deserve to be there for the irony. In reality, they are orders of magnitude easier then the other two.

javajosh · 3 years ago

Yes, off-by-one is easier to fix, but far more common. Hence if suffering is the product of frequency and difficulty, then off-by-one suffering (S=FD+1) is the same order as cache invalidation (S=FD).

stavros · 3 years ago

That is, indeed, the joke.

willis936 · 3 years ago

Perhaps the category was improperly named.

jdthedisciple · 3 years ago

the multiple layers of irony in this one just always amaze me

mklepaczewski · 3 years ago

You may also like this one (from https://twitter.com/mathiasverraes/status/632260618599403520):

There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery

ricardo81 · 3 years ago

I like how your answer wasn't just one of the list. I was almost not going to post this reply to illustrate your point.

MontyCarloHall · 3 years ago

Despite the complexity of the cause of the problem, the fix was really simple: they just introduced 64 bytes of padding between the two struct members to guarantee that they never could fall within the same cache line. Sometimes the best solution is brute force!

almost_usual · 3 years ago

They still saw the same performance though due to “true sharing” or CPU enforced memory ordering on a highly contended variable.

Patching the JDK to skip the secondary cache altogether was the real solution.

pestatije · 3 years ago

Weird they added 64 bytes and not 56, which is the minimum required to have the two variables in different cache lines.

Maybe the 64 bytes was better for some kind of alignment with other architectural structures....

greenyoda · 3 years ago

Here's the HN discussion of the original Netflix TechBlog article that this one references: https://news.ycombinator.com/item?id=33540091

tbrownaw · 3 years ago

> false sharing

Usually the "cache invalidation" remark is about correctness and knowing when to invalidate, like for example making sure all your login tokens get expired when you change your password. This is about performance impacts from having multiple logical entities in one shared cached entity.

stingraycharles · 3 years ago

Yes I’d say this is more along the lines of concurrency problems than caching problems: it’s about synchronizing access to memory (cache) between multiple cores.

tremon · 3 years ago

More generally, it's about making sure that no consumer uses stale data, in the presence of multiple concurrent producers and consumers.

bornfreddy · 3 years ago

In the fix in the original Netflix blogpost, shouldn't they pad with 56 instead of 64 bytes? Otherwise they are wasting some cache space.

I would also expect the fix to reflect the platform it is applicable for, but maybe they simplified it for the blog post...

You’re correct as far as I know, and now I wonder why the Netflix people didn’t elaborate on this choice.

sedeki · 3 years ago