Overfitting and the strong version of Goodhart’s law

This is one of those unfortunate cases...in which, it is, no doubt, a hardship upon the plaintiff to be without a remedy but by that consideration we ought not to be influenced. Hard cases, it has frequently been observed, are apt to introduce bad law.

This keeps getting rediscovered in new domains.

JIT was the savior of manufacturing, until people learned that a single traffic jam that delayed a single delivery could create costs far in excess of the inventory savings.

Optimizations are critical, everywhere. And measuring optimizations is important because it is is easier and cheaper and earlier than measuring end results. Measuring days of inventory, or dollars in inventory, is a pretty good proxy for supply chain efficiency, until covid hits and suddenly “efficiency” means stability rather than minimum costs.

Over and over. Branch prediction / Spectre. Mortgage-Backed Securities. Optimizations based on second- and third-order effects blow up in the real world where any abstraction is approximate.

So it’s not the efficiency that’s the problem, it’s changing the focus from maximizing the desired output to maximizing efficiency, and measuring success based on the performance of efficiency optimizations.

kqr · 3 years ago

I feel like i have to repeat this very often: if a single traffic jam or other predictable common-cause variation results in your JIT implementation costing "far in excess of inventory savings" all you're telling me is that you have a really shitty JIT implementation.

JIT means having the buffer on hand to handle at the very least common-cause variation. But it also should come with the flexibility to handle variation of assignable causes at a reasonable cost.

Critically, you can't just press delete on all your inventory and then call it JIT. You have to adapt your processes for it, work with local authorities for improvements to infrastructure, and then always keep a larger buffer than you think you need.

JIT is not about deleting inventory. It's about reducing variation in your processes until most of your inventory truly becomes useless.

When you have a good JIT implementation, you are more flexible to take advantage of changing external conditions, not less.

salawat · 3 years ago

You've now redefined JIT to mean "buffered with the minimum viable buffer" where minimum viable buffer is "whereever I drop the goalposts when defining common-cause variation.

It's the definitional version of working out a variable in an equation, making a mistake, and ending up looking at 0=0.

Happens all the time.

brookst · 3 years ago

Yes? My point was that people get carried away with maximizing optimizations because they start measuring the degree of optimization rather than the ultimate goal.

I’m not saying JIT always goes wrong, any more than neural net training always goes wrong. Just that when it does, it is often because an overzealousness that leads to a disconnect between the local and global goals.

astrobe_ · 3 years ago

> This keeps getting rediscovered in new domains.

I'm not sure there's a need for a "strong version" of Goodhart's law, or I fail to understand the distinction the author is trying to make.

Goodhart is "so true" precisely because it warns about the fact that the measure is often not the goal itself, just like "the map is not the territory", and is indeed an approximation of the goal.

One finds a similar problem with the word "best": best according to which metric? That's how people have different ideas of what is the best.

shanusmagnus · 3 years ago

See footnote 4. I would quote it except it's pretty long and full of links, and an edit, which HN "formatting" would make confusing.

marcosdumay · 3 years ago

I highly recommend playing the beer game with different inventory sizes, and looking at the results.

Inventory management is not a simple task, and can not be generalized like this. JIT was adopted because it reduced the number of supply chain disasters, not despite increasing it like you claim. But, of course, that reduction wasn't homogeneous and not every single place saw a gain.

brookst · 3 years ago

I apologize for the unclarity. I was not trying to assert that JIT is fundamentally wrong and always leads to disaster.

My point was that any optimization process can go wrong when people start focusing on maximizing the optimization rather than the ultimate goal. That is how JIT goes wrong. It is also how overfitting appears in neural net training, how security flaws appear in branch prediction, etc.

I'm an MBA. I love JIT. But it's undeniable that JIT has led to disasters. I'm not blaming the approach, I'm blaming specific implementations.

wrp · 3 years ago

A major consideration people seem to miss when discussing JIT is that it originated in a country with extremely reliable transportation. I had some chance to observe highly resilient and flexible logistical operations in Japan in the 1990s, and have since wondered just how well JIT in the USA follows that model.

seri4l · 3 years ago

The issue with JIT is not so much about proxies as it's about optimizing for the general case at the cost of worst case performance. Things that suppose an inefficiency 99.99% of the time can be an indispensable redundancy or buffer in the face of events that are difficult or impossible to account for.

I'm not sure I agree.

The problem with overfitting and inability to extrapolate out of sample is variance. The problem with Goodhart's law is bias.

I don't think a training sample is to the full population what a proxy metric is to the objective -- not theoretically, and not practically. The training sample faithfully represents the full population (by definition, if it was randomly selected). Any difference in composition is down to sampling error, and this is known from theory.

When overfitting we are still optimising for the objective, only adapting more to individual data points than desirable. Goodhart's law implies optimising for the wrong thing entirely. We have no theoretical tools to deal with this and I suspect we never will, because it's a problem of subjective judgment.

jaschasd · 3 years ago

Author of blog post here!

Overfitting can happen in many ways -- your training objective can be different at train and test time, or as you suggest the datapoints you use can be different at train and test time.

For overfitting induced by datapoints: If you include the datapoints in your problem specification, then you can say they induce bias at test time. If you treat the choice of training datapoints as a random variable, separate from the problem specification, then you can say they induce variance at test time. The difference is essentially semantic though. In general, you can freely move contributions to the error between bias and variance terms by changing which aspects of the modeling framework you define as fixed by the problem definition, and which you take to be stochastic.

kkoncevicius · 3 years ago

My first impression is also in agreement with the parent. The blog post appears to use some terms loosely in order to make the connection between overfitting and Goodhart's law stronger. For example - calling training sample "proxy" and stating that it is is a slightly different goal is already leading towards the pre-defined conclusion.

And the reply also leaves me with a similar impression:

> your training objective can be different at train and test time

But this is not overfitting, this is concept drift, a different and well-defined thing in ML.

> the datapoints you use can be different at train and test time

Both train and test data came from the same population. They are just different incomplete random samples.

I guess what I am getting at - overfitting happens because we know we are training a model on an incomplete representation of the whole. But that representation is not a proxy, as suggested in the article - it is not slightly different to the goal. It's an incomplete piece of the goal.

lqet · 3 years ago

I recommend reading "The Collapse of Complex Societies" by Joseph Tainter [0].

Complex societies tend to address problems by adding more and more rules and regulations, simply because they have always done so and it has been successful in the past. More importantly, though, it is typically the only tool they have. Essentially these societies are increasingly overfitting their legislature to narrow special cases until it cannot handle anything unexpected anymore. Such a society is highly fragile. I witness this firsthand every day in my own country. Living here feels like lying in a Procrustean bed.

[0] https://www.amazon.com/Collapse-Complex-Societies-Studies-Ar...

kwhitefoot · 3 years ago

This is not new, merely ignored:

"What's the origin of the phrase 'Hard cases make bad law'?

... 'Hard cases make bad law' isn't so much a universal proverb as a legal adage. It came to light in a comment made by Judge Robert Rolf in the case of Winterbottom v Wright in 1842:

The case required a judgment on whether third parties are able to sue for injury. The unusual nature of the case caused the judge to realise that, in the true sense of the expression, exceptions prove the rule and that, unfair as it might have appeared in some circumstances, the law was better drafted under the influence of the average case rather than the exceptional one.

The point was made explicitly in 1903 by V. S. Lean, in Collectanea:

    Hard cases make bad law. that is, lead to legislation for exceptions."

https://www.phrases.org.uk/meanings/hard-cases-make-bad-law....

bombcar · 3 years ago

Something like this could (should?) perhaps be handled by governmental "remedy" service, a small insurance/tax against everyone that the judges can use to award remedy without forcing the defendant to pay.

agalunar · 3 years ago

I'd second the recommendation.

I personally enjoyed reading it, and from my limited exposure to the topic, my impression was that although some anthropologists disagree and theory has advanced since, The Collapse of Complex Societies was a landmark work and its ideas are still taken seriously.

ramshanker · 3 years ago

Meanwhile current Indian government repealed 100s of Laws / Acts.

https://www.indiatoday.in/mail-today/story/narendra-modi-law...

paulluuk · 3 years ago

Interesting idea! Seeing as my "should read" list is already hundreds of books long, can you give us a spoiler as to what he suggests (if anything) to fix this? Or is it simply inevitable and therefore not actionable?

Here's what I can recall:

The big idea is diminishing marginal returns – increasing societal complexity pays off enormously at first, but eventually the returns level off while the overhead remains high or even continues to increase. Collapse is then seen as a natural (and not necessarily cataclysmic) response to this. It's a process of simplification.

Tainter also points out that collapse may not occur (1) when there are strong, neighboring powers, which simply absorb the foundering society, or (2) when there are strong cultural reasons, such as national identity, for which the members of the society may be willing to put up with the bureaucratic overhead indefinitely (in order to prevent being absorbed); he gives Europe as an example, I believe.

Thanks! I'm also adding this to my reading list now.

nine_k · 3 years ago

Everything is a tradeoff.

Efficiency is usually a tradeoff that limits flexibility. If you don't happen to need flexibility, you can have amazing efficiency: a wheel of a rail car is so much more efficient than a foot. But only when it stays on a rail.

If we look at a somehow more difficult terrain, a foot suddenly happens to be a better deal, because it adapts to a variety of conditions. Of course, at the expense of efficiency (complexity, energy consumption, need for a much more advanced control circuits, etc).

In business you usually have to have some safety margin for the unexpected. If you squeeze it out in the name of efficiency (and profit), all goes well until it does not, and then it fails much harder, maybe catastrophically.

(Nassim Taleb wrote a whole book about that, "Antifragility", where he gives a ton of fun real life examples to make basically the above point.)

legulere · 3 years ago

Trade offs are also optimizations: you are optimizing for the sums of weighted advantages and disadvantages of something: https://en.wikipedia.org/wiki/Loss_function

I would argue "efficiency" is the wrong word for what we discuss. Efficiency means optimising resource usage while achieving goals. If we need flexibility to achieve our goals consistently (and you usually do) then flexibility is part of efficiency (and effectiveness and efficacy) rather than opposed to it.

Phrased differently: if you define "efficiency" to mean "optimise for a single proxy metric and not what you're actually interested in" then yes, of course efficiency and effectiveness will be opposed other than sometimes by chance. But that's a dumb definition of efficiency!

phkx · 3 years ago

To me, efficiency means achieving a desired goal while consuming a minimum of resources (time, energy, space, …).

If the goal is defined in a too narrow scope, i.e. your ‚dumb‘ definition of efficiency, the flexibility may be missing. Still, that particular goal may be reached efficiently.

So it’s not an issue with the definition of efficiency, but rather with scoping the problem. As the article states, it may not always be possible to scope the problem in an easily measurable way, hence optimizing for proxy targets.

ToJans · 3 years ago

This is exactly what Taleb wrote about [1].

Systems start out very fragile. Over time they get more and more robust, as edge cases and events result in more rules and rigidity. This works until they become so robust that they can no longer cope with the change when something unexpected -i.e. a black swan event - happens.

The only way to counter this, to make a system that improves in coping with disorder/chaos as it encounters it.

[1] https://en.wikipedia.org/wiki/Antifragility

Update: reworded that last sentence a bit.

pixl97 · 3 years ago

Also reminds me of min/maxing in a multi-player game with an economy.

Some times in a balanced game a new min/max strategy is found in a game like this that it quickly becomes the dominant form of wealth generation in the economy. Asset inflation typically explodes in this scenario driving prices for everything way up.

The game itself typically becomes less fun at this point as min/max is now the only valid economic activity, hence intervention by the gods in the form of a patch. But while this fixes the exploit it does not fix the economic ruin. Players that made massive amounts of wealth in the event can have strangle holds on the economy and cause stagflation as material prices only drop slowly. New/poor players are hobbled with poor purchasing power.

It would be an interesting study on how many of the effects in this article could be studied by gamifying these situations.

a_c · 3 years ago

Maybe the ultimate proxy measurement is the pursuit of human economic productivity at the expense of longevity of our planet.

Contrived example, human need food to survive. We need food, efficiently, since there are so many of us. So we need efficient farming. Efficient means smallest amount of land to produce most biomass (plant, animal). To that, we need single purpose farmland and artificial feeds, fertilizer and pesticide. To have single purpose farmland, we purge the land from all life form, aka killing all plants and animals. To have artificial stuff, we need energy. To have energy, we expense fossil energy. To have fossil energy, we, the earth, started capturing sun energy millions of years ago.

Meanwhile, some farmer right now is roaming their cow and chickens on a "unindustrialized" land. The cow graze on the pasture, the chicken eat the bugs, their droppings feed the grass, the stomping excites the earthworm, the earthworms aerates the land, the grass captures the sunlight, the sunlight feed everyone, in a perfect balance. 0 GDP generated.

I think it is mentioned in the Omnivore Dilemma [1] that, to produce 1 unit energy in food, we are using 14 unit of fossil fuel energy and 3% of workforce. Before the industrial age (200ish year ago), the ratio is 1 unit of energy to 2 unit of sun energy and 90% of workforce.

Efficiency? Check. Quality of life? Check. All numbers show that we are improving

1. https://www.goodreads.com/book/show/3109.The_Omnivore_s_Dile...

Ironically, we're not very efficient at all with our land usage today. Nearly 60% of global agricultural land is used for beef (either for pasture or for growing feed) and yet it accounts for only 2% of our calories. And about 50% of all food is wasted – simply thrown away.

Filligree · 3 years ago

I doubt it affects your central claim, but do note that a lot of pasture land is unusable for anything else.

pksebben · 3 years ago

that's an optimization error of another stripe - thrown away or not, the beef gets sold and FarmCo makes money

4h53n · 3 years ago

If we went with latter we would need more land to feed same number of people. There is more "nature" when we leave the forest as is and not cut it down to, to convert it to a poorly run "human land".

> There were more cereal calories per person in 2020 than in 1992. And this abundance was brought about without massive increases in the area being farmed. While industrial emissions rocketed, emissions due to land-use change fell by a quarter.

We were able to more than triple our output in that period.

https://www.economist.com/special-report/2022/11/01/a-lot-ca...

c7b · 3 years ago

Not sure I agree. The problem with overfitting is fitting too closely to the data points at hand, but you might still be measuring the right thing, as discussed in other posts here.

The problem with Goodhart's law is, as I've always taken it, closer to the Lucas critique in economics than to the bias-variance trade-off in machine learning. Namely, when it comes to human behavior, structural relations that are very real and present in the training data may break down once you put pressure on them for control purposes.

When you use machine learning to, say, detect skin cancer, you might accidentally learn the markers put into the images to highlight the cancerous region rather than the skin properties - that's overfitting. But the skin cells themselves don't care - they won't alter their behavior whether you detect them correctly (and remove them) or not. If you use a model to find a relation between some input and a human behavior output, humans might very much start to change their behavioral responses once you start to make changes. The entire relation breaks down, even if you've measured it correctly beforehand, because people, unlike particles, have their own interests.

A note that the datapoints you train on are part of the training objective. If you are using different data at test time than you use at training time, then you are measuring the wrong thing during training, the same as if you used a different loss function at training time.

Also -- as you say, feedback loops and non-stationarity make everything more complex, and are ubiquitous in the real world! But in machine learning we also see overfitting phenomena in systems with feedback loops -- e.g. in reinforcement learning or robotics, where the system changes depending on the agent's behavior.

(blog author here)

Cool that you're responding here. Well, regarding robotics, I'm sure there's all sorts of problems when it comes to training models, but I'm not sure that Goodhart's law is one of them, unless you can give a concrete example. It's really geared towards social problems. Sure, some natural systems may also exhibit the kind of adaptive response that leads to the breakdown of structural relattions (eg the cancer cells mentioned before may evolve to avoid detection by the AI), but that happens on completely different timescales.

cs702 · 3 years ago

Except that AI models, especially large deep ones, do NOT overfit like the author thinks. They exhibit what is now called "deep double descent" -- the validation error declines, then increases, and then declines again:

https://openai.com/blog/deep-double-descent/

A question I've pondered for a while is whether complex systems in the real world also exhibit double descent.

For example, transitioning an online application that currently serves thousands of users to one that can serve millions and then billions requires reorganizing all code, processes, and infrastructure, making software development harder at first, but easier down the road. Anyone who has gone through it will tell you that it's like going through "phase transitions" that require "getting over humps."

Similarly, startups that want to transition from small-scale to mid-size and then to large-scale businesses must increase operational complexity, making everything harder at first, but easier down the road. Anyone who has been with a startup that has grown from tiny two-person shop to large corporation will tell you that it's like going through "phase transitions" that require "getting over humps."

Finally, it may be that whole countries and economies that want to improve the lives of their citizens may have to go through an interim period of less efficiency, making everything harder at first, but easier down the road. It may be that human progress involves "phase transitions" that require "getting over humps."

Blog post author here.

A brief note that I do discuss the deep double descent phenomenon in the blog. See the section starting with "One of the best understood causes of extreme overfitting is that the expressivity of the model being trained too closely matches the complexity of the proxy task."

I avoided using the actual term double descent, since I thought it would add unnecessary complexity. Lesson learned for next time -- I should have at least had an endnote using that terminology!

Thank you.

As you probably know, the big deal about double descent is that once sufficiently large AI models cross the so-called "interpolation threshold" in training, and get over the hump, they start generalizing better -- the opposite of overfitting. State-of-the-art performance in fact requires getting over the hump. As far as I can tell, you did not mention any of that explicitly anywhere in your post.

Also, all your plots show only the classical overfitting curve, not the actual curve we now see all the time with larger AI models like Transformers.