Goodhart's Law Isn't as Useful as You Might Think

rw-book-cover

Metadata

Author: Cedric Chin
Full Title: Goodhart’s Law Isn’t as Useful as You Might Think

Highlights

One of the more interesting things about the WBR is that the folks at Amazon have developed a number of ways to solve for Goodhart’s Law. We’ll use the set of practices around the WBR as an example in a bit. But the main idea that I want to highlight here is that the WBR’s practices came from a body of work; that body of work offers us a bunch of principles to use in our own contexts. (View Highlight)
There’s a fairly interesting paper by David Manheim and Scott Garrabrant titled Categorising Variants of Goodhart’s Law that lays out four categories of the phenomenon. I summarised the paper a number of years back, in which I talked about some of their proposed solutions for each of the categories. I do recommend the paper if you’d like a more general take on various forms of Goodhart’s Law — which is useful if you’re into, say, AI alignment research. But I did not think highly of the solutions — they seemed too academic, too theoretical, for my taste. (View Highlight)
The first step is to turn Goodhart’s Law as a narrower, more actionable formulation. The one that I like the most is from Deming contemporary Donald Wheeler, who writes, in Understanding Variation:
When people are pressured to meet a target value there are three ways they can proceed:
1. They can work to improve the system
2. They can distort the system
3. Or they can distort the data (View Highlight)
Say that you’re working in a widget factory, and management has decided you’re supposed to produce 10,000 widgets per month. If production this month is above the target, you may be tempted to stockpile it and use it against next month’s quota (distorting the system). If production is below target, you may be tempted to bring back skids of finished product from the warehouse, unpack it, load it back onto the conveyor belt, and have the automatic counting machine at the end of the production line count the product as finished units (thus distorting the data). Of course, at the end of the year this deception would show up as inventory shortage, and as plant manager, you’re likely to be fired. But if there is high, unyielding pressure to meet production targets in the short term, and no time for process improvement, the common response is to resort to trickery. Wheeler writes: (View Highlight)
Notice how the emphasis upon meeting the production target was the origin of all the turmoil in this case. People were fired and hired, money was spent, all because the production foreman did not like to explain, month after month, why they had not met the production quota. (View Highlight)
This list of possible responses to quantitative targets is attributed to Brian Joiner, who ‘came up with this list several years ago’ — likely in the 80s. I immediately glommed onto this list as a more useful formulation than Goodhart’s Law. Joiner’s list suggests a number of solutions:
1. Make it difficult to distort the system.
2. Make it difficult to distort the data, and
3. Give people the slack necessary to improve the system (a tricky thing to do, which we’ve covered elsewhere). (View Highlight)
The third point is really important. Preventing distortions is just one half of the solution. Avoiding Goodhart’s Law requires you to also give people the space to improve the system. Which begs the question: how do you encourage people to do just that? (View Highlight)
There’s a nuanced point that Wheeler makes immediately after giving us this list. He writes (emphasis mine):

Before you can improve any system you must listen to the voice of the system (the Voice of the Process). Then you must understand how the inputs affect the outputs of the system. Finally, you must be able to change the inputs (and possibly the system) in order to achieve the desired results. This will require sustained effort, constancy of purpose, and an environment where continual improvement is the operating philosophy.

Comparing numbers to specifications will not lead to the improvement of the process. Specifications are the Voice of the Customer, not the Voice of the Process. The specification approach does not reveal any insights into how the process works.

So if you only compare the data to the specifications, then you will be unable to improve the system, and will therefore be left with only the last two ways of meeting your goal (i.e. distorting the system, or distorting the data). When a current value is compared to an arbitrary numerical target (… it) will always create a temptation to make the data look favourable. And distortion is always easier than working to improve the system. (View Highlight)
‘Voice of the Customer’ and ‘Voice of the Process’ are fancy ways to say something simple. A target, goal, or budget usually represents some kind of ‘specification’ — some form of demand from the customer or from company management. This is the so-called ‘Voice of the Customer’. The ‘Voice of the Process’, on the other hand, is how the process actually works. (View Highlight)
Most of us, when faced with a goal, will fixate on the difference between the current output and our desired target. In other words, we think the way to hit our goals it to obsess over the goal. (View Highlight)
The natural thing to do is to set a 2kg weight reduction goal each month, weigh yourself every morning, and then pat yourself on the back if you’re on track to hitting your target reduction for the month, or exercise more / eat less if you’re not. A similar thing might occur in business: “We need to get to 100 new deals by the end of the quarter, which means 33 new deals per month, now GET ON IT!” This is a naive view of process improvement, and while it may work for something as simple as weight control, it is not going to work for the kind of complex processes that you would find in a typical business. (View Highlight)
Business processes are often processes where you don’t know the inputs to your desired output Well, let’s think about the weight control example. Losing weight is a process with two well known inputs: calories in (what you eat), and calories out (what you burn through exercise). This means that the primary difficulty of hitting a weight loss goal is to figure out how your body responds to different types of exercise or different types of foods, and how these new habits might fit into your daily life (this assumes you’re disciplined enough to stick to those habits in the first place, which, well, you know). (View Highlight)
By contrast, business processes are often processes where you don’t know the inputs to your desired output. So the first step is to figure out what those inputs are, and then figure out what subset of those you can influence, and then, finally, figure out the causal relationships between the input metrics and output metrics. A causal relationship looks something like: “an X% lift in input metric A usually leads to a Y% lift in output metric B. Oh, and output metric B is affected by five different input metrics, of which A is merely one”. It is not an accident that the best growth teams are able to say things like “a 30% increase in newsletter performance should drive a 5% improvement in new account signups” — if you ever hear something like this, you should assume that they’ve listened to the Voice of the Process very, very carefully. (View Highlight)
This is a long winded way of saying that if you want to improve some process, you have to ignore the goal first, in favour of examining the process itself. On the face of it, this is common sense: you cannot expect your team to improve some metric if that metric isn’t directly controllable. No, you must first work out what set of controllable input metrics leads to the output metrics outcomes you desire, before you can even begin to talk about hitting targets. You’ll need to figure out what levers to pull in order to hit 10,000 units a month; you’ll need to figure out what drivers exist before you push for 100 new deals a quarter. The way you get to this state is nothing at all like obsessively watching your target and measuring how far off you are from it and yelling at your team about the underperformance — down that path lies Goodhart’s Law. (View Highlight)
in Wheeler’s words: “You cannot improve a process by listening to the Voice of the Customer. You can only improve a process by listening to the Voice of the Process.” (View Highlight)
My personal belief is that Amazon’s adoption of the WBR may be traced back to this period of crisis, and the format of the meeting was influenced by folk with strong Operations Research backgrounds. How else to explain the uncanny implementation of just about every principle in Donald Wheeler’s Understanding Variation? (View Highlight)
The Amazon WBR is a weekly operational metrics review meeting in which Amazon’s leadership team gathers and reviews 400-500 metrics within 60-90 minutes. It occurs — or so I’m told — every Wednesday morning. I should note that a) a more detailed description of the WBR may be found in Chapter 6 of Working Backwards, and that b) the authors note that there is no one standard playbook to use and review metrics across Amazon; the details of the WBR here are based off of their experiences, as well as from the recollections of various Amazon execs they’ve talked to whilst writing the book. (View Highlight)
The way that a WBR deck is constructed is instructive. Broadly speaking, Amazon divides its metrics into ‘controllable input metrics’ and ‘output metrics’. Output metrics are not typically discussed in detail, because there is no way of directly influencing them. (Yes, Amazon leaders understand that they are evaluated based on their output metrics, but they recognise these are lagging indicators and are not directly actionable). Instead, the majority of discussions during WBR meetings focus on exceptions and trends in controllable input metrics. In other words, a metrics owner is expected to explain abnormal variation or a worrying trend (slowing growth rate, say, or if a metric is lagging behind target) — and is expected to announce “nothing to see here” if the metric is within normal variance and on track to hit target. In the latter case, the entire room glances at the metric for a second, and then moves on to the next graph. (View Highlight)
How do you come up with the right set of controllable input metrics? The short answer is that you do so by trial and error. Let’s pretend that you want to influence ‘Marketing Qualified Leads’ (or MQLs) and you hypothesise that ‘percentage of newsletters sent that is promotional’, ‘number of webinars conducted per week’ and ‘number of YouTube videos produced’ are controllable input metrics that affect this particular output metric. You include these three metrics in your WBR metrics deck, and charge the various metrics owners to drive up those numbers. Over the period of a few months (and recall, the WBR is conducted every week) your leadership team will soon say things like “Hmm, we’ve been driving up promotional newsletters for awhile now but there doesn’t seem to be a big difference in MQLs; maybe we should stop doing that” or “Number of webinars seems pretty predictive of a bump in MQLs, but why is the bump in numbers this week so large? You say it’s because of the joint webinar we did with Google Cloud? Well, should we track ‘number of webinars executed with a partner’ as a new controllable input metric and see if we can drive that up?” (View Highlight)
You can see how picking the wrong controllable input metric temporarily created a Goodhart’s Law type of situation within Amazon. But the nature of the WBR prevented the situation from persisting. Implicit in the WBR process is the understanding that the initial controllable input metrics you pick might be the wrong ones. As a result, the WBR acts as a safety net — a weekly checkpoint to examine the relationships between controllable input metrics (which are set up as targets for operational teams) and corresponding output metrics (which represent the fundamental business outcomes that Amazon desires). If the relationship is non-existent or negative, Amazon’s leadership knows to kill that particular input metric. Said differently, the WBR assumes that controllable input metrics are only important if they drive desirable outcomes — if the metric is wrong, or the metrics stops driving output metrics at some point in the future, the metric is simply dropped. (View Highlight)
This is the third solution in Joiner’s list (“give people enough slack to improve the system so that they do so”). The WBR simply functions as a mechanism to let that happen. The WBR is a weekly synchronisation mechanism for the company’s entire leadership team. This is more important than it might first seem. (View Highlight)
A larger point I want to make is that the WBR becomes a weekly synchronisation mechanism for the company’s entire leadership team. This is more important than it might first seem. Colin told me that the week-in, week-out cadence explains how Amazon’s leadership is eventually able to go through 400-500 metrics in a single hour. An external observer might be overwhelmed. But an insider who has been engaged in the WBR process over a period of months won’t see 500 different metrics — instead, their felt experience of a WBR deck is that they are looking at clusters of controllable input metrics that map to other clusters of output metrics. In other words, the WBR process forces them to build a causal model of the business in their heads. (View Highlight)
Repeated viewings of the same set of metrics will eventually turn into a ‘fingertip-feel’ of the business — execs will be able to say things like “this feels wrong, this dip is more than expected seasonal variation, what’s up?” — which can really only happen if you’re looking at numbers and trends every week. (This is why glancing at metrics are important, even when the metric owner announces “nothing to see here.”) Most importantly, though, the entire leadership team shares in the same causal model, since they would have been present for the laborious trial and error process to identify, control, and then manipulate each and every metric that mattered. (View Highlight)
Incidentally, this also explains why there are so many goddamn metrics in a typical WBR deck. Any output metric of importance in a business will typically be influenced by multiple input metrics. Pretending otherwise is to deny the multi-faceted nature of business. This means that if you track 50 output metrics, the number of controllable input metrics you’ll need to examine will greatly exceed that number — and may change from week to week! However you cut it, your WBR deck will expand to become much larger than an outside observer might expect. This is why it is often a mistake to ‘present a small, simple set of metrics’ or to anchor on a single ‘North Star metric’ for a business. (View Highlight)
One of the things that I find perpetually irritating about using data in operations is that even proposing to measure outcomes often sets off resistance along the lines of “oh, but Goodhart’s Law mumble mumble blah blah.” Which, yes, Goodhart’s Law is a thing, and really bad things happen when it occurs. But then what’s your solution for preventing it? It can’t be that you want your org to run without numbers. And it can’t be that you eschew quantitative goals forever! (View Highlight)
I think the biggest lesson of this blog post is just how difficult it is to be data driven — and to do it properly. A friend of mine likens data-driven decision making to nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face. I think about this analogy a lot. (View Highlight)
So what have we covered today? I opened this piece by introducing the common sense idea that Goodhart’s Law isn’t particularly useful. Instead, I argued that Donald Wheeler’s Understanding Variation gives us a more actionable formulation of Goodhart’s Law (quoting an observation from Brian Joiner, from the 80s, and drawing more broadly from the field of Statistical Process Control). Joiner points out that when you’re incentivising organisational behaviour with metrics, there are really only three ways people might respond: 1) they might improve the system, 2) they might distort the system, or 3) they might distort the data. (View Highlight)
Joiner’s list immediately suggests three solutions: a) you make it hard to distort the system, b) you make it hard to distort the data, and finally c) you make it possible for people to change the inputs to the system, or the system itself. That last solution isn’t as easy as you might think — but at least we know it’s necessary to avoid Goodhart’s Law. (View Highlight)
A brief note so you know the source of these principles: many of these ideas were worked out by W. Edwards Deming and his colleagues in the 70s and 80s, as part of a body of work known today as ‘Statistical Process Control’. I’ve talked a little about how I fell into this rabbit hole in the recent past; the short version is that I did some work for Colin Bryar to explicate Amazon’s Weekly Business Review process for his company’s clients, and during that project I discovered that many of the ideas in the WBR were actually taken from the field of Statistical Process Control. As a result, I started digging into SPC to see what other principles or ideas might be applicable to business. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

Self-proclaimed experts

My failure resume

Tres Millones de viviendas

Goodhart's Law Isn't as Useful as You Might Think

Metadata

Highlights

Graph View

Table of Contents

Now Reading

New platform, familiar risks: Zillow and Expedia bet on OpenAI’s ChatGPT apps rollout