# Regression Abuse

As I write this, I realize I go a long time without getting to climate.  Stick with me, there is an important climate point.

The process goes by a number of names, but multi-variate regression is a mathematical technique (really only made practical by computer processing power) of determining a numerical relationship between one output variable and one or more other input variables.

Regression is absolutely blind to the real world — it only knows numbers.  What do I mean by this?  Take the famous example of Washington Redskins football and presidential elections:

For nearly three quarters of a century, the Redskins have successfully predicted the outcome of each and every presidential election. It all began in 1933 when the Boston Braves changed their name to the Redskins, and since that time, the result of the team’s final home game before the election has always correctly picked who will lead the nation for the next four years.

And the formula is simple. If the Redskins win, the incumbent wins. If the Redskins lose, the challenger takes office.

Plug all of this into a regression and it would show a direct, predictive correlation between Redskins football and Presidential winners, with a high degree of certainty.  But we denizens of the real world would know that this is insane.  A meaningless coincidence with absolutely no predictive power.

You won’t often find me whipping out nuggets from my time at the Harvard Business School, because I have not always found a lot of that program to be relevant to my day-to-day business experience.  But one thing I do remember is my managerial economics teacher hammering us over and over with one caveat to regression analysis:

Don’t use regression analysis to go on fishing expeditions.  Include only the variables you have real-world evidence really affect the output variable to which you are regressing.

Let’s say one wanted to model the historic behavior of Exxon stock.  One approach would be to plug in a thousand or so variables that we could find in economics data bases and crank the model up and just see what comes out.  This is a fishing expedition.  With that many variables, by the math, you are almost bound to get a good fit (one characteristic of regressions is that adding an additional variable, no matter how irrelevant, always improves the fit).   And the odds are high you will end up with relationships to variables that look strong but are only coincidental, like the Redskins and elections.

Instead, I was taught to be thoughtful.  Interest rates, oil prices, gold prices, and value of the dollar are all sensible inputs to Exxon stock price.  But at this point my professor would have a further caveat.  He would say that one needs to have an expectation of the sign of the relationship.  In other words, I should have a theory in advance not just that oil prices affect Exxon stock price, but whether we expect higher oil prices to increase or decrease Exxon stock price.   In this he was echoing my freshman physics professor, who used to always say in the lab — if you are uncertain about the sign of a relationship, then you don’t really understand the process at all.

So lets say we ran the Exxon stock price model expecting higher oil prices to increase Exxon stock price, and our regression result actually showed the opposite, a strong relationship but with the opposite sign – higher oil prices seem to correlate better with lower Exxon stock price.  So do we just accept this finding?  Do we go out and bet a fortune on it tomorrow?  I sure wouldn’t.

No, what we do instead is take this as sign that we don’t know enough and need to research more.  Maybe my initial assumption was right, but my data is corrupt.  Maybe I was right about the relationship, but in the study period some other more powerful variable was dominating  (example – oil prices might have increased during the 1929 stock market crash, but all the oil company stocks were going down for other reasons).  It might be there is no relation between oil prices and Exxon stock prices.  Or it might be I was wrong, that in fact Exxon is dominated by refining and marketing rather than oil production and actually is worse off with higher oil prices.    But all of this points to needed research – I am not going to write an article immediately after my regression results pop out and say “New Study: Exxon stock prices vary inversely with oil prices” without doing more work to study what is going on.

Which brings us to climate (finally!) and temperature proxies.  We obviously did not have accurate thermometers measuring temperature in the year 1200, but we would still like to know something about temperatures.  One way to do this is to look at certain physical phenomenon, particularly natural processes that result in some sort of annual layers, and try to infer things from these layers.  Tree rings are the most common example – tree ring widths can be related to temperature and precipitation and other climate variables, so that by measuring tree ring widths (each of which can be matched to a specific year) we can infer things about climate in past years.

There are problems with tree rings for temperature measurement (not the least of which is that more things than just temperature affect ring width) so scientists search for other “proxies” of temperature.  One such proxy are lake sediments in certain northern lakes, which are layered like tree rings.  Scientists had a theory that the amount of organic matter in a sediment layer was related to the amount of growth activity in that year, which in term increased with temperature  (It is always ironic to me that climate scientists who talk about global warming catastrophe rely on increased growth and life in proxies to measure higher temperature).  Because more organic matter reduces x-ray density of samples, an inverse relationship between X-ray density and temperature could be formulated — in this case we will look at the Tiljander study of lake sediments.   Here is one core result:

The yellow band with lower X-ray density (meaning higher temperatures by the way the proxy is understood) corresponds pretty well with the Medieval Warm Period that is fairly well documented, at least in Europe (this proxy is from Finland).  The big drop in modern times is thought by most (including the original study authors) to be corrupted data, where modern agriculture has disrupted the sediments and what flows into the lake, eliminating its usefulness as a meaningful proxy.  It doesn’t mean that temperatures have dropped lately in the area.

But now the interesting part.  Michael Mann, among others, used this proxy series (despite the well-know corruption) among a number of others in an attempt to model the last thousand years or so of global temperature history.   To simplify what is in fact more complicated, his models regress each proxy series like this against measured temperatures over the last 100 years or so.  But look at the last 100 years on this graph.  Measured temperatures are going up, so his regression locked onto this proxy and … flipped the sign.  In effect, it reversed the proxy.  As far as his models are concerned, this proxy is averaged in with values of the opposite sign, like this:

A number of folks, particularly Steve McIntyre, have called Mann on this, saying that he can’t flip the proxy upside down.  Mann’s response is that the regression doesn’t care about the sign, and that its all in the math.

Hopefully, after our background exposition, you see the problem.  Mann started with a theory that more organic material in lake sediments (as shown by lower x-ray densities) correlated with higher temperatures.  But his regression showed the opposite relationship — and he just accepted this, presumably because it yielded the hockey stick shape he wanted.  But there is absolutely no physical theory as to why our historic understanding of organic matter deposition in lakes should be reversed, and Mann has not even bothered to provide one.  In fact, he says he doesn’t even need to.

This mistake (fraud?) is even more egregious because it is clear that the jump in x-ray values in recent years is due to a spurious signal and corruption of the data.  Mann’s algorithm is locking into meaningless noise, and converting it into a “signal” that there is a hockey stick shape to the proxy data.

As McIntyre concludes:

In Mann et al 2008, there is a truly remarkable example of opportunistic after-the-fact sign selection, which, in addition, beautifully illustrates the concept of spurious regression, a concept that seems to baffle signal mining paleoclimatologists.

Postscript: If you want an even more absurd example of this data-mining phenomenon, look no further than Steig’s study of Antarctic temperatures.   In the case of proxies, it is possible (though unlikely) that we might really reverse our understanding of how the proxy works based on the regression results. But in Steig, they were taking individual temperature station locations and creating a relationship between them to a synthesized continental temperature number.  Steig used regression techniques to weight various thermometers in rolling up the continental measure.  But five of the weights were negative!!

As I wrote then,

Do you see the problem?  Five stations actually have negative weights!  Basically, this means that in rolling up these stations, these five thermometers were used upside down!  Increases in these temperatures in these stations cause the reconstructed continental average to decrease, and vice versa.  Of course, this makes zero sense, and is a great example of scientists wallowing in the numbers and forgetting they are supposed to have a physical reality.  Michael Mann has been quoted as saying the multi-variable regression analysis doesn’t care as to the orientation (positive or negative) of the correlation.  This is literally true, but what he forgets is that while the math may not care, Nature does.

# More Proxy Hijinx

Steve McIntyre digs into more proxy hijinx from the usual suspects.  This is a pretty good summary of what he tends to find, time and again in these studies:

The problem with these sorts of studies is that no class of proxy (tree ring, ice core isotopes) is unambiguously correlated to temperature and, over and over again, authors pick proxies that confirm their bias and discard proxies that do not. This problem is exacerbated by author pre-knowledge of what individual proxies look like, leading to biased selection of certain proxies over and over again into these sorts of studies.

The temperature proxy world seems to have developed into a mono-culture, with the same 10 guys creating new studies, doing peer review, and leading IPCC sub-groups.  The most interesting issue McIntyre raises is that this new study again uses proxy’s “upside down.”  I explained this issue more here and here, but a summary is:

Scientists are trying to reconstruct past climate variables like temperature and precipitation from proxies such as tree rings.  They begin with a relationship they believe exists based on a physical understanding of a particular system – ie, for tree rings, trees grow faster when its warm so tree rings are wider in warm years.  But as they manipulate the data over and over in their computers, they start to lose touch with this physical reality.

…. in one temperature reconstruction, scientists have changed the relationship opportunistically between the proxy and temperature, reversing their physical understanding of the process and how similar proxies are handled in the same study, all in order to get the result they want to get.

Sometimes in modeling and data analysis one can get so deep in the math that one forgets there is a physical reality those numbers are supposed to represent.  This is a common theme on this site, and a good example was here.

Jeff Id, writing at Watts Up With That, brings us another example from Steig’s study on Antarctic temperature changes.  In this study, one step Steig takes is to reconstruct older, pre-satellite continental temperature  averages from station data at a few discrete stations.  To do so, he uses more recent data to create weighting factors for the individual stations.  In some sense, this is basically regression analysis, to see what combination of weighting factors times station data since 1982 seems to be fit with continental averages from the satellite.

Here are the weighting factors the study came up with:

Do you see the problem?  Five stations actually have negative weights!  Basically, this means that in rolling up these stations, these five thermometers were used upside down!  Increases in these temperatures in these stations cause the reconstructed continental average to decrease, and vice versa.  Of course, this makes zero sense, and is a great example of scientists wallowing in the numbers and forgetting they are supposed to have a physical reality.  Michael Mann has been quoted as saying the multi-variable regression analysis doesn’t care as to the orientation (positive or negative) of the correlation.  This is literally true, but what he forgets is that while the math may not care, Nature does.

For those who don’t follow, let me give you an example.  Let’s say we have market prices in a number of cities for a certain product, and we want to come up with an average.  To do so, we will have to weight the various local prices based on sizes of the city or perhaps populations or whatever.  But the one thing we can almost certainly predict is that none of the individual city weights will be negative.  We won’t, for example, ever find that the average western price of a product goes up because one component of the average, say the price in Portland, goes down.  This flies in the face of our understanding of how an arithmetic average should work.

It may happen that in a certain time periods, the price in Portland goes down in the same month as the Western average went up, but the decline in price in Portland did not drive the Western average up — in fact, its decline had to have actually limited the growth of the Western average below what it would have been had Portland also increased.   Someone looking at that one month and not understanding the underlying process might draw the conclusion that prices in Portland were related to the Western average price by a negative coefficient, but that conclusion would be wrong.

The Id post goes on to list a number of other failings of the Steig study on Antarctica, as does this post.  Years ago I wrote an article arguing that while the GISS and other bodies claim they have a statistical method for eliminating individual biases of measurement stations in their global averages, it appeared to me that all they were doing was spreading the warming bias around a larger geographic area like peanut butter.  Steig’ study appears to do the same thing, spreading the warming from the Antarctic Peninsula across the whole continent, in part based on its choice to use just three PC’s, a number that is both oddly small and coincidentally exactly the choice required to get the maximum warming value from their methodology.

# Numbers Divorced from Reality

This article on Climate Audit really gets at an issue that bothers many skeptics about the state of climate science:  the profession seems to spend so much time manipulating numbers in models and computer systems that they start to forget that those numbers are supposed to have physical meaning.

I discussed the phenomenon once before.  Scientists are trying to reconstruct past climate variables like temperature and precipitation from proxies such as tree rings.  They begin with a relationship they believe exists based on an understanding of a particular system – ie, for tree rings, trees grow faster when its warm so tree rings are wider in warm years.  But as they manipulate the data over and over in their computers, they start to lose touch with this physical reality.

In this particular example, Steve McIntyre shows how, in one temperature reconstruction, scientists have changed the relationship opportunistically between the proxy and temperature, reversing their physical understanding of the process and how similar proxies are handled in the same study, all in order to get the result they want to get.

McIntyre’s discussion may be too arcane for some, so let me give you an example.  As a graduate student, I have been tasked with proving that people are getting taller over time and estimating by how much.  As it turns out, I don’t have access to good historic height data, but by a fluke I inherited a hundred years of sales records from about 10 different shoe companies.  After talking to some medical experts, I gain some confidence that shoe size is positively correlated to height.  I therefore start collating my 10 series of shoe sales data, pursuing the original theory that the average size of the shoe sold should correlate to the average height of the target population.

It turns out that for four of my data sets, I find a nice pattern of steadily rising shoe sizes over time, reflecting my intuition that people’s height and shoe size should be increasing over time.  In three of the data sets I find the results to be equivical — there is no long-term trend in the sizes of shoes sold and the average size jumps around a lot.  In the final three data sets, there is actually a fairly clear negative trend – shoe sizes are decreasing over time.

So what would you say if I did the following:

• Kept the four positive data sets and used them as-is
• Threw out the three equivocal data sets
• Kept the three negative data sets, but inverted them
• Built a model for historic human heights based on seven data sets – four with positive coefficients between shoe size and height and three with negative coefficients.

My correlation coefficients are going to be really good, in part because I have flipped some of the data sets and in part I have thrown out the ones that don’t fit initial bias as to what the answer should be.  Have I done good science?  Would you trust my output?  No?

Well what I describe is identical to how many of the historical temperature reconstruction studies have been executed  (well, not quite — I have left out a number of other mistakes like smoothing before coefficients are derived and using de-trended data).

Mann once wrote that multivariate regression methods don’t care about the orientation of the proxy. This is strictly true – the math does not care. But people who recognize that there is an underlying physical reality that makes a proxy a proxy do care.

It makes no sense to physically change the sign of the relationship of our final three shoe databases.  There is no anatomical theory that would predict declining shoe sizes with increasing heights.  But this seems to happen all the time in climate research.  Financial modellers who try this go bankrupt.  Climate modellers who try this to reinforce an alarmist conclusion get more funding.  Go figure.

# The First Rule of Regression Analysis

Here is the first thing I was ever taught about regression analysis — never, ever use multi-variable regression analysis to go on a fishing expedition.  In other words, never throw in a bunch of random variables and see what turns out to have the strongest historical relationship.  Because the odds are that if you don’t understand the relationship between the variables and why you got the answer that you did, it is very likely a spurious result.

The purpose of a regression analysis is to confirm and quantify a relationship that you have a theoretical basis for believing to exist.  For example, I might think that home ownership rates might drop as interest rates rose, and vice versa, because interest rate increases effectively increase the cost of a house, and therefore should reduce the demand.  This is a perfectly valid proposition to test.  What would not be valid is to throw interest rates, population growth, regulatory levels, skirt lengths,  superbowl winners, and yogurt prices together into a regression with housing prices and see what pops up as having a correlation.   Another red flag would be, had we run our original regression between home ownership and interest rates and found the opposite result than we expected, with home ownership rising with interest rates, we need to be very very suspicious of the correlation.  If we don’t have a good theory to explain it, we should treat the result as spurious, likely the result of mutual correlation of the two variables to a third variable, or the result of time lags we have not considered correctly, etc.

Makes sense?  Well, then, what do we make of this:  Michael Mann builds temperature reconstructions from proxies.  An example is tree rings.  The theory is that warmer temperatures lead to wider tree rings, so one can correlate tree ring growth to temperature.  The same is true for a number of other proxies, such as sediment deposits.

In the particular case of the Tiljander sediments, Steve McIntyre observed that Mann had included the data upside down – meaning he had essentially reversed the sign of the proxy data.  This would be roughly equivalent to our running our interest rate – home ownership regression but plugging the changes in home ownership with the wrong sign (ie decreases shown as increases and vice versa).

You can see that the data was used upside down by comparing Mann’s own graph with the orientation of the original article, as we did last year. In the case of the Tiljander proxies, Tiljander asserted that “a definite sign could be a priori reasoned on physical grounds” – the only problem is that their sign was opposite to the one used by Mann. Mann says that multivariate regression methods don’t care about the orientation of the proxy.

The world is full of statements that are strictly true and totally wrong at the same time.  Mann’s statement in bold is such a case.  This is strictly true – the regression does not care if you get the sign right, it will still get a correlation.  But it is totally insane, because this implies that the correlation it is getting is exactly the opposite of what your physics told you to expect.  It’s like getting a positive correlation between interest rates and home ownership.  Or finding that tree rings got larger when temperatures dropped.

This is a mistake that Mann seems to make a lot — he gets buried so far down into the numbers, he forgets that they have physical meaning.  They are describing physical systems, and what they are saying in this case makes no sense.  He is essentially using a proxy that is essentially behaving exactly the opposite of what his physics tell him it should – in fact behaving exactly opposite to the whole theory of why it should be a proxy for temperature in the first place.  And this does not seem to bother him enough to toss it out.

PS-  These flawed Tiljander sediments matter.  It has been shown that the Tiljander series have an inordinate influence on Mann’s latest proxy results.  Remove them, and a couple of other flawed proxies  (and by flawed, I mean ones with manually made up data) and much of the hockey stick shape he loves so much goes away