The First Rule of Regression Analysis

Here is the first thing I was ever taught about regression analysis — never, ever use multi-variable regression analysis to go on a fishing expedition.  In other words, never throw in a bunch of random variables and see what turns out to have the strongest historical relationship.  Because the odds are that if you don’t understand the relationship between the variables and why you got the answer that you did, it is very likely a spurious result.

The purpose of a regression analysis is to confirm and quantify a relationship that you have a theoretical basis for believing to exist.  For example, I might think that home ownership rates might drop as interest rates rose, and vice versa, because interest rate increases effectively increase the cost of a house, and therefore should reduce the demand.  This is a perfectly valid proposition to test.  What would not be valid is to throw interest rates, population growth, regulatory levels, skirt lengths,  superbowl winners, and yogurt prices together into a regression with housing prices and see what pops up as having a correlation.   Another red flag would be, had we run our original regression between home ownership and interest rates and found the opposite result than we expected, with home ownership rising with interest rates, we need to be very very suspicious of the correlation.  If we don’t have a good theory to explain it, we should treat the result as spurious, likely the result of mutual correlation of the two variables to a third variable, or the result of time lags we have not considered correctly, etc.

Makes sense?  Well, then, what do we make of this:  Michael Mann builds temperature reconstructions from proxies.  An example is tree rings.  The theory is that warmer temperatures lead to wider tree rings, so one can correlate tree ring growth to temperature.  The same is true for a number of other proxies, such as sediment deposits.

In the particular case of the Tiljander sediments, Steve McIntyre observed that Mann had included the data upside down – meaning he had essentially reversed the sign of the proxy data.  This would be roughly equivalent to our running our interest rate – home ownership regression but plugging the changes in home ownership with the wrong sign (ie decreases shown as increases and vice versa).

You can see that the data was used upside down by comparing Mann’s own graph with the orientation of the original article, as we did last year. In the case of the Tiljander proxies, Tiljander asserted that “a definite sign could be a priori reasoned on physical grounds” – the only problem is that their sign was opposite to the one used by Mann. Mann says that multivariate regression methods don’t care about the orientation of the proxy.

The world is full of statements that are strictly true and totally wrong at the same time.  Mann’s statement in bold is such a case.  This is strictly true – the regression does not care if you get the sign right, it will still get a correlation.  But it is totally insane, because this implies that the correlation it is getting is exactly the opposite of what your physics told you to expect.  It’s like getting a positive correlation between interest rates and home ownership.  Or finding that tree rings got larger when temperatures dropped.

This is a mistake that Mann seems to make a lot — he gets buried so far down into the numbers, he forgets that they have physical meaning.  They are describing physical systems, and what they are saying in this case makes no sense.  He is essentially using a proxy that is essentially behaving exactly the opposite of what his physics tell him it should – in fact behaving exactly opposite to the whole theory of why it should be a proxy for temperature in the first place.  And this does not seem to bother him enough to toss it out.

PS-  These flawed Tiljander sediments matter.  It has been shown that the Tiljander series have an inordinate influence on Mann’s latest proxy results.  Remove them, and a couple of other flawed proxies  (and by flawed, I mean ones with manually made up data) and much of the hockey stick shape he loves so much goes away

4 thoughts on “The First Rule of Regression Analysis”

  1. you mean that if we shorten hemlines, it won’t make the stock market go up?


    it seemed such a cost effective way to end the current downturn…

  2. Occasionally I think of a response at the right time rather than saying, “Gee I wish I had said that.” It doesn’t happen very often. Some years ago, I had a chap discussing a correlation between cancer cases and locale. I was an environmental scientist then and said, “Correlation doen’t mean causality”.

    I then offered an example. I drew a chart of car radios per capita in the US and increased life expectancy. For the sixty year period examined at the time, the correlation coefficient was 0.95 which seems close to perfect. I then asked the good professor to explain why the increase in mumbers of car radios led to increased life expectancy on a causal basis. he could not do it. A researcher must have some proof of relationship between a phenomenon and a result to presume a causal conncection between the two.

    I caution young scientists not to assume facts not enetered into evidence and subject to examination. The same caution applies to all things in life.

  3. Brilliant discussion on the perils of linear regression. Regression has its limitations, as the author points out, especially when you understand precisely how it works.

    For this reason it is a good idea to get away from regressions, which assume stationarity and linearity, and use instead a copula-co-movement time-fit to a pseudoautoregressive ARMA process. This would capture the non-linear temporal changes, especially as it relates to tree ring variation.

    It is remarkable to what extent people rely on linear regressions when much better alternatives exist. The obvious reason, of course, is that linear regressions give the answers people want to hear and we all know what that is. Even statisticians who know perfectly well that regressions fail to capture heteroskedastic invariance continue to use them simply because the results are easy to communicate to whoever controls their budgets.

Comments are closed.