How to detect spurious correlations, and how to find the real ones

*Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.*

Specifically designed in the context of big data in our research lab, the new and simple *strong correlation* synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and evenbeing sued for wrongful analytic practice.

In this paper, the traditional correlation is referred to as the *weak correlation*, as it captures only a small part of the association between two variables: *weak correlation* results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our *strong correlation* (with a value between 0 and 1) is high (say above 0.80) if not only the *weak correlation* is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter **a** (with **a** = 0 corresponding to *weak correlation*, and **a** =1 being the recommended value for*strong correlation*). This setting is designed to avoid over-fitting.

Our *strong correlation* blends together the concept of ordinary or *weak regression* – indeed, an improved, robust, outlier-resistant version of ordinary regression (or see my book pages 130-140) – together with the concept of X and Y sharing similar bumpiness (or see my book pages 125-128).

In short, even nowadays, what makes two variables X and Y *seem* related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called *strong correlation* to boost accuracy) guarantees that high *strong* correlation means that the two variables are really associated, not just based on flawy, old-fashioned *weak* correlations, but also associated based on sharing similar internal auto-dependencies and structure. To put it differently, two variables can be highly *weakly* correlated yet have very different bumpiness coefficients, as shown in my original article – meaning that there might be no causal relationship (or see my book pages 165-168) or hidden factors explaining the link. An artificial example is provided below in figure 3.

Using *strong*, rather than *weak* correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This *strong correlation* metric is designed to be integrated in automated data science algorithms.

**1. Formal definition of strong correlation**

Let’s define

c(X, Y) as the absolute value of the ordinary correlation, with value between 0 and 1. This number is high (close to 1) if X and Y are highly correlated. I recommend using my rank-based, L-1 correlation (or see my book pages 130-140) to eliminate problems caused by outliers.*Weak correlation*- c1(X) as the lag-1 auto-correlation for X, that is, if we have n observations X_1 … X_n, then c1(X) = c(X_1 … X_, X_2 … X_n)
- c1(Y) as the lag-1 auto-correlation for Y
d(X, Y) = exp, with possible adjustment if numerator or denominator is zero, and parameter*d-correlation***a**must be positive or zero. This number, with value between 0 and 1, is high (close to 1) if X and Y have similar lag-1 auto-correlations.r(X, Y) = min*Strong correlation*

Note that c1(X), and c1(Y) are the bumpiness coefficients (or see my book pages 125-128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning either dissimilar lag-1 auto-correlations for X and Y, or lack of old-fashioned correlation.

The *strong correlation* between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article The curse of Big Data (see also my book pages 41-45).

This definition of strong correlation was initially suggested in one of our weekly challenges.

**2. Comparison with traditional ( weak) correlation**

When **a** = 0, weak and strong correlations are identical. Note that the *strong correlation* r(X, Y) still shares the same properties as the *weak correlation* c(X, Y): it is symmetric and invariant under linear transformations (such as re-scaling) of variables X or Y, regardless of **a**.

Published at Mon, 26 May 2014 23:43:14 +0000

## Leave a Reply