With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
- Independent of sample size to allow comparisons across buckets of different sizes: a correlation of (say) 0.6 must always correspond to (say) the 74-th percentile among all potential paired series (X, Y) of size n, regardless of n
- Same bounds as old-fashioned correlation for back-compatibility : it must be between -1 and +1, with -1 and +1 attained by extreme, singular data sets, and 0 meaning no correlation
- More general than traditional correlation: it measures the degree of monotonicity between two variables X and Y (does X grow when Y grows?) rather than linearity (Y = a + b*X + noise, with a, b chosen to minimize noise). Yet not as general as the distance correlation (equal to zero if and only if X and Y are independent) or my structuredness coefficient.
- Not sensitive to outliers, robust
- More intuitive, more compatible with the way the human brain perceives correlations
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let’s consider the following made-up data set with two paired variables or time series X, Y: …