This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.
Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include
Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.
As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.
Thus we are still dealing with the following regression framework:
Y = a_1 * X_1 + … + a_n * X_n + noise
Remember that the solution proposed in our previous paper was
- b_i = cov(Y, X_i) / var(X_i), i = 1, …, n
- a_i = M * b_i, i = 1, …, n
- M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y – a_1 * X_1 + … + a_n * X_n
When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.
Terminology: Z is the noise, Y is the (observed) response, the a_i’s are the regression coefficients, and and S = a_1 * X_1 + … + a_n * X_n is the estimated or predicted response. The X_i’s are the independent variables or features.
2. Re-visiting our previous data set
I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.