Multicollinearity
Multicollinearity
A multiple regression is said to exhibit multicollinearity when the explanatory variables are correlated with one another. Almost all multiple regressions have some degree of multicollinearity. The extent to which multicollinearity is a problem is widely misunderstood. Multicollinearity is not a violation of the classical statistical assumptions underlying multiple regression. Specifically, multicollinearity does not cause either biased coefficients or incorrect standard errors. For this reason, while identifying multicollinearity can be helpful in understanding the outcome of a regression, “corrections” to reduce multicollinearity are rarely appropriate.
In the regression model
yi = b 1x i 1 + b 2x i 2 + … + bKxik + ei
there is multicollinearity if the x variables are correlated with one another, as is usually the case. The consequence of such correlation is that the estimates of regression coefficients are less precise than they would be absent such correlation. For example, in the regression yi = a + b 1x i 1 + b 2x i 1 + ei with n observations, the variance of the estimated coefficient b̂1 can be thought of as
When x 1 and x 2 have a high correlation, corr(x 1, x 2), the uncertainty about b1 will be large. Because the formulas for reporting standard errors reflect this, such uncertainty will be correctly reflected in the reported regression statistics.
Fundamentally, a regression estimates the effect of one explanatory variable holding constant the other explanatory variables. If one or more variables tended to move together in the available data, in which case the data will be multicollinear, then very little evidence is available about the effect of a single variable, as is reflected in the variance formula above.
The only “cures” for multicollinearity are (1) to find data with less correlation among the explanatory variables, or (2) to use a priori information to specify a value for the coefficient of one of the correlated variables, and by so doing avoid the need to separately estimate the effect of each variable.
If one explanatory variable equals a linear combination of other explanatory variables (for example, if x 1 = x 2 + x 3) the regression has perfect multicollinearity. Perfect multicollinearity makes it impossible to estimate the regression model, as indicated by the infinite variance in the formula above. However, perfect multicollinearity almost always indicates an error in specifying the model. One common error is the dummy variable trap, in which a complete set of dummy variables and an intercept, or more than one complete set of dummy variables, are included in a regression. For example, including a variable for female gender (coded 1/0), a variable for male gender, and an intercept would cause the regression to fail.
Because of limits on the numerical accuracy of computer arithmetic, a high degree of multicollinearity can lead to numerical, as opposed to statistical, errors in computing regression results. This is rarely a problem with modern software, which typically includes internal checks for such errors.
One indication of significant multicollinearity is that individual coefficients are insignificant but sets of coefficients are jointly significant. For example, a set of indicators of underlying socioeconomic status (e.g., mother’s education and father’s education) may be jointly significant even though no single indicator is significant. In such situations, investigators sometimes drop all but one indicator. While not strictly rigorous, such a procedure is not harmful so long as the coefficient on the retained variable is interpreted as a proxy for the entire set of socioeconomic indicators rather than being the effect of the specific variable that was retained. (One might retain only mother’s education, but interpret the effect loosely as “parent’s education.”)
Another indication of multicollinearity that is sometimes used is a high variance inflation factor (VIF), which measures the increase in variance of b̂i due to correlation between xi and the other explanatory variables. In the example above, the VIF is 1/(1 – corr(x 1, x 2)2).
SEE ALSO Least Squares, Ordinary; Principal Components; Properties of Estimators (Asymptotic and Exact)
BIBLIOGRAPHY
Goldberger, Arthur S. 1991. A Course in Econometrics. Cambridge, MA: Harvard University Press.
Richard Startz