Ordinary Least Squares Regression
Ordinary Least Squares Regression
Ordinary least squares (OLS) regression is a statistical method of analysis that estimates the relationship between one or more independent variables and a dependent variable; the method estimates the relationship by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable configured as a straight line. In this entry, OLS regression will be discussed in the context of a bivariate model, that is, a model in which there is only one independent variable ( X ) predicting a dependent variable ( Y ). However, the logic of OLS regression is easily extended to the multivariate model in which there are two or more independent variables.
Social scientists are often concerned with questions about the relationship between two variables. These include the following: Among women, is there a relationship between education and fertility? Do more-educated women have fewer children, and less-educated women have more children? Among countries, is there a relationship between gross national product (GNP) and life expectancy? Do countries with higher levels of GNP have higher levels of life expectancy, and countries with lower levels of GNP, lower levels of life expectancy? Among countries, is there a positive relationship between employment opportunities and net migration? Among people, is there a relationship between age and values of baseline systolic blood pressure? (Lewis-Beck 1980; Vittinghoff et al. 2005).
As Michael Lewis-Beck notes, these examples are specific instances of the common query, “What is the relationship between variable X and variable Y ?” (1980, p. 9). If the relationship is assumed to be linear, bivariate regression may be used to address this issue by fitting a straight line to a scatterplot of observations on variable X and variable Y. The simplest statement of such a relationship between an independent variable, labeled X, and a dependent variable, labeled Y, may be expressed as a straight line in this formula:
where a is the intercept and indicates where the straight line intersects the Y -axis (the vertical axis); b is the slope and indicates the degree of steepness of the straight line; and e represents the error.
The error term indicates that the relationship predicted in the equation is not perfect. That is, the straight line does not perfectly predict Y. This lack of a perfect prediction is common in the social sciences. For instance, in terms of the education and fertility relationship mentioned above, we would not expect all women with exactly sixteen years of education to have exactly one child, and women with exactly four years of education to have exactly eight children. But we would expect that a woman with a lot of education would have fewer children than a woman with a little education. Stated in another way, the number of children born to a woman is likely to be a linear function of her education, plus some error. Actually, in low-fertility societies, Poisson and negative binomial regression methods are preferred over ordinary least squares regression methods for the prediction of fertility (Poston 2002; Poston and McKibben 2003).
We first introduce a note about the notation used in this entry. In the social sciences we almost always undertake research with samples drawn from larger populations, say, a 1 percent random sample of the U.S. population. Greek letters like α and β are used to denote the parameters (i.e., the intercept and slope values) representing the relationship between X and Y in the larger population, whereas lowercase Roman letters like a and b will be used to denote the parameters in the sample.
When postulating relationships in the social sciences, linearity is often assumed, but this may not be always the case. Indeed, a lot of relationships are not linear. When one hypothesizes the form of a relationship between two variables, one needs to be guided both by the theory being used, as well as by an inspection of the data.
But given that we wish to use a straight line for relating variable Y, the dependent variable, with variable X, the independent variable, there is a question about which line to use. In any scatterplot of observations of X and Y values (see Figure 1), there would be an infinite number of straight lines that might be used to represent the relationship. Which line is the best line?
The chosen straight line needs to be the one that minimizes the amount of error between the predicted values of Y and the actual values of Y. Specifically, for each of the i th observations in the sample, if one were to square the difference between the observed and predicted values of Y, and then sum these squared differences, the best line would have the lowest sum of squared errors (SSE), represented as follows:
Ordinary least squares regression is a statistical method that produces the one straight line that minimizes the total squared error.
Using the calculus, it may be shown that SSE is the lowest or the “least” amount when the coefficients a and b are calculated with these formulas (Hamilton 1992, p. 33):
These values of a and b are known as least squares coefficients, or sometimes as ordinary least squares coefficients or OLS coefficients.
We now will apply the least squares principles. We are interested in the extent to which there is a relationship among the counties of China between the fertility rate (the dependent variable) and the level of illiteracy (the independent variable). China had 2,372 counties in 1982. We hypothesize that counties with populations that are heavily illiterate will have higher fertility rates than those with populations with low levels of illiteracy.
The dependent variable, Y, is the general fertility rate, GFR, that is, the number of children born in 1982 per 1,000 women in the age group fifteen to forty-nine. The independent variable, X, is the percentage of the population in the county in 1981 aged twelve or more who are illiterate.
The relationship may be graphed in the scatterplot in Figure 1. The association between the GFR and the illiteracy rate appears to be linear and positive. Each dot refers to a county of China; there are 2,372 dots on the scatterplot.
Equation (1) may be estimated using the least squares formulas for a and b in equations (3) and (4). This produces the following:
The OLS results in equation (5) indicate that the intercept value is 57.56, and the slope value is 1.19. The intercept, or a, indicates the point where the regression line “intercepts” the Y -axis. It tells the average value of Y when X = 0. Thus, in this China dataset, the value of a indicates that a county with no illiterate person in the population would have an expected fertility rate of 57.6 children per 1,000 women aged fifteen to forty-nine.
The slope coefficient, or b, indicates the average change in Y associated with a one-unit change in X. In the China example, b = 1.19, meaning that a 1 percent increase in a county’s illiteracy rate is associated with an average GFR increase, or gain, of 1.19 children per 1,000 women aged fifteen to forty-nine.
We would probably want to interpret this b coefficient in the other direction; that is, it makes more sense to say that if we reduce the county’s illiteracy rate by 1 percent, this would result in an average reduction of 1.2 children per 1,000 women aged fifteen to forty-nine. This kind of interpretation would be consistent with a policy intervention that a government might wish to use; that is, a lower illiteracy rate would tend to result in a lower fertility rate.
The regression line may be plotted in the above scatterplot, as shown in Figure 2.
It is noted that while in general the relationship between illiteracy and fertility is linear, there is a lot of error in the prediction of county fertility with a knowledge of county illiteracy. Whereas some counties lie right on or close to the regression line, and therefore, their illiteracy rates perfectly or near perfectly predict their fertility rates, the predictions for other counties are not as good.
One way to appraise the overall predictive efficiency of the OLS model is to “eyeball” the relationship as we have done above. How well does the above OLS equation correspond with variation in the fertility rates of the counties? As we noted above, the relationship appears to be positive and linear. A more accurate statistical approach to address the
question of how well the data points fit the regression line is with the coefficient of determination ( R 2).
We start by considering the problem of predicting Y, the fertility rate, when we have no other knowledge about the observations (the counties). That is, if we only know the values of Y for the observations, then the best prediction of Y, the fertility rate, is the mean of Y. It is believed that Carl Friedrich Gauss (1777–1855) was the first to demonstrate that lacking any other information about a variable’s value for any one subject, the arithmetic mean is the most probable value (Gauss [1809] 2004, p. 244).
But if we guess the mean of Y for every case, we will have lots of poor predictions and lots of error. When we have information about the values of X, predictive efficiency may be improved, as long as X has a relationship with Y. “The question then is, how much does this knowledge of X improve our prediction of Y ?” (Lewis-Beck 1980, p. 20).
First, consider the sum of the squared differences of each observation’s value on Y from the mean of Y. This is the total sum of squares (TSS) and represents the total amount of statistical variation in Y, the dependent variable.
Values on X are then introduced for all the observations (the Chinese counties), and the OLS regression equation is estimated. The regression line is plotted (as in the scatterplot in Figure 2), and the actual values of Y for all the observations are compared to their predicted values of Y. The sum of the squared differences between the predicted values of Y and the mean of Y is the explained sum of squares (ESS), sometimes referred to as the model sum of squares. This represents the amount of the total variation in Y that is accounted for by X. The difference between TSS and ESS is the amount of the variation in Y that is not explained by X, known as the residual sum of squares (RSS).
The coefficient of determination (R2) is:
The coefficient of determination, when multiplied by 100, represents the percentage amount of variation in Y (the fertility rates of the Chinese counties) that is accounted for by X (the illiteracy rates of the counties). The R2 values range from +1 to 0. If R2 = 1.0, the X variable perfectly accounts for variation in Y. Alternately, when R2 = 0 (in this case the slope of the line, b, would also equal 0), the X variable does not account for any of the variation in Y (Vittinghoff et al. 2005, p. 44; Lewis-Beck 1980, pp. 21–22).
SEE ALSO Cliometrics; Least Squares, Three-Stage; Least Squares, Two-Stage; Linear Regression; Logistic Regression; Methods, Quantitative; Probabilistic Regression; Regression; Regression Analysis; Social Science; Statistics in the Social Sciences; Tobit
BIBLIOGRAPHY
Gauss, Carl Friedrich. [1809] 2004. Theory of Motion of the Heavenly Bodies Moving About the Sun in Conic Sections: A Translation of Theoria Motus. Mineola, NY: Dover.
Hamilton, Lawrence C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Pacific Grove, CA: Brooks/Cole.
Lewis-Beck, Michael S. 1980. Applied Regression: An Introduction. Beverly Hills, CA: Sage.
Poston, Dudley L., Jr. 2002. The Statistical Modeling of the Fertility of Chinese Women. Journal of Modern Applied Statistical Methods 1 (2): 387–396.
Poston, Dudley L., Jr., and Sherry L. McKibben. 2003. Zero-inflated Count Regression Models to Estimate the Fertility of U.S. Women. Journal of Modern Applied Statistical Methods 2 (2): 371–379.
Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2005. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. New York: Springer.
Dudley L. Poston Jr .