Linear Hypotheses

views updated

Linear Hypotheses

I. RegressionE. J. Williams

BIBLIOGRAPHY

II. Analysis of VarianceJulian C. Stanley

BIBLIOGRAPHY

III. Multiple ComparisonsPeter Nemenyi

BIBLIOGRAPHY

I REGRESSION

Regression analysis, as it is presented in this article, is an important and general statistical tool. It is applicable to situations in which one observed variable has an expected value that is assumed to be a function of other variables; the function usually has a specified form with unspecified parameters. For example, an investigator might assume that under appropriate circumstances the expected score on an examination is a linear function of the length of training period. Here there are two parameters, slope and intercept of the line. The techniques of regression analysis may be classified into two kinds : (1) testing the concordance of the observations with the assumed model, usually in the framework of some broader model, and (2) carrying out estimation, or other sorts of inferences, about the parameters when the model is assumed to be correct. This area of statistics is sometimes known as “least squares,” and in older publications it was called “the theory of errors.”

In the regression relations discussed in this article only one variable is regarded as random; the others are either fixed by the investigator (where experimental control is possible) or selected in some way from among the possible values. The relation between the expected value of the random variable (called the dependent variable, the predictand, or the regressand) and the nonrandom variables (called regression variables, independent variables, predictors, or regressors) is known as a regression relation. Thus, if a random variable Y, depending on a variable x, varies at random about a linear function of x, we can write

y = β0 + β1x+e,

which expresses a linear regression relation. The parameters β0 and β1 are the regression coefficients or parameters, and e is a random variable with expected value zero. Usually the e’s corresponding to different values of Y are assumed to be uncorrelated and to have the same variance. If η denotes the expected value of Y, the basic relation may be expressed alternatively as

E(Y)= η = β0 + β1x.

The parameters in the relation will be either unknown or given by theory; observations of Y for different values of x provide the means of estimating these parameters or testing the concordance of the simple linear structure with the data.

Linear models, linear hypotheses

A regression relation that is linear in the unknown parameters is known as a linear model, and the assertion of such a model as a basis for inference is the assertion of a linear hypothesis. Often the term “linear hypothesis” refers to a restriction on the linear model (for example, specifying that a parameter has the value 7 or that two parameters are equal) that is to be tested. The importance of the linear model lies in its ease of application and understanding; there is a well-developed body of theory and techniques for the statistical treatment of linear models, in particular for the estimation of their parameters and the testing of hypotheses about them.

Needless to say, the description of a phenomenon by means of a linear model is usually a matter of convenience; the model is accepted until some more elaborate one is required. Nevertheless, the linear model has a wide range of applicability and is of great value in elucidating relationships, especially in the early stages of an investigation. Often a linear model is applicable only after transformations of the independent variables (like x in the above example), the dependent variable (Y, above), or both [seeStatistical ANALYSIS, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA].

In its most general form, regression analysis includes a number of other statistical techniques as special cases. For instance, it is not necessary that the x’s be defined as metric variables. If the values of the observations on Y are classified into a number of groups’say,p ’then the regression relation is written E(Y) =β2x1 + β2x2 + ... + βpxp, and Xi may be taken to be 1 for all observations in the zth group and 0 for all the others. The p x-variables will then specify the different groups, and the regression relation will define the mean value of Y for each group. In the simplest case, with two groups,

E(Y) = β1x1 + β2X2,

where x1 = 1 and x2 = 0 for the first group, and vice versa for the second.

The estimation of the population mean from a sample is a special case, since the model is then just

E(Y) = β0,

β0 being the mean of the population.

This treatment of the comparison of different groups is somewhat artificial, although it is important to note that it falls under the regression rubric. Such comparisons are generally carried out by means of the technique known as the analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE].

When a regressor is not measured quantitatively but is given only as a ranking (for example, order in time of archeological specimens, social position of occupation), it may still provide a regression relation suitable for estimation or prediction. The simplest way to include such a variable in a relation is to replace the qualitative values (rankings) by arbitrary numerical scores, equally spaced (see, for example, Strodtbeck et al. 1957). More refined methods would use scores spaced according to some measure of “distance” between successive rankings; thus, in some instances the scores have been chosen so that their frequency distribution approximates a grouped normal distribution. Since any method of scoring is arbitrary, the method that is used must be judged by the relations based on it as well as by its theoretical cogency. Simple scoring systems, which can be easily understood, are usually to be preferred.

When both the dependent variable and the regression variable are qualitative, each may be replaced by arbitrary scores as indicated above. Alternative methods determine scores for the dependent variable that are most highly correlated (formally) with the regressor scores or, if the regressor scores for any set of data are open to choice, choose scores for both variables so that the correlation is maximized. The calculation and interpretation of the regression relations for such situations have been discussed by Yates (1948) and by Williams (1952).

Regression, correlation, functional relation

The regression relation is a one-way relation between variables in which the expected value of one random variable is related to nonrandom values of the other variables. It is to be distinguished from other types of statistical relations, in particular from correlation and functional relationships. [SeeMultivariate ANALYSIS, articles on CORRELATION.] Correlation is a relation between two or more random variables and may be described in terms of the amount of variation in one of the variables associated with variation in the other variable or variables. The functional relation, by contrast, is a relation between the expected values of random variables. If quantities related by some physical law are subject to errors of measurement, the functional relation between expected values, rather than the regression relation, is what the investigator generally wants to determine.

Although the regression relation relates a random variable to other, nonrandom variables, in many situations it will apply also when the regression variables are random; then the regression, conditional on the observed values of the random regression variables, is determined. Here the expected value of one random variable is related to the observed values of the other random variables. For a discussion of the fitting of regression lines when the regression variables are subject to error, see Madansky (1959). When more than one variable is to be considered as random, the problem is usually thought of as one of multivariate analysis [seeMultivariate ANALYSIS].

History

The method of least squares, on which most methods of estimation for linear models are based, was apparently first published by Adrien Legendre (1805), but the first treatment along the lines now familiar was given by Carl Friedrich Gauss (1821, see in 1855). Gauss showed that the method gives estimators of the unknown parameters with minimum variance among unbiased linear estimators. This basic result is sometimes known as the Gauss-Markov theorem, and the least squares estimators as Gauss-Markov estimators.

The term “regression” was first used by Francis Galton, who applied it to certain relations in the theory of heredity, but the term is now applied to relationships in general and to nonlinear as well as to linear relationships.

The linearity of linear hypotheses rests in the way the parameters appear; the x’s may be highly nonlinear functions of underlying nonrandom variables. For example,

and

η β1e21 + β2 tan x1

both fall squarely under the linear hypothesis model, whereas

η = β1eβ2x1

does not fit that model.

There is now a vast literature dealing with the linear model, and the subject is also treated in most statistical textbooks.

Application in the social sciences

There has been a good deal of discussion about the type of model that should be used to describe relations between variables in the social sciences, particularly in economics. Linear regression models have often been considered inadequate for complex economic phenomena, and more complicated models have been developed. Recent work, however, indicates that ordinary linear regression methods have a wider scope than had been supposed. For example, there has been much discussion about how to treat data correlated in time, for which the residuals from the regression equation (the e’s) show autocorrelation. This autocorrelation may be the result of autocorrelation in the variables not included in the model. Geary (1963) suggests that in such circumstances the inclusion of additional regression variables may effectively eliminate the autocorrelation among the residuals, so that standard methods may be applied.

Further discussion of the applicability of regression methods to economic data is given by Ezekiel and Fox (1930, chapters 20 and 24) and also by Wold and Jureen (see in Wold 1953).

Investigators should be encouraged to employ the simple methods of regression analysis as a first step before turning to more elaborate techniques. Despite the relative simplicity of its ideas, it is a powerful technique for elucidating relations, and its results are easily understood and applied. More elaborate techniques, by contrast, do not always provide a readily comprehensible interpretation.

Assumptions in regression analysis

A regression model may be expressed in the following way:

E(Y) = η = + β0 + β1x1 +...+ βpxp,
Y = η + e,

where Y is the random variable, η is its expected value, the x’s are known variables, the β’s are unknown coefficients, and e is a random error or deviation with zero mean. In the notation for variables, either fixed or random, subscripts are used only to distinguish th different variables but not to distinguish different observations of the same variable. The context generally makes the meaning clear. Thus, the above expression is an abbreviated form of

E(Y1) = η = β0 + β1x1j + β2x2j + ... + βpxpj, j = 1,2, ... , nj; Yj = ηj + ej.

This model is perfectly general; however, in estimating the coefficients, it is usually assumed that the ej are mutually uncorrelated and are of equal variance (homoscedastic).

If there is no regressor variable that is identically one (as in the two-sample situation described earlier), the β0 term might well be omitted. This is primarily a matter of notation al convention.

The additional assumption that the errors are normally distributed is convenient and simplifies the theory. It can be shown that, on this assumption, the linear estimators given by least squares are in fact the maximum likelihood (m.l.) estimators of the parameters. In addition, the residual sum of squares ∑(Y — ȃη)2 (see below) is the basis for the m.l. estimator of the error variance (σ2).

Apart from the theoretical advantages of the assumption of normality of the e’s, there are the practical advantages that efficient methods of estimation and suitable tests of significance are relatively easy to apply and that the test statistics have well-known properties and are extensively tabulated. The normality assumption is often reasonable in applications, since even appreciable departures from it do not as a rule seriously invalidate regression analyses based upon normality [seeErrors, article on EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS].

Some departures from assumptions may be expected in certain situations. For example, if some of the measurements of Y are much larger than others, the associated errors, either errors of measurement or errors resulting from uncontrolled random effects, may well be correspondingly larger, so that the variances of the errors will be heterogeneous (the errors are heteroscedastic). Again, with annual data it is to be expected that errors may arise from unobserved factors whose influence from year to year will be associated, so the errors will not be independent (but see Geary 1963). It is often possible in particular cases to transform the data so that they conform more closely to the assumptions; for instance, a logarithmic or square-root transformation of Y will often give a variable whose errors have approximately constant variance [seeStatistical Analysis, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA]. This will amount to replacing the linear model for Y with a linear model for the transformed variable. In practice this often gives a satisfactory representation of the data, any departure from the model being attributed to error.

The method of least squares determines, for the parameters βi in the regression equation, estimators that minimize the sum of squares of deviations of the Y-values from the values given by the equation. This sum of squares is

Σ(Y – η)2 = Σ(Y – β0 – β1x1 – ... – βpxp)2.

In the following discussion the estimated β’s are denoted by b’s, and the corresponding estimator of η is denoted by ̂η, so that

̂η = b0 + b1x1 + • • • + bpxp

and the minimized sum of squared deviations is ∑(y — ȃη)2.

The method has the twofold merit of minimizing not only the sum of squares of deviations but also the variance of the estimators bi (among unbiased linear estimators). Thus, for most practical purposes the method of least squares gives estimators with satisfactory properties. Sometimes such esti- mators are not appropriate—for example, when errors in one direction are more serious than those in the other—but those cases are usually apparent to the investigator.

The method of least squares applies equally well when the errors are heteroscedastic or even correlated, provided the covariance structure of the errors is known (apart from a constant of proportionality, which may be estimated from the data).

The method can be generalized to take account of the general correlation structure or, equivalently, a linear transformation of the observations may be used to reduce the problem to the simpler case of uncorrelated homoscedastic errors. (Details may be found in Rao 1965, chapter 4.)

When the correlation structure is unknown, the method of least squares may still be applied. If the data are analyzed as though the errors are uncorrelated and homoscedastic, the estimators of the parameters will be unbiased, although they will be less precise than if based on the correct model.

On the other hand, if the assumed linear model is incorrect—for example, if the relation is quadratic in one of the variables but only a linear model is fitted—then the estimators are liable to serious bias.

Since the form of the underlying model is almost always unknown there is usually a corresponding risk of bias. This problem has been studied in various contexts, but there is still much to be done; see Box and Wilson (1951), Box and Andersen (1955), and Plackett (1960, chapter 2).

Simple linear regression

In the simple linear regression model the expected value of Y is a linear function of a single variable, x1:

E(Y) = η = β0 + β1x1

The parameter β0 is the intercept, and the parameter β1 is the slope, of the regression line. This model is a satisfactory one in many cases, even if a number of variables affect the expected value of Y, for one of these may have a predominating influence, and although the omission of variables from the relation will lead to some bias, this may not be important in view of the increased simplicity of the model.

In studying the relation between two variables it is almost always desirable to plot a scatter diagram of the points representing the observations. The x-axis, or abscissa, is usually used for the regression variable and the y-axis, or ordinate, for the random variable. If the regression relation is linear the points should show a tendency to fall near a straight line, though if the variation is large this tendency may well be masked. Although for some purposes a line drawn “by eye” is adequate to represent the regression, in general such a line is not sufficiently accurate. There is always the risk of bias in both the position and the slope of the line. Because there is a tendency for the deviations from the line in both the x and y directions to be taken into account in determining the fit, lines fitted by eye are often affected by the scales of measurement used for the two axes. Since Y is the random variable, only the deviations in the y direction should be taken into account in determining the fit of the line. Often the investigator, knowing that there may be error in x1, may attempt to take it into account. It should be understood that this procedure will give an estimate not of the Degression relation but of underlying structure, which often differs from the regression relation. Another and more serious shortcoming of lines drawn by eye is that they do not provide an estimate of the variance about the line, and such an estimate is almost always required.

The method of least squares is commonly used when an arithmetical method of fitting is required, because of its useful properties and its relative ease of application. The equations for the least squares estimators, bi, based on n pairs of observations (x1, Y), are as follows:

b1 = ΣY (x1 – ̄x1)/Σ(x1 – ̄x1)2,

b0 = ̄Y – b1̄x1.

Here the summation is over the observed values, and ̄x1 = ∑x1/n, and ̄Y = ∑Y/n.(Note that the observations on x1 need not be all different, although they must not all be the same.) The estimated regression function is

̂η = b0 + b1x1.

The minimized sum of squares of deviations is

The standard errors (estimated standard deviations) of the estimators may be derived from the minimized sum of squares of deviations. Two independent linear parameters have been fitted, and it may readily be shown that the expected value of this minimized sum of squares is (n — 2) σ2, where σ2 is the common variance of the residual errors. Consequently, an unbiased estimator of σ2 is given by

s2 = ∑(Y — ̂η)2/(n — 2),

and this is the conventional estimator of σ ∑(Y — ȃη The sum of squares for deviations is said to have n — 2 degrees of freedom, representing the number of linearly independent quantities on which it is based.

The estimated variances of the estimators are est. var (b1) = s2/∑(x1 — x̄1)2 and est. var and the estimated covariance is est. cov

Separate confidence limits for the parameters β and β1 may be determined from the estimators and their standard errors, using Student’s t-distribution [seeEstimation, article on CONFIDENCE INTERVALS AND REGIONS]. If tαm-2 denotes the a-level of this distribution for n — 2 degrees of freedom, the I - α confidence limits for β are Confidence limits for the intercept, β0, may be determined in a similar way but are not usually of interest. In a few cases it may be necessary to determine whether the estimator b0 is in agreement with some theoretical value of the intercept. Thus, in some situations it is reasonable to expect the regression line to pass through the origin, so that β0 = 0. It will then be necessary to test the significance of the departure of b0 from zero or, equivalently, to determine whether the confidence limits for β0 include zero.

When it is assumed that β0 = 0 and there is no need to test this hypothesis, then the regression has only one unknown parameter; in such a case the sum of squares for deviations from the regression line, used to estimate the residual variance, will have n — 1 degrees of freedom.

When the parameters β0 and β1 are both of interest, a joint confidence statement about them may be useful. The joint confidence region is usually an ellipse centered at (b0, b1) and containing all values of (β0, β1) from which (β0, β1) does not differ with statistical significance as measured by an F-test (see the section on significance testing, below).[The question of joint confidence regions is discussed further inLinear Hypotheses, article On MULTIPLE COMPARISONS.]

Choice of experimental values

The formula for the variance of the regression coefficient b1 shows that it is the more accurately determined the larger is ∑(x1 — x2)2 the sum of squares of the values of x1 about their mean. This is in accordance with common sense, since a greater spread of experimental values will magnify the regression effect yet will in general leave the error component un-altered. If accurate estimation of β1 were the only criterion, the optimum allocation of experimental points would be in equal numbers at the extreme ends of the possible range. However, the assumption that a regression is linear, although satisfactory over most of the possible range, is often likely to fail near the ends of the range; for this and other reasons it may be desirable to check the linearity of the regression, and to do so points other than the two extreme values must be observed. In practice, where little is known about the form of the regression relation it is usually desirable to take points distributed uniformly throughout the range. If the experimental points are equally spaced, this will facilitate the fitting of quadratic or higher degree polynomials, using tabulated orthogonal polynomials as described below.

Confidence limits for the regression line

The estimated regression function is

̂η = b0 + b1x1
= ȃY + b1(x1 — ̄x1),

and corresponding to any specified value, x1* of x1 the variance of ȃη is estimated as

Thus, for any specified value of x1 confidence limits for η can be determined according to the formula

The locus of these limits consists of the two branches of a hyperbola, lying on either side of the fitted regression line; this locus defines what may be described as a confidence curve. A typical regression line fitted to a set of points is shown in Figure 1 with the 95 per cent confidence curve shown as the two inside upper and lower curves, YL.

The above limits are appropriate for the estimated value of η corresponding to a given value of X1. They do not, however, set limits to the whole line. Such limits are given by a method developed by Working and Hotelling, as described, e.g., by Kendall and Stuart ([1943–1946] 1958–1966, vol. 2, chapter 28). [See alsoLinear Hypotheses, article onMultiple Comparisons.] As might be expected, these limits lie outside the corresponding limits for the same probability for a single value of X1. The limits may be regarded as arising from the envelope of all lines whose parameters fall within a suitable confidence region. These limits are given by

where F1-α2, n-2 is the tabulated value for the F-distribution with 2 and n — 2 degrees of freedom at confidence level 1 - α These limits, for a 95 per cent confidence level, are shown as a pair of broken lines in Figure 1.

Figure 1 — Regression line and associated 95 per cent confidence regions*

* The Yp curves, although they appear straight in the figure, are hyperbolas like the other Y curves.

Source of data: Martin, Jean I., 1965, Refugee Settlers: A Study of Displaced Persons in Australia. Canberra: Australian National University.

The user of the confidence limits must be clear about which type of limits he requires. If he is interested in the limits on the estimated η for a particular value X1* or in only one pair of limits at a time, the inner limits, YL, will be appropriate, but if he is interested in limits for many values of x1* (some of which may not be envisaged when the calculations are being made), the Working-Hotelling limits,YWII, will be needed.

Application of the regression equation

The regression equation is usually determined not only to provide an empirical law relating variables but also as a means of making future estimates or predictions. Thus, in studies of demand, regression relations of demand on price and other factors enable demand to be predicted for future occasions when one or more of these factors is varied. Such prediction is provided directly by the regression equation. It should be noted, however, that the standard error of prediction will be greater than the standard error of the estimated points (ȃ η) on the regression line. This is because a future observation will vary about its regression value with variance equal to the variance of individual values about the regression in the population. When standard errors are being quoted, it is important to distinguish between the standard error of the point η on the regression line and the standard error of prediction. The estimated variance of prediction is

The outside upper and lower curves in Figure 1 are confidence limits for prediction, YP, based on this variance. Clearly, for making predictions of this sort there is little point in determining the regression line with great accuracy. The major part of the error in such cases will be the variance of individual values.

The formula for the standard error of ȃη or ȃηp shows that the error of estimation increases as the x1-value departs from the mean of the sample, so that when the deviation from the mean is large the variance of estimate can be so great as to make the estimate worthless. This is one reason why investigators should be discouraged from attempting to draw inferences beyond the range of the observed values of xl. The other reason is that the assumed linear regression, even though satisfactory within the observed range, may not hold true outside this range.

Inverse estimation

In many situations the investigator is primarily interested in determining the value of X1 corresponding to a given level or value, η*. Thus, although it is still appropriate to determine the regression of the random variable Y on the fixed variable x1 the inference has to be carried out in reverse. For example, if a drug that affects the reaction time of individuals is being tested at different levels, the reaction time Y will be a random variable with regression on the dose level x>1. However, the purpose of the investigation may be to determine a dose level that will lead to a given time of reaction on the average. The experimental doses, being fixed, cannot be treated as random, so that it is inappropriate to determine a regression of X1 on Y, and such a pseudo regression would give spurious results. In such situations the value of x, corresponding to a given value of η has to be estimated from the regression of Y on x1.

The regression equation can be rearranged to give an estimator of x1 corresponding to a given value, η*,

̂X* = (η — b0)/b1.

The approximate estimated variance of the estimator is

A more precise method of treating such a problem is to determine confidence limits for η given x1 and to determine from these, by rearranging the equation, confidence limits for x1. For the regression shown in Figure 1, the 95 per cent confidence curves (the inner curves, YL, on either side of the line) will in this way give confidence limits for x1 corresponding to a given value of η. The point at which the horizontal line Y = η cuts the regression line gives the estimate of x1; the points at which the line cuts the upper and lower curves give, respectively, lower and upper confidence limits for x1. This may be demonstrated by an extension of the reasoning leading to confidence limits. [SeeEstimation, article onconfidence intervals and regions.]

Sometimes, rather than a hypothetical regression value, η*, a single observed value, y* (not in the basic sample), is given, and limits are required for the value of x1 that could be associated with such a value. The estimator ȃX* is given by

ȃX* = (y* — b0)/b1,

and its approximate estimated variance (which must take into account the variation between responses on Y to a given value of x1)is

Using this augmented variance, confidence limits on x1 corresponding to a given y* may be found. For more precise determination of the confidence limits for prediction, the locus of limits for y* given x1 may be inverted to give limits for x1 given y*. In Figure 1, the outer curves are these loci (for the 95 per cent confidence level); the 95 per cent limits for x1 will be given by the intersection of the line Y = y* with these confidence curves for prediction.

Multiple regression

In many situations where a single regression variable is not adequate to represent the variation in the random variable Y, a multiple regression is appropriate. In other situations there may be only one regression variable, but the assumed relation, rather than being linear, is a quadratic or a polynomial of higher degree. Since both multiple linear regression and polynomial regression relations are linear in the unknown parameters, the same techniques are applicable to both; in fact, polynomial regression is a special case of multiple regression. The number of variables to include in a multiple regression, or the degree of polynomial to be applied, is to some extent a matter of judgment.and convenience, although it must be remembered that a regression equation containing a large number of variables is usually inconvenient to use as well as difficult to calculate. With the use of electronic computers, however, there is greater scope for increasing the number of regression variables, since the computations are routine.

Consider the multiple regression equation

E(Y) = η = β0 + β1x1 + ... + βpxp,

with p regression variables and a constant term. The estimation of these p + 1 unknown parameters can be systematically carried out if β0 is also regarded as a regression coefficient corresponding to a regression variable X0 that is always unity. As in simple regression, the method of least squares provides unbiased linear estimators of the coefficients with minimum variance and also provides estimators of the standard errors of these coefficients. The quantities required for determining the estimators are the sums of squares and products of the x-values, the sums of products of the observed Y with each of the x-values, and the sum of squares of Y. The method of least squares gives a set of linear equations for the b’s, called the normal equations:

where thi = tih = ∑xhxi and ui = ∑Yxi. These equations can be written in matrix form as

Tb = u,

where T = (thi) and u is the vector of the ui. The solution requires the inversion of the matrix T , the inverse matrix being denoted by T -1 (with typical element thi). The solution may be written in matrix form as

b = T-1u

or in extended form as

b0 = t00u0 + t01u1 + • • • + t0pup,

and so forth.

The variance of bi is tiiγ-2, and the covariance of bi and bj is tijγ. It should be remarked that in the special case of “regression through the origin” — that is, when the constant term β0 is assumed to be zero — the first equation and the first term of each other equation are omitted; the constant regressor x0 and its coefficient β0 thus have the same status as any other regression variable.

When the constant term is included, computational labor may be reduced and arithmetical accuracy increased if the sums of squares and products are taken about the means. That is, the thi and ui are replaced by

and

respectively. All the sums of products with zero subscripts then vanish, and the sums of squares are reduced in magnitude. The constant term has to be estimated separately; it is given by

The computational aspects of matrix inversion and the determination of the regression coefficients are dealt with in many statistical texts, including Williams (1959); in addition, many programs for matrix inversion are available for electronic computers.

Effect of heteroscedasticity

When the error variance of the dependent variable Y is different in different parts of its range (or, strictly, of the range of its expected value, η), estimators of regression coefficients ignoring the heteroscedasticity will be unbiased but of reduced accuracy, as already mentioned. The calculation of improved estimators may then sometimes be necessary.

There are some problems in taking heteroscedasticity into account. Among them is the problem of specification: defining the relation between expected value and variance. Often, with adequate data, the estimated (ȃη) values from the usual unweighted regression line can be grouped and the mean squared deviation from these values for each group used as a rough measure of the variance. The regression can then be refitted, each value being given a weight inversely proportional to the estimated variance. Two iterations of this method are likely to give estimates of about the accuracy practically attainable. If an empirical relation between expected value and error variance can be deduced, this simplifies the problem somewhat; however, the weight for each observation has to be determined from a provisionally fitted relation, so iteration is still required.

To calculate a weighted regression, each observation Yj(j = 1,2,..., n) is given a weight Wj instead of unit weight as in the standard calculation. These weights will be the reciprocals of the estimated variances of each value. Then, if weighted quantities are distinguished by the subscript w,

twhi = Σwxhxi,
uwi = ΣwYxi,

and the normal equations are

and so on, or in matrix form

Twbw = uw.

The solution is

and the variances of the estimators bwi are approximately tii10 σ2.

When the weights are estimated from the data, as in the iterative method just described, some allowance has to be made in an exact analysis for errors in the weights. This inaccuracy will some-what reduce the precision of the estimators. However, for most practical purposes, and provided that the number of observations in each group for which weights are estimated is not too small, the errors in the weights may be ignored. (For further discussion of this question see Cochran & Carroll 1953.)

Estimability of the coefficients

It is intuitively clear in a general way that the p + 1 regression variables included in a regression equation should not be too nearly linearly dependent on one another, for then it might be expected that these regression variables could be approximately expressed in terms of a smaller number.

More precisely, in order that meaningful estimators of the regression coefficients exist, it is necessary that the variables be linearly independent (or, equivalently, T must be nonsingular). That is, no one variable should be expressible as a linear combination of the others or, expressed symmetrically, no linear combination of the variables vanishes unless all coefficients are zero. Clearly, if only p - r of the variables are linearly independent, then the regression relation may be represented as a regression on these p —h r, together with arbitrary multiples of the vanishing linear combinations. From the practical point of view, this lack of estimability will cause no problems, provided that the regression on a set of pr linearly independent variables is calculated. Estimation from the equation will be unaffected, but for testing the significance of the regression it must be noted that the regression sum of squares has not p + 1 but p - r degrees of freedom, and the residual has n — p + r.

However, if the lack of estimability is ignored, the calculations to determine the p + 1 coefficients either will fail (since the matrix T , being singular, has no inverse) or will give misleading results (if an approximate value of T , having an inverse, is used in calculation and the lack of estimability is obscured).

When the regression variables, although linearly independent, are barely so (in the sense that the matrix T , although of rank p + 1, is “almost singular, “having a small but nonvanishing determinant), the regression coefficients will be estimable but will have large standard errors. In typical cases, many of the estimated coefficients will not differ with statistical significance from zero; this merely reflects the fact that the corresponding regression variable may be omitted from the equation and the remaining coefficients adjusted without significant worsening of the fit.

In this situation, as in the case of linear dependence, these effects are not usually important in jpractice; however, they may suggest the advisability of reducing the number of regression variables included in the equation. [For further discussion, seeStatistical IDENTIFIABILITY.]

Conditions on the coefficients

Sometimes the regression coefficients βi are assumed to satisfy some conditions based on theory. Provided these conditions are expressible as linear equations in the coefficients, the method of least squares carries through and leads, as before, to unbiased estimators satisfying the conditions and with minimum variance among linear estimators. It will be clear that with p + 1 regression coefficients subject to r + 1 independent linear restrictions, r + 1 of the coefficients may be eliminated, so that the restricted regression is equivalent to one with p — r coefficients. Thus, in principle there is a choice between expressing the model in terms of p — r unrestricted coefficients or p + 1 restricted ones; often the latter has advantages of symmetry and interpretability.

A simple example of restricted regression is one in which η is a weighted average of the x’s but with unknown weights, β1,..., βp. Here the side conditions would be β0 = 0, β + ηp = 1.

As the introduction of side conditions effectively reduces the number of linearly independent coefficients, such conditions are useful in restoring estimability when the coefficients are nonestimable. In many problems these side conditions may be chosen to have practical significance. For example, where an overall mean and a number of treatment “effects” are being estimated, it is conventional to specify the effects so that their mean vanishes; with this specification they represent deviations from the overall mean.

When a restricted regression is being estimated, it will often be possible and of interest to estimate the unrestricted regression as well, in order to see the effect of the restrictions and to test whether the data are concordant with the conditions assumed. The test of significance consists of comparing the (p + 1)-variable (unrestricted) regression with the (p — r)-variable (restricted) regression, in the manner described in the section on significance testing. This test of concordance is independent of the test of significance of any of the restricted coefficients.

Further details and examples of restricted regression are given by Rao (1965, p. 189) and Williams (1959, pp. 49-58). In the remainder of this article, the notation will presume unrestricted regression.

Missing values

When observations on some of the variables are missing, the simplest and usually the only practicable procedure is to ignore the corresponding values of the other variables—that is, to work only with complete sets of observations. However, it is sometimes possible to make use of the incomplete data, provided some additional assumptions are made. Methods have been developed under the assumption that (a) the missing values are in some sense randomly deleted, or the assumption that (b) the variables are all random and follow a multivariate normal distribution. Assumption (b) is treated by Anderson (1957) and Rao (1952, pp. 161-165). It is sometimes found, after the least squares equations for the constants in a regression relation have been set up, that some of the values of the dependent variable are unreliable or missing altogether. Rather than recalculate the equations it is often more convenient to replace the missing value by the value expected from the regression relation. This substitution conserves the form of the estimating equations, usually with little disturbance to the significance tests or the variances of the estimators.

The techniques of “fitting missing values” have been most fully developed for experiments designed in such a way that the estimators of various constants are either uncorrelated or have a symmetric pattern of correlations and the estimating equations have a symmetry of form that simplifies their solution. Missing values in such experiments destroy the symmetry and make estimation more difficult; it is therefore a great practical convenience to replace the missing values. Details of the method applied to designed experiments will be found in Cochran and Cox (1950). For applications to general regression models see Kruskal (1961).

The technique is itself an application of the method of least squares. To replace a missing value Y;, a value –η, is chosen so as to minimize its contribution to the residual sum of squares. Thus, the estimate is equivalent to the one that would have been obtained by a fresh analysis; the calculation is simplified by the fact that estimates for only one or a few values are being calculated. The degrees of freedom for the residual sum of squares are reduced by the number of values thus fitted. For most practical purposes it is then sufficiently accurate to treat the fitted values as though they were original observations. The exact analysis is described by Yates (1933) and, in general terms, by Kruskal (1961).

Significance testing

In order to determine the standard errors of the regression coefficients and to test their significance, it is necessary to estimate the residual variance, σ2. The sum of squares of deviations, ∑(Y — ȃη)2, which may readily be shown to satisfy

is found under p + 1 constraints and so may be said to have n — p — 1 degrees of freedom; if the model assumed is correct, so that the deviations are purely random, the expected value of the sum of squares is (n — p — l)σ2. Accordingly, the residual mean square,

is an unbiased estimator of the residual variance. The variances of the regression coefficients are estimated by

est.var(bi) = tiis2,

and the standard errors are the square roots of these quantities. The inverse matrix thus is used both in the calculation of the estimators and in the determination of their standard errors. From the off-diagonal elements thi of the inverse matrix are derived the estimated covariances between the estimators,

est.cov (bh, bi) - ttis2,

The splitting of the total sum of squares of Y into two parts, a part associated with the regression effects and a residual part independent of them, is a particular example of what is known as the analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE].

Testing for regression effects. The regression sum of squares, being based on p + 1 estimated quantities, will have p + 1 degrees of freedom. When regression effects are nonexistent, the expected value of each part is proportional to its degrees of freedom. Accordingly, it is often convenient and informative to present these two parts, and their corresponding mean squares, in an analysis-of-variance table, such as Table 1.

In the table, the final column gives the expected values of the two mean squares; it shows that real regression effects inflate the regression sum of squares but not the residual sum of squares. This fact provides the basis for tests of significance of a calculated regression, since large values of the ratio of regression mean square to residual mean square give evidence for the existence of a regression relation.

Significance of a single coefficient. The question may arise whether one or more of the regression variables contribute to the relation anything that is not already provided by the other variables. In such circumstances the relevant hypothesis to be examined is that the β’s corresponding to these variables are zero. A more general hypothesis that may sometimes need to be tested is that certain of the β’s take assigned values.

The simplest test is that of the statistical significance of a single coefficient—say, bi. The test will be of its departure from zero, if the contribution of Xi to the regression is in question. More generally, when β is specified, as, say, β*i, it will be relevant to test the significance of departure of bi from β*i. The significance test in either case is the same; the squared difference between estimated and hypothesized values is compared with the estimated variance of that difference, which is s2tii.

The ratio F = (bi — β*i 2/(s2tii) has the F-distribution with 1 and n — p — I degrees of freedom if the difference is in fact due to sampling fluctuations alone; in this case, the F-statistic is just the square of the usual t-statistic. When βi differs

Table 7 – Analysis-of-variance table for testing regression effects
SourceDegrees of freedomSum of squaresMean squareExpected mean square
RegressionP+lΣbiui
Residualn–p–1ΣY2–Σbiui=(n–p–1)S2S2σ2
TotalnΣY2  

from β*i the F-statistic will tend to be larger, so that a right-tail test is indicated.

Testing several coefficients. To test a number of regression variables — or, more precisely, their regression coefficients — the method of least squares is equivalent to fitting a regression with and without the variables in question and testing the difference in the regression sums of squares against the estimated error variance. To choose a specific example, suppose the last q coefficients in a p-variable regression are to be tested. If the symbol S2 is used to stand for sum of squares, the sum of squares for regression on all p variables may be written

with p + 1 degrees of freedom, and the corresponding sum of squares on the first p — q variables as

with p - q + 1 degrees of freedom. The difference, a sum of squares with q degrees of freedom, provides a criterion for testing the significance of the q regression coefficients. The ratio

has, under the null hypothesis that the last q coefficients are zero, the F-distribution with q and n — p — I degrees of freedom. This simultaneous test of q coefficients may also be adapted to testing the departure of the q coefficients from theoretical values, not necessarily zero.

The significance test may be conveniently set out as in Table 2, where only the mean squares required for the significance test appear in the last column.

When q = 1 , this test reduces to the test for a single regression coefficient, and the F-ratio

is then identical with the F-ratio given above for making such a test.

Linear combinations of coefficients. Sometimes it is necessary to test the significance of one or more linear combinations of the coefficients—that is, to test hypotheses about linear combinations of the β’s. A common example is the comparison of two coefficients, β1 and β2, say, for which the comparison bl — b2 is relevant. The F-test applies to such comparisons also. Thus, for the difference bl-b2, the estimated variance is s2(t11 - 2t12 + t22), and F=(b1 — b2> 2/[s2(t11 - 2t12)], with 1 and n — p — 1 degrees of freedom.

In general, to test the departure from zero of k linear combinations of regression coefficients the procedure is as follows. Let the linear combinations (expressed in matrix notation) be

where Γ is a (p + 1) x k matrix of known constants. Then the estimated covariance matrix of these linear combinations is

and the F-ratio is

with k and n — p — 1 degrees of freedom. Of course, this test can also be adapted to testing the departure of these linear combinations from pre-assigned values other than zero.

When the population coefficients βi are in fact nonzero, the expected value of the regression mean square in the analysis of variance shown in Table 1 will be larger than σ2 by a term that depends on both the magnitude of the coefficients and the accuracy with which they are estimated (see, for example, the last column of Table 1). Clearly, the greater this term, called the noncentrality, the greater the probability that the null hypothesis will be rejected at the adopted significance level. The F-test has certain optimum properties, but other tests may be preferred in special circumstances.

Table 2 – Analysis-of-variance table for testing several regression coefficients
SourceDegrees of freedomSum of squaresMean square
Regression on p-q variablesp-q + 1...
Additional q variables
Regress/on on all p variablesp+1...
Residuals2

Multivariate analogues

Although hitherto only the regression of a single dependent variable Y on one or more regressors xi has been discussed, it will be realized that often the simultaneous regressions of a number of random variables on the same regressors will be of importance. For instance, in a sociological study of immigrants the regressions of annual income and size of family on age, educational level, and period of residence in the country may be determined; here there are two dependent variables and three regressors.

Often the relations among the different dependent variables will also be of interest, or various linear combinations of the variables, rather than the original variables themselves, may be studied. The linear combination that is most highly correlated with the regressors may sometimes be relevant to the investigation, but the linear compounds will usually be chosen for their practical relevance rather than their statistical properties. [For further discussion of multivariate analogues, seeMultivariateAnalysis, especially the general article,Overview, and the article onClassification And Discrimination.]

Polynomial regression

When the relation between two variables, x1 and Y, appears to be curvilinear, it is natural to fit some form of smooth curve to the data. For some purposes a freehand curve is adequate to represent the relation, but if the curve is to be used for prediction or estimation and standard errors are required, some mathematical method of fitting, such as the method of least squares, must be used. The free-hand fitting of a curvilinear relation has all the disadvantages of freehand fitting of a straight line, with the added disadvantage that it is more difficult to distinguish real trends from random fluctuations.

The polynomial form is

Being a linear model, it has the advantages of simplicity, flexibility, and relative ease of calculation. It is for such reasons, not because it necessarily represents the theoretical form of the relation, that a polynomial regression is often fitted to data.

Orthogonal polynomials

The computations in polynomial regression are exactly the same as those in multiple regression, except that some simplification of the arithmetic may be introduced if the same values of X1. are used repeatedly. Then instead of using the powers of XT as the regression variables, these are replaced by orthogonal polynomials of successively increasing degree, so defined that the sum of products of any pair of them, over their chosen values, is zero.

This procedure has the twofold advantage that, first, all the off-diagonal elements of the matrix T are zero, so the calculation of regression coefficients and their standard errors is much simplified, and, second, the regression coefficient on each polynomial and the corresponding sum of squares can be independently determined.

Because it is common for investigators to use data with values of the independent variables equally spaced, the orthogonal polynomials for this particular case have been extensively tabulated. Fisher and Yates (1938) tabulate these orthogonal polynomials up to those of fifth degree, for numbers of equally spaced points up to 75. However, if the data are not equally spaced the tabulated polynomials are not applicable, and the regression must be calculated directly.

Testing adequacy of fit

The question of what degree of polynomial is appropriate to fit to a set of data is discussed below (see “Considerations in regression model choice”). If for each value of X1 there is an array of values for Y, the variation in the data can be analyzed into parts between and within arrays by the techniques of analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE]. The sum of squares between arrays can be further analyzed into that part accounted for by regression and that part not so accounted for (deviation from regression). The adequacy of a polynomial fitted to the data is indicated by non-significant deviation from regression.

When there is but one observation of Y for each value of X1, such an analysis is not possible. To test the adequacy of a pth-degree polynomial regression, a common though not strictly defensible procedure is to fit a polynomial of degree p + 1 and test whether the coefficient bp+1 of xp+11 is significant. Anderson (1962) has treated this problem as a multiple decision problem and has provided optimal procedures that can readily be applied.

Estimation of maxima

Sometimes a polynomial regression is fitted in order to estimate the value of X1 that yields a maximum value of 17. A detailed discussion of the estimation of maxima is given by Hotelling (1941). To give an idea of the methods that are used, consider a quadratic regression of the form

A maximum (or minimum) value of ȃη occurs at the point xm = —b1/2b2, and this value is taken as the estimated position of the maximum. Confidence limits for the position can be determined by means of the following device. If the position of the maximum of the true regression curve is ξ, then ξ = -β1/2β2, so that β1 + 2β2ξ = 0. Consequently the quantity

b1 + 2b2ξ

is distributed with mean zero and estimated variance

The confidence limits for ξ with confidence coefficient 1 - α are given by the roots of the equation

where Fα:1, n-3 is the a-point of the F-distribution with 1 and n — 3 degrees of freedom, abbreviated below as Fα. The solution of this equation may be simplified by writing

so that the confidence limits become

Note that these limits are not, in general, symmetrically placed about the estimated value —b1/2b2, since allowance is made for the skewness of the distribution of the ratio. Note also that the limits will include infinite values and will therefore not be of practical use, unless b2 is significant at the a-level. In terms of the g-values, this means that g22 must not exceed 1.

When the regression model is a polynomial in two or more variables, investigation of maxima and other aspects of shape becomes more complex. [A discussion of this problem appears in Experimental Design, article on Response Surfaces.]

Nonlinear models

In a nonlinear model the regression function is nonlinear in one or more of the parameters. Familiar examples are the exponential regression,

η = β0 + β1eβ2x1,

and the logistic curve,

β2 being the nonlinear parameter in each example. Such nonlinear models usually originate from theo retical considerations but nevertheless are often useful for applying to observational data.

Sometimes the model can be reduced to a linear form by a transformation of variables (and a corresponding change in the specification of the errors). The exponential regression with β0 = 0 may thus be reduced by taking logarithms of the dependent variable and assuming that the errors of the logarithms, rather than the errors of the original values, are distributed about zero. If Z = loge Y and E(Z) = ξ the exponential model with β0 = 0 reduces to ξ = loge β1 + β2x1, a linear model.

The general models shown above cannot be reduced to linear models in this way. For nonlinear models generally, the nonlinear parameters must be estimated by successive approximation. The following method is straightforward and of general applicability.

Suppose the model is

η = β0 + β1 f(x1, β2)

where f(x1 β2) is a nonlinear function of β2 and the estimated regression, determined by least squares, is

̂η = b0 + b1f(x1, c).

If c0 is a trial value of c (estimated by graphical or other means), the values of f(x1, c) and its first derivative with respect to c (denoted, for brevity, by f and f’, respectively) are calculated for each value of x1 , with c = c0. The regression of Y on f and f’ is then determined in the usual way, yielding the regression equation

̂η = b0 + b1f + b2f’.

A first adjustment to cn is given by b2/bl, giving the new approximation

c1 = co + b2/b1.

The process of recalculating the regression on f and f’ and determining successive approximations to c can be continued until the required accuracy is attained (for further details see Williams 1959).

The method is an adaptation of the delta method, which utilizes the principle of propagation of error. If a small change, δβt , is made in a parameter β2, the corresponding change in a function f(β2) is, to a first approximation, f’(βt)δβt. The use of this method allows the replacement of the nonlinear equations for the parameters by approximate linear equations for the adjustments. For a regression relation of the form

η = β0 + β12x1,

Stevens (1951) provides a table to facilitate the calculation of the nonlinear parameter by a method similar to that described above, and Pimentel Gomes (1953) provides tables from which, with a few preliminary calculations, the least squares estimate of the nonlinear parameter can be read off easily.

Considerations in regression model choice

In deciding which of several alternative models shall be used to interpret a relationship, a number of factors must be taken into account. Other things being equal, the model which represents the predictands most closely (where “closeness” is measured in terms of some criterion such as minimum mean square error among linear estimators) will be used. However, questions of convenience and simplicity should also be considered. A regression equation that includes a large number of regression variables is not convenient to use, and an equation with fewer variables may be only slightly less accurate. In deciding between alternative models, the residual variance is therefore not the only factor to take into account.

In polynomial regression particularly, the assumed polynomial form of the model is usually chosen for convenience, so that a polynomial of given degree is not assumed to be the true regression model. Because of this, the testing of individual polynomial coefficients is little more than a guide in deciding on the degree of polynomial to be fitted. Of far more importance is a decision on what degree of variability about the regression model is acceptable, and this decision will be based on practical rather than merely statistical considerations.

Besides the question of including additional variables in a regression, for which significance tests have already been described, there is also the question of alternative regression variables. The alternatives for a regression relation could be different variables or different functions of the same variable—for instance, x1 and log X1.

For comparison of two or more individual variables as predictors, a test devised by Hotelling (1940) is suitable, although not strictly accurate. It is based on the correlations between Y and the different predictors and of the predictors among themselves. For comparing two regression variables X1 and x2, the test statistic is

which is distributed approximately as F with 1 and n — 3 degrees of freedom. Here, as before,

and s2 is the mean square of residuals from the regression of Y on X1. and x2, with n — 3 degrees of freedom.

E. J. Williams

BIBLIOGRAPHY

Anderson, T. W. 1957 Maximum Likelihood Estimates for a Multivariate Normal Distribution When Some Observations Are Missing. Journal of the American Statistical Association 52:200–203.

Anderson, T. W. 1962 The Choice of the Degree of a Polynomial Regression as a Multiple Decision Problem. Annals of Mathematical Statistics 33:255–265.

Box, George E. P.; and Andersen, S. L. 1955 Permutation Theory in the Derivation of Robust Criteria and the Study of Departures From Assumption. Journal of the Royal Statistical Society Series B 17:1-26.

Box, George E. P.; and Wilson, K. B. 1951 On the Experimental Attainment of Optimum Conditions. Journal of the Royal Statistical Society Series B 13:1-45. → Contains seven pages of discussion.

Cochran, William G.; and Carroll, Sarah P. 1953 A Sampling Investigation of the Efficiency of Weighting Inversely as the Estimated Variance.Biometrics 9: 447-459.

Cochran, William G.; and Cox, Gertrude M. (1950) 1957 Experimental Designs. 2d ed. New York: Wiley.

Ezekiel, Mordecai; and Fox, Karl A. (1930) 1961 Methods of Correlation and Regression Analysis: Linear and Curvilinear. New York: Wiley.

Fisher, R. A.; and Yates, Frank (1938) 1963 Statistical Tables for Biological, Agricultural and Medical Research. 6th ed., rev. & enl. Edinburgh: Oliver & Boyd; New York: Hafner.

Gauss, Carl F. 1855 Methode des moindres carrés: Mémoires sur la combinaison des observations. Translated by J. Bertrand. Paris: Mallet-Bachelier. → An authorized translation of Carl Friedrich Gauss’s works on least squares.

Geary, R. C. 1963 Some Remarks About Relations Between Stochastic Variables: A Discussion Document. Institut International de Statistique,Revue 31:163–181.

Hotelling, Harold 1940 The Selection of Variates for Use in Prediction With Some Comments on the General Problem of Nuisance Parameters. Annals of Mathematical Statistics 11:271–283.

Hotelling, Harold 1941 Experimental Determination of the Maximum of a Function. Annals of Mathematical Statistics 12:20-45.

Kendall, Maurice G.; and Stuart, Alan (1943–1946) 1958–1966 The Advanced Theory of Statistics. New ed. 3 vols. New York: Hafner; London: Griffin. → Volume 1: Distribution Theory, 1958. Volume 2: Inference and Relationship, 1961. Volume 3:Design and Analysis, and Time-series, 1966. Kendall was the sole author of the 1943–1946 edition.

Kruskal, William H. 1961 The Coordinate-free Approach to Gauss-Markov Estimation and Its Application to Missing and Extra Observations. Volume 1, pages 435–451 in Symposium on Mathematical Statistics and Probability, Fourth, Berkeley, Proceedings. Berkeley and Los Angeles: Univ. of California Press.

Legendre, Adrien M. (1805) 1959 On a Method of Least Squares. Volume 2, pages 576–579 in David Eugene Smith, A Source Book in Mathematics. New York: Dover. → First published as “Sur la méthode des moindres carrés “in Legendre’s Nouvelles methodes pour la determination des orbites des cométes.

Madansky, Albert 1959 The Fitting of Straight Lines When Both Variables Are Subject to Error. Journal of the American Statistical Association 54:173–205.

Pimentel Gomes, Frederico 1953 The Use of Mitscherlich’s Regression Law in the Analysis of Experiments With Fertilizers. Biometrics 9:498–516.

Plackett, R. L. 1960 Principles of Regression Analysis. Oxford: Clarendon.

Rao, C. Radhakrishna 1952 Advanced Statistical Methods in Biometric Research. New York: Wiley.

Rao, C. Radhakrishna 1965 Linear Statistical Inference and Its Applications. New York: Wiley.

Stevens, W. L. 1951 Asymptotic Regression. Biometrics 7:247–267.

Strodtbeck, Fred L.; Mcdonald, Margaret R.; and Rosen, Bernard C. 1957 Evaluation of Occupations: A Reflection of Jewish and Italian Mobility Differences. American Sociological Review 22:546–553.

Williams, Evan J. 1952 Use of Scores for the Analysis of Association in Contingency Tables. Biometrika 39: 274-289.

Williams, Evan J. 1959 Regression Analysis. New York: Wiley.

Wold, Herman 1953 Demand Analysis: A Study in Econometrics. New York: Wiley.

Yates, Frank 1933 The Analysis of Replicated Experiments When the Field Results Are Incomplete. Empire Journal of Experimental Agriculture 1: 129-142.

Yates, Frank 1948 The Analysis of Contingency Tables With Groupings Based on Quantitative Characters. Biometrika 35:176–181.

II ANALYSIS OF VARIANCE

Analysis of variance is a body of statistical procedures for analyzing observational data that may be regarded as satisfying certain broad assumptions about the structure of means, variances, and distributional form. The basic notion of analysis of variance (or ANOVA) is that of comparing and dissecting empirical dispersions in the data in order to understand underlying central values and dispersions.

This basic notion was early noted and developed in special cases by Lexis and von Bortkiewicz [seeLexis; Bortkiewicz]. Not until the pioneering work of R. A. Fisher (1925; 1935), however, were the fundamental principles of analysis of variance and its most important techniques worked out and made public [seeFisher, R. A.]. Early applications of analysis of variance were primarily in agriculture and biology. The methodology is now used in every field of science and is one of the most important statistical areas for the social sciences. (For further historical material see Sampford 1964.)

Much basic material of analysis of variance may usefully be regarded as a special development of regression analysis [seeLinear Hypotheses, article on REGRESSION]. Analysis of variance extends, however, to techniques and models that do not strictly fall under the regression rubric.

In analysis of variance all the standard general theories of statistics, such as point and set estimation and hypothesis testing, come into play. In the past there has sometimes been overemphasis on testing hypotheses.

One-factor analysis of variance

Suppose that the experiment is set up so that P1, P2, and P3/ pupils (where P1 = P2, = P3 = P) read the chapter in styles 1, 2, and 3, respectively, and that XP8 denotes the comprehension score of the p th pupil reading style s. (Here s = 1, 2, 3; in general, s = 1, 2,..., S.) There is a hypothetical mean, or expected, value of Xps, μs, but Xps differs from μs because, first, the pupils are chosen randomly from a population of pupils with different inherent means and, second, a given pupil, on hypothetical repetitions of the experiment, would not always obtain the same score. This is expressed by writing

A simple experiment will now be described as an example of ANOVA. Suppose that the publisher of a junior-high-school textbook is considering styles of printing type for a new edition; there are three styles to investigate, and the same chapter of the book has been prepared in each of the three styles for the experiment. Junior-high-school pupils are to be chosen at random from an appropriate large population of such pupils, randomly assigned to read the chapter in one of the three styles, and then given a test that results in a reading-comprehension score for each pupil.

Xp8 = μ8 + e8.

Then the assumptions are made that the e8 are all independent, that they are all normally distributed, and that they have a common (usually unknown) variance, σ2. By definition, the expectation of ep8 is zero.

Because differences among the pupils reading a particular style of type are thrown into the random “error” terms (ep8), μ8, the expectation of Xp8, does not depend on p. It is convenient to rewrite (l)as

Xps = μ + (μ8 – μ) + ep8

where μ = (∑μs)/S, the average of the μs. For simplicity, set α8 = μ8μ (so that α1 + α2 +...+ αs = 0) and write the structural equation finally in the conventional form

Xp8 = μ + α8 + ep8.

Here α8 is the differential effect on comprehension scores of style s for the relevant population of pupils. The unknowns are μ, the α8, and σ2.

Note that this structure falls under the linear regression hypothesis with coefficients 0 or 1. For example, if E(Xp8) represents the expected value of Xp8,

E(Xp1)= 1.μ + 1.α1+0.α2+0.α3 + ...+0.μS,

E(Xp2) = 1.μ + 0.α1 + 1.α2 + 0.α3 +...+0.μS.

Consider how this illustrative experiment might be conducted. After defining the population to which he wishes to generalize his findings, the experimenter would use a table of random numbers to choose pupils to read the chapter printed in the different styles. (Actually, he would probably have to sample intact school classes rather than individual pupils, so the observations analyzed might be class means instead of individual scores, but this does not change the analysis in principle.) After the three groups have read the same chapter under conditions that differ only in style of type, a single test covering comprehension of the material in the chapter would be administered to all pupils

The experimenter’s attention would be focused on differences between average scores of the three style groups (that is, ̄X.1 versus ̄X.2., ̄X.2 versus ̄X.3, and ̄X.1 versus ̄X.3) relative to the variability of the test scores within these groups. He estimates the µs via the ̄X.3, and he attempts to determine which of the three averages, if any, differ with statistical significance from the others. Eventually he hopes to help the publisher decide which style of type to use for his new edition.

ANOVA of random numbers—an example

An imaginary experiment of the kind outlined above will be analyzed here to illustrate how ANOVA is applied. Suppose that the three Ps are each 20, that in fact the µs are all exactly equal to 0, and that σ= 1 (setting µs = 0 is just a convenience corresponding to a conventional origin for the comprehension-score scale).

Sixty random normal deviates, with mean 0 and variance 1, were chosen by use of an appropriate table (RAND Corporation 1955). They are listed in Table 1, where the second column from the left should be disregarded for the moment—it will be used later, in a modified example. From the “data” of Table 1 the usual estimates of the μs are just the column averages, ̄X.1, = –0.09, ̄X.2, = 0.10, and ̄X.3 = 0.08. The estimate of µ is the overall mean, ̄X.. = 0.03, and the estimates of the αs are –0.09 – 0.03 = –0.12, 0.10 – 0.03 = 0.07, and 0.08 – 0.03 = 0.05. Note that these add to zero,

Table 1 — Dafa for hypothetical experiment; 60 random normal deviates
 Xp1Xp1 + 1*Xp2Xp3
*This column was obtained by adding 1 to each deviate of the first column.
 0.4771.477–0.9871.158
 –0.0170.9832.3130.879
 0.5081.5080.0160.068
 –0.5120.4880.4831.116
 –0.1880.8120.1570.272
 –1.073–0.0731.107–0.396
 –0.4120.588–0.023–0.983
 1.2012.2010.898–0.267
 –0.6760.324–1.4040.3207
 –1.012–0.012–0.0800.929
 .9971.997–1.258–0.603
 –0.1270.873–0.0170.493
 1.1782.1781.607–1.243
 –1.507–0.5070.005–0.145
 1.0102.0100.1631.334
 –0.5280.472–0.771–0.906
 –0.1390.8610.485–1.633
 0.6211.6210.1470.424
 –2.078–1.078–1.764–0.433
 0.4851.4850.9861.245
Mean—0.090.910.100.08
Variance0.830.831.030.78

as required. In ANOVA, for this case, two quantities are compared. The first is the dispersion of the three µ* estimates—that is, the sum of the (̄X.3 – ̄X..)2 , conveniently multiplied by 20, the common sample size. This is called the between–styles dispersion or sum of squares. Here it is 0.4466. (These calculations, as well as those below, are made with the raw data of Table 1, not with the rounded means appearing there.) The second quantity is the within–sample dispersion, the sum of the three quantities p(XPs – ̄X.s)2. This is called the within-style dispersion or sum of squares. Here it is 50.1253.

This comparison corresponds to the decomposition

Xps –̄X.. = (̄X.s – ̄X..) + (Xps –̄X.s

and to the sum-of-squares identity

which shows how the factor of 20 arises. Such identities in sums of squares are basic in most elementary expositions of ANOVA.

The fundamental notion is that the within-style dispersion, divided by its so-called degrees of freedom (here, degrees of freedom for error), unbiasedly estimates σ2. Here the degrees of freedom for error are 57 (equals 60 [for the total number of

Table 2 – Analysis-of-variance table for one-factor experiment
(a) ANOVA of 60 random normal deviates
Source of variationdfSSMSFTabled F0.5;2,57
* Actually, σ2 here is known to be 1.
* Here P+ is used for
Between styles3-1=20.44660.22330.253.16
Within styles60-3=5750.12530.8794*  
Total60-1=5950.5719   
(b) ANOVA of general one-factor experiment with S treatments
Source of variationdtMSEMS  
Between treatmentsS-1  
Within treatmentsP+–S*σ2  
TotalP+–1*   

observations minus 3 [for the number of μs estimated]). On the other hand, the between-styles dispersion, divided by its degrees of freedom (here 2), estimates σ2 unbiasedly if and only if the µs; are equal; otherwise the estimate will tend to be larger than σ2. Furthermore, the between-styles and within-style dispersions are statistically independent. Hence, it is natural to look at the ratio of the two dispersions, each divided by its degrees of freedom. The result is the F-statistic, here

In repeated trials with the null hypothesis (that there are no differences between the µs) true, the F-statistic follows an F-distribution with (in this case) 2 and 57 degrees of freedom [seeDistributions, statistical, article onspecial continuous distributions]. Level of significance is denoted by “α” (which should not be confused with the totally unrelated “α8,” denoting style effect; the notational similarity stems from the juxtaposition of two terminological traditions and the finite number of Greek letters). The F-test at level of significance α of the null hypothesis that the styles are equivalent rejects that hypothesis when the F-statistic is too large, greater than its l00α percentage point, here Fα2,57. If α = 0.05, which is a conventional level, then F.05:2,57 = 3.16, so 0.25 is much smaller than the cutoff point, and the null hypothesis is, of course, not rejected. This is consonant with the fact that the null hypothesis is true in the imaginary experiment under discussion.

Table 2 summarizes the above discussion in both algebraic and numerical form. The algebraic form is for S styles with Ps students at the sth style.

To reiterate, in an analysis of variance each kind of effect (treatment, factor, and others to be discussed later) is represented by two basic numbers. The first is the so-called sum of squares (SS), corresponding to the effect; it is random, depending upon the particular sample, and has two fundamental properties: (a) If the effect in question is wholly absent, its sum of squares behaves probabilistically like a sum of squared independent normal deviates with zero means, (b) If the effect in question is present, its sum of squares tends to be relatively large; in fact, it behaves probabilistically like a sum of squared independent normal deviates with not all means zero.

The second number is the socalled degrees of freedom (df). This quantity is not random but depends only on the structure of the experimental design. The df is the number of independent normal deviates in the description of sums of squares just given.

A third (derived) number is the so-called mean square (MS), which is computed by dividing the sum of squares by the degrees of freedom. When an effect is wholly absent, its mean square is an unbiased estimator of underlying variance, σ2. When an effect is present, its mean square has an expectation greater than σ2.

In the example considered here, each observation is regarded as the sum of (a) a grand mean, (b) a printing-style effect, and (c) error. It is con ventional in analysis-of-variance tables not to have a line corresponding to the grand mean and to work with sample residuals centered on it; that convention is followed here. Printing-style effect and error differ in that the latter is assumed to be wholly random, whereas the former is not random but may be zero. The mean square for error estimates underlying variance unbiasedly and is a yardstick for judging other mean squares.

In the standard simple designs to which ANOVA is applied, it is customary to define effects so that the several sums of squares are statistically independent, from which additivity both of sums of squares and of degrees of freedom follows [seeProbability, article onformal probability], In the example, SSbetween + SSwithin = SStotal, and dfb + dfw = dftotal. (Here, and often below, the subscripts b and “W” are used to stand for “between” and “within,” respectively.) This additivity is computationally useful, either to save arithmetic or to verify it.

Analysis-of-variance tables, which, like Table 2, are convenient and compact summaries of both the relevant formulas and the computed numbers, usually also show expected mean squares (EMS), the average value of the mean squares over a (conceptually) infinite number of experiments. In fixed-effects models (such as the model of the example) these are always of the form σ2 (the underlying variance) plus an additional term that is zero when the relevant effect is absent and positive when it is present. The additional term is a convenient measure of the magnitude of the effect.

Expected mean squares, such as those given by the two formulas in Table 2, provide a necessary condition for the F-statistic to have an F-distribution when the null hypothesis is true. (Other conditions, such as independence, must also be met.) Note that if the population mean of the sth treatment,µs, is the same for all treatments (that is, if αs = 0 for all s) then the expected value of MSb will be σ2, the same as the expected value of MSW. If the null hypothesis is true, the average value of the F from a huge number of identical experiments employing fresh, randomly sampled experimental units will be (P+S)/(P+S – 2), which is very nearly 1 when, as is usually the case, the total number of experimental units, P+, is large compared with S. Expected mean squares become particularly important in analyses based on models of a nature somewhat different from the one illustrated in Tables 1 and 2, because in those cases it is not always easy to determine which mean square should be used as the denominator of F (see the discussion of some of these other models, below).

The simplest t-tests

It is worth digressing to show how the familiar one-sample and two-sample t-tests (or Student tests) fall under the analysis-of-variance rubric, at least for the symmetrical two-tail versions of these tests.

Single-sample t-test. In the single-sample t-test context, one considers a random sample, Xl, X2,...,XP, of independent normal observations with the same unknown mean, µ, and the same unknown variance, σ2. Another way of expressing this is to write

Xp = µ + ep, p = 1,...,p,

where the ep are independent normal random variables, with mean 0 and common variance σ2. The usual estimator of µ is ̄X., the average of the Xp , and this suggests the decomposition into average and deviation from average,

Xp = ̄X. + (Xp –̄X.),

from which one obtains the sum-of-squares identity

(since ∑(XP – ̄X.) = 0), a familiar algebraic relationship. Since the usual unbiased estimator of σ2 is s2 = ∑(Xp –̄X.)2 /(P – 1), the sum-of-squares identity may be written

Ordinarily the analysis-of-variance table is not written out for this simple case; it is, however, the one shown in Table 3. In Table 3 the total row is the actual total including all observations; it is of the essence that the row for mean is separated out.

Table 3
EffectdfSSEMS
Mean1P̄X2σ2+Pμ2
ErrorP – 1Σ(Xp–̄X.)2σ2
TotalP 

The F-statistic for testing that µ = 0 is the ratio of the mean squares for mean and error,

which, under the null hypothesis, has an F-distribution with 1 and P – 1 degrees of freedom. Notice that the above F-statistic is the square of

which is the ordinary t-statistic (or Student statistic) for testing µ = 0. If a symmetrical two-tail test is wanted, it is immaterial whether one deals with the t-statistic or its square. On the other hand, for a one-tail test the t-statistic would be referred to the t-distribution with P – 1 degrees of freedom [seeDistributions, statistical, article onspecial continuous distributions].

It is important to note that a confidence interval for µ may readily be established from the above discussion [seeEstimation, article onconfidence intervals and regions]. The symmetrical form is

Alternatively, Fα1 p – 1 can be replaced by the upper 100(α/2) per cent point for the t-distribution with P – 1 degrees of freedom, tα/2 ,P –1.

Suppose, for example, that from a normally distributed population there has been drawn a random sample of 25 observations for which the sample mean, ̄x., is 34.213 and the sample variance, s2, is 49.000. What is the population mean, µ? The usual point estimate from this sample is 34.213. How different from µ is this value likely to be? For α = .05, a 95 per cent confidence interval is constructed by looking up t.023:24 = 2.064 in a table (for instance, McNemar [1949] 1962, p. 430) and substituting in the formula

Thus,

This result means that if an infinite number of samples, each of size P = 25, were drawn randomly from a normally distributed population and a confidence interval for each sample were set up in the above way, only 5 per cent of the intervals would fail to cover the mean of the population (which is a certain fixed value).

Similarly, from this one sample the unbiased point estimate of σ2 is the value of s2, 49.000. Brownlee ([I960] 1965, page 282) shows how to find confidence intervals for σ2 [see alsoVariances, statistical study of].

Is it “reasonable” to suppose that the mean of the population from which this sample was randomly chosen is as large as, say, 40? No, because that number does not lie within even the 99 per cent confidence interval. Therefore it would be unreasonable to conclude that the sample was drawn from a population with a mean as great as 40. The relevant test of statistical significance is

the absolute magnitude of which lies beyond the 0.9995 percentile point (3.745) in the tabled t-distribution for 24 degrees of freedom. Therefore, the difference is statistically significant beyond the 0.0005 + 0.0005 = 0.001 level. The null hypothesis being tested was H0: µ = 40, against the alternative hypothesis Ha µ ≠ 40. Just as the confidence interval indicated that it is unreasonable to suppose the mean to be equal to 40, this test also shows that 40 will lie outside the 99 per cent confidence interval; however, of the two procedures, the confidence interval gives more information than the significance test.

Two-sample t-test. In the two-sample t-test context, there are two random samples from normal distributions assumed to have the same variance, σ2, and to have means µ1 and µ2. Call the observations in the first sample X11, ... , XPi l and the observations in the second sample X12,..., Xp23. The most usual null hypothesis is µ1 = µ,2, and for that the t-statistic is

where the P’s are the sample sizes, the ̄X’s are the sample means, and s2 is the estimate of σ2 based on the pooled within-sample sum of squares,

Here P1 + P2 – 2 is the number of degrees of freedom for error, the total number of observations less the number of estimated means (̄X.1 and ̄X.2 esti-mate µ1 and µ2, respectively). Under the null hypothesis, the t-statistic has the t-distribution with P1 + P2 – 2 degrees of freedom.

The basic decomposition is

Xps – ̄X.. = (̄X.s – ̄X..) + (Xps – ̄X.s).

leading to the sum-of-squares decomposition

Since s has only the values 1 and 2,

Table 4
OEffectdfSSEMS*
*Note that the expected mean square for style is a plus what is obtained by formal substitution for the random variables (̄X1,̄X.2) in the sum of squares of their respective expectations (divided by df, which here is 1). This relationship is a perfectly general one in the analysis-of-variance model now under discussion, but it must be changed for other models that will be mentioned later.
Style1
Errorσ2
Total   

and therefore

The analysis-of-variance table may be written as in Table 4. The F-statistic for the null hypothesis that µ = µ2 is

and this is exactly the square of the t-statistic for the two-sample problem.

Note that the two-sample problem as it is analyzed here is only a special case (with S = 2) of the S-sample problem presented earlier.

The numerical example continued

Returning to the numerical example of Table 1, add 1 to every number in the leftmost column to obtain the second column and consider the numbers in the second column as the observations for style 1. Now µ1 = 1 and µ2 = µ3 = 0. What happens to the analysis of variance and the F-test? Table 5 shows the result; the F-statistic is 5.41, which is of high statistical significance since F01257 = 5.07. Thus, one would correctly reject the null hypothesis of equality among the three µs.

The actual value of µ is 1/3 ≅ 0.33, and that of α1 2/3 ≅ 0.67. The estimate of µ is 0.36, and that of α1 is 0.55.

With three styles, one can consider many contrasts—for example, style 1 versus style 2, style 1 versus style 3, style 2 versus style 3, 1/2(style 1 + style 2) versus style 3. There are special methods for dealing with several contrasts simultaneously [seeLinear hypotheses, article onmultiple comparisons].

ANOVA with more than one factor

In the illustrative example being considered here, suppose that the publisher had been interested not only in style of type but also in a second factor, such as the tint of the printing ink (t). If he had three styles and four tints, a complete “crossed” factorial design would require 3 * 4 = 12 experimental conditions (s1 t1, s1 t2, ..., s3 t4). From 12 P experimental units he would assign P units at random to each of the 12 conditions, conduct his experiment, and obtain outcome measures to analyze. The total variation between the 12 P outcome measures can be partitioned into four sources rather than into the two found with one factor. The sources of variation are the following: between styles, between tints, interaction of styles with tints, and within style-tint combinations (error).

The usual model for the two-factor crossed design is

Xpst = µ + αs + βt + γst+epst,

Where ∑sαs = ∑tβt = ∑sγst = ∑tγst = 0.and epst are independent normally distributed random variables with mean 0 and equal variance σ2 for each st combination. The analysis-of-variance procedure for this design appears in Table 6. The αs and βt represent main effects of the styles and tints; the σst denote (two-factor)interactions.

Table 5 – One-factor ANOVA of 60 transformed random normal deviates
Source of variationdfSSMSEMSF
Between styles28.92464.46235.41
Within styles5750.12530.8794σ2 
Total5959.0499   
Table 6 - ANOVA of a complete, crossed-c/ass/ficaf/on, fwo-facfor factorial design with P experimental units for each factor-level combination
Source of variationdfSSEMS
Between stylesS - 1
Between tintsT – 1
Styles X tints (interaction)(S – 1)(T – 1)
Within style-tint combinationsST(P – 1)σ2
TotalPST - 1 

Interaction

The two-factor design introduces interaction, a concept not relevant in one-factor experiments. It might be found, for example, that, although in general S1 is an ineffective style and t3 is an ineffective tint, the particular combination s1 t3 produces rather good results. It is then said that style interacts with tint to produce nonadditive effects; if the effects were additive, an ineffective style combined with an ineffective tint would produce an ineffective combination.

Interaction is zero if E(Xpst) = µ + αs + βt for every st, because under this condition the population mean of the stth combination is the population grand mean plus the sum of the effects of the sth style and the tth tint. Then the interaction effect,σst, is zero for every combination. Table 7 contains hypothetical data showing population means,̄µst for zero interaction (Lubin 1961 discusses types of interaction). Note mat for every cell of Table 7, ̄µst –(̄µst – (̄µst-µ)=µ=3 (Here ̄µ .. is written as µ for simplicity.) For exam pie, for tint 1 and style 1,3 – (5 – 3) – (1 – 3) = 3.

One tests for interaction by computing F MSstyles * tints/MSwithin style-tint Comparing this F with the F*stabled at various significance levels for (s–1) (T–1)and ST (P–1) degrees of freedom.

Table 7 – Zero interaction of two factors (hypothetical population means ̄μst
Tint/Style1234Row means (̄μs.)
134585
201252
301252
Column means (̄μ.t)12363=μ

If there were but one subject reading with each style-tint combination (that is, if there were no replication), further assumptions would have to be made to permit testing of hypotheses about main effects. In particular, it is commonly then assumed that the style × tint interaction is zero, so that the expected mean square for interaction in Table 6 reduces to the underlying variance, and the MSstylesxtints may be used in the denominator of the MSstyles x tints may be used in the denominator of the F’s for testing main effects. No test of the assumption of additivity is possible through MSwithin style–tint, because this quantity cannot be calculated. However, Tukey (1949; see also Winer 1962, pp. 216–220) has provided a one-degree-of-freedom test for interaction, or nonadditivity, of a special kind that can be used for testing the hypothesis of no interaction for these unreplicated experiments of the fixed-effects kind. (See Scheffe 1959, pp. 129–134.)

The factorial design may be extended to three or more factors. With three factors there are four sums of squares for interactions: one for the three-factor interaction (sometimes called a second-order interaction, because a one-factor “interaction” is a main effect) and one each for the three two-factor (that is, first-order) interactions. If the three factors are A, B, and C, their interactions might be represented as A×B×C , A × ×B , A × C, and B × C. For example, a style of type that for the experiment as a whole yields excellent comprehension may, when combined with a generally effective size of type and a tint of paper that has overall facilitative effect, yield rather poor results. One three-factor factorial experiment permits testing of the hypothesis that there is a no second-order interaction and permits the magnitude of such interaction to be estimated, whereas three one-factor experiments or a two-factor experiment and a one-factor experiment do not. Usually, three- factor nonadditivity is difficult to explain substantively.

A large number of more complex designs, most of them more or less incomplete in some respect as compared with factorial designs of the kind discussed above, have been proposed.[SeeExperimental design; see also Winer 1962; Fisher 1935.]

The analysis of covariance

Suppose that the publisher in the earlier, styleof-type example had known reading-test scores for his 60 pupils prior to the experiment. He could have used these antecedent scores in the analysis of the comprehension scores to reduce the magnitude of the mean square within styles, which, as the estimate of underlying variance, is the denominator of the computed F. At the same time he would adjust the subsequent style means to account for initial differences between reading-test-score means in the three groups. One way of carrying out this more refined analysis would be to perform an analysis of variance of the differences between final comprehension scores and initial reading scores—say, XPsYps. A better prediction of the outcome measure, Xps, might be secured by computing α + βYps, where α and β are constants to be estimated.

By a statistical procedure called the analysis of covariance one or more antecedent variables may be used to reduce the magnitude of the sum of squares within styles and also to adjust the observed style means for differences between groups in average initial reading scores. If β ≠0, then the adjusted sum of squares within treatments (which provides the denominator of the F-ratio) will be less than the unadjusted SSW of Table 2, thereby tending to increase the magnitude of F. For each independent antecedent variable one uses, one degree of freedom is lost for SSW and none for SSb ; the loss of degrees of freedom for SSw will usually be more than compensated for by the decrease in its magnitude.

A principal statistical condition needed for the usual analysis of covariance is that the regression of outcome scores on antecedent scores is the same for every style, because one computes a single within-style regression coefficient to use in adjusting the within-style sum of squares. Homogeneity of regression can be tested statistically; see Winer (1962, chapter 11). Some procedures to adopt in the case of heterogeneity of regression are given inBrownlee (1960).

The regression model chosen must be appropriate for the data if the use of one or more antecedent variables is to reduce MSW appreciably. Usually the regression of outcome measures on antecedent measures is assumed to be linear.

The analysis of covariance can be extended to more than one antecedent variable and to more complex designs. (For further details see Cochran 1957; Smith 1957; Winer 1962; McNemar 1949.)

Models—fixed, finite, random, and mixed

In the example, the publisher’s “target population” of styles of print consisted of just those 3 styles that he tried out, so he exhausted the population of styles of interest to him. Suppose that, instead, he had been considering 39 different styles and had drawn at random from these 39 the 3 styles he used in the experiment. His intention is to determine from the experiment based on these 3 styles whether it would make any difference which one of the 39 styles he used for the textbook (of course, in practice a larger sample of styles would be drawn). If the styles did seem to differ in effectiveness, he would estimate from his experimental data involving only 3 styles the variance of the 39 population means of the styles. Then he might perform further experiments to find the most effective styles.

Finite-effects models

Thus far in this article the model assumed has been the fixed-effects model, in which one uses in the experiment itself all the styles of type to which one wishes to generalize. The 3-out-of-39 experiment mentioned above illustrates a finite-effects model, with only a small percentage (8 per cent, in the example given) of the styles drawn at random for the experiment but where one has the intention of testing the null hypothesis

Ho: µ1 = µ2 = ... = µ39

against all alternative hvpotheses and estimating “variance,” from and .

Random-effects models

If the number of “levels” of the factor is very large, so that the number of levels drawn randomly for the experiment is a negligible percentage of the total number, then one has a random-effects model, sometimes called a components-of -variance model or Model n. This model would apply if, for example, one drew 20 raters at random from an actual or hypothetical population of 100,000 raters and used those 20 to rate each of 25 subjects who had been chosen at random from a population of half a million. (Strictly speaking, the number of raters and the number of subjects in the respective populations would have to be infinite to produce the random-effects model, but for practical purposes 100,000/20 and 500,000/25 are sufficiently large.) If every rater rated every subject on one trait (say, gregariousness) there would be 20 × 25 = 500 ratings, one for each experimental combination — that is, one for each rater–subject combination.

This, then, would be a two-factor design without replication, that is, with just one rating per rater–subject combination. (Even if the experimenter had used available raters and subjects rather than drawing them randomly from any populations, he would probably want to generalize to other raters and subjects “like” them; see Cornfield & Tukey 1956, p. 913.)

The usual model for an experiment thus conceptualized is

Xrs = µ + ar + bs + ers,

where µ is a grand mean, the a’s are the (random) rater effects, the b’s are (random) subject effects, and the e’s combine interaction and inherent measurement error. The 20 + 25 + (20 x 25) random variables are supposed to be independent and assumed to have variances as follows:

For F-testing purposes, a, b, and e are supposed to be normally distributed.

The analysis-of-variance table in such a case is similar to those presented earlier, except that the expected mean square column is changed to the one shown in Table 8.

Table 8
EffectEMS
Rater
Subject
Error

The F-statistic for testing the hypothesis that the main effect of subjects is absent (that is MS*/MSerror, where

Under the null hypothesis that = 0 the F-statistic has an F-distribution with 24 and 19 x 24 degrees of freedom. (A similar F-statistic is used for testing An unbiased estimator of is

with a similar estimator for σ2r. A serious difficulty with these estimators is that they may take negative values; perhaps the best resolution of that difficulty is to enlarge the model. See Nelder (1954), and for another approach and a bibliography, see Thompson (1962).

Note that here it appears impossible to separate random interaction from inherent variability, both of which contribute to σ2e, the variance of the e’s; in the random-effects model, however, this does not jeopardize significance tests for main effects.

In more complex Model n situations, the F-tests used are inherently different from their Model i analogues; in particular, sample components of variance are often most reasonably compared, not with the “bottom” estimator of σ2, but with some other—usually an interaction—component of variance. (See Hays 1963, pp. 356–489; Brownlee [1960] 1965, pp. 309–396, 467–529.)

Mixed models

If all the levels of one factor are used in an experiment while a random sample of the levels of another factor is used, a mixed model results. Mixed models present special problems of analysis that have been discussed by Scheffe (1959, pp. 261–290) and by Mood and Graybill (1963).

Other topics in ANOVA

Robustness of ANOVA

Fixed-effects models are better understood than the other models and therefore, where appropriate, can be used with considerable confidence. Fixed-effects ANOVA seems “robust” for type i errors to departures from certain mathematical assumptions underlying the F-test, provided that the number of experimental units is the same for each experimental combination. Two of these assumptions are that the e’s are normally distributed and that they have common variance σ2 for every one of the experimental combinations. In particular, the common-variance assumption can be relaxed without greatly affecting the probability values for computed F’s. If the number of experimental units does not vary from one factor-level combination to another, then it may be unnecessary to test for heterogeneity of variances preliminary to performing an ANOVA, because ANOVA is robust to such heterogeneity. (In fact, it may be unwise to make such a test, because the usual test for heterogeneity of variance is more sensitive to nonnormality than is ANOVA.) For further discussion of this point see Lindquist (1953, pp. 78–86), Winer (1962, pp. 239–241), Brownlee ([1960] 1965, chapter 9), and Glass (1966). Brownlee (1960) and others have provided the finite-model expected mean squares for the complete three-factor factorial design, from which one can readily determine expected mean squares for three-factor fixed, mixed, and random models.

Analysis-of-variance F’s are unaffected by linear transformation of the observations—that is, by changes in the Xps of the form a + bXps , where a and b are constants (b ≠ 0). Multiplying every observation by b multiplies every mean square by b2. Adding a to every observation does not change the mean squares. Thus, if observations are two-decimal numbers running from, say, –1.22 upward, one could, to simplify calculations, drop the decimal (multiply each number by 100) and then add 122 to each observation. The lowest observation would become 100 (–1.22) + 122 = 0. Each mean square would become 1002 = 10,000 times as large as for the decimal fractions. With the increasing availability of high-speed digital computers, coding of data is becoming less important than it was formerly.

A brief classification of factors

The ANOVA “factors” considered thus far are style of printing type, tint of ink, rater, and subject. Styles differ from each other qualitatively, as do raters and subjects. Tint of ink might vary more quantitatively than do styles, raters, and subjects—as would, for example, size of printing type or temperature in a classroom. Thus, one basis for classifying factors is whether or not their levels are ordered and, if they are, whether meaningful numbers can be associated with the factor levels.

Another basis for classification is whether the variable is manipulated by the experimenter. In order to conduct a “true” experiment, one must assign his experimental units in some (simple or restrictive) random fashion to the levels of at least one manipulated factor. ANOVA may be applied to other types of data, such as the scores of Englishmen versus Americans on a certain test, but this is an associational study, not a stimulus-response experiment. Obviously, nationality is not an independent variable in the same sense that printing type is. The direct “causal” inference possible from a well-conducted style-of-type experiment differs from the associational information obtained from the comparison of Englishmen’s scores with those of Americans (see Stanley 1961; 1965; 1966; Campbell & Stanley 1963). Some variables, such as national origin, are impossible to manipulate in meaningful ways, whereas others, such as “enrolls for Latin versus does not enroll for Latin,” can in principle be manipulated, even though they usually are not.

Experimenters use nonmanipulated, classification variables for two chief reasons. First, they may wish to use a factor explicitly in a design in order to isolate the sum of squares for the main effect of that factor so that it will riot inflate the estimate of underlying variance—that is, so it will not make the denominator mean square of F unnecessarily large. For example, if the experimental units available for experimentation are children in grades seven, eight, and nine, and if IQ scores are available, it is wise in studying the three styles of type to use the three (ordered) grades as one fixed-effects factor and a number of ordered IQ levels—say, four—as another fixed-effects factor. If the experimenter suspects that girls and boys may react differently to the styles, he will probably use this two-level, unordered classification (girls versus boys) as the third factor. This would produce 3 x 4 x 2 x 3 = 72 experimental combinations, so with at least 2 children per combination he needs not less than 144 children.

Probably most children in the higher grades read better, regardless of style, than do most children in the lower grades, and children with high IQ’s tend to read better than children with lower IQ’s, so the main effects of grade and of IQ should be large. Therefore, the variation within grade-IQ—sex-style groups should be considerably less than within styles alone.

A second reason for using such stratifying or leveling variables is to study their interactions with the manipulated variable. Ninth graders might do relatively better with one style of type and seventh graders relatively better with another style, for example. If so, the experimenter might decide to recommend one style of type for ninth graders and another for seventh graders. With the above design one can isolate and examine one four-factor interaction, four three-factor interactions, six two-factor interactions, and four main effects, a total of 24 – 1 = 15 sources of variation across conditions. In the fixed-effects model all of these are tested against the variation within the experimental combinations, pooled from all combinations. Testing 15 sources of variation instead of 1 will tend to cause more apparently significant F’s at a given tabled significance level than would be expected under the null hypothesis. For any one of the significance tests, given that the null hypothesis is true, one expects 5 spurious rejections of the true null hypothesis out of 100 tests; thus, if an analyst keeps making F-tests within an experiment, he has more than a .05 probability of securing at least one statistically significant F, even if no actual effects exist. There are systematic ways to guard against this (see, for example, Pearson & Hartley 1954, pp. 39–40). At least, one should be suspicious of higher-order interactions that seem to be significant at or near the .05 level. Many an experimenter utilizing a complex design has worked extremely hard trying to interpret a spuriously significant high-order interaction and in the process has introduced his fantasies into the journal literature.

Studies in which researchers do not manipulate any variables are common and important in the social sciences. These include opinion surveys, studies of variables related to injury in automobile accidents, and studies of the Hiroshima and Nagasaki survivors. ANOVA proves useful in many such investigations.[See Campbell & Stanley 1963; Lindzey 1954; see alsoExperimental design, article onquasi-experimental design.]

“Nesting” and repeated measurements

Many studies and experiments in the social sciences involve one or more factors whose levels do not “cross” the levels of certain other factors. Usually these occur in conjunction with repeated measurements taken on the same individuals. For example, if one classification is school and another is teacher within school, where each teacher teaches two classes within her school with different methods, then teachers are said to be ℌnested” within schools. Schools can interact with methods (a given method may work relatively better in one school than in another) and teachers can interact with methods within schools (a method that works relatively better for one teacher does not necessarily produce better results for another teacher in the same school), but schools cannot interact with teachers, because teachers do not “cross” schools—that is, the same teacher does not teach at more than one school.

This does not mean that a given teacher might not be more effective in another school but merely that the experiment provides no evidence on that point. One could, somewhat inconveniently, devise an experiment in which teachers did cross schools, teaching some classes in one school and some in another. But an experimenter could not, for example, have boys cross from delinquency to non-delinquency and vice versa, because delinquency–nondelinquency is a personal rather than an environmental characteristic. (For further discussion of nested designs see Brownlee [1960] 1965, chapters 13 and 15.)

If the order of repeated measurements on each individual is randomized, as when each person undergoes several treatments successively in random order, there is more likelihood that ANOVA will be appropriate than when the order cannot be randomized, as occurs, for instance, when the learning process is studied over a series of trials. Complications occur also if the successive treatments have differential residual effects; taking a difficult test first may discourage one person in his work on the easier test that follows but make another person try harder. These residual effects seem likely to be of less importance if enough time occurs between successive treatment levels for some of the immediate influence of the treatment to dissipate. Human beings cannot have their memories erased like calculating machines, however, so repeated-measurement designs, although they usually reduce certain error terms because intraindividual variability tends to be less than interindividual variability, should not be used indiscriminately when analogous designs without repeated measurements are experimentally and financially feasible. (For further discussion see Winer 1962; Hays 1963, pp. 455–456; Campbell & Stanley 1963.)

Missing observations

For two factors with levels s = 1, 2, ..., S and t = 1, 2, ..., T in the experiment, such that the number of experimental units for the stth experimental combination is nst , one usually designs the experiment so that nst = n, a constant for all st. A few missing observations at the end of the experiment do not rule out a slightly adjusted simple ANOVA, if they were not caused differentially by the treatments. If, for example, one treatment was to administer a severe shock on several occasions, and the other was to give ice cream each time, it would not be surprising to find that fewer shocked than fed experimental subjects come for the final session. The outcome measure might be arithmetical-reasoning score; but if only the more shock-resistant subjects take the final test, comparison of the two treatments may be biased. There would be even more difficulty with, say, a male–female by shocked–fed design, because shocking might drive away more women than men (or vice versa).

When attrition is not caused differentially by the factors one may, for one-factor ANOVA, perform the usual analysis. For two or more factors, adjustments in the analysis are required to compensate for the few missing observations. (See Winer 1962, pp. 281–283, for example, for appropriate techniques.)

The power of the F-test

There are two kinds of errors that one can make when testing a null hypothesis against alternative hypotheses: one can reject the null hypothesis when in fact it is true, or one can fail to reject the null hypothesis when in fact it is false. Rejecting a true null hypothesis is called an “error of the first kind,” or a “type i error.” Failing to reject an untrue null hypothesis is called an “error of the second kind” or “type n error.” The probability of making an error of the first kind is called the size of the significance test and is usually signified by α. The probability of making an error of the second kind is usually signified by β. The quantity 1 – β is called the power of the significance test.

If there is no limitation on the number of experimental units available one can fix both α and β at any desired levels prior to the experiment. To do this some prior estimate of σ2 is required, and it is also necessary to state what nonnull difference among the factor-level means is considered large enough to be worth detecting. This latter requirement is quite troublesome in many social science experiments, because a good scale of value (such as dollars) is seldom available. For example, how much is a one-point difference between the mean of style 1 and style 2 on a reading-comprehension test worth educationally? Intelligence quotients and averages of college grades are quasi-utility scales, although one seldom thinks of them in just that way. How much is a real increase in IQ from 65 to 70 worth? How much more utility for the college does a grade-point average of 2.75 (where C = 2 and B = 3) have than a grade-point average of 2.50? (For further discussion of this topic see Chernoff & Moses 1959.)

In the hypothetical printing-styles example (Tables 1 and 5) it is known that σ2 = 1 and that the population mean of style 1 is one point greater than the population means of styles 2 and 3, so with this information it is simple to enter Winer’s Table B.ll (1962, p. 657) with, for example, α = .05 and β .10 and to find that for each of the three styles P = 20 experimental units are needed.

In actual experiments, where σ and the of interest to the experimenter are usually not known, the situation is more difficult (see Brownlee [1960] 1965, pp. 97–111; McNemar [1949] 1962, pp. 63–69; Hays 1963; and especially Scheffe 1959, pp. 38–42, 62–65, 437–455).

Alternatives to analysis of variance

If one conducted an experiment to determine how well ten-year-old boys add two-digit numbers at five equally spaced atmospheric temperatures, he could use the techniques of regression analysis to determine the equation for the line that best fits the five means (in the sense of minimum squared discrepancies). This line might be of the simple form α + βT (that is, straight with slope β and intercept α) or it might be based on some other function of T. [See Winer 1962 for further discussion of trend analysis; see alsoLinear hypotheses, article onregression.]

The symmetrical two-tail t-test is a special case of the F-test; Likewise, the unit normal deviate (z), called “critical ratio” in old statistics textbooks when used for testing significance, is a special case of F: z2 = F1,∞. The F-distribution is closely related to the chi-square distribution. [For further discussion of these relationships, seeDistributions, statistical, article onspecial continuous distributions.]

For speed and computational ease, or when assumptions of ANOVA are violated so badly that results would seem dubious even if the data were transformed, there are other procedures available (see Winer 1962). Some of these procedures involve consecutive, untied ranks, whose means and variances are parameters dependent only on the number of ranks; an important example is the Kruskal–Wallis analysis of variance for ranks (Winer 1962, pp. 622–623). Other procedures employ the binomial expansion (p + q)n or the chi-square approximation to it for “sign tests.” Still others involve dichotomizing the values for each treatment at the median and computing X2. Range tests may be used also. [See Winer 1962, p. 77; McNemar (1949) 1962, chapter 19. Some of these procedures are discussed in Nonparametricstatistics.]

When the normal assumption is reasonable, there are often available testing and other procedures that are competitive with the F-test. The latter has factotum utility, and it has optimal properties when the alternatives of interest are symmetrically arranged relative to the null hypothesis. But when the alternatives are asymmetrically arranged, or in other special circumstances, competitors to F procedures may be preferable. Particularly worthy of mention are Studentized range tests (see Scheffé 1959, pp. 82–83) and half-normal plotting (see Daniel 1959).

Special procedures are useful when the alternatives specify an ordering. For example, in the styleof-type example it might be known before the experiment that if there is any difference between the styles, style 1 is better than style 2, and style 2 better than style 3 (see Bartholomew 1961; Chacko 1963).

It is also important to mention here the desirability of examining residuals (observations less the estimates of their expectations) as a check on the model and as a source of suggestions toward useful modifications. [SeeStatisticalanalysis, special problems of, article Ontransformations of data; see also Anscombe & Tukey 1963.

Often an observed value appears to be so distant from the other values that the experimenter is tempted to discard it before performing an ANOVA. For a discussion of procedures in such cases, seeStatistical Analysis, Special problems of, article onoutliers.]

Multivariate analysis of variance

The analysis of variance is multivariate in the independent variables (the factors) but univariate in the dependent variables (the outcome measures). S. N. Roy (for example, see Roy & Gnanadesikan 1959) and others have developed a multivariate analysis of variance (MANOVA), multivariate with respect to both independent and dependent variables, of which ANOVA is a special case. A few social scientists (for example, Rodwan 1964; Bock 1963) have used MANOVA, but as yet it has not been used widely by workers in these disciplines.

Julian C. Stanley

BIBLIOGRAPHY

Anscombe, F. J.; and TUKEY, JOHN W. 1963 The Examination and Analysis of Residuals. Technometrics 5:141–160.

Bartholomew, D. J. 1961 Ordered Tests in the Analysis of Variance. Biometrika 48:325–332.

Bock, R. Darrell 1963 Programming Univariate and Multivariate Analysis of Variance. Technometrics 5: 95–117.

Brownlee, Kenneth A. (1960) 1965 Statistical Theory and Methodology in Science and Engineering. 2d ed. New York: Wiley.

Campbell, Donald T.; and STANLEY, J. S. 1963 Experimental and Quasi-experimental Designs for Research on Teaching. Pages 171–246 in Nathaniel L. Gage (editor), Handbook of Research on Teaching. Chicago: Rand McNally. → Republished in 1966 as a separate monograph titled Experimental and Quasi-experimental Designs for Research.

Chacko, V. J. 1963 Testing Homogeneity Against Ordered Alternatives. Annals of Mathematical Statistics 34:945–956.

Chernoff, Herman; and Moses, Lincoln E. 1959 Elementary Decision Theory. New York: Wiley.

Cochran, William G. 1957 Analysis of Covariance: Its Nature and Uses. Biometrics 13:261–281.

Cornfield, Jerome; and Tukey, John W. 1956 Average Values of Mean Squares in Factorials. Annals of Mathematical Statistics 27:907–949.

Daniel, Cuthbert 1959 Use of Half–normal Plots in Interpreting Factorial Two–level Experiments. Technometrics 1:311–341.

Fisher, R. A. (1925) 1958 Statistical Methods for Research Workers. 13th ed. New York: Hafner. → Previous editions were also published by Oliver & Boyd.

Fisher, R. A. (1935) 1960 The Design of Experiments. 7th ed. London: Oliver & Boyd; New York: Hafner.

Glass, Gene V. 1966 Testing Homogeneity of Variances. American Educational Research Journal 3:187–190.

[Gosset, William S.] (1908) 1943 The Probable Error of a Mean. Pages 11–34 in William S. Cosset, “Student’s” Collected Papers. London: University College, Biometrika Office. → First published in Volume 6 of Biometrika.

Hays, William L. 1963 Statistics for Psychologists. New York: Holt.

Lindquist, Everet F. 1953 Design and Analysis of Experiments in Psychology and Education. Boston: Houghton Mifflin.

Lindzey, Gardner (editor) (1954) 1959 Handbook of Social Psychology. 2 vols. Cambridge, Mass.: Addison–Wesley. → Volume 1: Theory and Method. Volume 2: Special Fields and Applications. A second edition, edited by Gardner Lindzey and Elliot Aronson, is in preparation.

Lubin, Ardie 1961 The Interpretation of Significant Interaction. Educational and Psychological Measurement 21:807–817.

Mclean, Leslie D. 1967 Some Important Principles for the Use of Incomplete Designs in Behavioral Research. Chapter 4 in Julian C. Stanley (editor), Improving Experimental Design and Statistical Analysis. Chicago: Rand McNally.

Mcnemar, Quinn (1949) 1962 Psychological Statistics. 3d ed. New York: Wiley.

Mood, Alexander M.; and GRAYBILL, FRANKLIN A. 1963 Introduction to the Theory of Statistics. 2d ed. New York: McGraw–Hill. → The first edition was published in 1950.

Nelder, J. A. 1954 The Interpretation of Negative Components of Variance. Biometrika 41:544–548.

Pearson, Egon S.; and HARTLEY, H. O. (editors) (1954) 1966 Biometrika Tables for Statisticians. Volume 1. 3d ed. Cambridge Univ. Press. → A revision of Tables for Statisticians and Biometricians (1914), edited by Karl Pearson.

Rand CORPORATION 1955 A Million Random Digits With 100,000 Normal Deviates. Glencoe, III.: Free Press.

Rodwan, Albert S. 1964 An Empirical Validation of the Concept of Coherence. Journal of Experimental Psychology 68:167–170.

Roy, S. N.; and GNANADESIKAN, R. 1959 Some Contributions to ANOVA in One or More Dimensions: I and II. Annals of Mathematical Statistics 30:304–317, 318–340.

Sampford, Michael R. (editor) 1964 In Memoriam Ronald Aylmer Fisher, 1890–1962. Biometrics 20, no. 2:237–373.

ScheffÉ, Henry 1959 The Analysis of Variance. New York: Wiley.

Smith, H. FAIRFIELD 1957 Interpretation of Adjusted Treatment Means and Regressions in Analysis of Co–variance. Biometrics 13:282–308.

Stanley, Julian C. 1961 Studying Status vs. Manipulating Variables. Phi Delta Kappa Symposium on Educational Research, Annual Phi Delta Kappa Symposium on Educational Research: [Proceedings] 2:173–208. → Published in Bloomington, Indiana.

Stanley, Julian C. 1965 Quasi–experimentation. School Review 73:197–205.

Stanley, Julian C. 1966 A Common Class of Pseudo–experiments. American Educational Research Journal 3:79–87.

Thompson, W. A. JR. 1962 The Problem of Negative Estimates of Variance Components. Annals of Mathematical Statistics 33:273–289.

Tukey, John W. 1949 One Degree of Freedom for Non–additivity. Biometrics 5:232–242.

Winer, B. J. 1962 Statistical Principles in Experimental Design. New York: McGraw–Hill.

III MULTIPLE COMPARISONS

Multiple comparison methods deal with a dilemma arising in statistical analysis: On the one hand, it would be unfortunate not to analyze the data thoroughly in all its aspects; on the other hand, performing several significance tests, or constructing several confidence intervals, for the same data compounds the error rates (significance levels), and it is often difficult to compute the overall error probability.

Multiple comparison and related methods are designed to give simple overall error probabilities for analyses that examine several aspects of the data simultaneously. For example, some simultaneous tests examine all differences between several treatment means.

Cronbach (1949, especially pp. 399-403) describes the problem of inflation of error probabilities in multiple comparisons. The solutions now available are, for the most part, of a later date (see Ryan 1959; Miller 1966). Miller’s book provides a comprehensive treatment of the major aspects of multiple comparisons.

Normal means—confidence regions, tests

1. Simultaneous limits for several means

As a simple example of a situation in which multiple comparison methods might be applied, suppose that independent random samples are drawn from three normal populations with unknown means, μ μ μ3, but known variances, If only the first sample were available, a 99 per cent confidence interval could be constructed for μ

where ̄X1 is the sample mean, and n1 the size, of the first sample. In hypothetical repetitions of the procedure, the confidence interval covers, or includes, the true value of μ1 99 per cent of the time in the long run. [seeEstimation, article onConfidence Intervals AND Regions.]

If all three samples are used, three statements like (1) can be made, successively replacing the subscript “1” by “2” and “3.” The probability that all three statements together are true, however, is not .99 but .99 x .99 x .99, or .9703.

In a coordinate system with three axes marked μ1 μ2 and μ3 the three intervals together define a 97 per cent (approximately) confidence box. This confidence box is shown in Figure 1. In order to obtain a 99 per cent confidence box—that is, to have all three statements hold simultaneously with probability .99—the confidence levels for the three individual statements must be increased. One

method would be to make each individual confidence level equal to .9967, the cube root of .99.

The simple two-tail test of the null hypothesis (H0) μ = 0 rejects it (at significance level .01) if the value 0 is not caught inside the confidence interval (1). It is natural to think of extending this test to the composite null hypothesis μ1 = 0 and μ2 = 0 and μ1 = 0 by rejecting the composite hypothesis if the point (0,0,0) is outside the confidence box corresponding to (1). The significance level of this procedure, however, is not .01 but 1 - .9703, almost .03. In order to reduce the significance level to .01, “2.58” in (1) must be replaced by a higher number. If this is done symmetrically, the significance level for each of the three individual statements like (1) must be .0033. In this argument any hypothetical values of the means, may be used in place of 0,0,0 to specify the null hypothesis; the point then takes the place of (0,0,0).

The same principles can be applied just as easily to the case where the three variances are not known but are estimated from the respective samples, in which case 1 per cent points of Student’s t-distribution take the place of 2.58. Of course, any other significance levels may also be used instead of 1 per cent.

Pooled estimate of variance. The problem considered so far is atypically simple because the three intervals are statistically independent, so that probabilities can simply be multiplied. This is no longer true if the variances are unknown but are assumed to be equal and are estimated by a single pooled estimate of variance, ̂σ2, which is the sum of the three within-sample sums of squares divided by n1 + n2 + n3 — 3. This is equal to the mean square used in the denominator of an analysis-of-variance F [seeLinearhypotheses, article onAnalysis OF Variance]. The conditions

(where M is a constant to be chosen), use the same ̂σ and hence are not statistically independent. Thus, the probability that all three hold simultaneously is not the product of the three separate probabilities, although this is still a surprisingly good approximation, adequate for most purposes.

Critical values, Mβ, have, however, been computed for β = .05 and .01 and for any number of degrees of freedom (n1 + n2 + n3 – 3) of ̂2. If Mβ is substituted for M in the three intervals, the prob-ability that all three conditions simultaneously hold is 1 – β (Tukey 1953).

Exactly the same principles described for the problem of estimating, or testing, three population means also apply to k means. A table providing critical values Mβ for k = 2, 3, … … …, 10 and for various numbers of degrees of freedom, N — k, has been computed by Pillai and Ramachandran (1954). Part of the table is reproduced in Miller (1966). The square of Mβ was tabulated earlier by Nair (1948a) for use in another context (see Section 7, below). This table is reproduced in Pearson and Hartley ([1954] 1966, table 19).

Notation. In the following exposition, “̄X” and “μi” represent sample and population means, respectively (i = 1, … … …, k), “σ2” the population variance, generally assumed to be common to all k populations, “̂2” the pooled sample estimate of cr2, and “SE” the estimated standard error of a statistic (SE will depend on “̂2,” on the particular statistic, and on the sample sizes involved). The symbol “∑” always denotes summation over i, from 1 to k, unless otherwise specified; N denotes ∑ni, , the total sample size, and “ddf” stands for “denominator degrees of freedom,” the degrees of freedom of σ2.

2. Treatments versus control (Dunnett)

Many studies are concerned with the difference between means rather than with the means themselves. For example, sample 1 may consist of controls (that is, observations taken under standard conditions) to be used for comparison with samples 2, 3, … … …, k (taken under different treatments or nonstandard conditions), for the purpose of estimating the treatment effects, μ – μ, … … …, μ – μ. For k = 3, 4, … … …, 10, for any number of denominator degrees of freedom, N – k, greater than 4, and for β = .05 and .01, Dunnett (1955; also in Miller 1966) has tabulated critical values Dβ such that with probability approximately equal to 1 – a, all k – 1 statements

will be simultaneously true—that is, all k — 1 effects fjii — fr will be covered by confidence intervals centered at X* — X1 with half-lengths DSE, where

The overall probability is exactly 1 — a if all h sample sizes are equal. It is not the product of k— 1 probabilities (obtained from Student’s t-distribution) of the separate confidence statements, because these are not statistically independent; dependence comes not only from the common estimator of cr in all statements but also from the correlation (p = .5 for sample sizes roughly the same) between any two differences X< — Xx with Xi in common. Surprisingly enough, the product rule gives a close approximation just the same.

Viewed as restrictions on the point (X1 X2 X3) in three-space, the two (pairs of) inequalities for k = 3 define a confidence region that is the intersection of the slab bounded by two parallel planes, μ2 – μ1 = ̄X2, – ̄X2 ± DβSE, and another slab at 45° to the first slab. This is illustrated in Figure 2, where for simplicity all nt are assumed to be equal. The region is a prism that is infinite in length, is parallel to the 45° line μ3 = μ3 = μ3, and has a rhombus as its cross section.

Dunnett’s significance test rejects the null hypothesis, Ho: μ2 = … = μk = μ1, in favor of the alternative hypothesis that one or more of the PI differ from μ1 if the k — 1 confidence intervals do not all contain the value 0 or, equivalently, if

for any i (i — 2, … … … k). If the null hypothesis is of the less trivial form μ – μ = di1, where the di! are any specified constants, then di1 is subtracted from the differences of sample means in the numerators of tia.

The probability of rejecting H0 if it is true, called the error rate experimentwise , is exactly the stated β if all sample sizes are equal, and is approximately β for unequal niy provided the inequality is not gross. Dunnett (1955) showed that design using equal ni, i = 2, … , fe, but with n} larger in about the proportion is most efficient. Unfortunately this leads to true error rates exceeding the stated β if Dunnett’s table is used, and it is then safer to substitute a Bonferroni t-statistic for Dunnett’s Dβ if k is as big as 6 or 10 (for Bonferroni t,

see Section 14, below; see also Miller 1966, table 2).

Simultaneous one-tail tests are of the same form as (2), above, except that the absolute-value signs are removed and an appropriate smaller critical value, Da, also tabulated in Dunnett (1955), is used. The corresponding confidence intervals are one-sided, extending to infinity on the other side.

3. All differences— Tukey method

In order to compare several means with one another rather than only with a single control, a method of Tukey’s (1953) is suitable. It provides simultaneous confidence intervals (or significance tests, if desired) for all = k(k - 1) differences, fr - & , among k means.

A constant, T, is chosen so that the probability is at least 1 — a that all (£) statements

ǀ(Xi - Xj) - (µi- µi)ǀ < Tε SE,

or, equivalently,

ǀtijǀ=ǀ(Xi - Xj) - (µi- µi)ǀ < SE/Tε

will be simultaneously true. Here SE is equal to lengths TaSE. In a significance test of the null hypothesis, H0 , that the differences, μ — /x,,, have any specified (mutually consistent) values, d{j

(often 0), one substitutes d{j for — JJLJ in the t-ratios and rejects H0 if the largest ratio is not less than T.

The constant,Tε, is Ra/ V2 - .707Ra, where Ra is the upper a-point in the distribution of the Studentized range. Table 29 of Pearson and Hartley ([1954] 1966) shows Ra for μ = .!, .05, and 01, for values of k up to 20, and for any number of ddf. Briefer tables are found in Vianelli (1959) and in a number of textbooks—for example, Winer (1962). More extensive tables prepared by Harter (I960) can also be found in Miller (1966).

Geometrically, Tukey’s (1 — a)-confidence region can be obtained, for k — 3, by widening and thickening Dunnett’s prism (Figure 2) in the proportion Ta : Dμ and then removing a pair of triangular prisms by intersection with a third slab. The cross section is hexagonal.

Tukey’s multiple comparisons are frequently used after an F-test rejects H0 but may also be used in place of F.

Simplified multiple t-tests. Simplified multiple t-tests, which were developed by Tukey, use the sum of sample ranges in place of or and a critical value, Tμ, adjusted accordingly. (See Kurtz et al. 1965.)

4. One outlying mean (slippage)

In comparing k populations it may be desirable to find out whether one of them (which one is not specified in advance) is outstanding (has “slipped”) relative to the others. Then using k independent treatment samples one may examine the differences, X̂i X̂ where X̂

Halperin provided critical values Ha such that with probability approximately 1 — α,

simultaneously for i = 1, … , k (Halperin et al. 1955). The probability is exactly 1 — α in the case of equal n{. This provides two-sided tests for the null hypothesis that all α = α and simultaneous confidence intervals for all the α — α in the usual way. In case the table is not at hand, a good approximation to the right-hand side of the inequality is (upper (μ/2fe)-point of Student’s £) x

Critical values for the corresponding one-sided test, to ascertain whether one of the means has slipped in a specified direction (for example, whether it has slipped down), were first computed by Nair (1952). David (19620; 1962b) provides improved tables. A refinement of Nair’s test and of Halperin’s is presented by Quesenberry and David (1961). In Pearson and Hartley ([1954] 1966), tables 26a and 26£> (and the explanation on p. 51) pertain to these methods, whereas table 26 is Nair’s statistic.

5. Contrasts—Scheffe method

A contrast in k population means is a linear combination, ∑ ciαi with coefficients adding up to zero ∑ci = 0. This is always equal to a multiple of the difference between weighted averages of two sets of means— that is, constant x ∑nanαn – ∑ibi with summations running over two subsets of the subscripts (1, … , fe) having no subscript in common and with ∑na = 1, ∑ibi = 1. The simple differences, μ — μ are special contrasts. Some other examples include contrasts representing a difference between two groups of means (for example,⅓[μ2 + μa + μ5] — ½[μ1 + μ4 or slippage of one mean (for example, μ2 — μ since this is equal to {[k-l]/k} [μ1 + μ3 + μ3 ... +μk]), or trend (for example, —3μ — μ2 + μ3 +3μ4

In an exploratory study to compare k means when little is known to suggest a specific pattern of differences in advance, any and all striking contrasts revealed by the data will be of interest. Also, when looking for slippage or simple differences one may wish to take account of some other, unanticipated, pattern displayed by the data.

Any of the systems of multiple comparisons discussed in sections 1—4 can be adapted to obtain tests, or simultaneous intervals, for all contrasts. For example, the k—1 simultaneous conditions where Dα represents the critical value of the Dunnett statistic as defined in Section 2, above, imply that every contrast, Σciμi, falls into an interval of half-length centered at Σcii in the case of equal sample sizes.

The following method, developed by Scheffe, however, is more efficient for all-contrasts analyses, because it yields shorter intervals for most contrasts. Scheffe proved that

the largest of all the (infinitely many) Studentized contrasts, where F is the analysis-of-variance F-ratio for testing equality of all the μi , and where . Thus, Simultaneous confidence intervals for all contrasts, Σciμi , are centered at Σcii and have half-lengths The confidence level is exactly the stated 1 — α, regardless of whether sample sizes are equal.

For k = 3, any particular interval can be depicted in (μ1, μ2, μ3)-space by a pair of parallel planes equidistant from the line given by μ1 – X̄1 = μ2 – X̄1 through the point (X̄1X, ̄2, X̄3). Together these planes constitute all the tangent planes of the cylinder (in the “variables” μ1, μ2, μ3)

where Fα has degrees of freedom 3 — 1 and n — 3. This cylinder, like the prism of Figure 2, is infinite in length and equally inclined to the coordinate axes. (As in the case of the regions for Dunnett’s and Tukey’s procedures, the addition of the same constant to each of the coordinates X1, X2 , X3 of a point on the surface will move this point along the surface.) See Figure 3.

Significance test. A value of F ≥ Fα implies for at least one contrast (namely, at least for the maximum Studentized contrast). Scheffe’s multiple comparison test declares Σcii to be statistically significant—that is, Σciμi different from zero—for all those contrasts

trasts for which the inequality is true. Thus, one may test every contrast of interest, or every contrast that looks promising, and incur a risk of just α of falsely declaring any Σciμi whatsoever to be different from zero; in other words, the probability of making no false statement of the form Σciμi ≠ 0 is 1 — μ the probability of making one or more such statements is μ Of course, the Scheffe approach gives a larger confidence interval (or decreased power) than the analogous procedure if only a single contrast is of interest.

General linear combinations. Simultaneous confidence intervals, or tests, can also be obtained for all possible linear combinations, Σciμi with the restriction Σci = o lifted. Then Scheffe’s confidence and significance statements for contrasts remain applicable, except that (k — 1) Fα is changed to kFμ and the numerator degrees of freedom of F are changed from k — l to k. (See Miller 1966, chapter 2, sec. 2).

A confidence region for all (standardized) linear combinations consists of the ellipsoid in the k- dimensional space with axes labeled μ1 μ2 … μ3, ∑ni(Xii)μ < σFâσ2 For k = 3, any particular interval can be depicted in (μ1, μ2, μ2)-space by a pair of parallel planes equidistant from the point (X̄1, X̄2, X̄3). Together these planes constitute all the tangent planes of the confidence ellipsoid (in the “variables” μ1, μ2, μ2).

Tukey (1953) and Miller (1966) also discuss the generalization of the application of intervals based on the Studentized range (referred to in Section 3, above) to take care of all linear combinations. Simultaneous intervals for all linear combinations can also be based on the Studentized maximum modulus (Section 1); half-lengths become (Tukey 1953).

All of these methods dealing with contrasts and general linear combinations are described in Miller (1966).

Further discussion of normal populations

6. Newman-Keuls and Duncan procedures

The Newman-Keuls procedure is a multiple comparison test for all differences. It does not provide a confidence region. The sample means are arranged and renumbered in order of magnitude, so that X̄1 < X̄2 < … X̄3<. The first step is the same as Tukey’s test; the null hypothesis is rejected or accepted according as X̄k& - X̄1 , the range of the sample means, is ≥ or < TakSE, where Tak is the upper α-point of Tukey’s statistic for k means and N-k ddf.

Accepting H0 means that there is not enough evidence to establish differences between any of the population means, and the analysis is complete (all k means are then called “homogeneous”). On the other hand, if the null hypothesis is rejected, so that μk, the population mean corresponding to the largest sample mean, is declared to be different from μ1, the population mean corresponding to the smallest sample mean, the next step is to test X̄k-i — X̄1 and X̅k — X̅2 similarly, but with Tαk-1 in place of Ta:k (the original pooled variance estimator σ̅2 and N — k ddf are used throughout). A subrange of means that is not found statistically significant is called homogeneous. As long as a subrange is statistically significant, the two subranges obtained by removing in one case its largest and in the other case its smallest X̅i are tested, using a critical value Tα;h, where h is only the number of means left in the new subranges—but testing is limited by the rule that every subrange contained in a homogeneous range of means is not tested but is automatically declared to be homogeneous. The result of the whole procedure is to group the means into homogeneous sets, which may also be represented diagrammatically by connecting lines, as in the example presented in Section 10, below.

Critics of the Newman–Keuls method object that the error probabilities, such as that of falsely declaring μ2 ≠μ5 are not even known in this test; its supporters, however, argue that power should not be wasted by judging subranges by the same stringent criterion used for the full range of all k sample means.

Duncan (1955) goes a step further, arguing that even Tα:h is too stringent a criterion because the differences between h means have only h – 1 degrees of freedom. He concludes that Tγ:h should be used instead, where 1 – = (1 –γ)h–1. This further increases the power—and the effective type i error probability. For a study of error rates of Tukey, Newman–Keuls, Duncan, and Stu-dent tests, see Harter (1957).

7. General Model I design

The F-test in the one-way analysis of variance and the multiple comparison methods already discussed are based on the fact that ddf times ̂<2/<2 has a chi-square distribution and is independent of the sample means. This condition is also satisfied by the residual variance used in randomized blocks, factorial designs, Latin squares, and all Model i designs. Therefore, all these designs permit the use of the methods, and tables, of sections 1—6, to compare the means defined by any one factor, provided that these are independent.

In certain instances of nonparametric multiple comparisons and in certain instances of multiple comparisons of interactions in balanced factorial designs, where the (adjusted or transformed) observations are not independent but equicorrelated, the multiple comparison methods of sections 2–6 still apply: The use of the adjusted error variance, (1 —ρ)̂σ2 to compute standard errors fully compensates for the effect of equal correlations (see Tukey 1953; Scheffé 1953; Miller 1966, pp. 41–42, 46–47). Scheffé’s method can also be adapted for use with unequal correlations (see Miller 1966, p. 53).

When several factors, and perhaps some interactions, are t-tested in the same experiment, the question arises whether extra adjustment should not be made for the resulting additional compounding of error probabilities. One method open to an experimenter willing to sacrifice power for strict experimentwise control of type I error is the conservative one of using error rates per t-test of α(number of t-tests contemplated), that is, using Bonferroni t-statistics (see Section 14). For experimentwise control of error rates in the special case of a 2r factorial design, Nair (1948 a) has tabulated percentage points of the largest of r independent χ2’s with one degree of freedom, divided by an independent variance estimator (Pearson & Hartley [1954] 1966, table 19). The statistic is equal to the square of the Studentized maximum modulus introduced in Section 1.

8. An example—juxtaposition of methods

Three competing theories about how hostility evoked in people by willfully imposed frustration may be diminished led Rothaus and Worchel (1964) to goad 192 experimental subjects into hostility by unfair administration of a test of coordination and then to apply the following “treatments” to four groups, each composed of 48 subjects: (1) no treatment (control); (2) fair readministration of the test, seemingly as a result of a grievance procedure (instrumental communication); (3) an opportunity for verbal expression of hostility (catharsis); (4) conversation to the effect that the test was unfair and the result therefore not indicative of failure on the subjects’ part (ego support). After treatment all subjects were given another—

Table 1 — Analysis of variance of hostility scores
SourcedfMean squareF-ratio
* Denotes statistical significance at the 5 per cent level.
4 treatments3369.773.38*
3 subgroups2151.311.38
2 sexes141.140.38
2 BIHS levels12.680.02
All the interactions, none of them statistically significant40  
4 replications (nested)144109.54 = σ2 

fair—test of coordination. Each treatment group was subdivided into three subgroups, a different experimenter working with each subgroup. All subjects had been given Behavioral Items for Hostility Scales (BIHS) three weeks before the experiment.

The experimental plan was factorial: 4 treatments x 3 subgroups x 2 sexes x 2 BIHS score groups (high versus low) x 4 replications. The study variable, X, was hostility measured on the Social Sensitivity Scale at the end of the experiment.

The sample means (unordered) for the four treatment groups were ̅x2 = 47.08, ̅x2 = 42.00, ̅x3 = 48.53, ̅x4 = 45.40.

In fact, the numbers in Table 1 reflect an analysis of covariance. The mean squares shown are adjusted mean squares, the sample means are adjusted means, and ̂σ2 has 143 df. But for the sake of simplicity of interpretation the data will be treated as if they had come from a 4 x 3 x 2 x 2 factorial analysis of variance. The estimated standard error for differences between two means, SE, is

Dunnett comparisons. The Dunnett method, with α = 05, would be applied to the data of the experiment, as analyzed in Table 2. As indicated in Table 2, the one-tail test in the direction of the theory (Ht) under study declares μ2 to be less than μ1. Thus, the conclusion, if the one-sided Dunnett test and the 5 per cent significance level are adopted, is that instrumental communication reduces hostility but that the evidence does not confirm any reduction due to ego support or catharsis. If the two-tail test had been chosen, allowing for

Table 2 — Dunnet comparisons of control with three treatments α = .05
   DUNNETT METHOD:TWO-SIDEDDUNNETT METHOD:ONE-SIDED
Pair̄Xi – ̄XjTest
Dα =
2.40
Confidence interval (half-length = 2.136Dα = 5.13Test:
Dα
= 2.08
Confidence interval (lower length =
2.0136Dα = 4.44) (1)–(2)
*Statistically significant at the 5 per cent level; all other comparisons do not reach statistical significance at the 5 per cent level.
(1)–(2)5.082.38(near significance)(–0.05,10.21)*(0.64∞)
(1)–(3)–1.45–0.68(–6.58,3.68)(–5.89∞)
(1)–(4)1.680.79(–3.45,6.81)(–2.76,∞)
Table 3 — All pairs, by Tukey and by Scheffé method, α = .05
   TUKEY METHODSCHEFFE METHOD
Pair̄Xi – ̄XiTest:
Tα = 2.60
Confidence interval (half-length =
2.136Tα = 5.55)
Test:Confidence interval
Statistically significant at the 5 per cent level; all other pairs do not reach statistical significance at the 5 per cent level.
(l)–(2)5.082.38(–0.47,10.63)(–0.90,11.06)
(l)–(3)–1.45–0.68(–7.00,4.10)(–7.43,4.53)
(l)–(4)1.680.79(–3.87,7.23)(–4.30,7.66)
(2)–(3)–6.53–3.06*(–12.08,–0.98)*(–12.51, –0.55)
(2)–(4)–3.40–1.59(–8.95,2.15)(–9.38,2.58)
(3)–(4)3.131.47(–2.42,8.68)(–2.85,9.11)

a possible increase in hostility due to treatment, the conclusion would be that there is insufficient evidence to reject.

All pairsTukey and Scheffe methods. A comparison of all possible pairs of means by the methods of Tukey and Scheffe is shown in Table 3. The tests of Tukey and Scheffe in this case both discount ̄xīx3 but declare ̄x2̄x3 “significant.” The conclusion is that instrumental communication leaves the mean hostility of frustrated subjects lower than ego support does, but no other difference is established; specifically, neither test would conclude that instrumental communication actually reduces hostility as compared with no treatment or that ego support increases (or reduces) it.

In addition to the simple differences, the data suggest testing a contrast related to the alternate hypothesis µ2 < µ4 <1 < µ3 , for example, the contrast –3µ2 – µ4 + µ1 + 3µ3. (It is legitimate, for these procedures, to choose such a contrast after inspecting the data.) For the present example, –3̄x2 – ̄x4 + ̄x1 + 3̄x3 = 21.27; the SE for a Scheffé test is and t = 21.27/6.755 = 3.15, statistically significant at the 5 per cent level (3.15 > 2.80). The conclusion is that µ2 ≤ µ4 ≤ µ1 µ3 with at least one strict inequality holding. A Scheffé 95 per cent confidence interval for –3µ2 – µ4 + µ1 + 3µ3 is (1.36,40.18). A Tukey test would not find this contrast statistically significant. For this analysis 2.136 is multiplied by 1/2(–3ǀ+ǀ–1 + ǀ1ǀ + ǀ3ǀ) = 4, instead of by yielding 8.544. Thus, in this case t = 21.27/8.544 = 2.49, which is less than 2.60, and a confidence interval is (–0.94, + 43.48).

The SE for individual ̄xi also used in slippage statistics, is For k = 4, M.05 = 2.50, and simultaneous confidence intervals for the four µi are centered at the ̄Xi and have half-lengths 2.50 x 1.51 = 3.78.

The 5 per cent critical value tabulated by Halperin et al. for two-sided slippage tests is 2.23; thus (̄x2 – ̄x)/1.51 = 2.48 is statistically significant, whereas the other three t-ratios for slippage are not. The conclusion of this test would be that mean hostility after instrumental communication is low compared with that after other treatments; no other treatment can be singled out as leaving hostility either low or high compared with that after other treatments.

An example of Newman–Keuls and Duncan tests is given in Section 10, below.

Other multiple comparison methods

9. Nonparametric multiple comparisons

The multiple comparison approach has been articulated with nonparametric (or distribution-free) methods in several ways [for background seeNonparametric statistics].

For example, one of the simplest nonparametric tests is the sign test. Suppose that an experiment concerning techniques for teaching reading deals with school classes and that each class is divided in half at random. One half is taught by method 1, the other by method 2, the methods being allocated at random. Suppose further that improvement in average score on a reading test after two months is the basic observation but that one chooses to consider only whether the pupils taught by method 1 gain more than the pupils taught by method 2, or vice versa, and not the magnitude of the difference. If C is the number of classes for which the pupils taught with method 1 have a larger average gain than those taught with method 2, then the (two-sided) sign test rejects the null hypothesis of equal method effect when the absolute value of C is larger than a critical value. The critical value comes simply from a symmetrical binomial distribution.

Suppose now that there are k teaching methods, where k might be 3 or 4, and the classes are each divided at random into k groups and assigned to methods. Let Cij(ij) be the number of classes

for which the average gain in reading-test score for the group taught by method i is greater than that for the group taught by method j. Each Cij taken separately has (under the null hypothesis that the corresponding two methods are equally effective) a symmetric binomial distribution which is approximated asymptotically by where n is the number of classes, z is a standard normal variable, and 1/2 is a continuity correction. But to test for the equality of all k methods, the largest ǀCijǀ should be used. The critical values of this statistic may be approximated by where Tα is the upper α point for Tukey’s statistic with k groups and ddf = ∞.

The same procedure is feasible for other two-sample test statistics—for example, rank sums. An analogous method works for comparing k— 1 treatments with a control; in the teaching-method experiment, if method 1 were the control, this would mean using as the test statistic the maximum over ; j ≠ 1 of ǀCijǀ (or of Cij in the one-sided case). For a discussion of this material, see Steel (1959).

Joint nonparametric confidence intervals may sometimes be obtained in a similar way. Given a confidence interval estimation procedure related to any two-sample test statistic with critical value Sα (see Moses 1953; 1965), the same procedure with Sα replaced by its multiple comparison analogue Cα yields confidence intervals with a joint confidence level of 1 — α

A second class of nonparametric multiple comparison tests arises by analogy with normal theory analysis of variance for the one-way classification and other simple designs [seeLinear hypotheses, article onanalysis of variance]. The procedures start by transforming the observations into ranks or other kinds of simplified scores (except that the so-called permutation tests leave the observations unaltered). The analysis is conditional on the totality of scores and uses as its null distribution that obtained from random allocations of the observed scores to treatments. The test statistic may be the ordinary F-ratio on the scores, but modified so that the denominator is the exact overall variance of the given scores. This statistic’s null distribution is approximately F, with k — 1 and ∞ as degrees of freedom (where k is the number of treatments), or, equivalently, k — 1 times the F-test statistic has as approximate null distribution the chi-square distribution with k — 1 degrees of freedom. Similar adaptations hold for the Tukey test statistic and others. The approach may also be extended to randomized block designs; in another direction, the approach may be extended to compare dispersion, rather than location. Discussions of this material are given by Nemenyi (1963) and Miller (1966, chapter 2, sec. 1.4, and chapter 4, sec. 7.5).

A difficulty with these test procedures is that confidence sets cannot generally be obtained in a straightforward way.

A third nonparametric approach to multiple comparisons is described by Walsh (1965, pp. 535–536). The basic notion applies when there are a number of observations for each treatment or treatment combination. Such a set of observations is divided into several subsets; the average of each subset is taken. These averages are then treated by normal theory procedures of the kind discussed earlier.

For convenient reference, a few 5 per cent and 1 per cent critical points of multiple comparison statistics with ddf = ∞ are listed in Table 4.

Table 4 – Selected 5 per cent and 1 per cenf critical points of multiple comparison statistics with ddf = ∞
 DUNNETTTUKEYNAIR-HALPERINSCHEFFÉDUNCAN
 k-1 yersus oneall pairsoutlier tests  
kone-tailtwo-tailone-tailtwo-tail 
   5 per cent level   
21.641.961.961.391.391.961.96
31.922.212.341.741.912.452.06
42.062.652.571.942.142.802.13
52.162.442.732.082.283.082.18
62.232.512.852.182.393.232.23
    1 per cent level     
22.332.582.581.821.822.582.58
32.562.792.912.222.383.032.68
42.682.923.112.432.613.372.76
52.773.003.252.572.763.642.81
62.843.063.362.682.873.882.86
Table 5 — Frequency of church attendance of scientists in four different fields
 (1)(2)(3)(4)  
 Chemical   CombinedScare
Church attendanceengineersPhysicistsZoologistsGeologistssampleU
IT is purely acciaeniai that T4 exactly equals 0.
Source: Vaughan et al. 1966.
Never44656672247-1
Not often381921301080
Often524649381851
Very often33291917982
Sample size, ni167159155157N = 638 
Ti = ∑ Frequency • u7439210*T = 134 
̄ui = Ti/ni0.4430.2450.1360.000̄u = 0.210 
1/ni0.0059880.0062890.0064520.0063691/N = 0.001567 

10. An example

As an illustration of some distribution-free multiple comparison methods, consider the following data from Vaughan, Sjoberg, and Smith (1966), who sent questionnaires to a sample of scientists listed in American Men of Science in order to compare scientists in four different fields with respect to the role that traditional religion plays in their lives. Table 5 summarizes responses to the question about frequency of church attendance and shows some of the calculations.

Using the data of Table 5, illustrative significance tests of the null hypothesis of four identical population distributions, against various alternatives, will be performed at the 1 per cent level.

The method of Yates (1948) begins by assigning ascending numerical scores, u, to the four ordered categories; arithmetically convenient scores, as shown in the last column of Table 5, are – 1, 0, 1, and 2. Sample totals of scores are calculated—for example, T1 = 44(-l) + 38(0) + 52(1) + 33(2) = 74, and the average score for sample i is ̄ui = Ti/ni From the combined sample (margin) Yates computes an average score, ̄u = T/N = .210, and the variance of scores,

giving (638/637){[247(l) + 108(0) + 185(1) + 98(4)]/638 - .2102} = 1.2494.

Yates then computes a variance between means, – T2/N = 17.05, and the critical ratio used is either F = (17.05/3)/1.2494 or X2 = 17.05/1.2494= 13.7. The second of these is referred to a table of chi-square with 3 df and found significant at the 1 per cent level (in fact, P =.0034).

It follows that some contrasts must be statistically significant. The almost linear progression of the sample mean scores suggests calculating For the denominator, (32/167 + 12/159 + 12/155 + 32/157) σ2 = .1240 × 1.2494 = .1549, so that X2 = 1.4382/.1549 = 13.35, or its square root, (This comes close to the value = 3.70 of the largest standardized contrast—see Section 5.) When 3.65 is referred to the Scheffé table (in Table 4, above) for k = 4, or when 13.35 is referred to a table of chi-square with 3 df, each is found to be statistically significant (in fact, P = .0040). The conclusion that can be drawn from this one-sided test for trend is that the population mean scores are ordered ̄μ1 ≥ ̄μ2 ≥ ̄μ3 ≥ ̄μ4 with at least one strict inequality holding. Had a trend in this particular order been predicted ahead of time and postulated as the sole alternative hypothesis to be considered, z = 3.65 could have been judged by the normal table, yielding P = .00013. The two-tail version of this test is Yates’s one-degree-of-freedom chi-square for trend (1948).

Another contrast that may be tested is the simple difference ̄u1 – ̄u4 = .443 - .000. Here SE = .1243, and z14 = .443/.1243 = 3.57. Because it is greater than 3.37, this contrast is statistically significant. Similarly, Z13 = (.443 - .136)/. 1239 = 2.48, but this is not significant at the 1 per cent level, and the other simple differences are still smaller.

If Tukey’s test had been adopted instead of Scheffé’s, the same ratios would be compared with the critical value 3.11 (k = 4, α= .01). The conclusions would be the same in the present case. Tukey’s method could also be used to test other contrasts.

In the present example, the Newman-Keuls procedure would also have led to the same conclusions about simple differences: Z14 = 3.57 is called significant because it is greater than 3.11; then Z13 (which equals 2.48) and Z24 (which is still smaller) are compared with 2.91 and found “not significant,” and the procedure ends. The conclusions may be summarized as follows:

.443 .245 .136 .000,

where the absence of a line connecting ū1 with ū4 signifies that ̄μ1 and ̄μ4 are declared unequal. It may be argued that a conclusion of the form “A, B, and C homogeneous, B, C, and D homogeneous, but A, B, C, and D not homogeneous” is self-contradictory. This is not necessarily the case if the interpretation is the usual one that A, B, and C may be equal (not enough evidence to prove them unequal) and B, C, and D may be equal, but A and D are not equal.

In Duncan’s procedure the critical value 3.11 used in the first stage would be replaced by 2.76 (see Table 5), and the critical value 2.91 used at the second stage (k = 3) would be replaced by 2.68. Since 3.57 > 2.76 but 2.48 < 2.68, Duncan’s test leads to the same conclusion in the present example as the Newman-Keuls procedure.

A Halperin outlier test would use max ǀūiǀ, in this case .443 - 2.10 = .233, divide it by and compare the resulting ratio, 2.72, with the critical value, 2.61 (k = 4, 1 per cent level). The next largest ratio is (.210 - .000).08944 = 2.35. The conclusion is that chemical engineers tend to report more frequent church attendance than the other groups, but nothing can be said about geologists. If the outlier contrasts had been tested as part of a Scheffe test for all contrasts, none of them would have been found significant at the 1 per cent level (critical value 3.37) or even at the 5 per cent level.

What would happen if unequally spaced scores had been used instead of –1, 0, 1, 2 to quantify the four degrees of religious loyalty? In fact, Vaughan and his associates described the ordered categories not verbally but as frequency of church attendance per month grouped into 0, 1, 2-4, 5+. Although we do not know whether frequency of church attendance is a linear measure of the importance of religion in a person’s life, the scores (0, 1, 3, 6) could reasonably have been assigned. In the present case this would lead to essentially the same conclusions that the other scoring led to: The mean scores become 2.35, 2.08, 1.82, and 1.57, Yates’s X2 changes from 13.7 to 12.5, the standardized contrast for trend changes from to very nearly z14 changes from 3.57 to 3.36, and z13 and z24 again have values too small for statistical significance by Tukey’s criterion or by NewmanKeuls’.

A fundamentally different assignment of scores —for example, 1, 0, 0, 1—would be used to test for differences in spread. It yields sample means, i, of .461, .591, .548, .576, = 0.541 and a variance, σ2, of .2484. Yates’s analysis-of-variance X2 is 1.449/0.2484, that is, only 5.83, so P = .12. Thus, no contrast is called significant in a Scheffé test (or, it turns out, in any other multiple comparison test at the 1 per cent significance level). In the present example these tests for spread are unreliable, because the presence of sample location differences, noted above, can vitiate the results of the test for differences in spread.

Throughout the numerical calculations in this section, the continuity correction has been neglected. In the case of unequal sample sizes it is difficult to determine what continuity correction would yield the most accurate results, and the effect of the adjustment would be slight anyway. When sample sizes are equal, the use of ǀTiTjǀ – in place of ǀTiTjǀ is recommended, as it frequently (although not invariably) improves the fit of the asymptotic approximation used.

11. Comparisons for differences in scale

Stand-and multiple comparisons of variances of k normal populations, by Cochran (1941), David (1952), and others, utilize ratios of the of . These methods should be used with caution, because they are ultrasensitive to slight nonnormality.

Distribution-free multiple comparison tests for scale differences are also available. Any rank test may be used with a Siegel-Tukey reranking [seeNonparametric Statistics, article onRanking Methods]. Such methods, too, require caution, because—especially in a joint ranking of all k samples—any sizable location differences may masquerade as differences in scale (Moses 1963).

Safer methods—but with efficiencies of only about 50 per cent for normal distributions—are adaptations of some tests by Moses (1963). In these tests a small integer, s, such as 2 or 3, is chosen, and each sample is randomly subdivided into subgroups of s observations. Let y be the range or variance of a subgroup. Then any multiple comparison tests may be applied to the k samples of y’s (or logi/’s), at the sacrifice of between-subgroups information. The effective sample sizes have been reduced to [ni/s]; if these are small (about 6), either a nonparametric test or, at any rate, log y’s should be used. (Some nonparametric multiple comparison tests, such as the median test, have no power—that is, they cannot possibly reject the null hypothesis—at significance levels such as .05 with small samples. But rank tests can be used with several samples as small as 4 or 5.)

12. Multiple comparisons of proportions

A simultaneous test for all differences between k proportions p1, … … …, pk, based on large samples, can be obtained by comparing

with a critical value of Tukey’s statistic (Section 3), where Xi, i = 1,..., k, denotes the number of “successes” in sample i and X = ∑Xi. Analogous asymptotic tests can be used for comparison of several treatments with a control and other forms of multiple comparisons. If X/N is small, the sample sizes must be very large for this asymptotic approximation to be adequate. (For a similar method see Ryan 1960.)

Small-sample multiple comparison tests of proportions may be carried out by transforming the counts into normal variables with known equal variances and then applying any test of sections 1–7 to these standardized variables (using ∞ ddf). [SeeStatistical Analysis, Special Problems OF, article onTransformations of Data; see also Siotani & Ozawa 1958.]

A (1 — α)-confidence region for k population proportions is composed of a α-confidence interval for each of them. Simultaneous confidence intervals for a set of differences of proportions may be approximated by using Bonferrbni’s inequality (see Section 14). For a discussion of confidence regions for multinomial proportions, see Goodman (1965).

Some discussion of multiple comparisons of proportions can be found in Walsh (1965, for example, pp. 536–537).

13. Selection and ranking

The approach called selection or ranking assumes a difference between populations and seeks to select the population(s) with the highest mean—or variance or proportion —or to arrange all k populations in order [seeScreening AND Selection; see also Bechhofer 1958], Bechhofer, Kiefer, and Sobel (1967) have written a monograph on the subject.

Error rates, choice of method, history

14. Error rates and choice of method

In a significance test comparing two populations, the significance level is defined as

in repeated use of the same criterion. This is termed the error rate per comparison. The corresponding confidence level for confidence intervals is 1 – a.

For analyses of k-sample experiments one may instead define the error rate per experiment,

This is related to what Miller (1966) terms the “expected error rate.” For m (computed or implied) comparisons per experiment, a’ = ma; a = a’/m (see Stanley 1957).

Standard multiple comparison tests specify an error rate experimentwise (or “familywise”):

Miller refers to this as the “probability of a non-zero family error rate” or “probability error rate.”

The only difference between a’ and α is that α counts multiple rejections in a single experiment as only one error whereas a’ counts them as more than one. Hence,α ≤ a’; this is termed Bonferroni’s inequality.

On the other hand, it is also true that unless a’ is large, α is almost equal to a’, so that α and a’ may be used interchangeably for practical purposes. For example, Dα for 6 treatments and a control, Mβ for k = 6, and Tα for k = 4 = 6), are all approximately equal to the two-tailed critical value [t]α/β= tα/12 of Student’s t. More generally, m individual comparisons may safely be made using any statistic at significance level α/m per comparison when it is desired to avoid error rates greater than α experimentwise; this procedure may be applied to comparisons of several correlation coefficients or other quantities for which multiple comparison tables are not available. Only when α is about .10 or more, or when m is very big, does this lead to serious waste. Then a’ grossly overstates α, power is lost, and confidence intervals are unnecessarily long (see Stanley 1957; Ryan 1959; Dunn 1961).

Some authors refer to (α/m) -points as Bonferroni statistics and to their use in multiple comparisons as the Bonferroni method. Table 2 in Miller (1966) shows Bonferroni t-statistics, (.05/2m)-points of Student’s t for various m and various numbers of ddf.

Bonferroni’s second inequality (see Halperin et al. 1955, p. 191) may sometimes be used to obtain an upper limit for the discrepancy a’ — α and a second approximation to critical values for error rates β experiment wise. This works best in the case of slippage statistics and was used by Halperin and his associates (1955), Doornbos and Prins (1958), Thompson and Willke (1963), and others.

The choice between “experimentwise” and “per comparison” is largely a matter of taste. An experimenter should make it consciously, aware of the implications: A given error probability, a, per comparison implies that the risk of at least one type I error in the analysis is much greater than a; indeed, about a × m such errors will probably occur.

Perhaps analyses reporting error rates experimentwise are generally the most honest, or transparent. However, too dogmatic an application of this principle would lead to all sorts of difficulties. Should not the researcher who in the course of his career analyzes 86 experiments involving 1,729 relevant contrasts control the error rate lifetimewise? If he does not, he is almost bound to make a false positive inference sooner or later.

Sterling (1959) discusses the related problem of concentration of type i errors in the literature that result from the habit of selecting significant findings for publication [seeFallacies, Statistical, for further discussion of this problem}.

There is another context in which the problem of choosing error rates arises: If an experimenter laboriously sets up expensive apparatus for an experiment to compare two treatments or conditions in which he is especially interested, he often feels that it would be unfortunate to pass up the opportunity to obtain additional data of secondary interest at practically no extra cost or trouble; so he makes observations on populations 3, 4,..., k as well. It is then possible that the results are such that a two-sample test on the data of primary interest would have shown statistical significance, but no “significant differences” are found in a multiple comparison test. If the bonus observations thus drown out, so to speak, the significant difference, was the experimenter wrong to read them? He was not—the opportunity to obtain extra information should not be wasted, but the analysis should be planned ahead of time with the experimenter’s interests and priorities in mind. He could decide to analyze his primary and subsidiary results as if they had come from separate experiments, or he could conduct multiple comparisons with an overall error rate enlarged to avoid undue loss of power, or he could use a method of analysis which sub-divides α, allocating a certain (large) part to the primary comparison and the rest to “data snooping” among the extra observations (Miller 1966, chapter 2, sec. 2.3).

Whenever it is decided to specify error rates experimentwise, a choice between different systems of multiple comparisons (different shapes of confidence regions) remains to be made. In order to study simple differences or slippage only, one of the methods of sections 2–4 above (or a nonparametric version of them) is best—that is, yields the shortest confidence intervals and most powerful tests, provided the ni are (nearly) equal. But Scheffe’s approach (see section 5) is better if a variety of contrasts may receive attention.

When sample sizes are grossly unequal, probability statements based on existing Tukey or Dunnett tables, computed for equal n’s, become too inaccurate. Pending the appearance of appropriate new tables, it is better to use Scheffe’s method, which furnishes exact probabilities. The Bonferroni statistics discussed above offer an alternative solution, preferable whenever attention is strictly limited to a few contrasts chosen ahead of time. Miller (1966, especially chapter 2, sees. 2.3 and 3.3) discusses these questions in some detail.

15. History of multiple comparisons

An early, isolated example of a multiple comparison method was one developed by Working and Hotelling (1929) to obtain a confidence belt for a regression line (see Miller 1966, chapter 3; Kerrich 1955). This region also corresponds to simultaneous confidence intervals for the intercept and slope [seeLinear hypotheses, article onregression]. Hotelling (1927) had already developed the idea of simultaneous confidence interval estimation earlier in connection with the fitting of logistic curves to population time series. In his famous paper introducing the T2-statistic, Hotelling (1931) also introduced the idea of simultaneous tests and a confidence ellipsoid for the components of a multivariate normal mean.

The systematic development of multiple comparison methods and theory began later, in connection with the problem of comparing several normal means. The usual method had been the analysis-of-variance F-test, sometimes accompanied by t-tests at a stated significance level, a (usually 5 per cent), per comparison.

Fisher, in the 1935 edition of The Design of Experiments, pointed out the problem of inflation of error probabilities in such multiple t-tests and recommended the use of t-tests at a stated level α’ per experiment. Pearson and Chandra Sekar further discussed the problem (1936). Newman (1939), acting on an informal suggestion by Student, described a test for all differences based on tables of the Studentized range and furnished a table of approximate 5 per cent and 1 per cent points. Keuls formulated Newman’s test more clearly much later (Keuls 1952).

Nair made two contributions in 1948, the one-sided test for slippage of means and a table for simultaneous F-tests in a 2r factorial design. Also in the late 1940s, Duncan and Tukey experimented with various tests for normal means which were forerunners of the multiple comparison tests now associated with their names.

The standard methods for multiple comparisons of normal means were developed between 1952 and 1955 by Tukey, Scheffé, Dunnett, and Duncan. Tukey wrote a comprehensive volume on the subject which was widely circulated in duplicated form and extensively quoted but which has not been published (1953). The form of Tukey’s method described in Section 3 for unequal n’s was given independently by Kurtz and by Kramer in 1956. Also in the early and middle 1950s, some multiple comparison methods for normal variances were published, by Hartley, David, Truax, Krishnaiah, and others. Cochran’s slippage test for normal variances was published, for use as a substitute for Bartlett’s test for homogeneity of variances, as early as 1941 (see Cochran 1941).

Selection and ranking procedures for means, variances, and proportions have been developed since 1953 by Bechhofer and others.

An easy, distribution-free slippage test was proposed by Mosteller in 1948—simply count the number of observations in the most extreme sample lying beyond the most extreme value of all the other samples and refer to a table by Mosteller and Tukey (1950). Other distribution-free multiple comparison methods—although some of them can be viewed as applications of S. N. Roy’s work of 1953—did not begin to appear until after 1958.

The most important applications of the very general methodology developed by the school of Roy and Bose since 1953 have been multivariate multiple comparison tests and confidence regions. Such work by Roy, Bose, Gnanadesikan, Krishnaiah, Gabriel, and others is generally recognizable by the word “simultaneous” in the title—for example, SMANOVA, that is, simultaneous multivariate analysis of variance (see Miller 1966, chapter 5).

Another recent development is the appearance of some Bayesian techniques for multiple comparisons. These are discussed by Duncan in the May 1965 issue of Technometries, an issue which is devoted to articles on multiple comparison methods and theory and reflects a cross section of current trends in this field.

Peter Nemenyi

BIBLIOGRAPHY

The only comprehensive source for the subject of multiple comparisons to date is Miller 1966. Multiple comparisons of normal means (and variances) are summarized by a number of authors, notably David 1962a and 1962fo. Several textbooks on statisticse.g., Winer 1962— also cover some of this ground. Many of the relevant tables, for normal means and variances, can also be found in David 1962a and 1962Z?; Vianelli 1959; and Pearson & Hartley 1954; these volumes also provide explanations of the derivation and use of the tables.

Bechhofer, R. E. 1958 A Sequential Multiple-decision Procedure for Selecting the Best One of Several Normal Populations With a Common Unknown Variance, and Its Use With Various Experimental Designs. Biometrics 14:408–429.

Bechhofer, R. E.; Kiefer, J.; and Sobel, M. 1967 Sequential Ranking Procedures. Unpublished manuscript. → Projected for publication by the University of Chicago Press in association with the Institute of Mathematical Statistics.

Cochran, W. G. 1941 The Distribution of the Largest of a Set of Estimated Variances as a Fraction of Their Total. Annals of Eugenics 11:47–52.

Cronbach, Lee J. 1949 Statistical Methods Applied to Rorschach Scores: A Review. Psychological Bulletin 46:393–429.

David, H. A. 1952 Upper 5 and 1% Points of the Maximum F-ratio. Biometrika 39:422–424.

David, H. A. 1962a Multiple Decisions and Multiple Comparisons. Pages 144–162 in Ahmed E. Sarhan and Bernard G. Greenberg (editors), Contributions to Order Statistics. New York: Wiley.

David, H. A. 1962b Order Statistics in Shortcut Tests. Pages 94–128 in Ahmed E. Sarhan and Bernard G. Greenberg (editors), Contributions to Order Statistics. New York: Wiley.

Doornbos, R.; and PRINS, H. J. 1958 On Slippage Tests. Part 3: Two Distribution-free Slippage Tests and Two Tables. Indagationes mathematicae 20:438–447.

Duncan, David B. 1955 Multiple Range and Multiple F Tests. Biometrics I I : 1–42.

Duncan, David B. 1965 A Bayesian Approach to Multiple Comparisons. Technometrics 7:171–222.

Dunn, Olive J. 1961 Multiple Comparisons Among Means. Journal of the American Statistical Association 56:52–64.

Dunnett, Charles W. 1955 A Multiple Comparison Procedure for Comparing Several Treatments With a Control. Journal of the American Statistical Association 50:1096–1121.

Fisher, R. A. (1935) 1960 The Design of Experiments. 7th ed. London: Oliver & Boyd; New York: Hafner.

Fisher, R. A.; and Yates, Frank (1938) 1963 Statistical Tables for Biological, Agricultural, and Medical Research. 6th ed., rev. & enl. Edinburgh: Oliver & Boyd; New York: Hafner.

Gabriel, K. R. 1966 Simultaneous Test Procedures for Multiple Comparisons on Categorical Data. Journal of the American Statistical Association 61:1081–1096.

Goodman, Leo A. 1965 On Simultaneous Confidence Intervals for Multinomial Proportions. Technometrics 7:247–252.

Halperin, M.; Greenhouse, S.; Cornfield, J.; and Zalokar, J. 1955 Tables of Percentage Points for the Studentized Maximum Absolute Deviate in Normal Samples. Journal of the American Statistical Association 50:185–195.

Harter, H. Leon 1957 Error Rates and Sample Sizes for Range Tests in Multiple Comparisons. Biometrics 13:511–536.

Harter, H. Leon 1960 Tables of Range and Studentized Range. Annals of Mathematical Statistics 31: 1122–1147.

Hartley, H. O. 1950 The Maximum F–ratio as a Shortcut Test for Heterogeneity of Variance. Biometrika 37:308–312.

Hotelling, Harold 1927 Differential Equations Subject to Error, and Population Estimates. Journal of the American Statistical Association 22:283–314.

Hotelling, Harold 1931 The Generalization of Student’s Ratio. Annals of Mathematical Statistics 2: 360–378.

Kerrich, J. E. 1955 Confidence Intervals Associated With a Straight Line Fitted by Least Squares. Statistica neerlandica 9:125–129.

Keuls, M. 1952 The Use of “Studentized Range” in Connection With an Analysis of Variance. Euphytica 1:112–122.

Kramer, Clyde Y. 1956 Extension of Multiple Range Tests to Group Means With Unequal Number of Replications. Biometrics 12:307–310.

Kramer, Clyde Y. 1957 Extension of Multiple Range Tests to Group Correlated Adjusted Means. Biometrics 13:13–18.

Krishnaiah, P. R. 1965a On a Multivariate Generalization of the Simultaneous Analysis of Variance Test. Institute of Statistical Mathematics (Tokyo), Annals 17, no. 2:167–173.

Krishnaiah, P. R. 1965k Simultaneous Tests for the Equality of Variance Against Certain Alternatives. Australian Journal of Statistics 7:105–109.

Kurtz, T. E. 1956 An Extension of a Method of Making Multiple Comparisons (Preliminary Report). Annals of Mathematical Statistics 27:547 only.

Kurtz, T. E.; LINK, R. F.; TUKEY, J. W.; and WALLACE, D. L. 1965 Short-cut Multiple Comparisons for Balanced Single and Double Classifications. Part 1: Results. Technometrics 7:95–169.

Mchuch, Richard B.; and Ellis, Douglas S. 1955 The “Post Mortem” Testing of Experimental Comparisons. Psychological Bulletin 52:425–428.

Miller, Rupert G. 1966 Simultaneous Statistical Inference. New York: McGraw-Hill.

Moses, Lincoln E. 1953 Nonparametric Methods. Pages 426–450 in Helen M. Walker and Joseph Lev, Statistical Inference. New York: Holt.

Moses, Lincoln E. 1963 Rank Tests of Dispersion. Annals of Mathematical Statistics 34:973–983.

Moses, Lincoln E. 1965 Confidence Limits From Rank Tests (Reply to a Query). Technometrics 7:257–260.

Mosteller, Frederick W.; and Tukey, John W. 1950 Significance Levels for a k-sample Slippage Test. Annals of Mathematical Statistics 21:120–123.

Nair, K. R. 1948a The Studentized Form of the Extreme Mean Square Test in the Analysis of Variance. Biometrika 35:16–31.

Nair, K. R. 1948 b The Distribution of the Extreme Deviate From the Sample Mean and Its Studentized Form. Biometrika 35:118–144.

Nair, K. R. 1952 Tables of Percentage Points of the “Studentized” Extreme Deviate From the Sample Mean. Biometrika 39:189–191.

Nemenyi, Peter 1963 Distribution-free Multiple Comparisons. Ph.D. dissertation, Princeton Univ.

Newman, D. 1939 The Distribution of the Range in Samples From a Normal Population, Expressed in Terms of an Independent Estimate of Standard Deviation. Biometrika 31:20–30.

Pearson, Egon S.; and Chandrasekar, C. 1936 The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations. Biometrika 28: 308–320.

Pearson, Egon S.; and Hartley, H. O. (editors) (1954) 1966 Biometrika Tables for Statisticians. Vol. 1. 3d ed. Cambridge Univ. Press. → Only the first volume of this edition has as yet been published.

Pillai, K. C. S.; and Ramachandran, K. V. 1954 On the Distribution of the Ratio of the ith Observation in an Ordered Sample From a Normal Population to an Independent Estimate of the Standard Deviation. Annals of Mathematical Statistics 25:565–572.

Quesenberry, C. P.; and David, H. A. 1961 Some Tests for Outliers. Biometrika 48:379–390.

Roessler, R. G. 1946 Testing the Significance of Observations Compared With a Control. American Society for Horticultural Science, Proceedings 47:249–251.

Rothaus, Paul; and Worchel, PHILIP 1964 Ego support, Communication, Catharsis, and Hostility. Journal of Personality 32:296–312.

Roy, S. N.; and Bose, R. C. 1953 Simultaneous Confidence Interval Estimation. Annals of Mathematical Statistics 24:513–536.

Roy, S. N.; and Gnanadesikan, R. 1957 Further Contributions to Multivariate Confidence Bounds. Biometrika 44:399–410.

Ryan, Thomas A. 1959 Multiple Comparisons in Psychological Research. Psychological Bulletin 56:26–47.

Ryan, Thomas A. 1960 Significance Tests for Multiple Comparisons of Proportions, Variances, and Other Statistics. Psychological Bulletin 57:318–328.

Scheffe, Henry 1953 A Method for Judging All Contrasts in the Analysis of Variance. Biometrika 40:87–104.

Siotani, M.; and Ozawa, Masaru 1958 Tables for Testing the Homogeneity of k Independent Binomial Experiments on a Certain Event Based on the Range. Institute of Statistical Mathematics (Tokyo), Annals 10:47–63.

Stanley, julian C. 1957 Additional “Post Mortem” Tests of Experimental Comparisons. Psychological Bulletin 54:128–130.

Steel, Robert G. D. 1959 A Multiple Comparison Sign Test: Treatments vs. Control. Journal of the American Statistical Association 54:767–775.

Sterling, Theodore D. 1959 Publication Decisions and Their Possible Effects on Inferences Drawn From Tests of Significance—or Vice Versa. Journal of the American Statistical Association 54:30–34.

Thompson, W. A. JR.; and Willke, T. A. 1963 On an Extreme Rank Sum Test for Outliers. Biometrika 50: 375–383.

Truax, Donald R. 1953 An Optimum Slippage Test for the Variances of k Normal Populations. Annals of Mathematical Statistics 24:669–674.

Tukey, J. W. 1953 The Problem of Multiple Comparisons. Unpublished manuscript, Princeton Univ.

Vaughan, Ted R.; Sjoberg, G.; and Smith, D. H. 1966 Religious Orientations of American Natural Scientists. Social Forces 44:519–526.

Vianelli, Silvio 1959 Prontuari per calcoli statistici: Tavole numeriche e complementi. Palermo: Abbaco.

Walsh, John E. 1965 Handbook of Nonparametric Statistics. Volume 2: Results for Two and Several Sample Problems, Symmetry, and Extremes. Princeton, N.J.: Van Nostrand.

Winer, B. J. 1962 Statistical Principles in Experimental Design. New York: McGraw–Hill.

Working, Holbrook; and Hotelling, Harold 1929 Application of the Theory of Error to the Interpretation of Trends. Journal of the American Statistical Association 24 (March Supplement) : 73–85.

Yates, Frank 1948 The Analysis of Contingency Tables With Groupings Based on Quantitative Characters. Biometrika 35:176–181.