Goodness of Fit
Goodness of Fit
A goodness of fit procedure is a statistical test of a hypothesis that the sampled population is distributed in a specific way, for example, normally with mean 100 and standard deviation 15. Corresponding confidence procedures for a population distribution also fall under this topic. Related tests are for broader hypotheses, for example, that the sampled population is normal (without further specification). Others test hypotheses that two or more population distributions are the same.
Populations arise because of variability, of which various sources (sometimes acting together) can be distinguished. First, there is inherent variability among experimental units, for example, the heights, IQ’s, or ages of the students in a class each vary among themselves. Then there is measurement error, a more abstract or conceptual notion. The age of a student may have negligible measurement error, but his IQ does not; it depends on a host of accidental factors: how the student slept, the particular questions chosen for the test, and so on. There are also other conceptual populations, not properly thought of in terms of measurement error—the population of subject responses, for example, in the learning experiment below.
The distribution of a numerical population trait is often portrayed by a histogram, a density function, or some other device that shows the proportion of cases for which a particular value of the numerical trait is achieved (or the proportion within a small interval around a particular value). The shape of the histogram or density function is important; it may or may not be symmetrical. If it is not, it is said to be skew. If it is symmetrical, it may have a special kind of shape called normal. For example, populations of scores on intelligence tests are often assumed normally distributed by psychologists. Indeed, the construction of the test may aim at normality, at least for some group of individuals. Again, lifetimes of machines may be assumed to have negative exponential distributions, meaning that expected remaining life does not vary with age. [SeeDistributions, Statistical, article onspecial continuous distributions; Probability; Statistics, descriptive.]
It is technically often convenient, especially in connection with goodness of fit tests, to deal with the cumulative distribution function (c.d.f.) rather than with the density function. The c.d.f. evaluated at x is the proportion of cases with numerical values less than or equal to x; thus, if f(x) is a density function, the corresponding c.d.f. is
For explicitness, a subscript will be added to F, indicating the population, distribution, or random variable to which it applies. It is a matter of convention that cumulation is from the left and that it is based on “less than or equal to” rather than just “less than.”
The sample c.d.f. is the steplike function whose value at x is the proportion of observations less than or equal to x. Many goodness of fit procedures are based on geometrically suggested measures of discrepancy between sample and hypothetical population c.d.f.’s. Some informal procedures use “probability” graph paper, especially normal paper (on which a normal c.d.f. becomes a straight line).
For nominal populations (for example, proportions of people expressing allegiance to different religions or to none) there is no concept corresponding to the c.d.f. The main emphasis of this article is on numerical populations.
Although goodness of fit procedures address themselves principally to the shape of population c.d.f.’s, the term “goodness of fit” is sometimes applied more generally than in this article. In particular, some authors write of goodness of fit of observed regressions to hypothetical forms, for example, to a straight line. [This topic is dealt with in Linear hypotheses, article onregression.]
Hypotheses—simple, composite, approximate. A test of goodness of fit, based on a sample from a population, assesses the plausibility that the population distribution has specified form; in brief, tests the hypothesis that FX has shape F0. The specification may be complete, that is, the population distribution may be specified completely, in which case the hypothesis is called simple. Alternatively, the form may be specified only up to certain unknown parameters, which often are the parameters of location and scale. In this case the hypothesis is called composite. Still another type of hypothesis is an approximate one, which is composite in a certain sense. Here one specifies first what one would consider a material departure from a hypothesized shape (Hodges & Lehmann 1954). For example, in the case of a simple approximate hypothesis, one might agree that FX departs ma terially from F0 if the maximum vertical deviation between the actual and hypothesized cumulative distribution functions exceeds .07. The approximate hypothesis then states that the actual and hypothesized distributions do not differ materially in this sense.
Approximate hypotheses specialize to the others, so that a complete theory of testing for the former would be desirable. This is especially true since, as has been pointed out by Karl Pearson (1900) and Joseph Berkson (1938), tests of “exact” hy potheses, being as a rule consistent, have problematical logical status: unless the exact hypothe sis is exactly correct and all of the sampling assumptions are exactly met, rejection of the hypothesis is assured (for fixed significance level) when sample size is large. Unfortunately, such a complete theory does not now exist, but the strong early interest in “exact” hypotheses was not mis spent: The testing and “acceptance” of “exact” hypotheses concerning FX seems to have much the same status as the provisional adoption of physical or other “laws.” If the latter has helped the advancement of science, so has no doubt the former; this is true notwithstanding that old hypotheses or theories will almost surely be discarded as additional data become available. This point has been made by Cochran (1952) and Chapman (1958). Cochran also suggests that the tests of “exact” hy potheses are “invertible” into confidence sets, in the usual manner, thus providing statistical procedures somewhat similar in intent to tests of approximate hypotheses [seeEstimation, article onconfidence intervals and regions].
Conducting a test of goodness of fit. Many tests of goodness of fit have been developed; as with statistical tests generally, a test of goodness of fit is conveniently conducted by computing from the sample a statistic and its sample significance level [seeHypothesis testing]. In the case of a test of goodness of fit, the statistic will measure the discrepancy between what the sample in fact is and what a sample from a population of hypothesized form ought to be. The sample significance level of an observed measure of discrepancy, d0, is, at least for all the standard goodness of fit procedures, the probability, Pr{d ≥ d0}, that d exceeds d0 under random sampling from a population of hypothesized form. In other words, it is the proportion of like discrepancy measures, d, exceeding d0, computed on the basis of many successive hypothetical random samples of the same size from a population of hypothesized form. For many tests of goodness of fit, there exist tables (for extensive bibliography see Greenwood & Hartley 1962) that give those values of d0 corresponding to given significance level and sample size (n). Many of these standard tests are nonparametric, which means that Pr{d ≥ d0} is the same for a very large class of hypotheses F0, so that only one such tabulation is required [seeNonparametric statistics].
If, as is usual, the relevant alternative population distributions (more generally, alternative probabilistic models for the generation of the sample at hand) tend to encourage large values of d0, the hypothesized population distribution will be judged implausible if the sample significance level is small (conventionally .05 or less). If the sample significance level is not small, it means that the statistic has a value unsurprising under the null hypothesis, so that the test gives no reason to reject the null hypothesis. If, however, the sample significance level is very large, say .95 or more, one may construe this as a warning of possible trouble, say, that an overzealous proponent of the hypothesis has slanted the data or that the sampling was not random. Note here an awkward usage prevalent in statistics generally: an observed measure of discrepancy d0 with low probability Pr{d ≥ d0} usually is described as highly significant.
Choosing a test of goodness of fit. Choosing a test of goodness of fit amounts to deciding in what sense the discrepancy between the hypothesized population distribution and the sample is to be measured: The sample c.d.f. may be compared directly with the hypothesized population c.d.f., as is done in the case of tests of the Kolmogorov-Smirnov type. For example, the original Kolmogorov-Smirnov test itself, as described below, summarizes the discrepancy by the maximum absolute deviation between the hypothesized population c.d.f., F0, and the sample c.d.f. Alternatively, one may compare uncumulated frequencies, as for the χ2 test. Again, a standard shape parameter, such as skewness, may be computed for the sample and for the hypothesized population and the two compared.
Any reasonable measure of discrepancy will of course tend to be small if the population yielding the sample conforms to the null hypothesis. A good measure of discrepancy will, in addition, tend to be large under the likely alternative forms of the population distribution, a property designated technically by the term power. For example, the sample skewness coefficient might have good power if the hypothesized population distribution were normal (zero population skewness coefficient) and the relevant alternative distributional forms were appreciably skew.
Two general considerations. Two general considerations should be kept in mind. First it is important that the particular goodness of fit test used be selected without consideration of the sample at hand, at least if the calculated significance level is to be meaningful. This is because a measure of discrepancy chosen in the light of an observed sample anomaly will tend to be inordinately large. Receiving license plate 437918 hardly warrants the inference that, this year, the first and second digits add to the third, and the fifth and sixth to the fourth. It may of course be true, in special instances, that some adjustment of the test procedure in the light of the data does not affect the significance computations appreciably—as, for example, when choosing category intervals, based on the sample mean and variance, for the χ2 test (Watson 1957).
Second, a goodness of fit test, like any other statistical test, leads to an inference from a sample to the population sampled. Indeed, the usual hypothesis under test is that the sample is in fact a random sample from an infinite population of hypothesized form, and the tabulated probabilities, Pr{d ≥ d0), almost always presuppose this. (In principle, one could obtain goodness of fit tests for more complex kinds of probability samples than random ones, but little seems to be known about such possibilities.) It is therefore essential that the sample to which a standard test is applied can be thought of as a random sample. If it cannot, then one must be prepared either to do one’s own non-standard significance probability computations or to defend the adequacy of the approximation involved in using the standard tabulations. Consider, for example, starting with a random sample in volving considerable repetition, say the sample of response words obtained from a panel of subjects taking a psychological word association test or the sample of nationalities obtained from a survey of the United Nations. Suppose now that one tal lies the number of items in the sample (response words, nationalities) appearing exactly once, exactly twice, etc. There results a new set of data, consisting of a certain number of one’s, a certain number of two’s, etc. This collection of integers has the outward appearance of a random sample, and the literature contains instances of the application of the standard tests of goodness of fit to such observed frequencies. Yet the probability mechanism that generates these integers has no resemblance whatever to random sampling, and the standard probability tabulations cannot be assumed to apply. Other examples arise when the data are generated by time series; for some of these the requisite nonstandard probability computations have been done (Patankar 1954), while, in other cases, special devices have made the standard computations apply. For example, in the case of the learning experiment by Suppes and his associates (1964), the sample consists of the time series of a subject’s responses to successive stimuli. Certain theories of learning predict a particular bimodal long-run response population distribution; but the goodness of fit test of this hypothesized shape, on the basis of a series of subject responses, is ham pered by the statistical dependence of neighboring responses. However, theory suggests, and a test of randomness confirms, that the subsample consisting of every fifth response is effectively random, enabling a standard χ2 test of goodness of fit to be carried out on the basis of this subsample. Whether four-fifths of the sample is a reasonable price to pay for validly carrying out a standard procedure is of course a matter of debate.
Tests of simple hypotheses
The χ2 test
The χ2 test was first proposed in 1900 by Karl Pearson. To apply the test, one first divides the possible range of numbers (number pairs in the bivariate case) into k regions. For example, if only nonnegative numbers are possible, one might use the categories 0 to .2, .2 to .5, .5 to .7, and .7 and beyond. Next, one computes the probabilities, pi, associated with each of these regions (intervals in the example just given) under the hypothesized F0. This is often done by subtracting values of F0 from each other; for example, when F0 is the exponential cumulative distribution function 1 − e−x,
The expected numbers Ei of observations in each category are (under the null hypothesis) Ei = npi where n is the size of the random sample.
After the sample has been collected, there also will be observed numbers, Oi, of sample members in each category. The chi-square measure of discrepancy dχ2 is then computed by summing squared differences of class frequencies, weighted in such a way as to bring to bear standard distribution theory,
where the subscript 0 indicates the specific sample value of dx2 (Often “X2” or “χ2” is used to denote this statistic.)
As is shown, for example, by Cochran (1952), the probability distribution of dx2, when Fx = F0, can be approximated by the chi-square distribution with k − 1 degrees of freedom, . This fact, to which the test owes its name, was first demonstrated by Karl Pearson. The larger the expectations Ej, the better is the approximation; this has been pointed out, for example, by Mann and Wald (1942). Hence, the significance, Pr{dx2 ≥ dx2,0}, is evaluated to a good approximation by consulting a tabulation of the distribution. For example, if k, as above, equals 4, and dx2,0 had happened to be 4.6, then Pr{dx2 ≥ dx2,0} ≅ .20. With a sample significance level of .20, most statisticians would not question the plausibility of F0. However, were dx2,0 larger, and the corresponding significance equal to .05 or less, the consensus would be reversed.
At what point is the distributional approximation endangered by small Ei? An early study of this problem, performed by Cochran in 1942 (referred to in Cochran 1952), shows that a few E; near 1 among several large ones do not materially affect the approximation. Recent studies, by Kempthorne (1966) and by Slakter (1965), show that this is true as well when all Ei are near 1.
These and other studies indicate that, although some care must be taken to avoid very small Ei, much latitude remains for choosing categories. How is this to be done? To begin with, in keeping with the spirit of remarks by Birnbaum (1953), if the relevant alternatives F* to F0 are such that
is large for a certain choice of k categories, it is these categories that should be selected. Among various sets of k categories, those yielding large d,x2 (F*, F0) are preferred.
In the absence of detailed knowledge of the alternatives, the usual recommendation, at least in the one-dimensional case, is to use intervals of equal Ei. There remains the question of how many such intervals there should be. The typical statistical criterion for this is power, that is, the likelihood that the value of dx2 will be large enough to warrant rejection of the hypothesis F0 when the population is in fact a relevant alternative one. If large power is desired for all alternative population c.d.f.’s departing from F0 at some x by at least a given fixed amount, Mann and Wald (1942) recommend a number of categories of the order of 4n2/5. Williams (1950) has shown that this figure can easily be halved.
The χ2 test is versatile; it is readily adapted to problems involving nominal rather than numerical populations [seeCounted data]. It can also be adapted to bivariate and multivariate problems, as, for example, by Keats and Lord (1962), where the joint distribution of two types of mental test scores is considered. As opposed to many of its competitors, the χ2 test is not biased, in the sense that there are no alternatives F* to F0 under which acceptance of F0 is more likely than it is under F0 itself. It is readily adapted to composite and approximate testing problems. Also, it seems to be true that the χ2 test is in the best position among its competitors with regard to the practical computation of power. As is pointed out by Cochran (1952), such computations are performed by means of the noncentral chi-square distribution with k − 1 degrees of freedom.
Modifications of the χ2 test
Important modifications of the χ2 test, intended to increase its power against specific families F of distributions alternative to F0, are given by Neyman (1949) and by Fix, Hodges, and Lehmann 1954). Here F is assumed to include F0 and to allow differentiable parametric representation of the category expectations Ei. Note that the inclusion of F0 in F differs from the point of view adopted, for example, by Mann and Wald (1942). These modifications are essentially likelihood ratio tests of F0 versus F and are similar to procedures used to test composite and approximate hypotheses.
Another modification, capable of orientation against specific “smooth” alternatives, Neyman’s ψ2 test, was introduced in 1937. Other important modifications are described in detail in Cochran (1954).
Other procedures
When (X1, …, Xn) is a random sample from a population distributed according to a continuous c.d.f. F0, then (U1 …, Un) = (F0(X1), …, F0(Xn)) has all the probabilistic properties of a random sample from a population distributed uniformly over the numbers between zero and one. (If the population has a density function, the c.d.f. is continuous.) No matter what the hypothesized F0, the initial application of this probability integral transformation thus reduces all probability computations to the case of this uniform population distribution and gives a nonparametric character to any procedure based on the transformed sample (U1, …, Un). Most goodness of fit tests of simple hypotheses are nonparametric in this sense, including the χ2 test itself, when categories are chosen so as to assign specified values, for example, the constant value 1/k, to the category probabilities pt.
Another common test making use of the trans formation U = F0(X) is the Kolmogorov-Smirnov test, first suggested by Kolmogorov (1933) and explained in detail by Goodman (1954) and Massey (1951). The test bears Smirnov’s name, as well as Kolmogorov’s, presumably because Smirnov (as Doob and Donsker did later) gave an alternate derivation of its asymptotic null distribution, tabulated this distribution, and also extended the test to the two-sample case discussed below (1939a). Denote by Fn(x) the sample c.d.f., that is, Fn(x) is the proportion of sample values less than or equal to x. The test is based on the maximum absolute vertical deviation between Fn(x) and F0(x),
the dependence of dK on the quantities Ui = F0(Xi) being best brought out by the alternate formula
where ui is the smallest Ui, u2 is the next to smallest, etc.; the equivalence of the two formulas is made clear by a sketch. As Kolmogorov noted in his original paper, the probabilities tabulated for dK are conservative when F0 is not continuous, in the sense that, for discontinuous F0, actual probabilities of dK ≥ dK,0 will tend to be less than those obtained from the tabulations, leading to occasional unwarranted acceptance of F0.
Computations (Shapiro & Wilk 1965) suggest that this test has low power against alternatives with mean and variance equal to those of the hypothesized distribution. It has, however, been argued, for example, by Birnbaum (1953) and Kac, Kiefer, and Wolfowitz (1955), that the test yields good minimum power over classes of alter natives F* satisfying dK(F*, F0) ^ δ these, as the reader will note, are precisely the classes of alternatives envisaged by Mann and Wald (1942) in optimizing the number of categories used in the χ2 test. A detrimental feature of the Kolmogorov-Smirnov test is its bias, pointed out in Massey (1951).
An important feature of the test is that it can be “inverted” in keeping with the usual method to provide a confidence band for F0(x) centered on F0(x), which, except for the narrowing caused by the restriction 0 ≤ F0(x) ≤ 1, has constant width [seeEstimation, article onconfidence intervals and regions]. The construction of such a band has been suggested by Wald and Wolfowitz and is described by Goodman (1954). Attaching a significance probability to an observed dK,0 amounts to ascertaining the band width required in order just to include wholly the hypothesized F0 in the confidence band.
The Kolmogorov-Smirnov test has been modified in several ways; the first of these converts the test into a “one-sided” procedure based on the discrepancy
A useful feature of this modification is the simplicity of the large sample computation of significance probabilities associated with observed discrepancies dK,+0,; abbreviating the latter to d, one has Pr{dK+ ≥ d} ≅ e−2d2. It is verified by Chapman (1958) that dK+ yields good minimum power over those classes of alternatives F* that satisfy .
Other, more complex modifications provide greater power against special alternatives, as in the weight function modifications (Darling 1957), which provide greater power against discrepancies from F0 in the tails. Another sort of modification, introduced and tabulated by Kuiper in 1960, calls for a measure of discrepancy dv that is especially suited to testing goodness of fit to hypothesized circular distributions, being invariant under arbitrary choices of the angular origin. This property could be important, for example, in psychological studies involving responses to the color wheel, or in the learning experiment mentioned above. The measured, also has been singled out by E. S. Pearson (1963) as the generally most attractive in competition with dK and the discrepancy measures dω2 and dv mentioned below.
A second general class of procedures also making use of the transformation U = F0(x) springs from the discrepancy measure
first proposed by Cramer in 1928 and also by Von Mises in 1931 (see Darling 1957). Marshall (1958) has verified a startling agreement between the asymptotic and small sample distributions of da,2 for sample sizes n as low as 3. Power considerations for dω2 are similar to those expressed for dK, and are discussed also in the sources cited by Marshall; the test based on dω2 can be expected to have good minimum power over classes of alter natives F* satisfying the conditions . However, the test is biased (as is that based on dK).
As in the case of dK, dω2 has weight function modifications for greater power selectivity, and also a modification dv, analogous to the modification dv of dK and introduced by Watson (1961), which does not depend on the choice of angular origin and is thus also suited for testing the goodness of fit to hypothesized circular distributions.
Other procedures include those based on the Fisher-Pearson measures and , apparently first suggested in connection with goodness of fit in 1938 by E. S. Pearson. As pointed out by Chapman (1958), the tests based on d(1) and d2 are uniformly most powerful against polynomial alternatives to Fv(x) = x of form xk and (1 − x)k, k > 1, and hence are “smooth” in the sense of Neyman’s ψ2 test. Computations by Chapman suggest that, dually to dK, d2 has good maximum power over classes of alternatives F° satisfying dK(F*, F0) ≤ 8.
Another set of procedures, discussed and defended by Pyke (1965) and extensively studied by Weiss (1958), are based on functions of the spacings, ui+ − ui or ui − (i + 1)−1, of the u’s, from each other or from their expected locations under F0. Still another criterion (Smirnov 1939b) examines the number of crossings of F0(x) and F0(x).
An important modification, applicable to all of the procedures in this section, is suggested in Durbin (1961). This modification is intended to increase the power of any procedure based on the transforms Ui, against a certain class of alternatives described in that paper.
Since there are multivariate probability integral transformations, applying an initial “uniformizing” transformation is possible in the multivariate case as well. However, one of several possible transformations must now be chosen, and, related to this nonuniqueness, the direct analogues of the univariate discrepancy measures are no longer functions of uniformly distributed transforms and do not lead to nonparametric tests (Rosenblatt 1952).
Tests of composite hypotheses
The χ2 test
In the composite case, null hypothesis specifies only that Fx(x) is a member of certain parametric class {F0(x)}. Typically, but not necessarily, θ is the pair (μ,σ), a parameter of location, and σ a parameter of scale, in which case F0(x) may be written F0[(x − μ)/σ]. In any event, there arises the question of modifying the measure d*x2 of discrepancy between the sample and a particular cumulative distribution function into a measure Dx2 of discrepancy between the sample and the class {Fθ(x)}. A natural approach is to set
Dx2 = minθ dx2.
If θ is composed of m parameters, it can be shown that, under quite general conditions, Dx2 is approximately distributed according to the distribution when Fx(x) equals any one of the Fθ(x). Hence significance probability computations can once again be referred to tabulations of the distribution. The requisite minimization with respect to θ can be cumbersome, and several modifications have been proposed, for example, the following by Neyman (1949):
Suppose that one defines dx2(θ) as the discrepancy dx2 between the observed sample and the particular distribution F0(x). Then D is defined also by
Dx2 = dx2(θ͂),
with the estimator θ͂ computed from
that is, with θ͂ the minimum chi-square estimator of θ. The suggested modifications involve using estimators of θ alternate to θ͂ in this last definition of Dx2, that is, estimators that “essentially” minimize dx2(θ); among these are the so-called grouped-data or partial-information maximum likelihood estimators.
Frequently used but not equivalent estimators are the ordinary “full-information” maximum like lihood estimators θ͂ of θ, for example, (x̄, s) for (μ, σ) in the normal case. These do not “essentially” minimize dx2 and consequently tend to inflate Dx2 beyond values predicted by the distribution, leading to some unwarranted rejections of the composite hypothesis. However, it is indicated by Chernoff and Lehmann (1954), and also by Watson (1957), that no serious distortion will result if the number of categories is ten or more.
Composite analogues of other tests
Adaptation of the tests based on the probability integral transformation to the composite case proceeds much as in the case of χ2. With definitions of dω2 and dK (θ) analogous to that of dx2 Darling (1955) has investigated the large sample probability distribution of Dω2 = dω2(θ^) and Dk=dk(θ^) for efficient estimators θ^ of θ analogous to the estimators θ¯ for χ2, Note that in the absence of any χ2-like categories, the ordinary full-information maximum likelihood estimators now do qualify as estimators θ^
A major problem now is, however, that the modified procedures are no longer nonparametric. Thus a special investigation is required for every composite hypothesis. This is done by Kac, Kiefer, and Wolfowitz (1955) for the normal scale-location family, and the resulting large sample distribution is partly tabulated.
Tests based on special characteristics
The alternatives of concern sometimes differ from a composite null hypothesis in a manner easily described by a standard shape parameter. Special tests have been proposed for such cases. For example, the sample skewness coefficient has been suggested (Geary 1947) for testing normality against skew alternatives. Again, for testing Poissonness against alternatives with variance unequal to mean, R. A. Fisher has recommended the variance-to-mean ratio (see Cochran 1954). This meas ure is approximately distributed as , when Pois sonness in fact obtains, for λ > 1 and n > 15 (Sukhatme 1938), which follows from the fact that the denominator is then a high-precision estimate of λ, and the numerator is approximately distributed as λ Analogous recommendations apply to testing binomiality. Essentially the same point of view underlies tests of normality based on the ratio of mean deviation, or of the range, to the standard deviation.
Transforming into simple hypotheses
Another interesting approach to the composite problem, advocated by Hogg and also by Sarkadi (1960), is to transform certain composite hypotheses into equivalent simple ones.
Specifically, there are location-scale parametric families {F0[(x —μ)/σ]}with the following property: A random sample from any particular F0[(x —μ)/σ] is reducible by a transformation T to a new set of random variables, Y = T(X), constituting in effect a random sample from a distribution G (y) involving no unknown parameters at all. Moreover, only random samples from distributions F0[(x —μ)/σ] lead to G(y) when operated on by T.
It then follows that testing the composite hypothesis H that (X1, …, Xn) is a random sample from a distribution F0[(x —μ)/σ] with some μ and some cr is equivalent to testing the hypothesis H′ that (Y1,…, Ym) is a random sample from the distribution G(y). Any of the tests for simple hy potheses is then available for testing H′. An example is provided by a negative exponential F0 and uniform G, in which case the ordered exponential random sample (X(1), …, X(n)) is transformed into an ordered uniform random sample (Y(l), …, Y(n-2)) by the transformation
Conditioning
Another way of neatly doing away with the unknown parameter is to consider the conditional distribution of the sample, given a sufficient estimate of it. This method is advocated, at least for testing Poissonness, in Fisher (1950).
Tests related to probability plots
S. S. Shapiro and M. B. Wilk have quantified in various ways the departure from linearity of the sorts of probability plots mentioned above, in particular of the plot of the ordered sample values against the expected values of the standardized order statistics [seeNonparametric statistics, article onorder statistics]. This new approach bears some similarity to one given in Darling (1955), which is based on the measure dω2 modified for the composite case. Both approaches, in a sense, compare adjusted observed order statistics with standardized order statistic expectations. But the approach of Shapiro and Wilk is tailored more explicitly to particular scale-location families, by using their particular order statistic variances and covariances. It is no wonder that preliminary evaluations of this sort of approach (for example, by Shapiro & Wilk 1965) have shown exceptional promise. As an added bonus, the procedure is similar over the entire scale-location family; that is, its probability distribution is independent of location and scale.
Approximate hypotheses
The first, and seemingly most practically developed, attempt to provide the requisite tests of approximate hypotheses is found in Hodges and Lehmann (1954). Hodges and Lehmann assume the k typical categories of the χ2 test and formulate the approximate simple hypothesis in terms of the discrepancy d(p, p0) between the category probabilities pi under Fx and the category probabilities p0,i. under a simple hypothesis F0. A very tractable discrepancy measure of this type is ordinary distance, for which the approximate hypothesis takes the form
Denoting Oi/n by Oi, the suggested test reduces, essentially, to the one-sided test of the hypothesis d(p, p0) = δ based on the approximately normal statistic [d(O,p0)-δ]/σ^, where σ^ is the standard deviation, estimated from the sample oi, of d(o, p0). For example, when F0 specifies k categories with p0,i,i = l/k, one treats as unit normal (under the null hypothesis) the statistic
and uses an upper-tail test. Thus a value of S of 1.645 leads to a sample significance level of .05. This approach lends itself easily to the computation of power and is extended as well by Hodges and Lehmann to the testing of approximate composite hypotheses.
Extension of other tests for simple hypotheses to the testing of approximate hypotheses has been considered by J. Rosenblatt (1962) and by Kac, Kiefer, and Wolfowitz (1955).
Further topics
That the sample is random may itself be in doubt, and tests have been designed to have power against specific sorts of departure from randomness. For example, tests of the hypothesis of randomness against the alternative hypothesis that the data are subject to a Markov structure are given by Billingsley (1961) and Goodman (1959); the latter work also covers testing that the data have Markov structure of a given order against the alternative that the data have Markov structure of higher order, and the testing of hypothesized values of transition probabilities when a Markov structure of given order is assumed [seeMarkov chains].
Many of the tests described in this article can be extended to several-sample procedures for testing the hypothesis that several populations are in fact distributed identically; thus, as first suggested in Smirnov (1939a), if Gm(x) denotes the proportion of values less than or equal to x, in an independent random sample (Y1, …, Ym) from a second population, dK(Fn,Gm) provides a natural test of the hypothesis that the two continuous population distribution functions Fx and GY coincide. Many of these extensions are functions only of the relative ranks of the two samples and, as such, are nonparametric, that is, their null probability distributions do not depend on the common functional form of Fx and y. [Several-sample nonparametric procedures are discussed in Nonparametric sta tistics.]
Another topic is that of tests of goodness of fit as preliminary tests of significance, in a sense discussed, for example, by Bancroft (1964). That tests of goodness of fit are typically applied in this sense is recognized by Chapman (1958), and the probabilistic properties of certain “nested” sequences of tests beginning with a test of goodness of fit have been considered by Hogg (1965). The Bayes and information theory approaches to χ2 tests of goodness of fit are also important (see Lindley 1965; Kullback 1959).
H. T. David
[Directly related are the entriesHypothesis Testing; Significance, Tests of. Other relevant material may be found inCounted Data; Estimation; Non-Parametric Statistics.]
BIBLIOGRAPHY
Bancroft, T. A. 1964 Analysis and Inference for In completely Specified Models Involving the Use of Preliminary Test(s) of Significance. Biometrics 20:427–442.
Berkson, Joseph 1938 Some Difficulties of Interpretation Encountered in the Chi-square Test. Journal of the American Statistical Association 33:526–536.
Billingsley, Patrick 1961 Statistical Methods in Mark ov Chains. Annals of Mathematical Statistics 32:12–40.
Birnbaum, Z. W. 1953 Distribution-free Tests of Fit for Continuous Distribution Functions. Annals of Mathe matical Statistics 24:1–8.
Chapman, Douglas G. 1958 A Comparative Study of Several One-sided Goodness-of-fit Tests. Annals of Mathematical Statistics 29:655–674.
Chernoff, Herman; and Lehmann, E. L. 1954 The Use of Maximum Likelihood Estimates in Tests for Goodness of Fit. Annals of Mathematical Statistics 25:579–586.
Cochran, William G. 1952 The X2Test of Goodness of Fit. Annals of Mathematical Statistics 23:315–345.
Cochran, William G. 1954 Some Methods for Strengthening the Common X2 Tests. Biometrics 10:417–451.
Darling, D. A. 1955 The Cramer-Smirnov Test in the Parametric Case. Annals of Mathematical Statistics 26:1–20.
Darling, D. A. 1957 The Kolmogorov-Smirnov, Cramer-Von Mises Tests. Annals of Mathematical Statistics 28:823–838.
Durbin, J. 1961 Some Methods of Constructing Exact Tests. Biometrika 48:41–55.
Fisher, R. A. 1924 The Conditions Under Which x2Measures the Discrepancy Between Observation and Hypothesis. Journal of the Royal Statistical Society 87:442–450.
Fisher, R. A. 1950 The Significance of Deviations From Expectation in a Poisson Series. Biometrics 6:17–24.
Fix, Evelyn; Hodges, J. L. Jr.; and Lehmann, E. L. 1954 The Restricted Chi-square Test. Pages 92-107 in Ulf Grenander (editor), Probability and Statistics. New York: Wiley.
Geary, R. C. 1947 Testing for Normality. Biometrika 34:209–242.
Goodman, Leo A. 1954 Kolmogorov-Smirnov Tests for Psychological Research. Psychological Bulletin 51: 160–168.
Goodman, Leo A. 1959 On Some Statistical Tests for mth Order Markov Chains. Annals of Mathematical Statistics 30:154–164.
Greenwood, Joseph A.; and Hartley, H. O. 1962 Guide to Tables in Mathematical Statistics, Princeton Univ. Press.→ A sequel to the guides to mathematical tables produced by and for the Committee on Mathematical Tables and Aids to Computation of the National Academy of Sciences-National Research Council of the United States.
Hodges, J. L. Jr.; and Lehmann, E. L. 1954 Testing the Approximate Validity of Statistical Hypotheses. Journal of the Royal Statistical Society Series B 16: 261–268.
Hogg, Robert V. 1965 On Models and Hypotheses With Restricted Alternatives. Journal of the American Statistical Association 60:1153–1162.
Kac, M.; Kiefer, J.; and Wolfowitz, J. 1955 On Tests of Normality and Other Tests of Goodness of Fit Based on Distance Methods. Annals of Mathematical Statistics 26:189–211.
Keats, J. A.; and Lord, Frederic M. 1962 A Theoreti cal Distribution for Mental Test Scores. Psychometrika 27:59–72.
Kempthorne, O. 1966 The Classical Problem of Inference: Goodness of Fit. Unpublished manuscript.→ Paper presented at the Berkeley Symposium on Mathe matical Statistics and Probability, Fifth, Proceedings to be published.
Kolmogorov, A. N. 1933 Sulla determinazione empirica di une legge di distribuzione. Istituto Italiano degli Attuari, Giornale 4:83–99.
Kuiper, Nicolaas H. 1960 Tests Concerning Random Points on a Circle. Akademie van Wetenschappen, Amsterdam, Proceedings Series A 63:38–47.
Kullback, S. 1959 Information Theory and Statistics. New York: Wiley.
Lindley, D. V. 1965 Introduction to Probability and Statistics From a Bayesian Viewpoint. Volume 2: Inference. Cambridge Univ. Press.
Mann, H. B.; and Wald, A. 1942 On the Choice of the Number of Class Intervals in the Application of the Chi Square Test. Annals of Mathematical Statistics 13:306–317.
Marshall, A. W. 1958 The Small Sample Distribution of nu. Annals of Mathematical Statistics 29:307–309.
Massey, Frank J. Jr. 1951 The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association 46:68–78.
Neyman, Jerzy 1937 “Smooth Test” for Goodness of Fit. Skandinavisk aktuarietidskrift 20:149–199.
Neyman, Jerzy 1949 Contribution to the Theory of the X2 Test. Pages 239-273 in Berkeley Symposium on Mathematical Statistics and Probability, First, Pro ceedings. Berkeley: Univ. of California Press.
Patankar, V. N. 1954 The Goodness of Fit of Fre quency Distributions Obtained From Stochastic Proc esses. Biometrika 41:450–462.
Pearson, E. S. 1938 The Probability Integral Transfor mation for Testing Goodness of Fit and Combining Independent Tests of Significance. Biometrika 30: 134–148.
Pearson, E. S. 1963 Comparison of Tests for Random ness of Points on a Line. Biometrika 50:315–325.
Pearson, Karl 1900 On the Criterion That a Given System of Deviations From the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen From Random Sampling. Philosophical Magazine 5th Series 50:157–175.
Pyke, Ronald 1965 Spacings. Journal of the Royal Statistical Society Series B 27:395–449.
Rosenblatt, Judah 1962 Testing Approximate Hypotheses in the Composite Case. Annals of Mathematical Statistics 33:1356–1364.
Rosenblatt, Murray 1952 Remarks on a Multivariate Transformation. Annals of Mathematical Statistics 23: 470–472.
Sarkadi, KÁroly 1960 On Testing for Normality. Ma gyar Tudomanyos Akademia, Matematikai Kutató Int£zet, Közlemenyek Series A 5:269–274.
Shapiro, S. S.; and Wilk, M. B. 1965 An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52:591–611.
Slakter, Malcolm J. 1965 A Comparison of the Pearson Chi-square and Kolmogorov Goodness-of-fit Tests With Respect to Validity. Journal of the American Statistical Association 60:854–858.
Smirnov, N. V. 1939a On the Estimation of the Discrepancy Between Empirical Curves of Distribution for Two Independent Samples. Moscow, Universitet, Bulletin mathematique Serie Internationale 2, no. 2: 3–26.
Smirnov, N. V. 1939b Ob ukloneniiakh empiricheskoi krivoi raspredeleniia (On the Deviations of the Empirical Distribution Curve). Matematicheskii sbornik New Series 6, no. 1:1–26. → Includes a French resume.
Sukhatme, P. V. 1938 On the Distribution of X2 hi Samples of the Poisson Series. Journal of the Royal Statistical Society 5 (Supplement):75–79.
Suppes, Patrick et al. 1964 Empirical Comparison of Models for a Continuum of Responses With Non-con tingent Bimodal Reinforcement. Pages 358-379 in R. C. Atkinson (editor), Studies in Mathematical Psychology. Stanford Univ. Press.
Watson, G. S. 1957 The X2 Goodness-of-fit Test for Nor mal Distributions. Biometrika 336–348.
Watson, G. S. 1961 Goodness-of-fit Tests on a Circle. Biometrika 48:109–114.
Weiss, Lionel 1958 Limiting Distributions of Homo geneous Functions of Sample Spacings. Annals of Mathematical Statistics 29:310–312.
Williams, C. Arthur Jr. 1950 On the Choice of the Number and Width of Classes for the Chi-square Test of Goodness of Fit. Journal of the American Statistical Association 45:77–86.