Hypothesis Testing
Hypothesis Testing
The formulation of hypotheses and their testing through observation are essential steps in the scientific process. A detailed discussion of their role in the development of scientific theories is given by Popper (1934). On the basis of observational evidence, a hypothesis is either accepted for the time being (until further evidence suggests modification) or rejected as untenable. (In the latter case, it is frequently desirable to indicate also the direction and size of the departure from the hypothesis.)
It is sometimes possible to obtain unequivocal evidence regarding the validity of a hypothesis. More typically, the observations are subject to chance variations, such as measurement or sampling errors, and the same observations could have occurred whether the hypothesis is true or not, although they are more likely in one case than in the other. It then becomes necessary to assess the strength of the evidence and, in particular, to decide whether the deviations of the observations from what ideally would be expected under the hypothesis are too large to be attributed to chance. This article deals with methods for making such decisions: the testing of statistical hypotheses.
Probability models. A quantitative evaluation of the observational material is possible only on the basis of quantitative assumptions regarding the errors and other uncertainties to which the observations are subject. Such assumptions are conveniently formulated in terms of a probability model for the observations. In such a model, the observations appear as the values of random variables, and the hypothesis becomes a statement concerning the distribution of these variables [seeProbability, article onformal probability].
The following are examples of some simple basic classes of probability models. Some of the most important applications of these models are to samples drawn at random from large populations and possibly subject to measurement errors.
Example 1—binomial model. If X is the number of successes in n independent dichotomous trials with constant probability p of “success,” then X has a binomial distribution. This model is applicable to large (nominally infinite) populations whose members are of two types (voters favoring one of two candidates, inmates of mental institutions who are or are not released within one year) of which one is conventionally called “success” and the other “failure.” The trials are the drawings of the n members of the population to be included in the sample. This model is realistic only if the population is large enough so that the n drawings are essentially independent.
Example 2—binomial two-sample model. To compare two proportions referring to two different (large) populations (voters favoring candidate A in two different districts, mental patients in two different institutions), a sample is drawn from each population. If the sample sizes are m and n, the observed proportions in the samples are X/m and Y/n, and the proportions in the populations are p1, and p2, then the model may assume that X and Y have independent binomial distributions. The same model may also be applicable when two samples are drawn from the same large population and subjected to different treatments.
Example 3—multinomial model. If a sample of size n is drawn from a (large) population whose members are classified into k types and the number of members in the sample belonging to each type is X1,…,Xk respectively, then an appropriate model may assign to (X1,…,Xk) a multinomial distribution.
Example 4—normal model. If Z1,…,Zn are measurements of the same characteristic taken on the n members of a sample (for example, test scores on a psychological test for n subjects or skull width for n skulls), an appropriate model may assume that Z1,…,Zn are independently and normally distributed with common mean μ and variance σ2.
Example 5—normal two-sample model. To study the effect of a treatment (for example, the effect of training or of a drug on a test score) two independent samples may be obtained, of which the first serves as control (is not treated) and the second receives the treatment. If the measurements of the untreated subjects are X1,…,Xm and those of the treated subjects are Y1,…,Yn, it may be reasonable to assume that X1,…,Xm ; Y1,…,Yn are all independently normally distributed—the X’s with mean μy and variance the Y’s with mean μy and variance . Frequently it may be realistic to make the additional assumption that that the variance of the measurements is not affected by the treatment.
Example 6—nonparametric one-sample model. If the normality assumption in example 4 cannot be justified, it may instead be assumed only that Z1,…,Zn are independently distributed, each according to the same continuous distribution, F, about which no other assumption is made.
Example 7—nonparametric two-sample model. If the normality assumption in example 5 cannot be justified, it may instead be assumed only that X1,…,Xm ; Y1,…,Yn are independently distributed, the X’s according to a continuous distribution F, the Y’s according to G. It may be realistic to suppose that the treatment has no effect on the shape or spread of the distribution but only on its location.
In a testing problem the model is never completely specified, for if there were no unknown element in the model, it would be known whether the hypothesis is true or false. One is thus dealing not with a single model but rather with a class of models, say Ω. For example, in a problem for which the models of example 1 are appropriate, Ω may consist of all binomial models corresponding to n trials and with p having any value between 0 and 1. If the model is specified except for certain parameters (the probabilities p1,…,pk in example 3, the mean, μ, and the variance, σ2, in example 4), the class Ω is called parametric; otherwise, as in examples 6 and 7, it is nonparametric [seeNonparametric statistics].
Statistical hypotheses. A hypothesis, when expressed in terms of a class of probability models, becomes a statement imposing additional restrictions on the class of models or on the distributions specified by the models.
Example 8. The hypothesis that the probability p2 of a cure with a new treatment is no higher than the probability p1, of a cure with the standard treatment, in the model of example 2, states that the parameters p1, p2 satisfy H : p2 ≤ p1.
Example 9. Consider the hypothesis that the rate at which a rat learns to run a maze is unaffected by its having previously learned to run a different maze. If X1,…,Xm, denote the learning times required by m control rats who have not previously run a maze and Y1,…,Yn denote the learning times of n rats with previous experience on another maze, and if the model of example 7 is assumed, then the hypothesis of no effect states that the distributions F and G satisfy H: G = F. (Since, as in the present example, hypotheses frequently state the absence of an effect, a hypothesis under test is sometimes referred to as the null hypothesis.)
Hypotheses about a single parameter. Hypotheses in parametric classes of models frequently concern only a single parameter, such as μ in example 4 or μy - μx in example 5, the remaining parameters being “nuisance parameters.” The most common hypotheses concerning a single parameter θ either (a) completely specify the value of the parameter, for example, state that in example 1, that θ = 0 in example 4, or that μy - μx = 0 in example 5 —in general, such a hypothesis states that θ = θ0 where θ0 is the specified value; or (b) state that the parameter does not exceed (or does not fall short of) a specified value, for example, the hypothesis in example 1 or μx - μy ≤ 0 in example 5—the general form of such a hypothesis is H1 : θ ≤ θ0 (or H2 : θ ≥ θ0).
Two other important, although not quite so common, hypotheses state (c) that the parameter θ does not differ from a specified value θ0 more than a given amount Δ : |θ - θ0| ≤ Δ or, equivalently, that θ lies in some specified interval a ≤ θ ≤ b or (d) that the parameter θ lies outside some specified interval.
Hypotheses about several parameters. In a parametric model involving several parameters, the hypothesis may of course concern more than one parameter. Thus, in example 3, one may wish to test the hypothesis that all the probabilities p1,…,pk have specified values. In example 5, the hypothesis might state that the point (μx,…,μy) lies in a rectangle H : a1 ≤ μx ≤ a2, b1, μy ≤ b2, or that it lies in a circle , etc.
Hypotheses in nonparametric models. The variety of hypotheses that may arise in nonparametric models is illustrated by the following hypotheses, which have often been considered in connection with examples 6 and 7. In example 6, (1) F is the normal distribution with zero mean and unit variance; (2) F is a normal distribution (mean and variance unspecified); and (3) F is symmetric about the origin. In example 7, (1) G = F; (2) G(x)F(x) for all x; and (3) for no x do G(x) and F(x) differ by more than a specified value Δ.
Simple and composite hypotheses. A hypothesis, by imposing restrictions on the original class Ω of models, defines the subclass ΩH of those models of Ω that satisfy the restrictions. If the hypothesis H completely specifies the model, so that ΩH contains only a single model, then H is called simple; otherwise it is composite. Examples of simple hypotheses are the hypothesis in example 1 and the hypothesis that F is the normal distribution with zero mean and unit variance in example 6. Examples of composite hypotheses are the hypothesis p2 ≤ p1 in example 2, the hypothesis μ = 0 in example 4 when σ2 is unknown, and the hypothesis that F is a normal distribution (mean and variance unspecified) in example 6.
Tests of hypotheses. A test of a hypothesis H is a rule that specifies for each possible set of values of the observations whether to accept or reject H, should these particular values be observed. It is therefore a division of all possible sets of values (the so-called sample space) into two groups: those for which the (null) hypothesis will be accepted (the acceptance region) and those for which it will be rejected (the rejection region or critical region).
Tests are typically defined in terms of a test statistic T, extreme values of which are highly unlikely to occur if H is true but are not surprising if H is false. To be specific, suppose that large values of T (and no others) are surprising if H is true but are not surprising if it is false. It is then natural to reject H when T is sufficiently large, say when
where c is a suitable constant, called the critical value.
The above argument shows that the choice of an appropriate test does not depend only on the hypothesis. The choice also depends on the ways in which the hypothesis can be false, that is, on the models of Ω not satisfying H (not belonging to ΩH); these are called the alternatives (or alternative hypotheses) to H. Thus, in example 8, the alternatives consist of the models of example 2 satisfying p2 > P1; in example 9, they consist of the models of example 7 satisfying G ≠ F.
The following two examples illustrate how the choice of the values of T for which H is rejected and the choice of T itself depend on the class of alternatives.
Example 10. Consider in example 1 the hypothesis H: and the three different sets of alternatives : p is less than , p is greater than or p is different from (either less or greater than) . Since one expects the proportion X/n of successes to be close to p, it is natural to reject H against the alternative if X/n is too small. (Very small values of X/n would be surprising under H but not under the alternatives.) Similarly, one would reject H against the alternative if X/n is too large. Finally, H would be rejected against the alternative if X/n is either too large or too small, for example, if
Alternatives of the first two types of this example and the associated tests are called one-sided; those corresponding to the third type are called two-sided.
Example 11. In example 7, consider the hypothesis H: G = F that the Y’s and X’s have the same distribution against the alternatives that the Y’s tend to be larger than the X’s. A standard test for this problem is based on the Wilcoxon statistic, W, which counts the number among the mn pairs (Xs, Yj) for which Yj exceeds X{. The hypothesis is rejected if W is too large [seeNonparametric statistics].
Suppose instead that the alternatives to H state that the Y’s are more spread out than the X’s, or only that G and F are unequal without specifying how they differ. Then W is no longer an appropriate test statistic, since very large (or small) values of W are not necessarily more likely to occur under such alternatives than under the hypothesis.
Significance. To specify the test (1) completely, it is still necessary to select a critical value. This selection is customarily made on the basis of the following consideration. The values T ≥ c, for which the hypothesis will be rejected, could occur even if the hypothesis H were true; they would then, however, be very unlikely and hence very surprising. A measure of how surprising such values are under H is the probability of observing them when H is true. This probability, PH(T≥c), is called the significance level (or size) of the test. The traditional specification of a critical value is in terms of significance level. A value α (typically a small value such as .01 or .05) of this level is prescribed, and the critical value c is determined by the equation
Values of T that are greater than or equal to c, and for which the hypothesis is therefore rejected, are said to be (statistically) significant at level α This expresses the fact that although such extreme values could have occurred under H, this event is too unlikely (its probability being only α) to be reasonably explained by random fluctuations under H.
Tests and hypotheses suggested by data. In stating that the test determined by (1) and (2) rejects the hypothesis H with probability a when H is true, it is assumed that H and the rejection region (1) were determined before the observations were taken. If, instead, either the hypothesis or the test was suggested by the data, the actual significance level of the test will be greater than a, since then other sets of observations would also have led to rejection. In such cases, the prescribed significance levels can be obtained by carrying out the test on a fresh set of data. There also exist certain multiple-decision procedures that permit the testing, at a prescribed level, of hypotheses suggested by the data [seelinear hypotheses,article onmultiple comparisons].
Determination of critical value. The actual determination of c from equation (2) for a given value of a is simple if there exists a table of the distribution of T under H. In cases where a complete table is not available, selected percentage points of the distribution, that is, the values of c corresponding to selected values of α, may have been published. If, instead, c has to be computed, it is frequently convenient to proceed as follows.
Let t be the observed value of the test statistic T. Then the probability a of obtaining a value at least as extreme as that observed is called the significance probability (also P-value, sample significance level, and descriptive level of significance) of the observed value t and is given by
For the observed value t, the hypothesis is rejected if t ≥ c (and hence if â ≤ α) and is otherwise accepted. By computing â one can therefore tell whether H should be rejected or accepted at any given significance level a from the rule
This rule, which is equivalent to the test defined by (1) and (2), requires only the computation of the probability (3); this is sometimes more convenient than determining c from (2).
When publishing the result of a statistical test, it is good practice to state not only whether the hypothesis was accepted or rejected at the chosen significance level (particularly since the choice of level is typically rather arbitrary) but also to publish the significance probability. This enables others to perform the test at the level of their choice by applying (4). It also provides a basis for combining the results of the test with those of other independent tests that may be performed at other times. (Various methods for combining a number of independent significance probabilities are discussed by Birnbaum 1954.) If no tables are available for the distribution of T but the critical values c of (2) are tabled for a number of different levels α, it is desirable at least to give the largest tabled value α at which the observations are nonsignificant and the smallest level at which they are significant, for example, “significant at 5 per cent, nonsignificant at 1 per cent.” Actually, whenever possible, some of the basic data should be published so as to permit others to carry the statistical analysis further (for example, to estimate the size of an effect, to check the adequacy of the model, etc.).
In addition to the above uses, the significance probability—by measuring the degree of surprise at getting a value of T as extreme as or more extreme than the observed value t—gives some indication of the strength of the evidence against H.
The smaller α is, the more surprising it is to get this extreme a value under H and, therefore, the stronger the evidence against H.
The use of equation (2) for determining c from α involves two possible difficulties:
(a)If H is composite, the left-hand side of (2) may have different values for different distributions of ΩH. In this case, equation (2) is replaced by
that is, the significance level or size is defined as the maximum probability of rejection under H. As an illustration, let T be Student’s t-statistic for testing H: μY ≥ μx in example 5 [see Linear hypotheses,article on Analysis of variance]. Here the maximum probability of rejection under H occurs when μy = μx, so that c is determined by the condition Pμy = μx(T ≥ c) = α This example illustrates the fact that PH(T ≥ c) typically takes on its maximum on the boundary between the hypothesis and the alternatives.
(b) If the distribution of T is discrete, there may not exist a value c for which (2) or (5) holds. In practice, it is then usual to replace the originally intended significance level by the closest smaller (or larger) value that is attainable. In theoretical comparisons of tests, it is sometimes preferable instead to get the exact value α through randomization—namely, to reject H if T ≠ c, to accept H if Tc, and if T = c to reject or accept with probability ρ and 1 –ρ respectively, where ρ is determined by the equation PH(T ≠ c) + ρPH(T=c)=α.
Power and choice of level. Suppose that a drug is being tested for possible beneficial effect on schizophrenic patients, with the hypothesis H stating that the drug has no such effect. Then a small significance level, by controlling the probability of falsely rejecting H when it is true, gives good protection against the possibility of falsely concluding that the drug is beneficial when in fact it is not. The test may, however, be quite unsatisfactory in its ability to detect a beneficial effect when one exists. This ability is measured by the probability of rejecting H when it is false, that is, by the probability pa (rejecting H), where A indicates an alternative to H (in the example, an average effect of a given size). This probability, which for tests of the form (1) is equal to PA(T ≠ c),is called the power of the test against the alternative A. The probability of rejecting H and the complementary probability of accepting H, as functions of the model or of the parameters specifying the model, are called respectively the power function and the operating characteristic of the test. Either of these functions describes the performance of the test against the totality of models of Ω
Unfortunately, the requirements of high power and small level are in conflict, since the smaller the significance level, the larger is c and the smaller, therefore, is the power of the test. When choosing a significance level (and hence c), it is necessary to take into account the power of the resulting test against the alternatives of interest. If this power is too low to be satisfactory, it may be necessary to permit a somewhat larger significance level in order to provide for an increase in power. To increase power without increasing significance level, it is necessary to find a better test statistic or to improve the basic structure of the procedure, for example, by increasing sample size.
The problem of achieving a balance between significance level and power may usefully be considered from a slightly different point of view. A test, by deciding to reject or to accept H, may come to an erroneous decision in two different ways: by rejecting when H is true (error of the first kind) or by accepting when H is false (error of the second kind). The probabilities of these two kinds of errors are
and
It is desirable to keep both the first probability and the second probability low, and these two requirements are usually in conflict. Any rule for balancing them must involve, at least implicitly, a weighing of the seriousness of the two kinds of error.
Choice of test. The constant c determines the size of the test (1) not only in the technical sense of (2) or (5) but also in the ordinary sense of the word, since the larger c is, the smaller is the rejection region. In the same sense, the test statistic T determines the shape of the test, that is, of the rejection region. The problem of selecting T is one of the main concerns of the theory of hypothesis testing.
A basis for making this selection is provided by the fact that of two tests of the same size, the one that has the higher power is typically more desirable, for with the same control of the probability of an error of the first kind, it gives better protection against errors of the second kind. The most satisfactory level α test against a particular alternative A is therefore the test that, subject to (5), maximizes the power against A: the most powerful level α test against A.
The fundamental result underlying all derivations of optimum tests is the Neyman-Pearson lemma, which states that for testing a simple hypothesis against a particular alternative A, the statistic T of the most powerful test is given for each possible set x of the observations by
(or by any monotone function of T), where PA and P11 denote the probabilities (or probability densities) of x under A and H respectively.
In most problems there are many alternatives to H. For example, if the hypothesis specifies that a treatment has no effect and the alternatives specify that it has a beneficial effect, a different alternative will correspond to each possible size of this effect. If it happens that the same test is simultaneously most powerful against all possible alternatives, this test is said to be uniformly most powerful (UMP). However, except for one-tailed tests (and for tests of the hypothesis that the parameter lies outside a specified interval) in the simplest one-parameter models, a UMP test typically does not exist; instead, different tests are most powerful against different alternatives.
If a UMP test does not exist, tests may be sought with somewhat weaker optimum properties. One may try, for example, to find a test that is UMP among all tests possessing certain desirable symmetry properties or among all unbiased tests, a test being unbiased if its power never falls below the level of significance. Many standard tests have one or the other of these properties.
A general method of test construction that frequently leads to satisfactory results is the likelihood ratio method, which in analogy to (8) defines T by
Here the denominator is the maximum probability of x when H is true, while the numerator is the over-all maximum of this probability. If the numerator is sufficiently larger than the denominator, this indicates that x has a much higher probability under one of the alternatives than under H, and it then seems reasonable to reject H when x is observed. The distribution of the test statistic (9) under H has a simple approximation when the sample sizes are large, and in this case the likelihood ratio test also has certain approximate optimum properties (Kendall & Stuart 1961, vol. 2, p. 230; Lehmann 1959, p. 310).
The specification problem. For many standard testing problems, tests with various optimum properties have been worked out and can be found in the textbooks. As a result, the principal difficulty, in practice, is typically not the choice of α or T but the problem of specification, that is, of selecting a class fl of models that will adequately represent the process generating the observations. Suppose, for example, that one is contemplating the model of example 5. Then the following questions, among others, must be considered: (a) May the experimental subjects reasonably be viewed as randomly chosen from the population of interest? (b) Are the populations large enough so that the X’s and Y’s may be assumed to be independent? (c) Does the normal shape provide an adequate approximation for the distribution of the observations or of some function of the observations? (d) Is it realistic to suppose thatσx = σy? The answers to such questions, and hence the choice of model, require considerable experience both with statistics and with the subject matter in which the problem arises. (Protection against some of the possible failures of the model of example 5 in particular may be obtained through the method of randomization discussed below.)
Problems of robustness. The difficulty of the specification problem naturally raises the question of how strongly the performance of a test depends on the particular class of models from which it is derived. There are two aspects to this question, namely robustness (insensitivity to departures from assumption) of the size of the test and robustness of its power [see errors,article oneffects of errors in statistical assumptions].
Two typical results concerning robustness of size of a test are the following: (a) In example 5, assuming σy = σx? the size of Student’s t-test for testing H: μy = μx is fairly robust against non-normality except for very small sample sizes, (b) In example 5, the F-test for testing σy = σx is very nonrobust against nonnormality.
The second aspect, the influence of the model on power, may again be illustrated by the case of Student’s Mest for testing H: σy = σx in example 5. It unfortunately turns out—and this result is typical—that the power of the test is not robust against distributions with heavy tails (distributions that assign relatively high probability to very large positive and negative values), for example, if normality is disturbed by the presence of gross errors. This difficulty can be avoided by the use of nonparametric tests such as the Wilcoxon test defined in example 11, which give only limited weight to extreme observations. The Wilcoxon test, at the expense of a very slight efficiency loss (about 5 per cent) in the case of normality, gives much better protection against gross errors. In addition, its size is completely independent of the assumption of normality, so that it is in fact a test of the hypothesis G = F in the nonparametric model of example 7.
Design. Analysis of data, through performance of an appropriate test, is not the only aspect of a testing problem to which statistical considerations are relevant. At least of equal importance is the question of what observations should be taken or what type of experiment should be performed. The following are illustrations of some of the statistical problems relating to the proper design of an investigation [see Experimental Design].
Sample size. Once it has been decided in principle what kind of observations to take, for example, what type of sampling to use, it is necessary to determine the number of observations required. This can be done by fixing, in addition to the significance level α the minimum power β that one wishes to achieve against the alternatives of interest. When the sample size is not fixed in advance, a compromise in the values of α and f$ is no longer necessary, since both can now be controlled. Instead, the problem may arise of balancing the desired error control against the cost of sampling. If the sample size n required to achieve a desired α and β is too large, a compromise between n and the values of a and β becomes necessary.
Sequential designs. Instead of using a fixed sample size, it may be more advantageous to take the observations one at a time, or in batches, and to let the decision of when to stop depend on the observations. (The stopping rule, which states for each possible sequence of observations when to stop, must of course be specified before any observations are taken.) With such a sequential design, one would stop early when the particular values observed happen to give a strong indication of the correct decision and continue longer when this is not the case. In this way, it is usually possible to achieve the same control of significance level and power with a (random) number of observations, which on the average is smaller than that required by the corresponding test of fixed sample size. When using a sequential test, account must be taken of the stopping rule to avoid distortion of the significance level [see Sequential analysis].
Grouping for homogeneity. The power of tests of the hypothesis µr = µx in example 5 depends not only on the size of the difference µr = µx but also on the inherent variability of the subjects, as measured by σx and σr . Frequently, the power can be increased by subdividing the subjects into more homogeneous groups and restricting the comparison between treatment and control to subjects within the same group [seeExperimental design, article onthe design of experiments].
An illustration of such a design is the method of paired comparisons, where each group consists of only two subjects (for example, twins) chosen for their likeness, one of which receives the treatment and the other serves as control. If a sample of n such pairs is drawn and the difference of the measurements in the treated and control subjects of the ith pair is Zi=Yi= Xi, then Z1,…, Zn are distributed as in example 4. The appropriate test is Student’s one-sample t-test, which is now, however, based on fewer degrees of freedom than for the design of example 5. To determine in any specific case whether the design of example 5, the paired-comparison design, or some intermediate design with group size larger than two is best, it is necessary to balance the reduction of variability due to grouping against the loss in degrees of freedom in the resulting t-test.
Randomization. When testing the effect of a treatment by comparing the results on n treated subjects with those on m controls, it may not be possible to obtain the subjects as a random sample from the population of interest. A probabilistic basis for inference and, at the same time, protection against various biases can be achieved by assigning the subjects to treatment and control not on a systematic or haphazard basis but at random, that is, in such a way that all possible such assignments are equally likely. Randomization is possible also in a paired-comparisons situation where within each pair one of the two possible assignments of treatment and control to the two subjects is chosen, for example, by tossing a coin. In the model resulting from such randomization, it is possible to carry out a test of the hypothesis of no treatment effect without any further assumptions.
Relation to other statistical procedures. The formulation of a problem as one of hypothesis testing may present a serious oversimplification if, in case of rejection, it is important to know which alternative or type of alternative is indicated. The following two situations provide typical and important examples.
(a)Suppose a two-sided test of the hypothesis H: θ = θ0 rejects H when a test statistic T is either too small or too large. In case of rejection, it is usually not enough to conclude that θ differs from θ0 ; one would wish to know in addition whether θ is less than or greater than θ0 . Here a three-decision procedure of the following form is called for:
conclude that θθ0 if T ≤ c1
conclude that θ > θ0ifT ≥ c2,
accept H if c1 <,T<c2
The constants c1, c2 can be determined by specifying the two error probabilities α1 = pθ0(T ≤ c1) and α2 = pθ0(T ≥ c2) whose sum is equal to the error probability, α, of the two-sided test that rejects for T ≤ c1 and T ≥ c2. How the total error probability, α is divided between α1, and α2 would depend on the relative seriousness of the two kinds of error involved and on the relative importance of detecting when θ is in fact less than or greater than θ0 If concern is about equal between values of θ > θ0 and values of θ < θ0 the procedure with may be reasonable. It is interesting to note that this three-decision procedure may be interpreted as the simultaneous application of two tests.- namely T ≤ c1 as a test of the hypothesis H1: θ≥ θ0 at level ax and T ≥ c2 as a test of H1: θ≤ θ0 at level α2.
(b)If θ1,…, θc denote the (average) effects of c treatments, one may wish to test the hypothesis H :θ0 =…,= θc to see if there are any appreciable differences. In case of rejection, one might wish todetermine which of the θ’s is largest, or to single out those 0’s that are substantially larger than the rest, or to obtain a complete ranking of the θ’s, or to divide the θ’s into roughly comparable groups, etc.
Under suitable normality assumptions, the last of these objectives can be achieved by applying a t-test to each of the hypotheses Hij: θi, = θj), or rather by applying to each the three-decision procedure based on t-tests of the type discussed in (a). Combining the conclusions (θi < θj θi = θj, or θi > θj) obtained in this way leads to a grouping of the kind desired. In determining the significance level, say α’, at which the individual t-test should be performed, one must of course relate it to the significance level α that one wishes to achieve for the original hypothesis H. (For further details and references regarding this procedure, see Lehmann 1959, p. 275; Mosteller & Bush 1954, p. 304.)
These two examples illustrate how a procedure involving several choices may sometimes be built up by the simultaneous consideration of a number of situations involving only two choices, that is, a number of testing problems. A similar approach also leads to the method of estimation by confidence sets [seeEstimation, article onconfidence Intervals and Regions].
The simultaneous consideration of a number of different tests also arises in other contexts. Frequently, investigators wish to explore a number of different aspects of the same data and for this purpose carry out multiple tests. This raises serious difficulties, since the stated significance levels relate to a single test without relation to others. Essentially the same difficulties arise when testing hypotheses suggested by the data, since this may be viewed as testing a (possibly large) number of potential hypotheses but reporting only the most significant outcome. This, of course, again invalidates the stated significance level. [Methods for dealing with such problems are discussed inlinear hypotheses,article onmultiple comparisons.]
History. Isolated examples of tests, as statements of (statistical) significance or nonsignifi-cance of a set of observations, occurred throughout the eighteenth and nineteenth centuries. A systematic use of hypothesis testing, but without explicit mention of alternatives to the hypothesis being tested, began with the work of Karl Pearson (1900) and owes much of its further development to R. A. Fisher (1925; 1935). That the choice of an appropriate test must depend on the alternatives as well as on the hypothesis was first explicitly recognized by Neyman and Pearson (1928; 1933), who introduced the concept of power and made it the cornerstone of the theory of hypothesis testing described here. Two other approaches to the subject, based on concepts of probability other than that of long-run frequency, which has been implicitly assumed here, are the Bayesian approach and that of Jeffreys [1939; see alsoBayesian inference].
E. L. Lehmann
[See alsoEstimation; Linear Hypotheses; Significance, Tests Of.]
BIBLIOGRAPHY
Birnbaum, Allan 1954 Combining Independent Tests of Significance. Journal of the American Statistical Association 49:559-574.
Fisher, R. A. (1925) 1958 Statistical Methods for Research Workers. 13th ed. New York: Hafner. → Previous editions were also published by Oliver & Boyd.
Fisher, R. A. (1935) 1960 The Design of Experiments. 7th ed. New York: Hafner. → Previous editions were also published by Oliver & Boyd.
Hodges, Joseph L. Jr.; and Lehmann, E. L. 1964 Basic Concepts of Probability and Statistics. San Francisco: Holden-Day. → Contains a more leisurely exposition of the basic concepts of hypothesis testing.
Jeffreys, Harold (1939) 1961 Theory of Probability. 3d ed. Oxford: Clarendon.
Kendall, Maurice G.; and Stuart, Alan 1961 The Advanced Theory of Statistics. New ed. Volume 2: Inference and Relationship. New York: Hafner; London: Griffin. → The first edition, published in 1946, was written by M. G. Kendall.
Lehmann, Erich L. 1959 Testing Statistical Hypotheses. New Ydrk: Wiley.
Mosteller, Frederick; and Bush, Robert R. (1954) 1959 Selected Quantitative Techniques. Volume 1, pages 289-334 in Gardner Lindzey (editor), Handbook of Social Psychology. Cambridge, Mass.: Addi-son-Wesley.
Neyman, J.; and Pearson, E. S. 1928 On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Biometrika 20A: 175-240, 263-294.
Neyman, J.; and Pearson, E. S. 1933 On the Problem of the Most Efficient Tests of Statistical Hypotheses. Royal Society of London, Philosophical Transactions Series A 231:289-337.
Pearson, Karl 1900 On the Criterion That a Given System of Deviations From the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen From Random Sampling. Philosophical Magazine Fifth Series 50:157-175.
Popper, Karl R. (1934) 1959 The Logic of Scientific Discovery. New York: Basic Books. → First published as Logik der Forschung. A paperback edition was published by Harper in 1965.
Hypothesis Testing
Hypothesis testing
The method psychologists employ to prove or disprove the validity of their hypotheses.
When psychologists engage in research, they generate specific questions called hypotheses. Research hypotheses are informed speculations about the likely results
HYPOTHESIS TESTING | ||
You conclude that the two groups differ so you reject the Null Hypothesis. | You conclude that the two groups do not differ so you fail to reject the Null Hypothesis. | |
Two groups really do differ | You correctly rejected the Null Hypothesis. You made a good decision. | You made a Type II error. You should have said there is a difference, but you made a mistake and said there wasn't. |
Two groups really do not differ | You made a Type I error. You said that the groups are different, but you made a mistake. | You correctly failed to reject the Null Hypothesis. You said that the groups are not different, and you were right. |
of a project. In a typical research design, researchers might want to know whether people in two groups differ in their behavior. For example, psychologists have asked whether the amount that we can remember increases if we can find a way to organize related information. The hypothesis here might be that the organization of related information increases the amount that a person can remember in a learning task.
The researcher knows that such a strategy might have no effect, however. Learning may not change or it may actually worsen. In research, psychologists set up their projects to find out which of two conclusions is more likely, the research hypothesis (i.e., whether organizing related information helps memory ) or its complement (i.e., whether organizing related information does not help memory). The possibility that organizing related information will make no difference is called the Null Hypothesis, because it speculates that there may be no change in learning. (The word "null" means "nothing" or "none.") The other possibility, that organizing related information helps to learn, is called the Research Hypothesis or the Alternate Hypothesis. To see which hypothesis is true, people will be randomly assigned to one of two groups that differ in the way they are told to learn. Then the memory of the people in the two groups is compared.
As a rule, psychologists attempt to rule out the Null Hypothesis and to accept the Research Hypothesis because their research typically tries to focus on changes from one situation to the next, not failure to change. In hypothesis testing, psychologists are aware that they may make erroneous conclusions. For example, they might reject the Null Hypothesis and conclude that performance of people in two groups is different, that is, that one group remembers more than the other because they organize the information differently. In reality, one group might have gotten lucky and if the study were performed a second time, the result might be different. In hypothesis testing, this mistaken conclusion is called a Type I error.
Sometimes researchers erroneously conclude that the difference in the way the two groups learn is not important. That is, they fail to reject the Null Hypothesis when they should. This kind of error is called a Type II error. The table below indicates the relationship among errors and correct decisions.
Unfortunately, when researchers conduct a single experiment, they may be making an error without realizing it. This is why other researchers may try to replicate the research of others in order to spot any errors that previous researchers may have made.
See also Scientific method
hypothesis
hy·poth·e·sis / hīˈpä[unvoicedth]əsis/ • n. (pl. -ses / -ˌsēz/ ) a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation: professional astronomers attacked him for popularizing an unconfirmed hypothesis. ∎ Philos. a proposition made as a basis for reasoning, without any assumption of its truth.ORIGIN: late 16th cent.: via late Latin from Greek hupothesis ‘foundation,’ from hupo ‘under’ + thesis ‘placing.’
hypothesis
Hypothesis
HYPOTHESIS
A hypothesis may be thought of as a well-informed guess that is drawn from a theory or collection of ideas. It provides the basis from which a reasoned prediction about the relationship between two or more factors is made (e.g., early attachment and the child's later educational attainment). The prediction should define clearly the factors and the group of people within which the relationship can be observed. After planning how best to control for the effect of the factors and assembling participants who reflect the defined group, scientific testing can proceed. The data gathered are then checked to see whether the hypothesis is supported or not.
Supporting data, however, cannot be taken as conclusive proof of the theory. Logically it is more persuasive to predict no relationship between two factors and then find through testing that there is a relationship. This is called the null hypothesis and is the theoretical basis for statistical examination of the data to test whether the relationship between the factors is greater than chance.
See also:METHODS OF STUDYING CHILDREN
Bibliography
Popper, Karl R. Conjectures and Refutations: The Growth of Scientific Knowledge. London: Routledge and Kegan Paul, 1963.
AnthonyLee