Quality Control, Statistical
Quality Control, Statistical
I. Acceptance SamplingH. C. Hamaker
II. Process ControlE. S. Page
III. Reliability And Life TestingMarvin Zelen
“Quality control,” in its broadest sense, refers to a spectrum of managerial methods for attempting to maintain the quality of manufactured articles at a desired level. “Statistical quality control” can refer to all those methods that use statistical principles and techniques for the control of quality. In this broad sense, statistical quality control might be regarded as embracing in principle all of statistical methodology. Some areas of statistics, however, have, both historically and as currently used, a special relationship to quality control; it is these areas that are discussed in the articles under this heading. The methods of quality control described in these articles have applications in fields other than industrial manufacturing. For example, process control has been used successfully as an administrative technique in large-scale data-handling organizations, such as census bureaus. Reliability concepts are important in engineering design and seem potentially useful in small-group research.
I ACCEPTANCE SAMPLING
It is generally recognized that mass-production processes inevitably turn out a small amount of product that does not satisfy the specification requirements. Certainly, however, the fraction of defective product should be kept under control, and acceptance sampling is one of the methods used for this purpose. Although primarily used in manufacturing situations, the techniques of acceptance sampling can also be applied to nonmanufac-turing operations, such as interviewing or editing.
Mass products are usually handled in discrete lots, and these are the units to which acceptance sampling applies. A sample from each lot is inspected, and on the basis of the data provided by the sample it is decided whether the lot as a whole shall be accepted or rejected.
The actual procedures differ according to the nature of the product and the method of inspection. A distinction must be made between discrete products (nuts and bolts, lamps, radios) and bulk products (coal, fertilizer, liquids). Another important distinction is that between inspection by the method of attributes, where each item inspected is simply classified as defective or nondefective, and inspection by the method of variables, where meaningful numerical measurements on the sample are used to sentence a lot.
The following discussion will mainly concern attribute inspection, which is the technique most widely applied. Variables inspection will be mentioned only briefly.
Systematic investigations of acceptance sampling problems were initiated by two now classic papers of Dodge and Romig (1929-1941), which were later published in book form. Most of the basic concepts discussed here, such as the AOQL, the LTPD, and single and double sampling, go back to their work.
Further important developments took place during World War n, when acceptance sampling was extensively applied to military supplies. Research was carried out by the Statistical Research Group, Columbia University, and published in two books (1947; 1948).
Since then there has been a steady flow of publications concerned with a variety of both theoretical and practical aspects, such as modifications of sampling procedures, economic principles, the development of sampling standards, or applications in particular situations.
Method of attributes
Basic concepts
Attribute sampling applies to discrete products that are classified as defective or nondefective, regardless of the nature and seriousness of the defects observed. The simplest procedure consists in a single sampling plan:
From each lot submitted, a random sample of size n is inspected. The lot is accepted when the number of defectives, x, found in the sample is less than or equal to the acceptance number, c. When x is greater than or equal to the rejection number, c + 1, the lot is rejected.
Suppose n = 100 and c = 2; that is, 2 per cent defectives are permitted in the sample. This does not mean that a lot containing 2 per cent defectives will always be accepted. Chance fluctuations come into play; sometimes the sample will be better and sometimes worse than the lot as a whole. Statistical theory teaches that if the size of the sample is much smaller than the size of the lot, then a lot with 2 per cent defectives has a probability of acceptance P^ = 0.68 = 68 per cent. On the average, from 100 such lots, 68 will be accepted and 32 rejected.
It is clear that the probability of acceptance is a function of the per cent defective, p, in the lot. A plot of P^ as a function of p is called the operating characteristic curve, or the O.C. curve. Every sampling plan has a corresponding O.C. curve, which conveniently portrays its practical performance. The O.C. curves for three single sampling plans are presented in Figure 1.
O.C. curves always have similar shapes, and for practical purposes they can be sufficiently specified by two parameters only. The most important parameters that have been proposed are the following:
(1) Producer’s risk point (PRP) = p95, = that per cent defective for which the probability of acceptance, p^ is 95 per cent.
(2) Consumer’s risk point (CRP), more often called the lot tolerance percentage defective
(LTPD) = pw = that per cent defective for which PA = 10 per cent.
(3) Indifference quality (IQ) or point of control = p50 = that per cent defective for which P^ = 50 per cent.
(4) Average outgoing quality limit (AOQL), which, however, applies only to rectifying inspection, that is, inspection whereby rejected lots are completely screened (100 per cent inspection) and accepted after the removal of all defective items. It can be shown that under rectifying inspection the average outgoing quality (AOQ), that is, the average per cent defective in the accepted lots, can never surpass a certain upper limit, even under the most unfavorable circumstances. This upper limit is the AOQL.
These four parameters are essentially characteristics of the O.C. curves or of the corresponding sampling plans. But the choice of a sampling plan cannot be determined by looking at the O.C. curves alone; in some way the choice must be in keeping with the capabilities of the production process under consideration. Realizing this, Dodge and Romig introduced the process average, that is, the average per cent defective found in the samples, as one of the quantities to be used in deciding on a sampling plan. Their sampling plans are such that for lots with a per cent defective equal to the process average the probability of acceptance is always fairly high.
In practice this is not quite satisfactory, because it means that the acceptance sampling system is more lenient toward a poor producer with a high process average than toward a good one with a low average.
Hence the process average has now been super seded by the acceptable quality level (AQL). This is again a per cent defective and may be considered as an upper limit of acceptability for the process average. Definitions have varied in the course of time. In the earliest version of the Military Standard 105, the 105A, which was developed for acceptance sampling of military supplies during World War n and published in 1950, the AQL was defined as a “nominal value to be specified by the U.S. Government.” The sampling plans in this standard are so constructed that the probability of acceptance for lots that meet the AQL standard is higher than 90 per cent. This has led other authors to define the AQL instead as the per cent defective with a probability of acceptance of 95 per cent, thus identifying the AQL with the PRP. Thereby, however, the AQL would be essentially a characteristic of a sampling plan and no longer a tolerance requirement of production. The uncertainty seems to have been finally settled by defining the AQL as the maximum per cent defective that, for purposes of acceptance sampling, can be considered satisfactory as a process average. This definition has been adopted in the latest version of the Military Standard 105, the 105D, and in a standard on terminology established by the American Society for Quality Control.
Strictly, the O.C. curve of a single sampling plan also depends on the size of the lot, N, but if, as in most cases, the sample is only a small fraction of the lot, the O.C. curve is almost completely determined by the sample size, n, and the acceptance number, c; the influence of the lot size is unimportant and can be ignored. Since, moreover, the per cent defective is generally very low, a few per cent only, the Poisson formula can be used for computing the O.C. curves [see Distributions, Statistical, article onSpecial Discrete Distributions].
An important logical distinction between contexts must be made: does the O.C. curve relate to the specific lot at hand or to the underlying process that produced the lot? Definitions of consumer’s risk, etc., also may be made in either of these contexts. In practice, however, the numerical results are nearly identical, whether one regards as basic a particular lot or the underlying process.
It will be noticed that the O.C. curve of a sampling plan has a close resemblance to the power curve of a one-sided test of a hypothesis [SeeHypothesis Testing]. Indeed, acceptance sampling can be considered as a practical application of hypothesis testing. If on the basis of the findings in the sample the hypothesis Hn : p ≤ PRP is tested with a significance level of 5 per cent, the lot is rejected if the hypothesis is rejected and the lot is accepted if the hypothesis is not rejected. Likewise Hn p = PRP; H0 p=PRP;h1:p= LTPD can be considered as two alternative hypotheses tested against each other with errors of the first and second kind, of 5 per cent and 10 per cent respectively.
Curtailed, double, multiple, and sequential sampling. Curtailed sampling means that inspection is stopped as soon as the decision to accept or reject is evident. For example, with a single sampling plan with n = 100, c = 2, inspection can be stopped as soon as a third defective item is observed. This reduces the number of observations without any change in the O.C. curve. The gain is small, however, because bad lots, which are rejected, occur infrequently in normal situations.
Better efficiency is achieved by double sampling plans, which proceed in two stages. First, a sample of size n1 is inspected, and the lot is accepted if x1≤ C1 and rejected if xl≥ c2, x1 being the number of defectives observed. But when C1 <x1<c2, a second sample, of size n2, is inspected; and the lot is then finally accepted if the total number of defectives, xl + x2, is ≤ c3 and is rejected otherwise. The basic idea is that clearly good or bad lots are sentenced by the first sample and only in doubtful cases is a second sample required.
Multiple sampling operates on the same principle but in more than two successive steps, often six or eight. After each step it is decided again whether to accept, reject, or proceed to inspection of another sample. In (fully) sequential sampling, inspection is carried out item by item and a three-way decision taken after each item is inspected.
The basic parameters discussed above apply also to double, multiple, and sequential sampling plans. These procedures can be so constructed that they possess O.C. curves almost identical with that of a given single sampling plan; on the average, however, they require fewer observations. The amount of saving is on the order of 25 per cent for double sampling and up to 50 per cent for multiple and sequential sampling.
Disadvantages of these plans as compared with single sampling plans are more complicated administration and a variable inspection load. Double sampling is fairly often adopted, but multiple and sequential sampling are used only in cases where the cost of inspection is very high, so that a high economy is essential. In many situations single sampling is preferred for the sake of simplicity.
The choice of a sampling plan
There are many considerations involved in choosing a sampling plan.
Sampling standards. If numerical values are specified for two of the four parameters listed above, the O.C. curve is practically fixed and the corresponding sampling plan may be derived by computation or from suitable tables. In practice, however, the use of standard sampling tables is much more common.
The best known of these tables is the Military Standard 105A (U.S. Department of Defense, Standardization Division 1950), which prescribes a sampling plan in relation to an AQL and the size of the lot; the sample size is made to increase with the lot size so as to reduce the risk of wrong decisions. Three separate tables give single, double, and multiple plans. The user can also choose between different inspection levels, corresponding to systematic changes in all sample sizes; a sample size n = 50 at level i is increased to n — 150 at level in for example.
The Military Standard 105 also contains rules for a transition to tightened or reduced inspection. The idea is that, when the sample data indicate that the average per cent defective in the lots received is significantly higher than the required AQL, the consumer should tighten his inspection in order to prevent the acceptance of too many bad lots and to stimulate the producer to improve his production. If, on the other hand, the average per cent defective is significantly lower than the AQL, this indicates that the production process is tightly controlled; inspection is then less essential and the amount of inspection can be reduced.
The Military Standard 105 has several times been revised; the successive versions are known as the Military Standard 105A, 105B, 105C, and 105D (U.S. Department of Defense, Standardization Division 1963). The differences between A, B, and C are relatively unimportant; but in 105D, issued in 1963, more drastic changes were introduced. These changes were proposed by a special committee of experts jointly appointed by the departments of defense in the United States, Great Britain, and Canada. The basic principles underlying the Military Standard 105, however, have not been altered.
The Military Standard 105 was originally intended for acceptance sampling of military supplies, but it has also been widely applied in industry for other purposes. Other standards have been established by large industrial firms or by governmental offices in other countries. Sometimes these are only modifications of the Military Standard 105A; sometimes they have been developed independently and from different points of view. The merits and demerits of a number of these standards are discussed by Hamaker (1960).
Acceptance sampling as an economic problem. Many authors have attempted to deduce an optimum sampling plan from economic considerations. Given (a) the cost of inspection of an item, (fe) the a priori distribution, that is, the distribution of the per cent defective among the lots submitted for inspection, (c) the loss caused by accepted defectives, and (d) the loss due to rejected nondefec-tives, one can, in principle, derive an optimum sampling plan, for which the total cost (cost of inspection plus losses) is a minimum.
These economic theories have had little practical success except in isolated cases, mainly, I believe, because their basis is too restricted and because they require detailed information not readily available in industry. The a priori distribution is usually not known and cannot easily be obtained; the consequences of rejecting a lot depend to a very high degree on the stock available and may consequently vary from day to day. Rejection often does not mean that a lot is actually refused; it only means that the inspector cannot accept without consultation with a higher authority or some other authorization.
Since even the most simple products can show a variety of defects, some much more serious than others, it is not easy to see how the loss due to accepted defectives should be estimated. In the Military Standard 105 this problem has been solved by a classification into critical, major, and minor defects, with a separate AQL for each class. An alternative method, demerit rating, consists in scoring points, say 10, 3, and 1, for critical, major, and minor defects and sentencing the lot by the total score resulting from a sample. The advantage is a single judgment instead of three separate judgments, but the theory of O.C. curves no longer applies in a simple and straightforward manner. There are practical problems as to how different defects should be classified and what scores should be assigned to the different classes. These should be solved by discussions between parties interested in the situation envisaged. Some case studies have been described in the literature. No attempts have so far been made to incorporate the classification of defects into economic theories; that would make them yet more complicated and unworkable. This is a decided drawback because in actual practice the degree of seriousness of the defects observed always has a considerable influence on the final decisions concerning the lots inspected. Besides, the economic theories assume that the prevention of accepted defectives is the only purpose of acceptance sampling and this is not correct. It also serves to show the consumer’s interest in good quality and to stimulate the producer to take good care of his production processes. Some large firms have developed vendor rating systems for this very purpose. The information supplied by the samples is systematically collected for each supplier separately and is used in deciding where to place a new contract.
It should not be concluded that no attention is devoted to the economic aspects of the problem of choosing a sampling plan. In acceptance sampling there is always a risk of taking a wrong decision by accepting a bad lot or rejecting a good lot. The larger the lot, the more serious are the economic consequences, but the risk of wrong decisions can be reduced by using larger samples, which give steeper O.C. curves. All existing sampling tables prescribe increased sample sizes with increasing lot sizes, and this practice is derived from economic considerations.
Also, before installing sampling inspection it is always necessary to make some inquiries about the situation to be dealt with. Do bad lots occur, and if so how frequently? How bad are those bad lots? Do accepted defectives cause a lot of trouble? The answers to such questions as these provide crude information about the a priori distribution and the economic aspects, and in choosing an AQL and an inspection level this information is duly taken into account.
Looking at the problem from this point of view, industrial statistics always uses a rough Bayesian approach [SeeBayesian Inference]. A statistician will never be successful in industry if he does not properly combine his statistical practices with existing technical knowledge and experience and with cost considerations. It is only when an attempt is made to apply the Bayesian principles in a more precise way that problems arise, because the basic parameters required for that purpose cannot easily be estimated with sufficient accuracy. [It is interesting in this connection to compare the empirical Bayesian approach discussed in Decision Theory].
Method of variables
In sampling by the method of variables a numerical quality characteristic, x, is measured on each item in the sample. It is usually supposed that x has a normal distribution [SeeDistributions, Statistical] and that a product is defective when x falls beyond a single specification limit or outside a specification interval: that is, when x < L and/or x > U. The basic idea is that from the mean, x, and standard deviation, s, computed from the sample, an estimate of the per cent defective in the lot can be derived, and this estimate can be compared with the AQL required. For a single specification limit the criterion operates as follows: If x is a quality such as length, and the specification requires that not more than a small percentage, p, of the units in the lot shall have a length greater than U (upper limit), then the lot is accepted when U — x≥ks and rejected when U — x ≤ ks, where k is a constant that depends on p and the sample size n and is derived from the theory. For double limits the technique is somewhat more complicated. When the standard deviation is known, s is replaced by o- and a different value of k has to be used.
On the basis of earlier suggestions, the theory of sampling by variables was worked out in detail by the Statistical Research Group, Columbia University (1947), and by Bowker and Goode (1952). This theory led to the establishment of the Military Standard 414 in 1957. This standard has a structure similar to the Military Standard 105A; lot size and AQL are the main parameters that determine a sampling plan. Tables are given for one and two specification limits, and for o- both known and unknown.
The advantage of sampling by variables is a more effective use of the information provided by the sample and consequently fewer observations. Where the Military Standard 105A prescribes a sample size n = 150, the Military Standard 414 uses n = 50 when o- is unknown and n ranging from 8 to 30 when ?r is known.
Disadvantages are that the performance and handling of measurements require more highly trained personnel and that the assumption of a normal distribution is a risky one when it is not known under what circumstances a lot has been produced. For these reasons sampling by variables has found only limited application; sampling by attributes is often preferred.
Present-day reliability requirements have led to an increased interest in life testing procedures and hence to the development of acceptance sampling techniques based on the exponential and the Weibull distributions [SeeQuality Control, Statistical, article onReliability And Life Testing].
Acceptance sampling of bulk materials
Acceptance sampling of bulk material constitutes a separate problem. A liquid can be homogenized through stirring, and then the analysis of a single specimen will suffice. Solid material, such as fertilizer or coal, is handled in bales, barrels, wagons, etc. Then, there may be a variability within bales and additional variability among bales. Extensive research is often needed for each product separately before an adequate acceptance sampling procedure can be developed. A good example is Duncan’s investigation of fertilizer (1960).
The theory of sampling by attributes is fairly complete. Research into the economics of acceptance sampling will probably continue for quite a while, but this seems to be an interesting academic exercise that will not lead to drastic changes in industrial practices. A common international sampling standard would be of great practical value, but since in different countries different standards are already established, this is a goal that will not be easily reached.
The theory of sampling by variables may perhaps require some further development. In the Military Standard 414 the sample sizes prescribed when o- is unknown are three times the size of those for σ known, even for samples as large as 100 or more items. Further research may make possible the reduction of this ratio.
Perhaps the most important new developments will come from new fields of application. It is, for example, recognized that using accountancy to check a financial administration can also be considered as an acceptance sampling procedure and should be dealt with as such. However, the nature of the material and the requirements to be satisfied are entirely different from those in the technological sector. Industrial techniques cannot be taken over without considerable modification, and suitable methods have to be developed afresh. (See Vance & Neter 1956; Trueblood & Cyert 1957.)
H. C. Hamaker
BIBLIOGRAPHY
Bowker, Albert H.; and Goode, Henry P. 1952 Sampling Inspection by Variables. New York: McGraw-Hill.
Columbia University, Statistical Research Group 1947 Selected Techniques of Statistical Analysis for Scientific and Industrial Research, and Production and Management Engineering. New York: McGraw-Hill. →See especially Chapter 1 on the use of variables in acceptance inspection for per cent defective.
Columbia University, Statistical Research Group 1948 Sampling Inspection. New York: McGraw-Hill.-→ Describes theory and methods of attribute inspection developed during World War II.
Deming, W. Edwards 1960 Sample Design in Business Research. New York: Wiley. → Contains practical examples of the use of samples in business administration; some of these applications can be considered as problems in acceptance sampling.
Dodge, Harold F.; and Romig, Harry G. (1929-1941) 1959 Sampling Inspection Tables: Single and Double Sampling. 2d ed., rev. & enl. New York: Wiley; London: Chapman. → This book is a republication of fundamental papers published by the authors in 1929 and 1941 in the Bell System Technical Journal.
Duncan, Acheson J. (1952) 1965 Quality Control and Industrial Statistics. 3d ed. Homewood, 111.: Irwin. → Contains five chapters on acceptance sampling.
Duncan, Acheson J. 1960 An Experiment in the Sampling and Analysis of Bagged Fertilizer. Journal of the Association of Official Agricultural Chemists 43: 831-904. → A good example of the research needed to establish acceptance-sampling procedures of bulk material.
Grant, Eugene L. (1946)1964 Statistical Quality Control. 3d ed. New York: McGraw-Hill. → The merits and demerits of various acceptance-sampling procedures and standards as well as Military Standard 105D are discussed in detail.
Hald, H. A. 1960 The Compound Hypergeometric Distribution and a System of Single Sampling Inspection Plans Based on Prior Distributions and Costs. Tech-nometrics 2:275-340. → Considers sampling from an economic point of view, and contains a fairly complete list of references to earlier literature on this aspect of the sampling problem.
Hamaker, Hugo C. 1958 Some Basic Principles of Acceptance Sampling by Attributes. Applied Statistics 1:149-159. → Contains a discussion of the difficulties hampering the practical application of economic theories.
Hamaker, Hugo C. 1960 Attribute Sampling in Operation. International Statistical Institute, Bulletin 37, no. 2:265-281. → A discussion of the merits and demerits of a number of existing sampling standards.
Quality Control and Applied Statistics: Abstract Service. → Published since 1956. A useful source of information giving fairly complete abstracts of papers published in statistical and technical journals. Of special importance are the classification numbers 200-299: Sampling Principles and Plans, and number 823: Sampling for Reliability.
Trueblood, Robert M.; and Cyert, Richard M. 1957 Sampling Techniques in Accounting. Englewood Cliffs, N.J.: Prentice-Hall.
U.S. Department Of Defense, Standardization Division 1950 Military Standard 105A. Washington: Government Printing Office. → A sampling standard widely applied. In 1959 this standard was slightly revised in Military Standard 105B. Recently, a further, more drastic revision has been effected jointly by the departments of defense in Canada, Great Britain, and the United States.
U.S. Department Of Defense, Standardization Division 1957 Military Standard 414. Washington: Government Printing Office.→ A standard for variables inspection corresponding to Military Standard 105A.
U.S. Department Of Defense, Standardization Division 1963 Military Standard 105D. Washington: Government Printing Office. → Earlier versions of this standard are known as Military Standard 105A, 105B, and 105C.
Vance, Lawrence L.; and Neter, John 1956 Statistical Sampling for Auditors and Accountants. New York: Wiley.
II PROCESS CONTROL
The present article discusses that aspect of statistical quality control relating to the control of a routinely operating process. The traditional and most common field of use is in controlling the
quality of manufactured products, but applications are possible in fields as diverse as learning experiments, stock exchange prices, and error control in the preparation of data for automatic computers. Process control of this kind is usually effected by means of charts that exhibit graphically the temporal behavior of the process; hence, the subject is sometimes called, somewhat superficially, “control charts.”
Inspection for control
The concept of quality control in its industrial context and the first widely used methods were introduced by W. A. Shewhart (1931). One of the prime concerns was to detect whether the items of output studied had characteristics that behaved like independent observations from a common statistical distribution, that is, whether groups of such items had characteristics behaving like random samples. If the procedure suggested acceptance of the hypothesis of a single distribution, with independence between observations, the production process was said to be “in control” (Shewhart 1931, p. 3), although at this stage the quality of the output, determined by the parameters of the distribution, might not be acceptable. A major contribution was the explicit recognition that such a state of control is necessary before any continuous control of the process can succeed. When the initial examination of the sample data shows the process to be out of control in the above sense, reasons connected with the operation of the process are sought (the search for “assignable causes”; Shewhart 1931, p. 13), improved production methods are introduced, and a further examination is made to see if control has been achieved. Thus, this initial phase of study of a process employs a significance test of a hypothesis in a way encountered in other applications of statistics. It should be noted, however, that often no alternative hypotheses are specified and that, indeed, they are frequently only vaguely realized at the beginning of the investigation.
Inspection of a process in control
When the process is in a state of control and its output has the relevant quality characteristics following a distribution with the values of the parameters at their targets (that is, those values of the parameters chosen to cause the output to meet the design specifications with as much tolerance as possible), one task is to determine when the process departs from this state, so that prompt restoring action may be taken. A quality control scheme for this purpose needs, therefore, to give a signal when action is demanded. One of the features of importance for a process inspection scheme is thus the speed at which it detects a change from target.
In conflict with the desire for rapid detection of a change is the necessity for infrequent signals demanding action when no change from target has occurred. The “errors” of signaling a change when none has occurred and of failing to give the signal immediately after a change are similar to the two types of error familiar in hypothesis testing. Whereas in the latter case the probability of such errors is a suitable measure of their occurrence, in quality control the repetition of sampling as the process continues operation and the possibility of combining several samples to decide about a signal make probabilities less convenient measures of error behavior. Instead, the average run length (A.R.L.), defined as the average number of samples taken up to the appearance of a signal, gives a convenient means of comparison for different process inspection schemes (Barnard 1959, p. 240); the A.R.L. is easily related both to the amount of substandard output produced between any change and the signal and to the frequency of unnecessary interference with a controlled process.
Ancillary tasks of process inspection. In addition to providing a signal after a change, a process inspection scheme may be required to yield other information. For example, the magnitude of change may need to be estimated so that a dial may be adjusted. Again, coupled with the provision of a signal to take remedial action may be a rule about the destination of any recent production for which there is evidence of a fall in standard, and so the position of any change may need estimation; in this case the scheme is partly one for the deferred sentencing of output, that is, for retrospective acceptance or rejection. In other cases a satisfactory record of a process inspection scheme may guarantee the acceptance of the output by a consumer.
In these and other manufacturing applications where the aim is financial gain, a comparison of alternative schemes in monetary terms should be attempted and not, as so frequently has happened in practice, ignored. Although quantification of many aspects of real situations is difficult or impossible, a monetary comparison can avoid the confusion of using in one situation a measure appropriate for comparing schemes in a totally different situation.
Types of inspection schemes
Schemes for the various applications base their rules for the appropriate action to be taken on the results either of single samples of observation or of sequences of such samples; the sequences may be of fixed or variable length. All such schemes are closely related, for those employing a single sample or a fixed number of samples can be exhibited as special
cases of schemes using a variable number of samples. However, the appearance of the graphical records is quite different. In the first two cases the individual sample results are marked on the chart (Figure 1), while in the last case the sums of all the previous observations, corrected for any expected trend, are plotted (Figure 2). The application of such techniques to different distributions changes the schemes only slightly; the constants involved differ, and sometimes attention is concentrated upon evidence of changes in the distribution parameter in one direction only—for example, increases in the fraction of defective articles produced or in the variance of a measured characteristic are often particularly interesting.
Besides the normal distribution model to be discussed below, another useful model is the Poisson for applications concerning the number of occurrences of a particular type in a given item or length of time, for example, the number of blemishes in sheets of glass of fixed size. The most commonly used schemes are those for controlling the mean
of a normal distribution; they are typical of those for other distributions, both continuous and discrete, although schemes for each distribution need separate consideration for calculation of their properties.
The samples used in this sort of activity are usually small, of perhaps four to six observations.
The Shewhart procedure. The classical Shew-hart procedure for providing a signal when a process in control departed from its target mean μ used two “action lines” drawn at where σ is the process standard deviation estimated from many earlier observations, n is the size of sample, and k is a constant chosen pragmatically to be 3 (Shewhart 1931, p. 277) or 3.09 (Dudding & Jen-nett 1942, p. 64). The signal of a change was received when any sample point, the mean of a sample, fell outside the action lines (Shewhart 1931, p. 290). These rules are such that only about 1 in 500 samples would yield a point outside the action lines if the output were in control with the assumed parameter values; the l-in-500 choice rests on experience and lacks any foundation from consideration of costs, but the success of these schemes has, in many applications, afforded abundant justification of their use and represented a major advance in process control. For processes where information is required about small changes in the mean, the single small-sample schemes were insufficiently sensitive—small changes need large samples for quick detection—but limitations in sampling effort and a reluctance to sacrifice the possibility of rapid detection of large changes with small samples have caused additions to the original Shewhart scheme.
Warning line schemes. Charts of warning line schemes have drawn on them the action lines and two other lines—warning lines—at where k′ is a constant less than k (Dudding & Jen-nett 1942, p. 14; Grant 1946, art. 159). Accordingly, the chart is divided into action, warning, and good regions (Figure 1). A change in parameter is signaled by rules such as the following: Take action if m out of the last n sample points fell in one of the warning regions or if the last point falls in an action region (Page 1955).
Schemes using runs of points are special cases (Grant 1946, art. 88); some that are popular base their action rule on runs of sample points on one side of the target mean (case k′ =0, k =∞, m=n,that is, the action regions disappear and the two warning regions are separated only by a line at the target value: action is taken when a long enough sequence of consecutive points falls in one region).
These schemes retain some of the advantages of small samples and seek to combine the results of a fixed number of samples in a simple way to increase the sensitivity for sustained small changes in parameter.
Cumulative sum schemes. An extension of this idea is provided by the cumulative sum schemes (or cusum schemes), which enjoy the advantages of both large and small samples by combining the relevant information from all recent samples (Barnard 1959, p. 270). Instead of plotting individual sample means, x̄, on the chart, the differences of these means from the target value, x̄-μ, are cumulated and the running total plotted after each sample is taken (Figure 2). A change in process mean causes a change in the direction of the trend of plotted points. One method of defining the conditions for a signal is to place a V-mask on the chart and take action if the arms of the V obscure any of the sample points. The angle of the V and the position of the vertex relative to the last plotted point (the lead distance) can be chosen to achieve the required A.R.L.s. Tables of these constants exist for several important distributions (Ewan & Kemp I960; Goldsmith & Whitfield 1961; Kemp 1961; Page 1962; 1963). Alternative methods of recording may be adopted for schemes to detect one-sided or two-sided deviations in the parameter. Gauging devices may be used instead of measuring in order to speed the manual operation of such a scheme or to make automatic performance simpler (Page 1962).
Recent developments
Attention has been given to schemes for the complete control of the process, that is, procedures for automatically detecting departures from target and making adjustments of appropriate sizes to the control variables (Box & Jenkins 1963; 1964). Naturally, such methods of control are applicable only to those processes for which both measurements of the quality characteristics and the adjustments can be made automatically. Other work has examined different stochastic models of process behavior (Barnard 1959; Bather 1963) and has attempted a comprehensive study of all the costs and savings of a process inspection scheme (Duncan 1956). In industrial applications the financial benefits accruing from the operation of such a scheme are usually of paramount importance, and however difficult it may be to assess them quantitatively, they deserve careful consideration at all stages of the selection and operation of the scheme.
E. S. Page
BIBLIOGRAPHY
Barnard, G. A. 1959 Control Charts and Stochastic Processes. Journal of the Royal Statistical Society Series B 21:239-271. → Introduces the V-mask cusum chart and estimation methods.
Bather, J. A. 1963 Control Charts and the Minimization of Cost. Journal of the Royal Statistical Society Series B 25:49-80.
box, G. E. P., and Jenkins, G. M. 1962 Some Statistical Aspects of Adaptive Optimization and Control. Journal of the Royal Statistical Society Series B 24:297-343.
box, G. E. P.; and Jenkins, G. M. 1964 Further Contributions to Adaptive Quality Control; Simultaneous Estimation of Dynamics: Non-zero Costs. International Statistical Institute, Bulletin 40:943-974.
Dudding, B. P.; and Jennett, W. J. 1942 Quality Control Charts. London: British Standards Institution.
Duncan, Acheson J. (1952) 1959 Quality Control and Industrial Statistics. Rev. ed. Homewood, 111.: Irwin.
Duncan, Acheson J. 1956 The Economic Design of x̄ Charts Used to Maintain Current Control of a Process. Journal of the American Statistical Association 51:228-242.
Ewan, W. D.; and Kemp, K. W. 1960 Sampling Inspection of Continuous Processes With No Autocorrelation Between Successive Results. Biometrika 47: 363-380. → Gives tables and a nomogram for onesided cusum charts on a normal mean and fraction defective.
Goldsmith, P. L.; and Whitfield, H. 1961 Average Run Lengths in Cumulative Chart Quality Control Schemes. Technometrics 3:11-20. → Graphs of V-mask schemes for normal means.
Grant, Eugene L. (1946)1964 Statistical Quality Control. 3d ed. New York: McGraw-Hill. → Many examples of the Shewhart chart for different distributions.
Kemp, K. W. 1961 The Average Run Length of the Cumulative Sum Chart When a V-mask Is Used. Journal of the Royal Statistical Society Series B 23:149-153. → Gives tables of cusum schemes for a normal mean.
Page, E. S. 1954 Continuous Inspection Schemes. Bio-metrika 41:100-115. → Introduces cusum schemes.
Page, E. S. 1955 Control Charts With Warning Lines. Biometrika 42:242-257.
Page, E. S. 1962 Cumulative Sum Schemes Using Gauging. Technometrics 4:97-109.
Page, E. S. 1963 Controlling the Standard Deviation by Cusums and Warning Lines. Technometrics 5:307-315. → Gives tables of cusum schemes for a normal range.
Shewhart, Walter A. 1931 Economic Control of Quality of Manufactured Product. Princeton, N.J.: Van Nostrand. → The classic volume introducing control chart methods.
III RELIABILITY AND LIFE TESTING
Technology has been characterized since the end of World War n by the development of complex systems containing large numbers of subsystems, components, and parts. This trend to even larger and more complex systems is accelerating with the development of space vehicles, electronic computers, and communications and weapons systems. Many of these systems may fail or may operate inefficiently if a single part or component fails. Hence, there is a high premium on having the components operate efficiently so that the system operates in a reliable, trustworthy manner.
In order to have reliable systems, it is not only necessary initially to design the system to be reliable, but also, once the system is in operation, to have appropriate maintenance and check-out schedules. This requires quantitative estimates of the reliability of the entire system, as well as reliability estimates for the major components, parts, and circuits that make up the system. While formalized definitions of reliability vary, there is general agreement that it refers to the probability of satisfactory performance under clearly specified conditions.
Reliability of components and the system
An important problem is to predict the reliability of a system from knowledge of the reliability of the components that make it up. For example, a satellite system may be regarded as composed of subsystems that perform the propulsion, guidance, communication, and instrument functions. All subsystems must function in order for the satellite to function. It is desired to predict the reliability of the satellite from knowledge of the reliability of the basic components or subsystems.
More abstractly, consider a system made up of n major components, each of which must function in order that the whole system function. An idealization of this system is to regard the components as connected in a series network (Figure la), although the actual connections may be much more complex. The input signal to a component is passed on to a connecting component only if the component is functioning. Let p” refer to the probability that the ith component functions, and assume that the components operate independently of one another. Then the probability of the entire system
functioning correctly is equal to the product of the probabilities for each major component; that is,
Reliability of series system: R=p1 p2… pn
It is clear that the reliability of the system is no greater (and generally rather less) than the reliability of any single component. For example, if a system has three major components with reliabilities of .95, .90, and .90, the reliability of the entire system would be R = (.95)(.90)(.90) = .7695.
One way of improving the reliability of a system is to introduce redundancy. Redundancy, as usually employed, refers to replacing a low-reliability component by several components having identical functions. These are connected so that it is necessary only that at least erne of these components function in order for the system to function. Auxiliary power systems in hospital operating rooms and two sets of brakes on automobiles are common examples of redundancy. Redundant components may be idealized as making up a parallel series (cf. Figure lb). When there are n components in parallel, such that each component operates independently of the others, the reliability of the system is given by
Since 1 - R=Πi(1-pi)≤1-pi for any i, it can be seen that R≥pi for any i. Consequently using redundancy always increases the reliability of a system (assuming 0 < pi < 1). The most frequent use of redundancy is when a single component is replaced by n identical components in parallel. Then the reliability of this parallel system is R= 1-(1-p)n, where p refers to the probability of a single component functioning.
The formulas for the reliability of series and parallel systems are strictly valid only if the components function independently of one another. In some applications this may not be true, as failure of one component may throw added stress on other components. In practice, also, an entire system will usually be made up of both series and parallel systems.
Note that either in the series or parallel network, one can regard the system as functioning if one can take a “path” from the input to the output via functioning components. There is only one such path for components in series whereas there are n possible paths for the parallel system. The concept of finding paths of components for which the system functions is the basis for predicting the reliability of more complicated systems. Consider n components having only two states of performance;
that is, a component is either in a functioning or failing state. Let Xi take the value 1 if component i is functioning and 0 otherwise. Also define a function ϕ(X), which depends on the state of the n components through the vector X= (X1, X2,… Xn), such that ϕ(X) = 1 if the system functions and ϕ(X) = 0 if the system is not functioning. This function is termed the structure function of the system. The structure functions for series and parallel systems are
series system:
parallel system:
Some examples of other structure functions are shown in Figure 2.
The reliability of the system with structure function ϕ(X) is obtained by taking the expectation of ϕ(X). When the components function independently of one another, the reliability of the system is a function only of the vector p =(p1, p2, …pn and can be obtained by replacing each Xi by pi:
Barlow and Proschan (1965) have an excellent exposition of the properties of structure functions, as well as a presentation of optimum redundancy techniques.
Predicting reliability from sample data
The development in the preceding section assumed that the reliability of the individual components was known. In the practical situation, this is not the case. Although experiments may be conducted on each component in order to obtain an estimate of its reliability, it is often more feasible to test subsystems or major components [for a general discussion of these methods, seeEstimation].
Suppose that nf independent components of type i are tested, with the result that si components function and ni-si components fail. Then p^ si/ni, is an estimate of pi, the reliability of the component. When such information is available on all components in a system, then an estimate of the reliability of the system, R(p^ ), is obtained by replacing pi by p^ in (1). However, one often requires an interval estimate which may take the form of a lower confidence interval; that is, one may wish to find a number Rα(s),s=(s1 s2 … sn which is a function of the sample information, such that
for all p. This general problem has not yet been solved satisfactorily, although progress has been made in a few special cases for components in series. These special methods are reviewed in Lloyd and Lipow (1962).
A general method of computing confidence intervals is to simulate the operation of the system using the sample information. The simulation consists of “building“a set of systems out of the tested components, using the data from each component only once. The proportion of times ɸ(X)= 1 is an estimate of R(p), and one can use the theory associated with the binomial distribution to calculate a confidence interval for R(p). A good discussion and summary of such simulation techniques is given by Rosenblatt (1963). These same ideas and techniques may be useful for the study of human organizations, particularly in the transmission of information. [SeeSimulation].
Time-dependent reliability
Often the reliability of a system or component is defined in terms of the equipment functioning for a given period of time. When a complex system is capable of repair, the system is usually repaired after a failure and the failure characteristics of the system are described by the time between failures. On the other hand, if a part such as a vacuum tube fails, it is replaced (not repaired) and one refers to the time to failure to describe the failure characteristics of a population of nominally identical tubes. Let T be a random variable denoting the time to failure of a component or times between failure of a system, and define f(t) to be its probability density function. Then the probability of the component functioning to time t (or the time between failures of a system being greater than t) is the survivorship function
A useful quantity associated with the failure distribution is the hazard function defined by
h(t) = f(t)/S(t).
For small positive At, h(t)At is approximately the probability that a component will fail during the time interval (t, t+Δ), given that the component has been in satisfactory use up to time t; that is,
Sometimes h(t) is called the instantaneous failure rate or force of mortality [SeeLife Tables].
When h(t) is an increasing (decreasing) function of t, then the longer the component has been in use, the greater (smaller) the probability of immediate failure. These failure laws are referred to as positive (negative) aging. If h(t) = constant, independent of t, the conditional probability of failure does not depend on the length of time the
components have been in use. Such failures are called random failures.
Knowledge of the hazard function enables one to calculate the survivorship function from the relation
where A particularly important class of hazard functions is given by
h(t) = ptp-1/θp
Since the survivorship function is
S(t)=exp [-(t/θ)p].
This distribution is called the Weibull distribution. The parameter p is termed the shape parameter, and 0 is the scale parameter. If p > 1 (p < 1) there is positive (negative) aging. The importance of the Weibull distribution arises from the fact that positive, negative, and no aging depend only on the shape parameter p. When p = 1, h(t) = 1/0 (which is independent of t) and the distribution reduces to the simple exponential distribution
S(t) = e-t/θ
[SeeDistributions, Statistical, article onSpecial Continuous Distributions, for more information on these distributions and those discussed below.]
Two other distributions that are useful in describing the failure time of components are the gamma and log-normal distributions. With respect to the gamma distribution, negative or positive aging occurs for 0 < p < 1 and p > 1 respectively. The hazard function for the log-normal distribution increases to a maximum and then goes to zero as t→∞ Buckland (1964), Epstein (1962), Govindarajulu (1964), and Mendenhall (1958) have compiled extensive bibliographies on the topics discussed in this section.
The exponential distribution
The exponential distribution (sometimes called the negative exponential distribution) has been widely used in applications to describe the failure of components and systems. One reason for its popularity is that the mathematical properties of the distribution are very tractable. The above-mentioned bibliographies cite large numbers of papers dealing with statistical techniques based on the exponential failure law.
One of the properties of the exponential distribution is that the conditional probability of failure does not depend on how long the component has been in use. This is not true for most applications to components and parts; however, the exponential distribution may be appropriate for some complex systems. If a complex system is composed of a large number of components such that the failure of any component of the system will cause the system to fail, then under general mathematical conditions it has been proved that the distribution of times between failures tends to an exponential distribution when the number of components becomes large (cf. Khintchine [1955] 1960, chapter 5). One of these conditions is that the times between failures of the system be relatively short compared to the failure time of each component.
Using the exponential distribution when it is not appropriate may result in estimates, decisions, and conclusions that are seriously in error. This has motivated research into methods that assume only an increasing failure rate (IFR) or decreasing failure rate (DFR). Barlow and Proschan (1965) present a very complete summary of results dealing with IFR and DFR hazard functions.
Marvin Zelen
BIBLIOGRAPHY
Barlow, Richard E.; and Proschan, Frank 1965 Mathematical Theory of Reliability. New York: Wiley. → A very good development of specialized aspects of theory.
Buckland, William R. 1964 Statistical Assessment of the Life Characteristic: A Bibliographic Guide. New York: Hafner. → A large bibliography on life and fatigue testing.
Epstein, Benjamin 1962 Recent Developments in Life Testing. International Statistical Institute Bulletin 39, no. 3:67-72.
Govindakajulu, Zakkula 1964 A Supplement to Men-denhall’s “Bibliography on Life Testing and Related Topics.“Journal of the American Statistical Association 59:1231-1291. -* An excellent bibliography covering the period 1958-1962.
Khintchin, Aleksandr Ia. (1955) 1960 Mathematical Methods in the Theory of Queueing. London: Griffin. → First published in Russian.
Lloyd, David K.; and Lipow, Myron 1962 Reliability: Management, Methods, and Mathematics. Englewood Cliffs, N.J.: Prentice-Hall. → Discusses theory and applications.
Mendenhall, William 1958 A Bibliography on Life Testing and Related Topics. Biometrika 45:521-543. → Includes important works up to 1957.
Rosenblatt, Joan R. 1963 Confidence Limits for the Reliability of Complex Systems. Pages 115-137 in Marvin Zelen (editor), Statistical Theory of Reliability. Madison: Univ. of Wisconsin Press.
Zelen, Marvin (editor) 1963 Statistical Theory of Reliability. Madison: Univ. of Wisconsin Press. → Contains survey and expository articles on reliability theory.
QUANTAL RESPONSE
The response of an experimental or survey subject is called “quantal“if it is a dichotomous response—for example, dead-alive or success-failure. Such dichotomous variables are common throughout the social sciences, but the term “quantal response“and the statistical analyses associated with the term refer specifically to situations in which dichotomous observations are taken in a series of groups that are ordered on some underlying metric. The probability of a specific response —for example, dead—is then taken to be a function of this underlying variable, and it is this function (or some characteristic of it) that is of interest. The first developments of techniques for the analyses of quantal response were in psycho-physics [SeePsychophysics], where the functional relationship between the probability of a psychological response (for example, affirmation of a sensation) and the amount of, or change in, stimulus was of interest.
The purpose of this article is to point out the areas in which the techniques for quantal response find application, to describe the techniques, and to discuss in detail a few of the applications in terms of methods and areas of investigation and of fulfillment of the statistical assumptions.
Quantal response data arise when subjects (more generally, experimental units) are exposed to varying levels of some treatment or stimulus and when it is noted for each subject whether or not a specific response is exhibited. In psychophysical experiments the stimulus might be a sound (with varying levels of intensity measured in decibels) or a light (with variations in color measured in wave lengths). The subject states whether he has heard the sound or can discern that the light is different in color from some standard. In the assay of a tranquilizing drug for toxicity, the treatment variable is the dose of the drug and the response variable might be the presence or absence of nausea or some other dichotomous indication of toxicity.
Descriptive statistics for quantal data
Table 1 contains a typical set of quantal response data. The data were treated by Spearman (1908) and represent the proportion (pi) of times a subject could distinguish a sound of varying intensity (Xi) from a given standard sound. As Figure 1 shows, the proportion of positive responses tends to increase from 0 to 1, in an S-shaped, or sigmoid, curve, as the level of x increases. It is often assumed that points like those plotted in
Table 1 — Proportion of times a sound of varying intensity was judged higher than a standard sound of intensity 1,772 decibels | ||
---|---|---|
Intensity (decibels) | Logarithm of intensity (Xj) | Proportion of high judgments(pi) |
Source: Data from Spearman 1908. | ||
1,078 | 3.03 | .00 |
1,234 | 3.09 | .00 |
1,402 | 3.15 | .11 |
1,577 | 3.20 | .28 |
1,772 | 3.25 | .48 |
1,972 | 3.30 | .71 |
2,169 | 3.34 | .83 |
2,375 | 3.37 | .91 |
2,579 | 3.41 | .95 |
2,793 | 3.45 | .98 |
3,011 | 3.48 | 1.00 |
Figure 1 approximate an underlying smooth sig-moid curve that represents the “true“circumstances.
The value of x for which the sigmoid curve has value J is taken to be the most important descriptive aspect of the curve by most writers. This halfway value of x, denoted by Med, corresponds to the median if the sigmoid curve is regarded as a cumulative frequency curve. For values of x below Med, the probability of response is less than i; for values of x above Med, the probability of response is greater than J. Of course, in describing a set of data a distinction must be made between the true Med for the underlying smooth curve and an estimate of Med from the data. The value Med is often called the threshold value in research in sensory perception (see Guilford 1936). This value is not to be confused with the conceptual stimulus value (or increment) that induces no response, that is, the 0 per cent response value, sometimes called the absolute threshold. (See Corso 1963 for a review of these concepts.) In pharmacological research the dose at which 50 per cent of the subjects respond is called the E.D. 50 (effective dose fifty), or L.D. 50 (lethal dose fifty) if the response is death.
A natural method of estimating Med would be to graph the p, against x-, (as in Figure 1), graduate the data with a smooth line by eye, and estimate Med from the graph. In many cases this method is sufficient. The method lacks objectivity, however, and provides no measure of reliability, so that a number of arithmetic procedures have been developed.
Unweighted least squares
One early method of estimating Med was to fit a straight line to the p{ by unweighted least squares [SeeLinear Hypotheses, article onRegression]. This standard procedure leads easily to an estimator of Med,
where p̄ and x̄ are averages over the levels of x, and Bj, the estimated slope of the straight line, is
Minimum normit least squares
The method of unweighted least squares is simple and objective, but it applies simple linear regression to data that usually show a sigmoid trend. Furthermore, the method does not allow for differing amounts of random dispersion in the pi.
Empirical and theoretical considerations have suggested the assumption that the expected value of Pi is related to the xi by the sigmoidal function, Φ[(xi — Med)/σ], where σ is the standard deviation of the distribution of the xi and Φ denotes the standard normal cumulative distribution. [SeeDistributions, Statistical, article onSpecial Continuous Distributions]. Correspondingly, the probability of response at any x is assumed to be Φ[(xi — Med)/σ] Notice that
The inverse function ofΦ Z(p), is called the standard normal deviate of p, or the normit. To eliminate negative numbers, Z(p) + 5 is used by some and is called the probit. The normit, Z(p), is linearly related to x:
If Xi is plotted against Z(pf) from Table 1, a sensibly linear relationship is indeed obtained (see Figure 2)
Figure 2 — Linear relationship between the normit, Zip), and the logarithm of sound intensity for Spearman’s data
The minimum normit least squares method (Berkson 1955) is to fit a straight line to the Z(pf) by weighted least squares [SeeLinear Hypotheses, article onRegression]. The weights are chosen to approximate the reciprocal variances of the Z(p{). Specifically, letting yt. = Z(p4) and letting Zi be the ordinate of the standard normal distribution with cumulative probability p”, the weight, w4, for j/i will be
The resultant estimator of Med is
where x̄ = Σwixi/Σwi and ȳ= Σwiyi/Σwi are weighted means and B2 is the estimator of slope,
Maximum likelihood estimators. The method of maximum likelihood, one of general application in statistics, may be applied to quantal response analysis [SeeEstimation, article onPoint Estimation]. A detailed discussion is given by Finney (1947; 1952). In essence, the technique may be regarded as a weighted least squares procedure, similar to the minimum normit procedure, using y, and w,, slightly different from the normits and weights of the preceding section. The values are obtained from special tables. The modified weighted least squares procedure is iterated, using new weights stemming from the prior computations, until stability in the estimates is achieved.
Estimation using the logistic function
Although the cumulative normal distribution has been most heavily used in quantal response analysis, the logistic function was proposed early and has been advocated for quantal assay in several fields (Baker 1961; Emmens 1940). The function is
Notice that for This function is of sigmoid form and is practically indistinguishable from a cumulative normal, with appropriate choice of A and B.
One advantage of the logistic function is that x can be expressed as a simple function of p, that is,
Thus, z/i =lnpf/(l — Pj) will be approximately linearly related to xt. The method of approximate weighted least squares can be used to fit a straight line to the data. The resulting estimator for Med is
where ȳ and x̄ represent weighted means of the yi=1n[pi/(1-pi)] and xi with weights wi=nipi(1-pi)and the estimated slope is
The method of maximum likelihood could be applied with the logistic function, but again it proves to be an iterative regression method. However, some simplifications in this case allow the maximum likelihood solution to be graphed; this has been done for certain configurations of x-levels common in drug assay (Berkson 1960).
Spearman-Karber estimation
Finally, an important method of estimating Med was described by Spearman (1908) and attributed by him to the German psychophysicists. The method is most convenient when the difference between x-levels is constant, for then the estimator is
where xk is the highest level of x and d is the constant difference between levels of x. Note that an
additional lower level with p0 = 0 would not change the value of M̂4. Neither would an additional higher level of x with pk+1, = 1, since xk would be increased by d, and an additional d would be subtracted. A more general formula applicable in the case of unequal spacing of the Xi is easily available. The expression for M̂t4 above does not require that the group sizes (ni) be equal. The calculation of M̂4 can be intuitively justified by noting that it is equivalent to reconstructing a histogram from the cumulative distribution formed by graphing the pi, against the xi as a broken line increasing from 0 to 1. The proportion in the histogram between xi and xi+1 would be pi+1 — pi. The mean of the histogram would be the sum of the midpoints of the intervals, multiplied by their respective relative frequencies, pi+1 — pi:
where p0 = 0, pk+l = 1, by definition. It is easily shown that when both are applicable, thus demonstrating that the Spearman estimator is simply the mean of a histogram constructed from the quantal response data. Although resembles a mean, it can be regarded as an estimator of the median, Med, for the usual symmetrical sigmoidal curve, since mean and median will be equal in these cases.
The Spearman-Karber estimator has been criticized because it seems to depend on the possibly fallacious assumption that further x-levels on either end of the series actually used would have resulted in no responses and all responses, respectively. It should be emphasized here that the heuristic justification above makes use of the unobserved values, p0 = 0, pk+l = 1 but the Spearman-Karber estimator must be judged on the basis of theory and performance, not on the basis of the heuristic reasoning that suggested the method. Theoretical and practical results suggest that the Spearman-Karber method has no superior as a general method for estimating Med from quantal data. (See Brown 1961 for a review and for additional results on this point.) Furthermore, the Spearman-Karber estimator is nonparametric in that no function relating p and x appears in the definition of the estimator.
Other methods
Other methods of analyzing quantal data are described by Finney (1952). None seems as desirable for general use as the methods described above. The more frequently mentioned procedures are (a) methods based on moving average interpolation of the pi, (b) methods based on the angular function as an alternative to the normal or logistic, and (c) the Reed-Muench and Dragstedt-Behrens method used primarily in biology.
Sequential estimation of Med
Occasionally, data on quantal response can be collected most economically in a sequential way and analyzed as they are gathered. As an example, in a clinical trial of a psychotherapeutic drug, individual patients may be allocated to various dose-levels of drug and placebo as they are admitted to, and diagnosed at, a mental hospital. Since the data on treatment effectiveness will become available sequentially and not too rapidly, the evaluation might well be done sequentially. Sequential collection and analysis of quantal response data will yield values of Med (the L.D. 50, or threshold value, etc.) that are more reliable than the fixed-sample-size investigation for comparable numbers of observations. The added precision of the sequential procedure is attained by choosing levels for further observations in the light of the data observed. The result is a concentration of x-levels in the range where the most information on Med is gained—namely, at x levels in the vicinity of Med. Several sources (Cochran & Davis 1964; Wetherill 1963) give appropriate methods of carrying out sequential experimentation and analysis specifically for quantal data. [SeeSequential Analysisfor a general discussion].
Estimation of slope of curve
All of the preceding discussion of computational procedures has been concerned with computation of an estimator for Med. Another characteristic of the data shown in Figure 1 is the steepness with which the pi rise from 0 to 1 as the level of x is increased. It has been noted that if the function relating pi to x is regarded as a cumulative frequency function, then Med is the median of the distribution. Similarly, the steepness is related (inversely) to the variability of the frequency function. In particular, if the normal frequency function is used, the slope of the straight line (eq. 1) relating the normal deviate of p to x is simply the inverse of the standard deviation of the normal function. Therefore, if the regression methods that stem from least squares or maximum likelihood are used, the resulting value for the slope, for example, B̂2 in expression (2), could be used to estimate σ. If the logistic functional form is used, the slope can be converted to an estimate of the standard deviation by multiplying the in-inverse of the slope by a constant, . The non-parametric Spearman-Karber estimation of Med does not provide as a side product an estimator of slope or standard deviation. If a value for the standard deviation is desired, a corresponding Spearman procedure for estimating the standard deviation is available (Cornfield & Mantel 1950).
Occasionally, an estimator of a quantile other than the median is of interest. If the tolerance distribution is of normal form, the pth quantile, xp, can be estimated by
where M̂ and σ̂ are estimators of the median and standard deviation of the tolerance distribution and zp, is the pth quantile of the standard normal distribution.
Reliability of estimators
The computational procedures described above can be carried out on any set of quantal response data for the purpose of summary or concise description. However, it is apparent that the data are subject to random variation, and this variation, in turn, implies that estimators of Med or cr computed from a given set of quantal response data should be accompanied by standard errors to facilitate proper evaluation. Valid standard errors can be computed only on the basis of a careful examination of the sources and nature of the variation in each specific application, but some widely applicable procedures will be discussed in this section.
Measuring reliability of the estimators
The estimators of Med discussed in previous paragraphs will be unbiased, for practical purposes, unless the number of xi levels used is small (say, two or three) or the xi levels are so widely spaced or poorly chosen that the probability of response is either 0 or 1 at each xi. The ideal experiment will have several xi levels, with probabilities of response ranging from 5 per cent to 95 per cent. [SeeEstimation, article onPoint Estimation, for a discussion of unbiasedness].
A simple method for measuring the reliability of an estimator is to carry out the experiment in several independent replications or complete repetitions. This series of experiments will provide a sequence of estimates for the desired parameter (Med, for example). The mean of the estimates can be taken as the estimator of the parameter, and the standard error of this mean can be computed as the standard deviation of the estimates divided by the square root of the number of estimates. It will be a valid measure of the reliability of the mean.
Statistical model
The disadvantage of the procedure described above is that it may not yield error limits as narrow as a method that is tailor-made to the known characteristics of the particular investigation. Furthermore, the statistical properties of the procedure may not be as easily ascertained as for a method based on a more specific mathematical model. The following model seems to describe quite a number of situations involving quantal response; the estimation procedures discussed above are appropriate to this model.
Take Pi to be the expected value of pi at the level xi, with ni subjects at this zth level of xi and with ri (= Pini) of the ni subjects responding. Assume that the xi are fixed and known without error and that each of the ni subjects at xi has a probability, Pi of responding, independent of all other subjects, whatever their xi level. This implies that the observed Pi are independent, binomially distributed proportions. In particular, each pi is an unbiased estimator of its Pi with variance [Pi(l — Pi)]/ni.
The assumption of complete independence among all subjects in the investigation must be carefully checked. For example, if the animals at a given dose-level (xi) are littermates, or if the persons at each level of ability (xi) are tested as a group, there may be a serious violation of this assumption, and any assessment of standard error must acknowledge the dependence. (See Finney [1952] 1964, pp. 136-138, for a discussion of procedures appropriate for this type of dependence.) In what follows, complete independence is assumed.
Standard errors and confidence limits
In the parametric regression methods, either maximum likelihood or minimum normit chi-squared, estimated standard errors for the estimators of Med and σ are
where B̂, wi, and x͂ are as defined in previous sections. The same formulas apply for logistic applications. If 95 per cent confidence limits are desired, it is usually satisfactory to take the estimate plus and minus two standard errors. A more exact procedure, developed by Fieller, is discussed in detail by Finney (1952).
The standard error for the nonparametric Spearman procedure is simple and rapidly computed. From expression (3) it can be seen that the Spearman estimator involves only the sum of independent binomially distributed random variables. The usual estimator of standard error is therefore
although there is evidence that it is slightly better to replace ni with ni — 1. It is useful to note that the Spearman estimator has a standard deviation that depends on σ, the standard deviation of the curve relating P to x. If the curve is normal or logistic, the relationship (Brown 1961) is approximately
Thus, if σ,- can be approximated from past experience, this last expression can be used to plan an investigation, with respect to the choice of d and n.
Choice of computational procedure
In many applications a graphical estimate of Med (and, perhaps, σ,) seems to be sufficient. However, if a more objective estimator is desired, with a measure of reliability, some one of the computational techniques must be chosen. Finney ([1952] 1964, p. 540) presents a review of work comparing the several computational techniques. The results of this work can be described as follows:
(a) The estimators discussed above give quite comparable values in practice, with no estimator clearly more reasonable than another as a descriptive measure.
(b) The reliabilities of the weighted normit or logit least squares, the maximum likelihood, and the Spearman estimator are the same for practical purposes.
(c) The choice of specific functional form— that is, normal, logistic, or some other—does not affect the estimators much and should be made on the basis of custom in the field of application and mathematical convenience.
In summary, the Spearman-Karber estimator, . or a noniterative, least-squares procedure, M̂1 with standard error given by SE. or SEM̂, respectively, seems acceptable for many applications. If a parametric estimator is used, the choice of functional form can usually be made on the basis of availability of tables and computational ease.
Some quantal response investigations
Presented below in some detail are several examples of the application of the quantal response techniques discussed above.
Quantal response in sensory perception
A typical experiment in sensory perception will call for the presentation of a continuous sound at 1,000 cycles per second, interrupted regularly by sounds with small increment in cycles per second (see, for example, Stevens et al. 1941). The subject tells whether or not he detected the increment. A sequence of increments of the same size may be followed by a sequence of increments of increased size, ending with a sequence of quite easily detectable, large increments. Modifications of this procedure call for random ordering of the magnitudes of increment, varying the background signal, or varying the conditions of the observer—for example, his motivation, fatigue, or training. Recent modifications involve random time of presentation of the increment and variations in the length of the signal increment. Other types of stimulus present new possibilities for experimental procedure.
In many experimental situations the proportions of positive responses to the sequence of stimuli at varying levels resemble independent binomially distributed proportions, with expected values ranging from 0 for small stimulus (or stimulus increment) to 1 for large stimulus. The proportion of responses as a function of the stimulus level (or increment) often has a sigmoidal shape. Then the methods discussed above for estimating the stimulus (Med, or threshold) that gives 50 per cent positive response, with standard error, are applicable. [See Psychophysics].
The present research emphasis is not on establishing the existence of thresholds and measuring their values, but on determining the factors that cause them to vary. At present, evidence points to great sensitivity of the threshold to the way in which the sequence of stimuli is given, the experimental environment (for example, noise), and the psychological and physiological state of the subject.
The work of Stevens, Morgan, and Volkmann (1941) ana that of Miller and Garner (1944), among others, indicate that well-trained observers, tested under ideal conditions for discrimination, can produce linear (rather than sigmoid) functions relating probability of response to stimulus (increment) level. These results have been used as a basis for a new theory of discrimination called the neural quantum theory. The statistical methods that depend on a sigmoidal function do not apply to this type of data but could be adapted so that they would apply.
Quantal data in pharmacological research
The strength of drugs is often measured in terms of the amount of a drug necessary to induce a well-defined quantal response in some biological subject. (Often, dosage is measured on a logarithmic scale, in order to attain an approximately sigmoid curve.) Since biological subjects vary in their responses to the same dose of a drug, the strength must be carefully denned.
If a large number of subjects are divided into subgroups and if these groups are given varying doses, ranging from a relatively small dose for one group to a large dose for another group, the proportion of subjects responding per group will in crease gradually from 0 for the smallest-dose group to 1 for the group receiving the highest dose. This curve is called the dose-response curve in biological assay. The E.D. 50, the dose at which 50 per cent of the subjects respond, is used to characterize the strength of a preparation.
If the subjects are randomly allocated to the dose groups and if the dose-response curve is sigmoidal in shape, the proportions of responses in the groups often may be taken to be binomially distributed proportions, and the estimation methods discussed above will be appropriate (Finney 1952).
It is convenient to regard each subject as having an individual absolute threshold for response to the drug. A dose lower than the threshold of a subject would fail to induce a response; a dose higher than the threshold would induce a response. The thresholds or tolerances of the subjects will form a distribution. The E.D. 50 is the median of this tolerance distribution, and the dose-response function is simply the cumulative distribution of the tolerances or absolute thresholds. It should be emphasized that the dose-response curve is directly estimated by the observed proportions, whereas the underlying tolerance distribution is not ordinarily directly observed. In fact, postulation of the tolerance distribution is not essential to the validity of the assay model or estimation procedure.
Experience has shown that characterization of the strength of a drug by its E.D. 50 is unreliable, since the subjects themselves may change in sensitivity over a period of time. It is common practice, at present, to assay a drug by comparing its effect on the subjects to the effect of a standard preparation. The result is expressed as the relative potency of the test preparation compared to the standard preparation. If zT units of the test drug perform like pzT units of the standard for any zT (that is, if the test drug acts like a dilution of the standard drug) this relation will hold regardless of the fluctuation in the sensitivity of the subjects. This implies that the E.D. 50 for the test preparation will be p-1 times the E.D. 50 of the standard. On the log dose scale the log (E.D. 50)’s will differ by logp, and the estimation is usually carried out by estimating the E.D. 50’s for test and standard preparations on the log scale, taking the difference and then taking the antilog of the difference. The estimation methods above can be used for each preparation separately, although some refinements are useful in combining the two analyses.
The use of a standard preparation and the potency concept are especially necessary in clinical trials of tranquilizers or analgesics. In these cases, random variation is large, and stability from one trial to the next is difficult to achieve. Inclusion of doses of standard preparation for comparison of dose-response functions seems to be essential for each trial. [SeeScreening And Selection].
Quantal response in mental testing
The theory of mental tests developed by Lawley (1943) and elaborated by Lord (1952) and by others assumes a probability of correct response to each item that depends on the ability level of the responder. This functional relationship between probability of correct response and ability level was taken by Lawley to be a normal function, but Maxwell (1959) proposed the logistic function as an alternative and Baker (1961) presented empirical evidence that the logistic is more economical to use and fits mental-test data as well as the integrated normal.
In Lawley’s theory of mental testing, each item is characterized by a measure of difficulty, the ability level corresponding to 50 per cent probability of correct response. This is the mean (Med) of the normal cumulative distribution and is called the limen value for the item. The standard deviation of the normal distribution is a measure of the variation in the probability of response over ability levels. The inverse of this standard deviation is defined as the discriminating power of the item.
Although Ferguson (1942) and Baker (1961) have described in detail the methods of item analysis for estimating the limen and the discriminating power, with particular reference to the data typical of mental-test investigations, this method is rarely used at present and is not described in the introductory manuals on test construction.
There is a noteworthy difference in the typical mental data and the analogous data of psycho-physics or drug assay. Although a detailed discussion of the theory of testing is not in order here, it should be noted that the x variable is ability. There would be difficulties in obtaining groups of subjects at specified ability levels. The usual procedure is to administer the test to a large group and to stratify them with respect to their total scores on the test after it is given. If the number of items in the test is large and if the number of ability groups is not too small, the usual assumptions for quantal assay will be well approximated. [SeePsychometrics].
Other uses in the social sciences
There are other areas of the social sciences in which the quantal response model may prove useful. In market analysis, for example, the probability of purchasing specific items will depend on some underlying, continuous variable, such as income, amount of education, or exposure to advertising. Data on this dependence might well be available in the form of proportions of purchasers or of intending purchasers at increasing levels of the continuous variable. Such data could be summarized by estimating the level at which the probability attains some specified value, such as 50 per cent, and the rate at which the probability increases with the continuous variable. This technique could be adapted to the summary of many sample survey collections of data.
The concept of quantal response can be useful in estimating or verifying the validity of an average that might ordinarily be obtained by depending on the memory of persons interviewed. For example, the average age at which a child is able to take several steps without aid might be ascertained by questioning a large group of mothers whose children have been walking for some time. A less obvious, but probably more valid, method of obtaining the estimate would be to interview groups of mothers with children of various ages, ranging from six months to thirty months. The proportion of children walking at each age could be recorded, and this set of quantal response data could be analyzed by the methods discussed above to obtain an estimate of the age at which 50 per cent of the children walk, with a standard error for the estimate. This method has been used to estimate the age of menarche through interviews with adolescent schoolgirls.
Byron W. Brown, Jr.
[See alsoCounted Data; Psychophysics].
BIBLIOGRAPHY
Baker, Frank B. 1961 Empirical Comparison of Item Parameters Based on the Logistic and Normal Functions. Psychometrika 26:239-246.
Berkson, Joseph 1955 Estimate of the Integrated Normal Curve by Minimum Normit Chi-square With Particular Reference to Bio-assay. Journal of the American Statistical Association 50:529-549.
Berkson, Joseph 1960 Nomograms for Fitting the Logistic Function by Maximum Likelihood. Biometrika 47:121-141.
Brown, Byron W. Jr. 1961 Some Properties of the Spearman Estimator in Bioassay. Biometrika 48:293-302.
Cochran, William G.; and Davis, Miles 1964 Stochastic Approximation to the Median Effective Dose in Bioassay. Pages 281-297 in Stochastic Models in Medicine and Biology: Proceedings of a Symposium … 1963. Edited by John Gurland. Madison: Univ. of Wisconsin Press.
Cornfield, Jerome; and Mantel, Nathan 1950 Some New Aspects of the Application of Maximum Likelihood to the Calculation of the Dosage Response Curve. Journal of the American Statistical Association 45: 181-210.
Corso, John F. 1963 A Theoretico-Historical Review of the Threshold Concept. Psychological Bulletin 60: 356-370.
Emmens, C. W. 1940 The Dose/Response Relation for Certain Principles of the Pituitary Gland, and of the Serum and Urine of Pregnancy. Journal of Endocrinology 2:194-225.
Ferguson, George A. 1942 Item Selection by the Constant Process. Psychometrika 7:19-29.
Finney, David J. (1947) 1962 Probit Analysis: A Statistical Treatment of the Sigmoid Response Curve. 2d ed. Cambridge Univ. Press.
Finney, David J. (1952) 1964 Statistical Method in Biological Assay. 2d ed. New York: Hafner.
Guilford, Joy P. (1936) 1954 Psychometric Methods. 2d ed. New York: McGraw-Hill.
Lawley, D. N. 1943 On Problems Connected With Item Selection and Test Construction. Royal Society of Edinburgh, Proceedings 61A: 273-287.
Lord, F. 1952 A Theory of Test Scores. Psychometric Monograph No. 7. New York: Psychometric Society.
Maxwell, A. E. 1959 Maximum Likelihood Estimates of Item Parameters Using the Logistic Function. Psychometrika 24:221-227.
Miller, G. A.; and Garner, W. R. 1944 Effect of Random Presentation on the Psychometric Function: Implications for a Quantal Theory of Discrimination. American Journal of Psychology 57:451-467.
Spearman, C. 1908 The Method of “Right and Wrong Cases” (“Constant Stimuli”) Without Gauss’s Formulae. British Journal of Psychology 2:227-242.
Stevens, S. S.; Morgan, C. T.; and Volkmann, J. 1941 Theory of the Neural Quantum in the Discrimination of Loudness and Pitch. American Journal of Psychology 54:315-335.
Wetherill, G. B. 1963 Sequential Estimation of Quantal Response Curves. Journal of the Royal Statistical Society Series B 25:1-48. → Contains 10 pages of discussion by P. Armitage et al.