Distributions, Statistical
Distributions, Statistical
i. Special Discrete DistributionsFrank A. Haight
ii. Special Continuous DistributionsDonald B. Owen
iii. Approximations to DistributionsN. L. Johnson
iv. Mixtures of DistributionsWallace R. Blischke
I SPECIAL DISCRETE DISTRIBUTIONS
The continuous distributions mentioned in this article, together with others, are described in detail in the companion article onspecial continuous distributions.
A discrete probability distribution gives the probability of every possible outcome of a discrete experiment, trial, or observation. By “discrete” it is meant that the possible outcomes are finite (or countably infinite) in number. As an illustration of the ideas involved, consider the act of throwing three coins onto a table. Before trying to assess the probability of various outcomes, it is necessary first to decide (as part of the definition of the experiment) what outcomes are possible. From the point of view of the dynamics of the falling coins, no two throws are exactly identical, and there are thus infinitely many possible outcomes; but if the experimenter is interested only in the heads and tails shown, he may wish to reduce the number of outcomes to the eight shown in Table 1.
Table 1 — Possible outcomes of the experiment of lossing three conis | ||
---|---|---|
Coin 1 | Coin 2 | Coin 3 |
Head | Head | Head |
Head | Head | Tail |
Head | Tail | Head |
Head | Tail | Tail |
Tail | Head | Head |
Tail | Head | Tail |
Tail | Tail | Head |
Tail | Tail | Tail |
If the coins are fair, that is, equally likely to show a head or tail, and if they behave independently, the eight possibilities are also all equally probable, and each would have assigned probability. The list of eight ’s assigned to the eight outcomes is a discrete probability distribution that is exactly equivalent to the probabilistic content of the verbal description of the defining experiment.
The same physical activity (throwing three coins on a table) can be used to define another experiment. If the experimenter is interested only in the number of heads showing and not in which particular coins show the heads, he may define the possible outcomes to be only four in number: 0, 1, 2, 3. Since of the eight equally probable outcomes of the first experiment, one yields zero heads and one yields three heads while three yield one head and three yield two heads, the discrete probability distribution defined by the second experiment is
It is frequently convenient to write such a list in more compact form by using mathematical symbols. If pn means the probability of n heads, the second experiment leads to the formula
where, the binomial coefficient, is the mathematical symbol for the number of combinations of N things taken n at a time,
n! (which is read “n-factorial”) being equal to n(n- l)(n-2 ) … 3·2·1. Note that in (1) the form is given for the probabilities and then the domain of definition is stated.
A domain of definition of a probability distribution is merely a list of the possible outcomes of the experiment. In nearly every case, the experimenter finds it convenient to define the experiment so that the domain is numerical, as in the second experiment above. However, there is no reason why the domain should not be nonnumerical. In the first experiment above, the domain could be written (HHH), (HHT), (HTH), (HTT), (THH), (THT), (TTH), (TTT), where T means tails and H means heads.
An expression for pn like the above needs only two properties to describe a valid discrete probability distribution. It must be nonnegative and the sum over all possible outcomes must be unity: pn ≥0, ∑pn = 1. It is therefore possible to invent any number of discrete probability distributions simply by choosing some function pn satisfying these simple conditions. In both theory and applications, it is more important to study certain probability distributions that correspond to basic probability experiments than to attempt a systematic classification of all functions satisfying these conditions. In the following list of important discrete distributions, the corresponding experiment is stated in very general terms, accompanied in many cases by concrete interpretations.
Discrete distributions
Binomial distribution
Suppose an experiment can have only two possible outcomes, conventionally called success and failure, and suppose the probability of success is p. If the experiment is repeated independently N times under exactly the same conditions, the number of successes, n, can be 0, 1, … -, N. A composite experiment consisting of observing the number of successes in the N performances of the individual experiment can therefore have one of the following outcomes: 0, 1,2, … -, N. The probability of observing exactly n successes forms the binomial distribution:
The second experiment mentioned in the introduction is of this form, where N – 3 and
The mean value of the binomial distribution is Np and its variance is Np(l – p). The tails of the binomial distribution (and of the negative binomial discussed below) may be expressed as incomplete beta functions.
The following list of interpretations for the abstract words success and failure will indicate the breadth of application of the binomial distribution: male-female; infected-not infected; working-broken; pass-fail; alive-dead; head-tail; right-wrong. If a common test is known to give the right answer (using the last pair as an example) nine times out of ten on the average, and if it is performed ten times, what is the probability that it will give exactly nine right answers and one wrong one?
In using the binomial distribution for this purpose, it is important to remember that the probability of a success must be constant. For example, if each of 30 students took an examination composed of 3 equally difficult parts and each separate part was graded pass or fail, the number of fails out of the 90 grades would not have a binomial distribution since the probability of failure varies from student to student.
If, for the binomial distribution, N is large and p is small, then the simpler Poisson distribution (see next section) often forms a good approximation. The sense of this approximation is a limiting one, with N → ∞ and Np having a finite limit.
The normal distribution is often a good approximation to the binomial, especially when N is large and p not too close to 0 or 1. The limiting sense of the normal approximation is in terms of the standardized variable, n less its mean and divided by its standard deviation.
Poisson distribution
Imagine a machine turning out a continuous roll of wire or thread. Minor defects occur from time to time; their frequency and their statistical behavior are of interest. If the defects occur independently and if the risk of a defect is constant (that is, if the probability of a defect in a very small interval of the wire is a constant, λ, times the length of the interval), then the average number of defects per unit length is λ and the probability that a unit length contains exactly n defects is
This distribution, named for the French mathematician Poisson, is an important example of a distribution defined over an infinite set of integers. Any value of n is possible, although large values (compared with λ) are unlikely. The mean and variance of the Poisson distribution are both λ. If the segment of unit length is replaced by a segment of length L, this is only a change in scale and one need only substitute Lλ for λ in (3) and in the moments. The tails of the Poisson distribution are incomplete gamma functions.
The Poisson distribution has a very wide range of application, not only because of its mathematical simplicity but because it represents one important concept of true randomness. In addition, the Poisson distribution, as has been noted, is useful as an approximation.
The most important interpretations of the Poisson distribution occur when the line is considered to be the axis of real time and the points to be times at which certain events take place. The following is a partial list of events that have been compared with the Poisson distribution: deaths by horse kick, radioactive particles entering a GeigerMüller counter, arrival of patients into a doctor’s waiting room, declarations of war, strikes of lightning, registration of vehicles over a road tape, demands for service at a telephone switchboard, demands for parking space, occurrence of accidents, demands for electric power, instants of nerve excitation, occurrence of suicide, etc. [for a discussion of some of these applications, seeQueues].
On the other hand, the interpretation of the Poisson distribution as a criterion of perfect disorder remains valid in space of any number of dimensions, provided only that L is interpreted as length, area, or volume as the case may require. Space examples are perhaps not as numerous, but the following can be noted: one dimension, misprints per line, cars per mile; two dimensions, bomb bursts per square mile, weeds per square yard, stars per square unit of photographic plate; three dimensions, raisins per loaf of raisin bread, bacteria per cubic centimeter of fluid.
Morse distribution
The Poisson distribution is an example of a counting distribution. In a single dimension (whether time or space) it is characterized by the fact that the continuous density function (the gap distribution) of distance between consecutive points is negative exponential (with λ = 1/θ). If the gap distribution is more general (for example, the Pearson Type in, with λ = 1/θ), the counting distribution may be very complicated indeed. With Pearson Type in gaps, the counting probabilities are
where
This distribution takes its name from Philip Morse (1958), and reduces to the Poisson for r= 1. In data that fitted such a function, one would have a mixture of randomness and regularity, suggesting two factors at work.
Negative binomial distribution
There are two traditional important probability models leading to the negative binomial distribution, although others have been suggested, (i) If each member of a population experiences Poisson events (for example, accidents) but with the mean value λ varying statistically from member to member according to the Pearson Type in distribution, then the probability that n events occur in time T among all members of the population is
where p = θ/(T + θ) and r and θ are the parameters of the Pearson Type in distribution, (ii) If experiments are performed in sequence with a fixed probability p of success (as for the binomial distribution), then the negative binomial distribution gives the probability that exactly n failures will precede the rth success. This is particularly useful for r = 1, since in many applications one calls a halt after the first “success,” for example, in repeated dialings of a telephone that may be busy. The negative binomial distribution takes its name from its formal similarity to the binomial distribution.
The mean of the negative binomial distribution is (1 – p)r/p and its variance is (1 – p)r/p2. It is an example of a mixed distribution, obtained by assuming a parameter in one distribution to be itself subject to statistical fluctuation. [Many other such distributions, including those called “contagious” are discussed in Distributions, statistical, article onmixtures of distributions.]
Geometric distribution
In certain types of waiting lines, the probability of n persons in the queue is
where λ is the mean arrival rate and μ is the mean service rate and λ < μ The mean value of the geometric distribution is λ/(μ — λ), and the variance is λμ/(μ – λ)2. The geometric distribution is a special case of the negative binomial distribution with r = 1 and p = (μ/) – 1. [These waiting line problems are discussed inQueues.]
Hypergeometric distribution
Suppose a collection of N objects contains k of one kind and N – k of another kind—for example, any of the dichotomies of the binomial distribution model. If exactly r objects are taken at random from the collection, the probability that n of them will be of the first kind is
The mean of this distribution is kr/N and the variance is kr(N — k)(N — r)/N2.
The binomial distribution and the hypergeo-metric distribution can be regarded as arising from analogous experiments, the former by sampling “with replacement” and the latter by sampling “without replacement.” This can be seen from either of two facts: (i) if, in the probability experiment described for the hypergeometric distribution, each of the r objects is put back before the next one is chosen, the probability model leading to the binomial distribution would result; (ü) if N → ∞; in equation (4), with (N — k)/N replaced by p, the resulting limit is equation (2). Thus, if th total number of objects is very large and the proportions of the two types remain fixed, the hypergeometric model approaches the binomial model. If the number of objects of the first kind is very small in comparison with the total number of objects, the hypergeometric distribution will approach the Poisson. The hypergeometric distribution has important statistical application in Fisher’s so-called exact test in a 2 x 2 table [seeCounted Data].
It is possible to obtain many other distributions, both discrete and continuous, as special or limiting cases of the hypergeometric. Karl Pearson used the probability model leading to the hypergeometric distribution as the basis for his analysis of density functions and obtained 12 separate types as special limiting cases. Of these, only the Pearson Type III is still frequently called by his designation, although many of the most common continuous densities (including the normal) fit into his classification system [seeDistributions, statistical, article onapproximations to distributions].
Negative hypergeometric distribution
If the parameter p in a binomial distribution is itself beta distributed, then the unconditional distribution of the discrete variate is
where p and q are the parameters of the beta distribution and
This model is the discrete time analog of the first interpretation of the negative binomial distribution; the events (such as accidents) would be counted in short time periods so that only a success or failure would be recorded.
Occupancy distribution
If each of k objects is thrown at random into one of N boxes, the probability that a given box contains n objects is
This distribution has mean k/N and variance (k/N)(l – 1/N). Problems of this type are called occupancy problems and are important in statistical mechanics.
Busy period distributions
A box contains r objects, and additional objects are thrown into the box at random instants, that is, in accordance with a Poisson distribution (mean λ). If one object is removed in every time interval of length 1/μ, then the probability that exactly n objects will pass through the box before it first becomes empty is
This is the Borel-Tanner distribution, with mean r/(l-λ/μ ) and variance (λr/μ)/(l – λ/μ)3. If the objects are removed also in accordance with a Poisson distribution (mean μ), then the probability that n will pass through before the box first becomes empty is
which is called the Narayana distribution. This distribution has mean r/(l – λ/μ) and variance [(λr/μ)]/r/(l – λ/μ)3.
Such distributions are important in the theory of queues, where the box represents the collection of people waiting. Then the number n, which has probability pn, is the length of a busy period for the service mechanism, beginning with r in the system [seeQueues].
Uniform distribution
An experiment with N equally likely outcomes corresponds to the discrete uniform (or discrete rectangular) distribution:
For a single throw of a die, N = 6. This distribution has mean and variance.
Yule distribution
Some experimental data suggest that in a long list of names (for example, a telephone book) the probability that a randomly chosen name occurs n times is
Fisher distribution
In the distribution of frequency of plant species, the distribution
occurs, where 1/a = –loge, (1 – p). The Fisher distribution has mean ap/(l-p) and variance ap (l – ap)/(l -p)2. This distribution has also been called the logarithmic distribution because of its close relationship to the Taylor series for —loge (l – p). Fisher’s distribution can be obtained from the negative binomial by truncating the zero category and letting r approach zero.
Formation of distributions from others
The most important families of secondary distributions are the so-called mixed distributions, in which the parameter in a given distribution is itself subject to statistical fluctuation [seeDistributions, Statistical, article onmixtures of distributions]. Two examples above (the first case of the negative binomial and the negative hypergeometric) are also mixed. In addition, the following paragraphs illustrate other methods of obtaining secondary distributions.
Truncated distributions
It may happen that the domain of definition of a distribution does not exactly agree with some data, either for theoretical reasons or because part of the data is unobtainable, although the model is in other respects quite satisfactory. The most famous example of this is concerned with albinism. The Poisson distribution gives a satisfactory fit to the number of albino children born to parents genetically capable of producing albinos, except for the value of pn, the probability of no albino children. This frequency cannot be observed, since such parents with no albino children are ordinarily indistinguishable from normal parents. Therefore, a new distribution is formed from the Poisson by removing the zero category, and dividing each probability by 1 –. po that the total sum remains one:
In certain applications, it has been useful to truncate similarly the binomial distribution. [The practice, which has also been applied to continuous density functions, is discussed inStatistical Analysis, special problems of, article Ontruncation and censorship.]
Bissinger’s system
In connection with an inventory problem, the distribution qn, formed from a given distribution pn by the transformation
has been useful. A more general transformation of this type is
which reduces to the simpler form for k= 1. This transformation is discussed by Bissinger in Patil (1965).
Joint occupancy
Let pn be the probability of an object of type i in a box, and let β be the probability that an object of type i brings an object of type ii with it into the box. Then the probability qn of a total of n objects in the box is
where [k] represents the integer part of k. This form of distribution has been applied to the number of persons in an automobile and to the number of persons in a group buying a railway ticket. The two types of objects might be male-female or adult-child.
Analogy with density functions
It is only necessary to adjust a constant in order to convert a continuous distribution into a discrete one. For ex-ample, the negative exponential density λe–λx, usually defined over 0 < x < ∞, can be applied to the domain n = 0, 1, 2, …. However, since
the discrete probability distribution will be
p1 = (1-e-λ)e-λn, n=0, 1, ….
Several such discrete analogs, including the normal, have appeared in the literature.
Alternative descriptions of distributions
A probability distribution pn can be transformed in many ways, and, provided the transformation is one-to-one, the result will characterize the distribution equally well. Some of the principal auxiliary functions are
The last two functions are so named because in the corresponding power series expansion the coefficients involve, respectively, the central moments and the factorial moments.
For certain mathematical operations on discrete probability distributions, such as the calculation of moments, generating functions are extremely useful. In many cases the best way to obtain pn from a defining experiment is to calculate ф(s) first. Many examples of this type of argument can be found in Riordan (1958).
Multivariate distributions
There have been a few discrete multivariate distributions proposed, but only one, the multinomial, has been very widely applied.
Multinomial distribution
If an experiment can have r possible outcomes, with probabilities p1, p2, · · ·, pr and if it is repeated under the same conditions N times, the probability that the jth outcome will occur nj times, j = 1, 2, · · ·, r, is
which reduces to the binomial for r = 2.
Negative multinomial distribution
If, in the scheme given above, trials stop after the rth outcome has been observed exactly k times, the joint probability that the jth outcome will be observed nj. times (j – 1, · · ·, r – 1) is called the negative multinomial distribution.
Bivariate Poisson distribution
The joint probability
has been proposed as a generalization of the Poisson; when p – 0, n1 and n2 are independently Poisson.
Frank A. Haight
[Other relevant material may be found inProbabilityand in the biographies ofFisher, R. A.; Pearson; Poisson.]
BIBLIOGRAPHY
By far the best textbook on the theory and application of discrete distributions is Feller 1950. Busy period distributions and other distributions arising from queueing theory will be found in Takdcs 1962. Riordan 1958 explains the various auxiliary functions and the relationships between them. An index to all distributions with complete references to the literature is given in Haight 1961. The most complete and useful volume of statistical tables is Owen 1962. Further references to tables may be found in Greenwood & Hartley 1962. Computer programs for generating tables are distributed on a cooperative basis by SHARE Distribution Agency, International Business Machines Corporation. Patil 1965 contains several research and expository papers discussing the probabilistic models, structural relations, statistical theory, and methods for many of the discrete distributions mentioned in the list above, together with a bibliography on the subject by the editor.
Feller, William –1966 An Introduction to Prob-ability Theory and Its Applications. 2 vols. New York: Wiley. → A second edition of Volume I was published in 1957.
Greenwood, Joseph A.; and HARTLEY, H. O. 1962 Guide to Tables in Mathematical Statistics. Princeton Univ. Press.
Haight, Frank A. 1961 Index to the Distributions of Mathematical Statistics. U.S. National Bureau of Standards, Journal of Research Series B: Mathematics and Mathematical Physics 65B: –60.
Morse, Philip M. 1958 Queues, Inventories and Maintenance: The Analysis of Operational Systems With Variable Demand and Supply. New York: Wiley.
Owen, Donald B. 1962 Handbook of Statistical Tables. Reading, Mass.: Addison-Wesley. → A list of addenda and errata is available from the author.
Patil, Ganapati P. (editor) 1965 Classical and Contagious Discrete Distributions. Proceedings of the International Symposium held at McGill University, Montreal, Canada, August 15-August 20, 1963. Calcutta (India): Statistical Publishing Society; distributed by Pergamon Press. → See especially pages –17 on “A Type-resisting Distribution Generated From Considerations of an Inventory Decision Model” by Bernard H. Bissinger.
Riordan, John 1958 An Introduction to Combinatorial Analysis. New York: Wiley.
TakÁcs, Lajos 1962 Introduction to the Theory of Queues. New York: Oxford Univ. Press.
II SPECIAL CONTINUOUS DISTRIBUTIONS
This article describes, and gives the more important properties of, the major continuous distributions that arise in statistics. It is intended as both an overview for the reader generally interested in distributions and as a reference for a reader seeking the form of a particular distribution.
Technical terms, such as “density function,” “cumulative distribution function,” etc., are explained elsewhere [seeProbability].
The present article is, with a few exceptions, restricted to univariate distributions. Further specific references to numerical tabulations of distributions are not generally given here; most of the distributions are tabulated in Owen (1962) or in Pearson and Hartley (1954), and full references to other tabulations are given in Greenwood and Hartley (1962). Tables of many functions discussed here and an extensive reference list to tables are given in Zelen and Severe ( 1964). An index to properties of distributions is given in Haight (1961).
Normal distributions
The most important family of continuous distributions is that of the normal (or Gaussian) probability distributions. [SeeProbability, article onformal probability, for more of the many properties of the normal distributions.]
The normal probability density function is
where μ is the mean (and also here the median and the mode) and σ is the standard deviation. Figure 1 shows the shape of this density. Note that it is symmetric about μ, that is, f(μ + x) = f(μ – x), and that the density is essentially zero for x > μ + 3σ and x < n – 3σ. The normal distribution is some-times said to be “bell-shaped,” but note that there are many nonnormal bell-shaped distributions. The normal distribution is “standardized” or “normalized” by the transformation z = (x – μ)/σ, which gives the standard-normal density,
A standardized normal random variable is also referred to as a “unit” normal since the mean of the standardized form is zero and the variance is one. The cumulative distribution function for the normal distribution in standardized form is
Thus, to find the probability that a normal random variable with mean μ, and standard deviation σ is
less than x, first compute the number of standard deviations x is away from μ, that is, let z = (x – μ/σ. Probabilities associated with x then may be read from tables of the standardized distribution. Care must be exercised to determine what is tabulated in any particular table. Various tables give the following: the cumulative probability; the probability in the right-hand tail only; the sum of the probabilities in the two tails; the central prob-ability, that is, the probability that the absolute value of the random variable is less than the argument; and others.
A great many probability distributions are derived from the normal distribution; see Eisenhart and Zelen (1958); Kendall and Stuart ([–1946] –1961); Korn and Korn (1961); Zelen and Severo (1964). These references also list additional continuous distributions not covered here.
Normal distributions are sometimes called Gaussian, sometimes Laplace-Gauss distributions, and sometimes distributions following Laplace’s second law. This terminology reflects a tangled and often misstated history [see Walker 1929, chapter 2; see alsoGauss; Laplace].
Chi-square distributions
If X is unit normally distributed, X2 has a chi-square distribution with one degree of freedom. If X1, X2, · · ·, Xf are all unit normally distributed and independent, then the sum of squares has a chi-square distribution with f degrees of freedom.
The probability density function for a chi-square random variable is
This is also a probability density function when f is positive, but not integral, and a chi-square distribution with fractional degrees of freedom is simply defined via the density function.
The mean of a chi-square distribution is f, and its variance is 2f. For f > 2, the chi-square distributions have their mode at f – 2. For 0 < f ≤ 2, the densities are J-shaped and are maximum at the origin. Figure 2 shows the shape of the chi-square distributions.
If Y1, Y2 are independent and chi-square distributed with f1, f2degrees of freedom, then Y1 + Y2 is chi-square with f1 + f2 degrees of freedom. This additivity property extends to any finite number of independent, chi-square summands.
If Y has a chi-square distribution with f degrees of freedom, then Y/f has a mean-square distribution, or a chi-square divided by degrees of freedom distribution.
The cumulative distribution function for the chi-square random variable with an even number of degrees of freedom (equal to 2a) is related to the Poisson cumulative distribution function with mean λ as follows:
Where f = 2a
Gamma or Pearson Type m distributions
The gamma distribution, a generalization of the chi-square distribution, has probability density function
where 0< θ < ∞ >, 0 < r < ∞ >, and 0 < x < ∞;. Note that r does not have to be an integer. The mean of this distribution is rθ, and the variance is rθ2. A simple modification permits shifting the left endpoint from zero to any other value in this distribution and in several others discussed below. If and θ = 2, the gamma distribution reduces to a chi-square distribution. If X has a chi-square distribution with f degrees of freedom, then has a gamma distribution with parameters θ. If Y has a gamma distribution with parameters θ, r, then 2θ–1Y has a chi-square distribution with 2r degrees of freedom.
Negative exponential distributions
Negative exponential distributions are special cases of gamma distributions with r = 1. The probability density is (1/ θ) exp (-x/θ) for 0 ≤ x < ∞, 0 < θ < ∞ the mean is θ and the variance is θ2. If θ = 2, the negative exponential distribution reduces to a chi-square distribution with 2 degrees of freedom.
The cumulative distribution is
1 – exp (– x/θ), 0 ≤ x < ∞, 0 < θ < ∞.
This distribution has been widely used to represent the distribution of lives of certain manufactured goods, for example, light bulbs, radio tubes, etc. [seeQuality Control, Statisticalarticle onreliability and life testing].
Suppose that the probability that an item will function over the time period t to t + Δt is independent of t, given that the item is functioning at time t. In other words, suppose that the age of an item does not affect the probability that it continues to function over any specified length of future time provided the item is operating at present. In still other terms, if X is a random variable denoting length of life for an item, suppose that
Pr(X ≥ x ǀ X ≥ ξ} = Pr{X ≥ x – ξ}
for all ξ and x > ξ. This “constant risk property” obtains if, and only if, X has a negative exponential distribution. The underlying temporal process is called a stationary Poisson process; and, for a Poisson process, the number of failures (or deaths) in any given interval of time has a Poisson distribution [seeQueuesand the biography ofPoisson].
Noncentral chi-square distributions
If X1, X2, · · ·, Xf are all normally distributed and independent, and if Xi has mean μi and variance one, then the sum "Noncentral chi-square distributions" has a noncentral chi-square distribution with f degrees of freedom and noncentrality parameter "Noncentral chi-square distributions". This family of distributions can be extended to nonintegral values of f by noting that the density function (not given here) obtained for integral values of f is still a density function for the non-integral values of f. The mean of the noncentral chi-square distribution is f + λ and the variance is 2(f + 2λ).
Perhaps the main statistical use of the noncentral chi-square distribution is in connection with the power of standard tests for counted data [seeCounted Data].
The distribution also arises in bombing studies. For example, the proportion of a circular target destroyed by a bomb with a circular effects region may be obtained from the noncentral chi-square distribution with two degrees of freedom if the aiming errors follow a circular normal distribution. (A circular normal distribution is a bivariate normal distribution, discussed below, with p = 0 and σX = σY)
Noncentral chi-square distributions have an additivity property similar to that of (central) chi-square distributions.
Weibull distributions
A random variable X has a Weibull distribution if its probability density function is of the form
where 0 ≤ x < ∞, 0 < θ < ∞ and r ≥ 1. This means that random variables with Weibull distributions can be obtained by starting with negative exponential random variables and raising them to powers ≥ 1. If in particular r = 1, the Weibull distribution reduces to a negative exponential distribution. The Weibull distributions are widely used to represent the distribution of lives of various manufactured products. The mean of X is θГ[(r + 1)/r], and θ2{Г[(r + 2)/r] – Г2[(r + l)/r]} is the variance of X.
Student (or t-) distributions
If X has a unit normal distribution and if Y is distributed independently of X according to a chi-square distribution with f degrees of freedom, then X√/Y/f has a Student (or t-) distribution with f degrees of freedom. Note that f need not be an integer since the degrees of freedom for chi-square need not be integral. The only restriction is 0 < f < ∞. The density for a random variable having the Student distribution is
where – ∞ < x <+ ∞. A graph of the density functions for f = 1 and f = 3 is shown in Figure 1. Note that the density is symmetric about zero. As f approaches ∞, the Student density approaches the unit normal density.
The rth moment of the Student distribution exists if and only if r < f. Thus for f ≥ 2, the mean is zero; and for f ≥ 3, the variance is f/(f – 2). The median and the mode are zero for all f.
The Student distribution is named after W. S. Gosset, who wrote under the pseudonym Student. Cosset’s development of the t-distribution, as it arises in dealing with normal means, is often considered to be the start of modern “small sample” mathematical statistics [seeCosset].
The sample correlation coefficient, r, based on n pairs of observations from a bivariate normal population, may be reduced to a Student t-statistic with n – 2 degrees of freedom, when the population correlation coefficient is zero, by the transformation "Weibull distributions" [seeMultivariate Analysis, articles onCorrelation].
Cauchy distributions
The Cauchy distribution is an example of a distribution for which no moments exist. The probability density function is
where -∞ < x < + ∞, β > 0, and -∞ < λ < + ∞. The cumulative probability distribution is
The median and the mode of this distribution are at x – λ. For β = 1 and λ = 0, the Cauchy distribution is also a Student t-distribution with f = 1 degree of freedom.
Noncentral t-distributions
If X is a unit normal random variable and if Y is distributed independently of X according to a chi-square distribution with ft degrees of freedom, then "Noncentral t-distributions" has a noncentral t-distribution with f degrees of freedom and noncentrality parameter δ where 0 < f < ∞. and -∞ < δ < ∞. The mean for f ≥ 2 of this distribution is C11δ and the variance for f ≥ 3 is c22δ2 + c20, where c22 = [f/(f–2)]-c 2 and c20 = f/(f–2) . If δ=0, the noncentral t-distribution reduces to the Student t-distribution. Note that the moments do not exist for f = 1; despite this similarity, the non-central t-distribution with δ ≠ 0 is not a Cauchy distribution. The noncentral t-distribution arises when considering one-sided tolerance limits on a normal distribution and in power computations for the Student t-test.
F-distributions
If Y1 has a chi-square distribution with f1 degrees of freedom and Y2 has a chi-square distribution with ft degrees of freedom, and Y1 and Y2 are independent, then (Y1/ft)/(Y2 /ft) has an F-distribution with ft degrees of freedom for the numerator and ft degrees of freedom for the denominator. The F-distributions are also known as Snedecor’s F-distributions and variance ratio distributions. They arise as the distributions of the ratios of many of the mean squares in the analysis of variance [seeLinear hypotheses, article onanalysis of variance].
The density of F is
where 0< x <∞,0<f1<∞ and 0 < f2 < ∞. Figure 3 shows a plot of this density for four cases: f1 = f2 = 1, f1 = f2 = 2, f1, = 3, f2 = 5, and f1 = f2 = 10. Of these the cases of f1 = 3, f2 = 5, and f1 = f2 = 10 are most typical of F-distributions.
The mean of this distribution is f2/(f2 – 2) for f2 > 2; for f2 > 4 the variance is given by and the mod is f2(f1 – 2)/[f1(f2 + 2)] for f1 > 2. The F-distribution is J-shaped if f1 ≤ 2. When f1 and f2 are greater than 2, the F-distribution has its mode below x = 1 and its mean above x = 1 and, hence, is positively skew.
Let Ff1.f2 represent a random variable having an F-distribution with f1 degrees of freedom for the
numerator and f2 degrees of freedom for the denominator. Then Pr{Ff1,f2 ≤ c] = 1 – Pr{Ff1,f2 ≤ 1/c]. Hence the F-distribution is usually tabulated for one tail (usually the upper tail), as the other tail is easily obtained from the one tabulated.
The chi-square and t-distributions are related to the F-distributions as follows: f2/F∞,f2 has a chi-square distribution with f2 degrees of freedom; f1Ff,∞ has a chi-square distribution with f1 degrees of freedom; F1, f2 is distributed as the square of a Student-distributed random variable with f2 degrees of freedom; f1F/f2∞ is distributed as the square of a Student-distributed random variable with f1 degrees of freedom.
Let, that is, let E(n, r, p) be the probability of r or more successes for a binomial probability distribution where p is the probability of success of a single trial [seeDistributions, statistical, article onspecial discrete distributions]. Let Cv,f1,f1 be defined by Pr{Ff1,f2 ≤ cv,f1f2} = v Then the following relationship exists between the binomial and F-distributions. If E(n, r, p) = v then
A slight variation on the F-distribution obtained by the change of variable Y = [f1/f2]X is often referred to as the inverted beta distribution, the beta distribution of the second kind, or the beta prime distribution. These names are also occasionally applied to the F-distribution itself. Another variation on the F-distribution is Fisher’s z-distribution, which is that of ’s" gale:type="formula"/>lnF.
Beta distributions
A random variable, X, has a beta distribution with parameters p and q if its density is of the form
where 0<x<l; p,q > 0. If the transformation X == f1Ff1,f2/(f1, + f1Ff1,f1) is made, then X has a beta distribution with and.
The cumulative beta distribution is known as the incomplete beta function,
The incomplete beta function has been tabulated by Karl Pearson (see Pearson 1934). The relationship Ix(p,q) = 1 – I1-x(q,p) is often useful. The mean of the beta distribution is p/(p + q), and the variance is pq/[( p + q)2 (p + q + 1)].
The mode for p ≥ 1 and q ≥ 1, but p and q not both equal to 1, is (p – l)/(p + q – 2). For p ≤ 1 and q > 1, the density is in the shape of a reversed J; for p > 1 and q ≤ 1 the density is J-shaped; and f or p < 1 and q < 1, the density is U-shaped. For both p and q = 1, the density takes the form of the density of the rectangular distributions with a = 0 and b = 1. The beta distribution is also known as the beta distribution of the first kind or the incomplete beta distribution.
Let that is, let E(n, r, x) be the probability of r or more successes in n trials for a binomial probability distribution where x is the probability of success of a single trial. Then Ix(p,q) = E(p + q – 1, p, x). In other words, partial binomial sums are expressible directly in terms of the incomplete beta function. To solve Ix(p,q) = γ for x, with γ, p, and q fixed, find x from x = p/[p + qCi–7,2q,2p] where Cγf1f2 is defined by Pr(Ff1,f2 ≤ Cγf1f2} = γ that is, Cγf1f2 is a percentage point of the F-distribution.
If Y1 and Y2 are independent random variables having gamma distributions with equal values of θ, r = n1 for Y1, and r = n2 for Y2, then the variable X = Y1/(Y1 + Y2) has a beta distribution with p = n1 and q = n2.
Bivariate beta distributions
The random variables X and Y are said to have a joint bivariate beta distribution if the joint probability density function for X and Y is given by
for x,y >0 and x + y < 1, 0 < f1 < ∞, 0 < f2 < ∞, and 0 < f3 < ∞. This distribution is also known as the bivariate Dirichlet distribution. The mean of X is given by f1/f+, and the mean of Y by f2/f+, where f+ represents (f1 + f2 + f3). The variance of X is ; the variance of Y is; the correlation between X and Y is. The conditional distribution of X/(l – Y), given Y, is a beta distribution with p – f1 and q = f3 . The sum X + Y has a beta distribution with p = f1 + f2 and q = f3.
Noncentral
F-distributions. If Y1 has a non-central chi-square distribution with f1 degrees of freedom and noncentrality parameter λ, and if Y2 has a (central) chi-square distribution with f1 degrees of freedom and is independent of Y1, then (Y1/f1)/(Y1/f2) has a noncentral F-distribution with f1 degrees of freedom for the numerator and f2 degrees of freedom for the denominator and noncentrality parameter λ. The cumulative distribution function of noncentral F may be closely approximated by a central F cumulative distribution function as follows:
where "Bivariate beta distributions" has a noncentral F-distribution and .Ff* f2, has a central F-distribution and f* denotes (f1 + λ)2 /(f1 + 2λ). The mean of the noncentral F-distribution is f2(f1 + λ)/[(f2 – 2)f1] for f2 > 2, and the variance is
for f2 > 4. The means do not exist if f2 ≤ 2 and the variances do not exist if f2 ≤ 4.
The ratio of two noncentral chi-square random variables also arises occasionally. This distribution has not yet been given a specific name, The non-central F-distribution is, of course, a special case of the distribution of this ratio. The noncentral F-distribution arises in considering the power of analysis of variance tests.
Bivariate normal distributions
To say that X and Y have a joint (nonsingular) normal distribution with means μx and μY, variances, and correlation p is to say that the joint probability density function for X and Y is
The cumulative distribution function occurs in many problems; it is a special case (two-dimensional) of the multivariate normal distribution. A fundamental fact is that X and Y are jointly normal if and only if aX + bY is normal for every a and b [seeMultivariate analysis].
Distributions of the sum of normal variables
Let X1X2, · · ·, Xn be jointly normally distributed random variables so that Xi has mean μi and variance erf and the correlation between "Distributions of the sum of normal variables" and Xi- is pij. Then the distribution of the weighted sum, a1,X1, + a2X2, + · ·· + anXn, where a1,a2 · · ·, an are any real constants (positive, negative, or zero), is normal with mean a1μ1, + a2μn, + · · · + anμn and variance
If the normality assumption is dropped, then the means and variances remain as stated, but the form of the distribution of the sum a1X1 + a2X2, + · · · + anXn is often different from the distribution of the X’s. If a linear function, of a finite number of independent random variables is normally distributed, then each of the random variables X1, X2, ···, Xn is also normally distributed. Note that in this instance independence of the random variables is required.
In particular, if X1 and X2 are jointly normally distributed, the sum X1 + X2, is normally distributed with mean μ1 + μ1 and variance. The difference X1 –X2. is normally distributed with mean μ1 – μ2 and variance. If X1, X2, · · · Xn are jointly normally distributed with common mean μ and variance σ2 and are independent, then the mean of the X’s, that is, is normally distributed with mean μ variance σ2/n.
Rectangular distributions
The rectangular (or uniform) distribution has the following density: it is zero for x < a; it is l/(b – a) for a ≤ x ≤ b; and it is zero for x > b, where a and b are real constants. In other words, it has a graph that is a rectangle with base of length b – a and height of 1/(b – a). The cumulative distribution function is zero for x < a; it is (x – a)/(b – a) for a ≤ x ≤ b; and it is one for x > b. The mean and the median of this distribution are both, and the variance is (b – a)2/12.
One of the principal applications of the rectangular distribution occurs in conjunction with the probability integral transformation. This is the transformation Y = F(X), where F(X) is the cumulative distribution function for a continuous random variable X. Then Y is rectangularly distributed with a = 0 and b = 1. Many distribution-free tests of fit have been derived starting with this transformation [seeGoodness of fit; Nonparametric statistics].
If Y has the rectangular distribution with a = 0, b=l, then -2 In Y has the chi-square distribution with 2 degrees of freedom. It follows, from the additivity of chi-square distributions, that if Y1, Y2, · · ·, Yn are jointly and independently distributed according to rectangular distributions with a = 0 and b = 1, then the sum, In Yi, has a chi-square distribution with 2n degrees of freedom.
There is also a discrete form of the rectangular distribution.
Pareto distributions
The density functions for Pareto distributions take the form: zero for x < b; (a/b)(b)/(b/x)a+1 for b ≤ x ≤ < σ, where a and b are positive real constants (not zero). The cumulative distribution function is zero for x < b and is equal to 1 – (b/x)n for b ≤ x < ∝. For a > 1, the mean of the Pareto distribution is ab/(a – 1), and for a > 2 the variance is (ab2)/[(a – l)2 (a – 2)]. The median for a > 0 is at x = 21/ab, and the mode is at x = b. The Pareto distribution is related to the negative exponential distribution by the transfermation Y = θa In (X/b) where Y has the negative exponential distribution and X has the Pareto distribution.
Pareto distributions have been employed in the study of income distribution [seeIncome distribution].
Laplace distributions
A random variable X has the Laplace distribution if its probability density function takes the form
where – < λ < + < +∞, 0 < θ <∞ and – ∞< x < +∞. The mean of this distribution is λ, and the variance is 2θ2. This distribution is also known as the double exponential, since the graph has the shape of an exponential function f or x > λ and it is a reflection (about the line x =λ) of the same exponential function for x < λ. The Laplace distribution is sometimes called Laplace’s first law of error, the second being the normal distribution.
Lognormal distributions
A random variable X is said to have a logarithmic normal distribution (or lognormal distribution) if the logarithm of the variate is normally distributed. Let Y – In X be normally distributed with mean μ and variance σ2. The mean of X is exp and the variance of X is (exp σ2 – 1) exp (2μ, – σ2). Note that 0 < X < ∞, while – ∞ < Y < + ∞ . The base of the logarithm may be any number greater than one (or between zero and one), and the bases 2 and 10 are often used. If the base a is used, then the mean of X is aμ+½ (1 n a)σ2, and the variance of (aσ21n a – 1)a2μ+½ (1 n a)σ2
Logistic distributions
The logistic curve y – λ/[l + γe–Kr] is used frequently to represent the growth of populations. It may also be used as a cumulative probability distribution function. A random variable, X, is said to have the logistic probability distribution if the density of X is given by
where –<⋡< x < + ⋡; μ is the mean of X and σ2 is the variance of X. As with the normal distribution, the transformation z = (x – μ)/σ gives a standardized form to the distribution. The cumulative distribution function for the standardized variable is
where – ⋡ > < z < +⊱. The shape of this cumulative distribution so nearly resembles the normal distribution that samples from normal and logistic distributions are difficult to distinguish from one another.
The exponential family of distributions. The
one-parameter, single-variate, exponential density functions are those of the form
c(θ)exp[θA(x) + B(x)],
where c, A, and B are functions usually taken to satisfy regularity conditions. If there are several parameters, θl, · · ·, θr, the exponential form is
c(θ1) · · · θr)exp[∑θiAi + (x) + B (x)].
Analogous forms may be considered for the multivariate case and for discrete distributions. Most of the standard distributions (normal, binomial, and so on) are exponential, but reparameterization may be required to express them in the above form.
The exponential distributions are important in theoretical statistics, and they arise naturally in discussions of sufficiency. Under rather stringent regularity conditions, an interesting sufficient statistic exists if, and only if, sampling is from an exponential distribution; this relationship was first explored by Koopman (1936), Darmois (1935), and Pitman (1936), so that the exponential distributions are sometimes eponymously called the Koopman-Darmois or Koopman-Pitman distributions [seeSufficiency].
A discussion of the exponential family of distributions, and its relation to hypothesis testing, is given by Lehmann (1959, especially pp. –54).
It is important to distinguish between the family of exponential distributions and that of negative exponential distributions. The latter is a very special, although important, subfamily of the former.
Donald B. Owen
BIBLIOGRAPHY
Darmois, Georges 1935 Sur les lois de probabilité à estimation exhaustive. Académic des Sciences, Paris, Comptes rendus hebdomadaires 200:–1266.
Eisenhart, Churchill; and Zelen, Marvin 1958 Elements of Probability. Pages –164 in E. U. Condon and Hugh Odishaw (editors), Handbook of Physics. New York: McGraw-Hill.
Greenwood, Joseph A.; and Hartley, H. O. 1962 Guide to Tables in Mathematical Statistics. Princeton Univ. Press.
Haight, Frank A. 1961 Index to the Distributions of Mathematical Statistics. U.S. National Bureau of Standards, Journal of Research Series B: Mathematics and Mathematical Physics 65B: –60.
Kendall, Maurice G.; and Stuart, Alan (–1946) –1966 The Advanced Theory of Statistics. 3 vols. New ed. New York: Hafner; London: Griffin. → The first edition was written by Kendall alone.
Koopman, B. O. 1936 On Distributions Admitting a Sufficient Statistic. American Mathematical Society, Transactions 39:–409.
Korn, Granio A.; and Korn, Theresa M. 1961 Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. New York: McGraw-Hill. → See especially pages –586 on “Probability Theory and Random Processes” and pages –626 on “Mathematical Statistics.”
Lehmann, Erich L. 1959 Testing Statistical Hypotheses. New York: Wiley.
Owen, Donald B. 1962 Handbook of Statistical Tables. Reading, Mass.: Addison-Wesley. → A list of addenda and errata is available from the author.
Pearson, Egon S.; and Hartley, H. O. (editors), (1954) 1958 Biometrika Tables for Statisticians. 2 vols., 2d ed. Cambridge Univ. Press. → See especially Volume 1.
Pearson, Karl (editor) (1922) 1951 Tables of the Incomplete Г-function. London: Office of Biometrika.
Pearson, Karl (editor) 1934 Tables of the Incomplete Beta-function. London: Office of Biometrika.
Pitman, E. J. G. 1936 Sufficient Statistics and Intrinsic Accuracy. Cambridge Philosophical Society, Proceedings 32:–579.
U.S. National Bureau of Standards 1953 Tables of Normal Probability Functions. Applied Mathematics Series, No. 23. Washington: The Bureau.
U.S. National Bureau of Standards 1959 Tables of the Bivariate Normal Distribution Function and Related Functions. Applied Mathematics Series, No. 50. Washington: The Bureau.
Walker, Helen M. 1929 Studies in the History of Statistical Method, With Special Reference to Certain Educational Problems. Baltimore: Williams & Wilkins.
Zelen, Marvin; and Severo, Norman C. (1964) 1965 Probability Functions. Chapter 26 in Milton Abramowitz and I. A. Stegun (editors), Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. New York: Dover. → First published as National Bureau of Standards, Applied Mathematics Series, No. 55. A list of errata is available from the National Bureau of Standards.
III APPROXIMATIONS TO DISTRIBUTIONS
The term approximation refers, in general, to the representation of “something” by “something else” that is expected to be a useful replacement for the “something.” Approximations are sometimes needed because it is not possible to obtain an exact representation of the “something”; even when an exact representation is possible, approximations may simplify analytical treatment.
In scientific work, approximations are in constant use. For example, much scientific argument, and nearly all statistical analysis, is based on mathematical models that are essentially approximations. This article, however, is restricted to approximations to distributions of empirical data and to theoretical probability distributions.
When approximating empirically observed distributions—for example, a histogram of frequencies of different words in a sample of speech—the primary objectives are those of compact description and smoothing. These are also the primary objectives of much approximation in demographic and actuarial work [seeLife Tables].
On the other hand, approximation to a theoretical distribution is often needed when exact treatment is too complicated to be practicable. For example, an econometrician who has developed a new estimator of price elasticity may well find the exact distribution of his estimator quite intractable; he will probably resort to large-sample (asymptotic) methods to find an approximate distribution.
It may also happen that a distribution arising not from statistical considerations, but from another kind of mathematical model, requires approximation to improve understanding. For example, a psychologist may use a probabilistic model of the learning process, a model that leads to a theoretical distribution for the number of trials needed to reach a specified level of performance. This distribution may be so complicated that an approximate form will markedly increase appreciation of its meaning.
The final section of this article discusses a general requirement, the measurement of the goodness of any particular approximation; this is especially important in the comparison of different approximations.
Approximation to empirical distributions
This section deals with approximations to distributions of numerical data representing measurements on each of a group of individuals. (Usually the group is a sample of some kind.) Among important techniques not discussed here are those that are purely mathematical (such as numerical quadrature and iterative solutions of equations) and those associated with the analysis of time series (such as trend fitting and periodogram and correlogram analysis) [seeTime Series].
As a specific example, data on the distributions of diseases by frequency of diagnosis (for males) in the teaching hospitals of England and Wales for the year 1949 (based on Herdan 1957) are presented in Table 1. The figures mean, for example, that out of 718 different diseases of males reported during the year, 120 occurred between one and five times each, while two were reported between 4,000 and 5,000 times each.
The figures shown in the table have already been grouped but present a rather irregular appearance; it would be useful to summarize the data in a more readily comprehensible form, A quick way to do
Table 1 — Diseases by frequency of diagnosis in the teaching hospitals of England and Wales for 1949 | ||
---|---|---|
FREQUENCY OF DIAGNOSIS | NUMBER OF DISEASES | |
Observed | Fitted | |
Source: Herdan 1957. | ||
–5 | 120 | 103 |
–10 | 55 | 74 |
–15 | 44 | 74 |
–20 | 31 | 43 |
–25 | 28 | 34 |
–30 | 32 | 28 |
–40 | 45 | 45 |
–50 | 37 | 35 |
–60 | 33 | 28 |
–70 | 22 | 23 |
–80 | 23 | 19 |
–90 | 22 | 17 |
–100 | 15 | 14 |
–119 | 21 | 22 |
–139 | 18 | 19 |
–159 | 19 | 15 |
–179 | 19 | 13 |
–199 | 12 | 13 |
–239 | 19 | 17 |
–279 | 24 | 13 |
–319 | 13 | 11 |
–359 | 4 | 8.5 |
–399 | 9 | 7.0 |
–499 | 15 | 13.1 |
–599 | 13 | 9.1 |
–699 5 | 6.7 | |
–799 | 6 | 5.0 |
–899 | 2 | 4.0 |
–999 | 2 | 3.1 |
1,–1,999 | 6 | 14.0 |
2,–2,999 | 1 | 4.1 |
3,–3,999 | 1 | 1.9 |
4,–4,999 | 2 | 1.0 |
≥ 5,000 | – | 2.25 |
Total | 718 | 717.75 |
this is to group further and to form a histogram (as in Figure 1). [Further information on this method is presented inGraphic presentation; statistical analysis, special problems of, article ongrouped observations.]
Grouping is in itself a kind of approximation, since it does not reproduce all features of the original data. For concise description, however, representation by a formula can be more useful. This is effected by fitting a frequency curve. If the fitted curve is simple enough and the effectiveness of approximation (“goodness of fit”) is adequate, considerable benefit can be derived by replacing a large accumulation of data, with its inevitable irregularities, with a simple formula that can be handled with some facility and is a conveniently brief way of summarizing the data.
The present-day decline in the importance of fitting observed frequency distributions may well be only a temporary phenomenon. In the years 1890 to 1915 (roughly) there was a need to demonstrate that statistical methods did apply to real physical situations. The X2 test developed by Karl Pearson (1900) demonstrated clearly that the normal distribution, which had previously been assumed to be of rather general application, was not applicable to much observed data. It was desirable, therefore, to show, if possible, that some reasonably simple mathematical formula could give an adequate fit to the data.
Subsequent development of the theory of mathematical statistics has, on the one hand, been very much concerned with clarification of the logical principles underlying statistical method (assuming that there is some fairly well-established mathematical representation of the distributions involved); and on the other hand it has produced “distribution free” procedures, particularly significance tests, that eliminate the need for considering the actual form of distribution (to any but the broadest detail) [seeNonparametric statistics].
Both these lines of work tend to reduce interest in description of actual distributions in as precise detail as the data allow. However, from the more general viewpoint of scientific inquiry, the neglect of systematic study of distributional form in favor of application of formal techniques to arbitrarily hypothetical situations can represent a wasteful use of the data.
In the data of the example, the frequency of occurrences is naturally discrete—it takes only integer values—and, therefore, prime consideration should be given to formulas appropriate to discrete variables. In view of the wide range of variation of frequency, however, formulas appropriate to continuous variables may also give good approximations, and they are worth considering if they offer substantially simpler results.
A number of families of frequency curves have been found effective in approximating observed sets of data. Provided that a suitable form of curve has been chosen, fitting a maximum of four parameters gives a reasonably effective approximation. Fitting the curve is equivalent to estimating these parameters; this may be effected in various ways, among which are (i) the method of maximum likelihood; (ii) the method of moments, in which certain lower moments of the fitted curve are made to agree with the corresponding values calculated from the observed data; and (iii) the method of percentile points, in which the fitted curve is made to give cumulative proportions agreeing with those for the observed data at certain points (7%, 25%, 75%, and 93% are often recommended) [seeEstimation, article onpoint estimation].
In cases where a clear probabilistic model can be established, there is usually good reason to prefer the method of maximum likelihood. When it is not possible to establish such a model, another method is required.
Fitting is sometimes facilitated by using appropriately designed “probability paper,” with the ordinate and abscissa scales such that if cumulative frequency (ordinate) is plotted against variable values (abscissa), a straight line relationship should be obtained [seeGraphic presentation].
Among the more commonly used families of frequency curves are those representing the normal, exponential and gamma distributions. These specify the proportion less than, or equal to, a fixed number x (the cumulative distribution function, often denoted by the symbol F(x)) as a function depending on two or three parameters (and also, of course, on x) [seeDistributions, statistical, article onspecial continuous distributions]. The best-known system of curves is the Pearson system (Elderton 1906), which is derived from solutions of the differential equation
Originally (Pearson 1895) this equation was arrived at by (i) considering limiting cases of sampling from a finite population and (ii) requiring the curve y = dF/dx to satisfy certain natural conditions of unimodality, smoothness, etc. (Note that there are four parameters, c, a0, a1, a2, in this equation.)
Methods have been worked out (Elderton 1906) for estimating the values of these parameters from the first four moments of the distributions. In certain cases, the procedures can be much facilitated by using Table 42 of the Biometrika tables (Pearson & Hartley 1954) or the considerably extended version of these tables in Johnson et al. (1963). Entering these tables with the values of the moment ratios and β2) it is possible to read directly standardized percentage points () for a number of values of P. The percentage points XP (such that F(P) = P) are then calculated as Mean + XP’ x (Standard Deviation).
Other systems are based on “transformation to normality” with z = al + a2g[c(x + a0)], where z is a normal variable with zero mean and unit standard deviation, and g[ ] is a fairly simple explicit function. Here, c, a0, a1, and a2, are again parameters. If g[y] = log y, one gets a lognormal curve. Such a curve has been fitted to the distribution of frequency of diagnosis in Figure 1 (for clarity, up to frequency 100 only). Since z = a1 + a2log [c(x + a0)] can be written as log (x + a0), where now log c, there are in fact only three separate parameters in this case. In fitting the curve in Figure 1, a0 has been taken equal to –0.5, leaving only a1 and a2, to be estimated. This has been done by the method of percentile points (making the fitted frequencies less than 1 1 and less than 200 agree with the observed frequencies – 175 and 596 respectively). The fitted formula is (rounded)
Number less than, or equal to, x (an integer)
= 718 ф(.55 In x – 1.95)
where ф is the cumulative distribution function of the unit-normal distribution [seeDistributions, statistical, article onspecial continuous distributions]. Numerical values of the fitted frequencies are shown in the last column of Table 1.
While a reasonable fit is obtained in the center of the distribution, the fit is poor for the larger frequencies of diagnosis. As has been noted, although the data are discrete, a continuous curve has been fitted; however, this is not in itself reason for obtaining a poor fit. Indeed many standard approximations— such as the approximation to a binomial distribution by a normal distribution — are of this kind.
Another system of curves is obtained by taking the approximate value of F(x) to be
where ф(j) is the jth derivative of ф, and where the parameters are a1, a1 · · ·, ak, ξ, and σ > 0. The number of parameters depends on the value of k, which is usually not taken to be greater than 3 or 4. This system of curves is known as the Gram-Charlier system; a modified form is known as the Edgeworth system (Kendall & Stuart –1946, chapter 6). If all a’s after a0 are zero (and a0 is equal to one) then F(x) is simply the normal cumulative distribution. The terms after the first can be regarded as “corrections” to the simple normal approximation. It should not automatically be assumed, however, that the more corrections added the better. If a high value of k is used, the curve may present a wavy appearance; even with a small value of k it is possible to obtain negative values of fitted frequencies (Barton & Dennis 1952).
Similar expansions can be constructed replacing ф(x) by other standard cumulative distribution (see Cramèr 1945, sec. 20.6). If the ф(j)’s are replaced by cumulative distribution functions фj(x) (with, of course, a0 + a1 + a2 + · · · + ak = 1 ), then F(x) is represented by a “mixture” of these distributions [seeDistributions, statistical, article onmixtures of distributions]. When joint distributions of two or more variates are to be fitted, the variety of possible functional forms can be an embarrassment. A convenient mode of attack is to search for an effective normalizing transformation for each variate separately. If such a transformation can be found for each variate, then a joint multinormal distribution can be fitted to the set of transformed variates, using the means, variances, and correlations of the transformed variates. A discussion of some possibilities, in the case of two variates, will be found in Johnson (1949).
Approximation to one theoretical distribution by another
Approximation is also useful even when it is not necessary to deal with observed data. An important field of application is the replacement of complicated formulas for the theoretical distributions of statistics by simpler (or more thoroughly investigated) distributions. Just as when approximating to observed data, it is essential that the approximation be sufficiently effective to give useful results. In this case, however, the problem is more definitely expressible in purely mathematical terms; there are a number of results in the mathematical theory of probability that are often used in constructing approximations of this kind.
Among these results, the “Central Limit” group of theorems has the broadest range of applicability. The theorems in this group state that, under appropriate conditions, the limiting distribution of certain statistics, Tn, based on a number, n, of random variables, as n tends to infinity, is a normal distribution. [SeeProbabilityfor a discussion of the Central Limit theorems.]
If the conditions of a Central Limit theorem are satisfied, then the distribution of the statistic, where E(Tn) is the mean and σ- (Tn) the standard deviation of Tn, may be approximated by a “unit normal distribution.” Then the probability Pr[ ≥ τ] may be approximated by using tables of the normal integral with argument [τ–E(Tn)]/σ(Tn).
There are many different Central Limit theorems; the most generally useful of these relate to the special case when the X’s are mutually independent and Tn = X1 + X2 + · · · + Xn The simplest set of conditions is that each X should have the same distribution (with finite expected value and standard deviation), but weaker sets of conditions can replace the requirement that the distributions be identical. These conditions, roughly speaking, ensure that none of the variables Xi have such large standard deviations that they dominate the distribution of Tn.
Central Limit results are used to approximate distributions of test criteria calculated from a ”large” number of sample values. The meaning of “large” depends on the way in which the effectiveness of the approximation is measured and the accuracy of representation required. In turn, these factors depend on the use that is to be made of the results. Some of the problems arising in the measurement of effectiveness of approximation will be described in the next section.
Very often there is a choice of quantities to which Central Limit type results can be applied. For example, any power of a chi-square random variable tends to normality as the number of degrees of freedom increases. For many purposes, the one-third power (cube root) gives the most rapid convergence to normality (see Moore 1957), although it may not always be the best power to use. While many useful approximations can be suggested by skillful use of probability theory, it should always be remembered that sufficient numerical accuracy is an essential requirement. In some cases (for example, Mallows 1956; Gnedenko et al. 1961; Wallace 1959), theoretical considerations may provide satisfactory evidence of accuracy; in other cases, ad hoc investigations, possibly using Monte Carlo methods, may be desirable. Interesting examples of these kinds of investigation are described by Goodman and Kruskal (1963) and Pearson (1963). [A discussion of Monte Carlo methods may be found inRandom numbers.]
Another type of result (Cramer 1945, sec. 20.6) useful in approximating theoretical distributions and, in particular, in extending the field of Central Limit results, may be stated formally: “if and Yn tends to η in prob-ability, then,” or less formally, “the limiting distribution of Xn + Yn is the same as that of Xn + η” Similar statements hold for multiplication, and this kind of useful result may be considerably generalized. A fuller discussion is given in the Appendix of Goodman and Kruskal (1963).
A classic and important example of approximation of one theoretical distribution by another is that of approximating a binomial distribution by a normal one. If the random variable X has a binomial distribution with parameters p, n, then the simplest form of the approximation is to take Pr{X ≤ c} as about equal to
where c = 0, 1, 2, · · ·, n. One partial justification for this approximation is that the mean and standard deviation of X are, respectively, np and the asymptotic validity of the approximation follows from the simplest Central Limit theorem.
The approximation may be generally improved by replacing c – np, in the numerator of the argument of ф, with. This modification, which has no asymptotic effect, is called the “continuity correction.” Together with similar modifications, continuity corrections may often be used effectively to improve approximations at little extra computational cost.
The methods described in the section on approximating numerical data can also be used, with appropriate modification, in approximating theoretical distributions. Different considerations may arise, however, in deciding on the adequacy of an approximation in the two different situations.
Measuring the effectiveness of an approximation
The effectiveness of an approximation is its suitability for the purpose for which it is used. With sufficiently exhaustive knowledge of the properties of each of a number of approximations, it would be possible to choose the best one to use in any given situation, with little risk of making a bad choice. However, in the majority of cases, the attainment of such knowledge (even if possible) would entail such excessively time-consuming labor that a main purpose for the use of approximations —the saving of effort—would be defeated. Considerable insight into the properties of approximations can, however, be gained by the careful use of certain representative figures, or indexes, for their effectiveness. Such an index summarizes one aspect of the accuracy of an approximation.
Since a single form of approximation may be used in many different ways, it is likely that different indexes will be needed for different types of application. For example, the approximation obtained by representing the distribution of a statistic by a normal distribution may be used, inter alia, (a) for calculating approximate significance levels for the statistic, (b) for designing an experiment to have specified sensitivity, and (c) for combining the results of several experiments. In case (a), effectiveness will be related particularly to accuracy of representation of probabilities in the tail(s) of the distribution, but this will not be so clearly the case in (b) and (c). It is important to bear in mind the necessity for care in choosing an appropriate index for a particular application.
This article has been concerned with approximations to theoretical or empirical distribution functions. The primary purpose of such approximations is to obtain a useful representation, say F (x), of the actual cumulative distribution function F(x). It is natural to base an index of accuracy on some function of the difference F (x) – F(x). Here are some examples of indexes that might be used:
The third and fourth of these indexes are modifications of the first and second indexes respectively. A further pair of definitions could be obtained by replacing the intervals (x1,x2) in (2) and (4) by sets, co, of possible values, replacing ǀF(x1 – F(x2)ǀ and ǀF(x1) – F(x2)ǀ by the approximate and actual values of Pr[X in ω], and replacing maxx1-x2 by maxω Indexes of type (3) and (4) are based on the proportional error in the approximation, while indexes of type (1) and (2) are based on absolute error in the approximation. While the former are the more generally useful, it should be remembered that they may take infinite values. Very often, in such cases, the difficulty may be removed by choosing a) to exclude extreme values of x.
In particular instances ω may be quite severely restricted. For example, the 5% and 1% levels may be regarded as of paramount importance. In such a case, comparison of F(x) and F(x) in the neighborhood of these values may be all that is needed. It has been suggested in such cases that actual values between 0.04 and 0.06, corresponding to a nominal 0.05, and between 0.007 and 0.013, corresponding to a nominal 0.01, can be regarded as satisfactory for practical purposes.
In measuring the accuracy of approximation to empirical distributions, it is quite common to calculate x2. This is, of course, a function of the differences F(x) – F(x). Even when circumstances are such that no probabilistic interpretation of the statistic is possible, indexes of relative accuracy of approximation have been based on the magnitude of x2. A similar index for measuring accuracy of approximations to theoretical distributions can be constructed. Such indexes are based on more or less elaborate forms of average (as opposed to maximum) size of error in approximation.
N. L. Johnson
[See alsoGoodness of fit.]
BIBLIOGRAPHY
Barton, D. E.; and Dennis, K. E. 1952 The Conditions Under Which Gram-Charlier and Edgeworth Curves Are Positive Definite and Unimodal. Biometrika 39: –427.
CRAMÉR, HAROLD (1945) 1951 Mathematical Methods of Statistics. Princeton Mathematical Series, No. 9. Princeton Univ. Press.
Elderton, W. Palin (1906) 1953 Frequency Curves and Correlation. 4th ed. Washington: Harren.
Gnedenko, B. V. et al. 1961 Asymptotic Expansions in Probability Theory. Volume 2, pages 153–170 in Berkeley Symposium on Mathematical Statistics and Probability, Fourth, University of California, 1960, Proceedings. Berkeley: Univ. of California Press.
Goodman, Leo A.; and Kruskal, William H. 1963Measures of Association for Cross-classifications: III. Approximate Sampling Theory. Journal of the American Statistical Association 58:–364.
Herdan, G. 1957 The Mathematical Relation Between the Number of Diseases and the Number of Patients in a Community. Journal of the Royal Statistical Society Series A 120:–330.
Johnson, Norman L. 1949 Bivariate Distributions Based on Simple Translation Systems. Biometrika 36:–304.
Johnson, Norman L.; Nixon, Eric; Amos, D. E.; and Pearson, Egon S. 1963 Table of Percentage Points of Pearson Curves, for Given β and β2, Expressed in Standard Measure. Biometrika 50:–498.
Kendall, Maurice G.; and Stuart, Alan (–1946) –1961 The Advanced Theory of Statistics. 2 vols. New ed. New York: Hafner; London: Griffin. → Volume 1: Distribution Theory, 1958. Volume 2: Inference and Relationship, 1961. Third volume in preparation.
Mallows, C. L. 1956 Generalizations of Tchebycheff’s Inequalities. Journal of the Royal Statistical Society Series B 18:–168.
Moore, P. G. 1957 Transformations to Normality Using Fractional Powers of the Variable. Journal of the American Statistical Association 52:–246.
Pearson, Egon S. 1963 Some Problems Arising in Approximating to Probability Distributions, Using Moments. Biometrika 50:–112.
Pearson, Egon S.; and Hartley, H. O. (editors) (1954) 1958 Biometrika Tables for Statisticians. 2d ed., 2 vols. Cambridge Univ. Press.
Pearson, Karl 1895 Contributions to the Mathematical
Theory of Evolution: II. Skew Variations in Homogeneous Material. Royal Society of London, Philosophical Transactions Series A 186:–414.
Pearson, Karl 1900 On the Criterion That a Given
System of Deviations From the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen From Random Sampling. Philosophical Magazine 5th Series 50:–175.
Wallace, David L. 1959 Bounds on Normal Approximations to Student’s and the Chi-square Distributions. Annals of Mathematical Statistics 30:–1130.
IV MIXTURES OF DISTRIBUTIONS
A mixture of distributions is a weighted average of probability distributions with positive weights that sum to one. The distributions thus mixed are called the components of the mixture. The weights themselves comprise a probability distribution called the mixing distribution. Because of this property of the weights, a mixture is, in particular, again a probability distribution.
As an example, suppose that the probability distribution of heights of 30-year-old men in New York is approximately a normal distribution, while that of 30-year-old women in New York is another approximately normal distribution. Then the probability distribution of heights of 30-year-old people in New York will be, to the same degree of approximation, a mixture of the two normal distributions. The two separate normal distributions are the components, and the mixing distribution is the simple one on the dichotomy male-female, with the weights given by the relative frequencies of men and women in New York.
Probability distributions of this type arise when observed phenomena can be the consequence of two or more related, but usually unobserved, phenomena, each of which leads to a different probability distribution. Mixtures and related structures often arise in the construction of probabilistic models; for example, models for factor analysis. Mixtures also arise in a number of statistical contexts. A general problem is that of “decomposing” a mixture on the basis of a sample, that is, of estimating the parameters of the mixing distribution and those of the components.
Mixtures occur most commonly when the parameter, θ, of a family of distributions, given, say, by the density or frequency functions f(x; θ), is itself subject to chance variation. The mixing distribution, say g(θ), is then a probability distribution on the parameter of the distributions f(x; θ).
The components of a mixture may be discrete, continuous, or some of each type. Mixtures are classified, in accordance with the number of their components, as finite, countable, or noncountably infinite.
The generic formula for the most common form of finite mixture is
the infinite analogue (in which g is a density function) is
As an illustration of the above ideas, consider the following simple example. Two machines produce pennies, which are fed into a common bin. The pennies are identical except that those produced by machine 1 have probability θ1 of showing a head when tossed, while those produced by machine 2 have probability θ2 ≠θ1 of showing a head. Let α be the proportion of coins produced by machine 1, and 1 – α the proportion produced by machine 2.
A coin chosen at random from the bin is tossed n times. By the basic rules of probability theory, the probability of observing x heads is
This is a mixture of two binomial distributions (the components) with mixing distribution g(θ1) = α, g(θ2) = α. [SeeProbability, article onformal probability, for a discussion of the rules giving rise to (3).]
By contrast, if θ1 = θ2 = θ, that is to say if the two machines produce exactly identical coins, then
which is the simple binomial. Similarly if the experimenter selected a coin from a particular machine, say machine 2, rather than choosing a coin at random, the distribution of outcomes would be a binomial with parameter θ2 A simple binomial also results if n distinct coins are chosen and each is tossed a single time. In this case, however, the binomial parameter is αθ1 + (1 – α)θ2.
A generalization of the above example is a mixture of two trinomial distributions. It is applicable to description of the sex distribution of twins (cf. Goodman & Kruskal 1959, p. 134; Strandskov & Edelen 1946). Twin pairs fall into three classes: MM, MF, and FF, where M denotes male and F female. This leads to the trinomial distribution. Since, in addition, twins may be dizygotic or monozygotic, a mixture of trinomials results. Because the sexes of individual dizygotic twins are independent, the corresponding trinomial has parameters p2, 2p(l – p), and (1 – p)2, where p is the probability of a male. Monozygotic twins, however, are genetically identical, so that the corresponding trinomial has parameters p, 0, and q. The mixing distribution is determined by the relative frequencies of monozygotic and dizygotic twin births.
Some properties of mixtures
Mixtures are themselves probability distributions, and hence their density or frequency functions are nonnegative and sum or integrate to unity.
While the definitions (1) and (2) are given in terms of the density or frequency functions, precisely the same relationships hold with regard to the corresponding cumulative distribution functions. Similarly, the moments (about zero) of a mixture are weighted averages (that is, mixtures) of the moments of the component distributions. Thus, for a finite mixture, if is the rth moment about zero of the ith component, then the rth moment about zero of the mixture is
The same is true of the factorial moments. A similar, although slightly more complicated, relationship holds for moments about the mean.
An important property of many common mixtures of discrete distributions is the relationship imposed upon the generating functions: if ф1(s) and ф2(s) are the generating functions of the mixing distribution and the component distributions, respectively, then ф1(ф2(s) is the generating function of the mixture. This greatly simplifies certain calculations. [SeeDistributions, statistical, article onspecial discrete distributions, for a discussion of generating functions.]
Mixtures of standard distributions may possess interesting geometric properties. They may be bimodal or multimodal and are ordinarily more dispersed than are the components. Properties such as these account, at least in part, for the fact that mixtures frequently fit data more satisfactorily than do standard distributions.
Conditionality
The concept of conditionality underlies the definitions (1) and (2), for the function f(x; θ) is the conditional probability distribution given the value of θ, while the product of f(x; θ) and g(θ) is the joint probability distribution of x and θ, and the sum (or integral) is the unconditional distribution of x.
The importance of conditionality in applications can be seen in the coin example: the machine on which the randomly chosen coin was produced may be thought of as an auxiliary chance variable (taking on the values 1 and 2 with respective probabilities α and 1 – α ); conditional on the value of the auxiliary chance variable, the distribution of heads is a simple binomial.
The auxiliary chance variable is usually unobservable and hence cannot play a direct role in data analysis. The merit of introducing the auxiliary chance variable in this context is, therefore, not that it results in simpler data analyses but that it yields a proper understanding of the underlying mathematical structure and simplifies the derivation of the probability distribution.
This situation is typical; the conditional distribution of the chance variable under investigation, given the value of a related unobservable chance variable, is either known exactly or is of a relatively simple form. Because of the inability to observe the related chance variable, however, the unconditional distribution is a mixture and is usually much more complex.
Some applications
Because of the immense complexity of living organisms, variables that the investigator can neither control nor observe frequently arise in investigations of natural phenomena. Some examples are attitude, emotional stability, skills, genotype, resistance to disease, etc. Since mixtures of distributions result when such variables are related to the variable under investigation, mixtures have many applications in the social and biological sciences.
Mixtures have played an important role in the construction of so-called contagious distributions. These distributions are deduced from models in which the occurrence of an event has the effect of changing the probability of additional such occurrences. An early example of a contagious distribution was given by Greenwood and Yule (1920) in an analysis of accident data. Greenwood and Yule derived the classical negative binomial distribution as a mixture of distributions. Other contagious distributions that are mixtures are given by Neyman (1939, the Neyman types A, B, and C) and Gurland (1958).
Mixtures of distributions also arise in dealing with unusual observations or outliers. One approach to the problem of outliers is predicated upon writing the underlying probability distribution as a mixture of distributions (see Dixon 1953). [For further discussion, seeStatistical analysis, special problems of, article Onoutliers.]
Mixtures of distributions also arise in the Bayesian approach to statistical analysis. B ayes’ Procedures are constructed under the assumption that the parameters of the underlying probability distribution are themselves chance variables. In this context, the mixing distribution is called the a priori distribution [seeBayesian inference].
The following are a few additional applications of mixtures:
(1) Life testing of equipment subject to sudden and delayed failures (Kao 1959).
(2) Acceptance testing, when the proportion of defectives varies from batch to batch (Vagholkar 1959).
(3) Latent stucture analysis based on mixtures of multivariate distributions with independence be-
tween the variates within each component (Lazarsfeld 1950; Green 1952). Factor analysis also comes under this general description. [SeeFactor analysis; Latent structure.]
(4) Construction of a learning model as a weighted average of two simpler such models (Sternberg 1959).
(5) A model for spatial configurations in map analysis (Dacey 1964).
Special mixed distributions
Using the definitions (1) and (2), it is easy to generate large numbers of probability distributions. Furthermore, the process of mixing of distributions may be repeated; that is, new mixtures, in which the components are themselves mixtures of distributions, may be formed by repeated application of (1) and (2). Thus, it is possible to form an unlimited number of probability distributions by this relatively simple process. In addition, many classical probability distributions can be represented nontrivially as mixtures. Teicher (1960) gives two representations of the normal distribution as a mixture of distributions. [SeeDistributions, statistical, article Onspecial discrete distributions, for both classical and mixture representations of the negative binomial distribution.]
It is customary, when both component and mixing distribution are well known, to call the mixture by the names of the distributions involved.
The following are a few specific examples of mixtures:
Mixture of two normal distributions. The density function of a mixture of two normal distributions is
where – ∞ < x < ∞ < α1 < 1,0 < α2 < 1, α + α2=1, and μ1, μ2 and, are the respective means and variances of the component normal distributions. This example is due to Pearson (1894). The distribution has mean α1μ1 + α2μ2and variance. An example of this mixture with μ1 = 0, μ2 = 4, and α1 = 0.4 is given in Figure 1.
Mixture of two Poisson distributions. The frequency function of a mixture of two Poisson distributions is
where α1 and α2 are as before and λ ≠ λ2 are the parameters of the components. This distribution has mean α1α2 + α2λ2 and variance. α1λ1 + α2λ2 + α1α2(λ1 –λ2)2 An example of this discrete mixed distribution with Xt = 1, X2 = 8, and λ1 =1 λ = 8, and α1 = 0.5 is given in Figure 2.
Poissonbinomial distribution. The frequency function of the Poisson-binomial distribution is
where 0 < θ < 1, 0< λ <∞, and n is an integer. The components are binomial distributions; the mixing distribution is Poisson. The probabilities can also be obtained by successive differentiation of the generating function
ф(s) = exp {λ[1 – θ + θs)n – 1]}, or from the recursion formulas
The mean isrc0A;the variance is nθλ[1 + (n – 1)θ].
Poisson-negative binomial distribution. The frequency function of the Poisson–negative binomial distribution is
where 0<θ<1, 0<λ<∞, and 0 < n < ∞. This is a Poisson mixture of negative binomial distributions. Here
ф(s) = exp [λ[θn(1– s + s θ)–n – 1]};
the recursion formulas are
The mean and variance are nλ(l – θ)/θ and nλ(1 –θ)(n-nθ + l)/θ2, respectively.
Neyman Type A contagious distribution. The frequency function of the Neyman Type A distribution is
where 0 < λ, < ∞ and 0 < λ2. The Neyman Type A is a Poisson-Poisson. The generating function
φ(s) = exp (λ1[eλ2(s-1)-1]};
the recursion formulas are
The mean is λ1,λ2, and (1 – λ2) is the variance. Point probability-negative exponential An exemple of a mixture of a discrete and a continuous distribution is written in terms of its cumulative distribution function,
F(x) = 0 | if x < 0 |
= α1 | if x = 0 |
= α1 + α2(1 – e–xθ) | ifx > 0, |
where θ > 0 and α1 and α2 are as in previous examples. An application in economics is given by Aitchison (1955). The distribution has mean α2θ and variance α1(l + α2)θ2.
Identifiability
The subject of identifiability, that is, of unique characterization, is of concern on two levels. On both levels, identifiability, or lack thereof, has very important practical implications.
The first level is in model construction. Here the question revolves around the existence of a one-to-one correspondence between phenomena observed in nature and their corresponding mathematical models. That such one-to-one correspondences need not exist is apparent in the application of contagious distributions to accident data. Greenwood and Yule (1920) devised two models, proneness and apparent proneness, for accidents. Both lead to the negative binomial distribution. As noted by Feller (1943), even complete knowledge of the underlying negative binomial distribution therefore would not enable the experimenter to distinguish between proneness and apparent proneness as the cause of accidents.
The question of identifiability is an important consideration in mathematical modeling generally. It is particularly crucial in attempting to distinguish between competing theories. Such distinctions can be made through properly designed experiments. For example, proneness and apparent proneness can be distinguished in follow-up studies or by sampling in several time periods.
The question of identifiability also arises in a purely mathematical context. A mixture is called identifiable if there exists a one-to-one correspondence between the mixing distribution and the resulting mixture. Mixtures that are not identifiable cannot be expressed uniquely as functions of component and mixing distributions. This is true, for example, of a mixture of two binomials when n = 2, For example, both α1 = .4, p1 = .3, p2 = .6 and α1 = .6, p1 = .36, p2 = 66 result in identical mixed distributions.
The derivation of conditions under which mixtures are identifiable in this sense is difficult. Teicher (1963) derives some such conditions and discusses identifiability for many common mixtures. Identifiability in this sense has important statistical implications in that it is not possible to estimate or test hypotheses about the parameters of unidentifiable mixtures. [For further discussion, seeStatistical identifiability.]
All of the mixtures listed in the section on “Special mixed distributions” are identifiable. The mixture of two binomial distributions given in equation (3) is identifiable if, and only if, n ≥ 3.
Estimation
The construction of estimates for the parameters of mixtures of distributions is difficult. Procedures such as maximum likelihood and minimum x2, which are known to be optimal, at least for large sample sizes, require the solution of systems of equations that are typically intractable for mixtures. The problem has been somewhat alleviated with the advent of high-speed electronic computers, but it is by no means resolved. In any case the distributions of the estimators are likely to be difficult to work with. [For further discussion, seeEstimation.]
Because of this complexity, it is not uncommon to choose estimation procedures almost solely on the basis of computational simplicity. Moment estimators are often used even though they may be inefficient in that they require larger sample sizes to attain a given degree of accuracy than do maximum likelihood estimators. Moment estimators are constructed by equating sample moments to population moments, the latter being written in terms of the parameters of the underlying distribution, then solving the resulting system of equations for the parameters. For example, for the Poisson-binomial distribution the moment estimators based on the sample mean, x, and variance, s2, are
Even moment estimators can become quite complex. Moment estimates for mixtures of binomial distributions can require formidable calculations (see Blischke 1964). Pearson’s (1894) solution for a mixture of two normal distributions requires extraction of the roots of a ninth-degree equation. In the normal case, the problem is greatly simplified if the two normal components are assumed to have the same variance (see Rao 1952, section 8b.6).
Moment estimators and/or maximum likelihood estimators for the distributions mentioned above are given in the references. For a summarization of these results and some additional comments on the estimation problem, see Blischke (1963).
Another aspect of the estimation problem that can be troublesome in practice occurs if two components of a finite mixture are nearly identical. If this is the case, extremely large sample sizes are required to estimate the parameters of the mixture with any degree of accuracy. Such mixtures are, for all practical purposes, unidentifiable (cf. Chiang 1951).
Nomenclature
It is important to distinguish between the concepts of mixing of distributions and the distribution of a sum of random variables. The latter distribution is called a convolution and, except for special cases or particular notational conventions, is not a mixture of distributions.
There are, however, several additional terms sometimes used as synonyms for “mixture of distributions.” These include “compound distribution,” “mixed distribution,” “probability mixture,” “superposition,” “composite distribution,” and “sum of distributions.” The most common terms are “mixture,” “mixed distribution,” and “compound distribution.” In addition, certain “generalized distributions” are mixtures of some specific structure (cf. Feller 1943).
The terms “dissection” and “decomposition” are sometimes used in connection with estimation for finite mixtures. These terms are descriptive since the estimates give information about the components from observations on the composite.
Wallace R. Blischke
BIBLIOGRAPHY
Aitchison, John 1955 On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50:–908.
Blischke, Wallace R. (1963) 1965 Mixtures of Discrete Distributions. Pages –372 in Ganapati P. Patil (editor), Classical and Contagious Discrete Distributions. Proceedings of the International Symposium held at McGill University, Montreal, Canada, August –20, 1963. Calcutta: Statistical Publishing Society; Distributed by Pergamon Press.
Blischke, Wallace R. 1964 Estimating the Parameters of Mixtures of Binomial Distributions. Journal of the American Statistical Association 59:–528.
Chiang, Chin Long 1951 On the Design of Mass Medical Surveys. Human Biology 23:–271.000
Dacey, Michael F. 1964 Modified Poisson Probability Law for a Point Pattern More Regular Than Random. Association of American Geographers, Annals 54:–565.
Dixon, W. J. 1953 Processing Data for Outliers. Bio-metrics 9:–89.
Feller, W. 1943 On a General Class of “Contagious” Distributions. Annals of Mathematical Statistics 14: –400.
Goodman, Leo A.; and Kruskal, William H. 1959 Measures of Association for Cross Classifications: II. Further Discussion and References. Journal of the American Statistical Association 54:–163.
Green, Bert F. JR. 1952 Latent Structure Analysis and Its Relation to Factor Analysis. Journal of the American Statistical Association 47:–76.
Greenwood, Major; and Yule, G. Udny 1920 An Inquiry Into the Nature of Frequency Distributions Representative of Multiple Happenings With Particular Reference to the Occurrence of Multiple Attacks of Disease or of Repeated Accidents. Journal of the Royal Statistical Society 83:–279.
Gurland, John 1958 A Generalized Class of Contagious Distributions. Biometrics 14:–249.
Kao, John H. K. 1959 A Graphical Estimation of Mixed eibull Parameters in Life-testing of Electron Tubes. Technometrics 1:–407.
Lazarsfeld, Paul F. 1950 The Logical and Mathematical Foundation of Latent Structure Analysis. Pages 362–412 in Samuel A. Stouffer et al., Measurement and Prediction. Studies in Social Psychology in World War ii, Vol. 4. Princeton Univ. Press.
Neyman, J. 1939 On a New Class of “Contagious” Distributions, Applicable in Entomology and Bacteriology. Annals of Mathematical Statistics 10:–57. PEARSON, KARL 1894 Contributions to the Mathematical Theory of Evolution. Royal Society of London, Philosophical Transactions Series A 185:–110.
Rao, C. Radhakrishna 1952 Advanced Statistical Methods in Biometric Research. New York: Wiley.
Sternberg, Saul H. 1959 A Path-dependent Linear Model. Pages 308–339 in Robert R. Bush and William K. Estes (editors), Studies in Mathematical Learning Theory. Stanford Univ. Press.
Strandskov, Herluf H.; and Edelen, Earl W. 1946 Monozygotic and Dizygotic Twin Birth Frequencies in the Total, the “White” and the “Colored” U.S. Populations. Genetics 31:438–446.
Teicher, Henry 1960 On the Mixture of Distributions. Annals of Mathematical Statistics 31:1265–73.
Teicher, Henry 1963 Identifiability of Finite Mixtures. Annals of Mathematical Statistics 34:–1269.
Vagholkar, M. K. 1959 The Process Curve and the Equivalent Mixed Binomial With Two Components. Journal of the Royal Statistical Society Series B 21:63–66.