Probability Theory
PROBABILITY THEORY
Sociologists, as much as researchers in any field perhaps, use a variety of approaches in the investigation of their subject matter. Quite successful and important are the historical and exegetical approaches and those in the traditions of anthropology and philosophy. Also of great importance are the systematic approaches that use mathematical models. Here the social investigator proposes a model, a mathematical depiction of social phenomena. A successful mathematical model can be very powerful, providing not only confidence in the theory from which the model was derived, giving us an explanation of the phenomena, but producing as well a method for predicting, giving us a practical means for controlling or affecting the social phenomena.
The social mathematical model is first of all a description of the relationship of the properties of social objects—groups, states, institutions, organizations, even people. If the model is derived from a theory, or if it contains features implied by a theory, and if the model fits data (i.e., has been found to satisfy some criterion of performance), the model can in addition be regarded as evidence to support that theory. In this case we can think of a true, underlying model that generated the observations we are studying and a proposed model that will be tested against data. Quantitative analysis begins, then, with some theoretical understanding of the properties of groups of social objects; this understanding leads to the specification of a model of the interaction of these properties, after which observations of these properties on a sample of the objects are collected. The performance of the model is then evaluated to determine to what degree the model truly describes the underlying process.
The measurement of a property of a social object is called a variable. Variables can be either fixed or random. Fixed variables are those determined by the investigator; they usually occur in experiments and will not be of concern in this chapter. All other variables are random. The random nature of these variables is the unavoidable consequence of two things; first, the fact that our observations are samples, that is, groups of instances of social objects drawn from a population (that is, a very large number of possible instances to be observed); second, the fact that our theories and data collection are often unable to account for all the relevant variables affecting the variables included in the analysis. Probability theory in social models, or, equivalently, random variables in social models, will derive from these two subtopics: sampling and the specification of residual or excluded variables in the models.
A certain philosophical difference of opinion arises among probability theorists about the nature of the true source of the randomness in nature. One group argues that these features are inherent in reality, and another argues they are simply the consequence of ignorance. The primary modeling tool of the former group is the stochastic process (Chung 1974), while that of the latter is the Bayesian statistical model (de Finetti 1974).
MAIN CONCEPTS
Probability is a name assigned to the relative frequency of an event in an event space, that is, a set of possible events. For example, we might define the event space as the two sides of a single coin labeled heads (H) and tails (T): {T, H}. The actual outcome of a coin flip is a random variable, X, say, and the probability of the outcome H is P(X = H). The probability distribution function (or PDF) assigns a quantity to this probability. By definition, for a fair coin P(X = H) = .5. Since the event space is composed of only two events, then P(X = T) + P(X = H) = 1, that is, one or the other event occurs for certain, and P(X = T) = 1 - P(X = H) = .5. Thus the probability of T is equal to the probability of H and the coin is fair.
In general we assign numbers to the events in our event space, allowing us to use mathematical language to describe the probabilities. For example, the event space of the number of people arriving at a bank's automatic teller machine (ATM) is {0, 1, 2, . . .} over a given time interval Δt. Given certain assumptions, such as that the arrival time of each person is independent of anyone else's, we can derive a theoretical PDF. For a given time interval Δt, the probability of the number of people X can be shown to be where λ is the mean rate of people arriving at the ATM over the time interval Δt, and k = 0,1,2, . . .
Suppose from bank records we are able to determine that 100 people per hour complete a transaction at a particular ATM during normal working hours. For Δt equal to one minute or 1/60 an hour, the PDF is For some selected values of k we have
k | p(x = k) |
0 | .1889 |
1 | .3148 |
2 | .2623 |
3 | .1457 |
4 | .0607 |
... | ... |
If we assume that each person spends about a minute at the ATM, we should expect one or more people standing in line behind someone at the ATM about 50 percent of the time since the probability of two or more people arriving during a one minute interval is The event spaces for the examples above are discrete, but continuous event spaces are also widely used. A common PDF for continuous event spaces is the normal distribution: where μ and σ2, commonly called the mean and variance respectively, are parameters of the distribution, and χ is a real number greater than minus infinity and less than plus infinity. The normal PDF is the most widely used distribution in social models, first because it had advantageous mathematical properties and second because its specification in many cases can be justified on the basis of the central limit theorem (Hogg and Tanis 1977, p. 155).
Other important concepts in probability theory are the cumulative distribution function (or CDF), joint distributions (distributions involving more than one variable), and conditional distributions. The CDF is the probability of X being less than or equal to x, that is, Pr(X < x). An accessible introduction to probability may be found in Hogg and Tanis (1977).
SAMPLING
In physics all protons behave similarly. To determine their properties, any given instance of a proton will do. Social objects, on the other hand, tend to be complex, and their properties can vary considerably from instance to instance. It is not possible to draw conclusions about all instances of a social object from a given one in the same manner we might from single instance of a proton. Given equivalent circumstances, we cannot expect everyone to respond the same way to a question about their attitudes toward political issues or to behave the same way when presented with a set of options.
For example, suppose we wish to determine the extent to which a person's education affects his or her attitudes towards abortion. Let Ai represent a measurement of the attitude of some person, labeled the ith person, scored 0 if they are opposed to abortion or 1 if not. Let Bi be the measurement of the person's education, scored 0 for less than high school, 1 for high school but no college, or 2 for at least some college.
Given measurements on a sample of people, we would find that they would be distributed in some fashion across all the six possible categories of the two variables. Dividing the number that fall into each category by the total number in the sample would give us estimates of the empirical distribution for the probabilities: PR(A = 0, B = 1), PR(A = 0, B = 2), and so on. We might also model this distribution. For example, an important type of model is the loglinear model (Goodman 1972; Haberman 1979; Agresti 1990), which models the log of the probability: where λai, λbj and λabij are parameters (actually sets of parameters). In this model the λabij parameters represent the associations between A and B, and an estimate of these quantities might have important implications for a theory.
Given a sample distribution, computing an estimate of λabij is straightforward (Bishop, Fienberg, and Holland 1975). It is important, however, to realize that such an estimate is itself a random variable, that is, we can expect the estimate to vary with every sample of observations we produce. If the sample is properly selected, in particular if it is a simple random sample in which each person has an equal chance of being included in the sample, it can be shown that the estimates of λabij have, in large samples at least, a normal distribution (Haberman 1973). Our estimates, then, are themselves parameters of a distribution, usually the means of a normal distribution. It follows that the fundamental parameters upon which a theory will depend can never be directly observed and that we must infer its true value from sample data.
All research on social objects is unavoidably research on samples of observations. Therefore all such research will necessarily entail at the very least a probabilistic sampling model, and the conclusions drawn will require properly conceived statistical inference.
MODELS WITH EXCLUDED VARIABLES
The Regression Model. The most well-known and widely used statistical model is the regression model. It is a simple linear hypersurface model with an added feature: a disturbance term, which represents the effects on the dependent variable of variables that have not been measured. To the extent that the claims or implications of a theory may be put into linear form, or at least transformed into linear form, the parameters (or regression coefficients) may be estimated and statistical inference drawn by making some reasonably benign assumptions about the behavior of the variables that have been excluded from measurement. The key assumption is that the excluded variables are uncorrelated with included variables. The failure of this assumption gives rise to spurious effects; that is, parameters may be under- or over-estimated, and this results in faulty conclusions. The statistical inference also requires a homogeneity of variance of the disturbance variables, called homoscedasticity. The variation of the excluded variables must be the same across the range of the independent variables. This is not a critical assumption, however, because the consequence of the violation of this assumption is inefficiency rather than bias, as in the case of the spurious effects. Moreover, the underlying process generating the heteroscedasticity may be specified, which would yield efficient estimates, or a modified inference may be computed, based on revised estimates of the variances of the distribution of the parameter estimates (White 1980).
For example, a simple regression of income, say, on years of education may be described, yi = b0 + bixi + εi, where yi and xi are observations on income and years of education, respectively, of the ith person b0 and b1 are regression coefficients, and εi is the disturbance term. Estimates of b0 and b1 may be found (without making any assumptions about the functional form of the distribution of εi by using perhaps the most celebrated theorem in statistics, the Gauss-Markov theorem, and they are usually called ordinary least squares estimates.
If we gather the observations into matrices, we can rewrite the regression equation as functions of matrices: Y = XB + E, where Y is an N x 1 vector of observations on the dependent variable. X an N x K matrix of observations on K independent variables, B a K x 1 vector of regression coefficients, and E an N x 1 vector of disturbances. With this notation the estimates in B may be described B̂ = (X'X)-1X'Y, where; the "^" over the B emphasizes that they are estimates of the parameters.
Our observations are samples, and since our estimates of B will vary from sample to sample, it follows that these estimates will themselves be random variables. Appealing again to the Gauss-Markov theorem, it is possible to show that the ordinary least squares estimates have a normal distribution with variance-covariance matrix equal to VarCov(B̂) = σ2ε (X'X)-1X'Y, where σ2ε is estimated by σ2ε = (Y - XB)'(Y- XB)/(N - K - 1).
Other models. The regression model in the previous section is a "single equation" model, that is, it contains one dependent variable. A generalization of the regression model incorporates multiple dependent variables. This model may be represented in matrix notation as BY = ΓX + Z, where Y is an N x L matrix of L endogenous variables (i.e., variables that are dependent in at least one equation), B is an L x L matrix of coefficients relating endogenous variables among themselves, X is an N x K matrix of K exogenous variables (i.e., variables that are never dependent), Γ is an L x K matrix of coefficients relating the exogenous variables to the endogenous variables, and Z is an N x L matrix of disturbances. Techniques have been developed to produce estimates and statistical inference for these kinds of models (Judge et al. 1982; Fox 1984).
Measurement error is another kind of excluded variable, and models have been developed to incorporate them into the regression and simultaneous equation models. One method for handling measurement error is to use multiple measures of an underlying latent variable (Bollen 1989; Jöreskog and Sörbom 1988). A model that incorporates both measurement error and excluded variable disturbances may be described in the following way: where Y and X are our observations on the endogenous and exogenous variables respectively, Λy, and Λz are coefficient matrices relating the underlying variables to the observed variables, η and ξ are the latent endogenous and exogenous variables respectively, B and Γ are coefficient matrices relating the latent variables among themselves εy and εz are the measurement error disturbances, and ζ is the excluded variable disturbance.
This model incorporates three sources of randomness, measurement error disturbance, excluded variable disturbance, and sampling error. Models of the future may contain a fourth source of randomness: a structural disturbance in the coefficients. These latter models are called random coefficient models and are a special case of the most general kind of probabilistic model called the mixture model (Judge et al. 1982; Everitt 1984).
The models described to this point have been linear. Linearity can be a useful approximation that renders the problem tractable. Nonlinearity may be an important aspect of a theoretical specification, however, and methods to incorporate nonlinearity in large-scale models have been developed (Amemiya 1985). It also appears to be the fact that most social measures are not continuous, real variables, which is what is assumed by the regression and simultaneous models described above. Thus, much work is now being devoted to the development of models that may be used with measures that are limited in a variety of ways—they are categorical, ordinal, truncated, or censored, for example (Muthén 1984; Maddala 1983). Limited variable methods also include methods for handling variations on the simple random method of sampling.
Probability theory has had a profound effect on the modeling of social processes. It has helped solve the sampling problem, permitted the specification of models with excluded variables, and provided a method for handling measurement.
references
Agresti, A. 1990 Categorical Data Analysis. New York: John Wiley.
Amemiya, T. 1985 Advanced Econometrics. Cambridge, Mass.: Harvard University Press.
Bishop, Y. M. M., S. E. Fienberg, and P. W. Holland 1975 Discrete Multivariate Analysis: Theory and Practice. Cambridge, Mass.: MIT Press.
Bollen, K. A. 1989 Structural Equations with Latent Variables. New York: John Wiley.
Chung, K. L. 1974 Elementary Probability Theory withStochastic Processes. Berlin: Springer-Verlag.
de Finetti, B. 1974 Theory of Probability, 2 vols. New York: John Wiley.
Everitt, B. S. 1984 An Introduction to Latent VariableModels. London: Chapman and Hall.
Fox, J. 1984 Linear Statistical Models and Related Methods. New York: John Wiley.
Goodman, L. A. 1972 "A General Model for the Analysis of Surveys." American Journal of Sociology 37:28–46.
Haberman, S. J. 1973 "Loglinear Models for Frequency Data: Sufficient Statistics and Likelihood Equations." Annals of Mathematical Statistics 1:617–632.
——1979 Analysis of Qualitative Data, vol. 2, NewDevelopments. Orlando, Fla.: Academic.
Hogg, Robert V., and Elliot A. Tanis 1977 Probability andStatistical Inference. New York: Macmillan.
Jöreskog, K. G., and Dag Sörbom 1988 LISREL VII. Chicago: SPSS.
Judge, George G., R. Carter Hill, William Griffiths, Helmut Lutkepohl, and Tsoung-Chao Lee 1982 Introduction to the Theory and Practice of Econometrics. New York: John Wiley.
Maddala, G. 1983 Limited-Dependent and Qualitative Variables in Econometrics. Cambridge, England: Cambridge University Press.
Muthén, B. 1984 "A General Structural Equation Model with Dichotomous, Ordered Categorical, and Continuous Latent Variable Indicators." Psychometrika 49:115–132.
Tuma, Nancy Brandon, and Michael T. Hannan 1984 Social Dynamics: Models and Methods. Orlando, Fla.: Academic.
White, H. 1980 "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica 48:817–838.
Ronald Schoenberg