Validity
VALIDITY
In the simplest sense, a measure is said to be valid to the degree that it measures what it is hypothesized to measure (Nunnally 1967, p. 75). More precisely, validity has been defined as the degree to which a score derived from a measurement procedure reflects a point on the underlying construct it is hypothesized to reflect (Bohrnstedt 1983). In the most recent Standards for Educational and Psychological Testing (American Psychological Association 1985), it is stated that validity "refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from . . . scores." The emphasis is clear: Validity refers to the degree to which evidence supports the inferences drawn from a score rather than the scores or the instruments that produce the scores. Inferences drawn for a given measure with one population may be valid but may not be valid for other measures. As will be shown below, evidence for inferences about validity can be accumulated in a variety of ways. In spite of this variety, validity is a unitary concept. The varied types of inferential evidence relate to the validity of a particular measure under investigation.
Several important points related to validity should be noted:
- Validity is a matter of degree rather than an all-or-none matter (Nunnally 1967, p. 75; Messick 1989).
- Since the constructs of interest in sociology (normlessness, religiosity, economic conservatism, etc.) generally are not amenable to direct observation, validity can be ascertained only indirectly.
- Validation is a dynamic process; the evidence for or against the validity of the inferences that can be drawn from a measure may change with accumulating evidence. Validity in this sense is always a continuing and evolving matter rather than something that is fixed once and for all (Messick 1989).
- Validity is the sine qua non of measurement; without it, measurement is meaningless.
In spite of the clear importance of validity in making defensible inferences about the reasonableness of theoretical formulations, the construct more often than not is given little more than lip service in sociological research. Measures are assumed to be valid because they "look valid," not because they have been evaluated as a way to get statistical estimates of validity. In this article, the different meanings of validity are introduced and methods for estimating the various types of validity are discussed.
TYPES OF VALIDITY
The Standards produced jointly by the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education distinguish between and among three types of evidence related to validity: (1) criterion-related, (2) content, and (3)construct evidence (American Psychological Association 1985).
Criterion-Related Evidence for Validity. Criterion-related evidence for validity is assessed by the correlation between a measure and a criterion variable of interest. The criterion varies with the purpose of the researcher and/or the client for the research. Thus, in a study to determine the effect of early childhood education, a criterion of interest might be how well children perform on a standardized reading test at the end of the third grade. In a study for an industrial client, it might be the number of years it takes to reach a certain job level. The question that is always asked when one is accumulating evidence for criterion-related validity is: How accurately can the criterion be predicted from the scores on a measure? (American Psychological Association 1985).
Since the criterion variable may be one that exists in the present or one that a researcher may want to predict in the future, evidence for criterion-related validity is classified into two major types: predictive and concurrent.
Evidence for predictive validity is assessed by examining the future standing on a criterion variable as predicted from the present standing on a measure of interest. For example, if one constructs a measure of work orientation, evidence of its predictive validity for job performance might be ascertained by administering that measure to a group of new hires and correlating it with a criterion of success (supervisors' ratings, regular advances within the organization, etc.) at a later point in time. The evidence for the validity of a measure is not limited to a single criterion. There are as many validities as there are criterion variables to be predicted from that measure. The preceding example makes this clear. In addition, the example shows that the evidence for the validity of a measure varies depending on the time at which the criterion is assessed. Generally, the closer in time the measure and the criterion are assessed, the higher the validity, but this is not always true.
Evidence for concurrent validity is assessed by correlating a measure and a criterion of interest at the same point in time. A measure of the concurrent validity of a measure of religious belief, for example, is its correlation with concurrent attendance at religious services. Just as is the case for predictive validity, there are as many concurrent validities as there are criteria to be explained; there is no single concurrent validity for a measure.
Concurrent validation also can be evaluated by correlating a measure of X with extant measures of X, for instance, correlating one measure of self-esteem with a second one. It is assumed that the two measures reflect the same underlying construct. Two measures may both be labeled self-esteem, but if one contains items that deal with one's social competence and the other contains items that deal with how one feels and evaluates oneself, it will not be surprising to find no more than a modest correlation between the two.
Evidence for validity based on concurrent studies may not square with evidence for validity based on predictive studies. For example, a measure of an attitude toward a political issue may correlate highly in August in terms of which political party one believes one will vote for in November but may correlate rather poorly with the actual vote in November.
Many of the constructs of interest to sociologists do not have criteria against which the validity of a measure can be ascertained easily. When they do, the criteria may be so poorly measured that the validity coefficients are badly attenuated by measurement error. For these reasons, sociological researchers have rarely computed criterion-related validities.
Content Validity. One can imagine a domain of meaning that a construct is intended to measure. Content validity provides evidence for the degree to which one has representatively sampled from that domain of meaning. (Bohrnstedt 1983). One also can think of a domain as having various facets (Guttman 1959), and just as one can use stratification to obtain a sample of persons, one can use stratification principles to improve the evidence for content validity.
While content validity has received close attention in the construction of achievement and proficiency measures psychology and educational psychology, it usually has been ignored by sociologists. Many sociological researchers have instead been satisfied to construct a few items on an ad hoc, one-shot basis in the apparent belief that they are measuring what they intended to measure. In fact, the construction of good measures is a tedious, arduous, and time-consuming task.
Because domains of interest cannot be enumerated in the same way that a population of persons or objects can, the task of assuring the content validity of one's measures is less rigorous than one would hope. While an educational psychologist can sample four-, five-, or six-letter words in constructing a spelling test, no such clear criteria exist for a sociologist who engages in social measurement. However, some guidelines can be provided. First, the researcher should search the literature carefully to determine how various authors have used the concept that is to be measured. There are several excellent handbooks that summarize social measures in use, including Robinson and Shaver's Measures of Social Psychological Attitudes (1973); Robinson et al.'s Measures of Political Attitudes (1968); Robinson et al.'s Measures of Occupational Attitudes and Occupational Characteristics (1969); Shaw and Wright's Scales for the Measurement of Attitudes (1967); and Miller's Handbook of Research Design and Social Measurement (1977). These volumes not only contain lists of measures but provide existing data on the reliability and validity of those measures. However, since these books are out of date as soon as they go to press, researchers developing their own methods must do additional literature searches. Second, sociological researchers should rely on their own observations and insights and ask whether they yield additional facets to the construct under consideration.
Using these two approaches, one develops sets of items, one to capture each of the various facets or strata within the domain of meaning. There is no simple criterion by which one can judge whether a domain of meaning has been sampled properly. However, a few precautions can be taken to help ensure the representation of the various facets within the domain.
First, the domain can be stratified into its major facets. One first notes the most central meanings of the construct, making certain that the stratification is exhaustive, that is, that all major meaning facets are represented. If a facet appears to involve a complex of meanings, it should be subdivided further into substrata. The more one refines the strata and substrata the easier it is to construct the items later and the more complete the coverage of meanings associated with the construct will be. Second, one should write several items or locate several extant indicators to reflect the meanings associated with each stratum and substratum. Third, after the items have been written, they should tried out on very small samples composed of persons of the type the items will eventually be used with, using cognitive interviewing techniques, in which subjects are asked to "think aloud" as they respond to the items. This technique for the improvement of items, while quite new in survey research, is very useful for improving the validity of items (Sudman et al. 1995). For example, Levine et al. (1997) have shown how cognitive interviewing helped in the improvement of school staffing resources, as did Levine (1996) in describing the development of background questionnaires for use with the large-scale cognitive assessments. Fourth, after the items have been refined through the use of cognitive laboratory techniques, the newly developed items should be field-tested on a sample similar to that with which one intends to examine the main research questions. The fieldtest sample should be large enough to examine whether the items are operating as planned vis-àvis the constructs they are putatively measuring, using multivariate tools such as confirmatory factor analysis ( Joreskog 1969) and item response theory methods (Hambleton and Swaminathan 1985).
Finally, after the items are developed, the main study should employ a sampling design that takes into account the characteristics of the population about which generalizations are to be made (ethnicity, gender, region of country, etc.). The study also should be large enough to generate stable parameter estimates when one is using multivariate techniques such as multiple regression (Bohrnstedt and Knoke 1988) and structural equation techniques (Bollen 1989).
It can be argued that what the Standards call content validity is not a separate method for assessing validity. Instead, it is a set of procedures for sampling content domains that, if followed, can help provide evidence for construct validity (see the discussion of construct validity below). Messick (1989), in a similar stance, states that so-called content validity does not meet the definition of validity given above, since it does not deal directly with scores or their interpretation. This position can be better understood in the context of construct validity.
Construct Validity. The 1974 Standards state: "A construct is. . . a theoretical idea developed to explain and to organize some aspects of existing knowledge. . . It is a dimension understood or inferred from its network of interrelationships" (American Psychological Association 1985). The Standards further indicate that in developing evidence for construct validity,
the investigator begins by formulating hypotheses about the characteristics of those who have high scores on the [measure] in contrast to those who have low scores. Taken together, such hypotheses form at least a tentative theory about the nature of the construct the [measure] is believed to be measuring.
Such hypotheses or theoretical formulations lead to certain predictions about how people. . . will behave. . . in certain defined situations. If the investigator's theory. . . is correct, most predictions should be confirmed. (p. 30)
The notion of a construct implies hypotheses of two types. First, it implies that items from one stratum within the domain of meaning correlate together because they all reflect the same underlying construct or "true" score. Second, whereas items from one domain may correlate with items from another domain, the implication is that they do so only because the constructs themselves are correlated. Furthermore, it is assumed that there are hypotheses about how measures of different domains correlate with one another. To repeat, construct validation involves two types of evidence. The first is evidence for theoretical validity (Lord and Novick 1968): an assessment of the relationship between items and an underlying, latent unobserved construct. The second involves evidence that the underlying latent variables correlate as hypothesized. If either or both sets of these hypotheses fail, evidence for construct validation is absent. If one can show evidence for theoretical validity but evidence about the interrelations among those constructs is missing, that suggests that one is not measuring the intended construct or that the theory is wrong or inadequate. The more unconfirmed hypotheses one has involving the constructs, the more one is likely to assume the former rather than the latter.
The discussion above makes clear the close relationship between construct validation and theory validation. To be able to show construct validity assumes that the researcher has a clearly stated set of interrelated hypotheses between important theoretical constructs, which in turn can be measured by sets of indicators. Too often in sociology, one or both of these components are missing.
Campbell (1953, 1956) uses a multitrait–multimethod matrix, a useful tool for assessing the construct validity of a set of measures collected using differing methods. Thus, for example, one might collect data using multiple indicators of three constructs, say, prejudice, alienation, and anomie, using three different data collection methods: a face-to-face interview, a telephone interview, and a questionnaire. To the degree that different methods yield the same or a very similar result, the construct demonstrates what Campbell (1954) calls convergent validity. Campbell argues that in addition, the constructs must not correlate too highly with each other; that is, to use Campbell and Fiske's (1959) term, they must also exhibit discriminant validity. Measures that meet both criteria provide evidence for construct validity.
VALIDITY GENERALIZATION
An important issue for work in educational and industrial settings is the degree to which the criterion-related evidence for validity obtained in one setting generalizes to other settings (American Psychological Association 1985). The point is that evidence for the validity of an instrument in one setting in no ways guarantees its validity in any other setting. By contrast, the more evidence there is of consistency of findings across settings that are maximally different, the stronger the evidence for validity generalization is.
Evidence for validity generalization generally is garnered in one of two ways. The usual way is simply to do a nonquantitative review of the relevant literature; then, on the basis of that review, a conclusion about the generalizability of the measure across a variety of settings is made. More recently, however, meta-analytic techniques (Hedges and Olkin 1985) have been employed to provide quantitative evidence for validity generalization.
Variables that may affect validity generalization include the particular criterion measure used, the sample to which the instrument is administered, the time period during which the instrument was used, and the setting in which the assessment is done.
Differential predication. In using a measure in different demographic groups that differ in experience or that have received different treatments (e.g., different instructional programs), the possibility exists that the relationship between the criterion measure and the predictor will vary across groups. To the degree that this is true, a measure is said to display differential prediction.
Closely related is the notion of predictive bias. While there is some dispute about the best definition, the most commonly accepted definition states that predictive bias exists if different regression equations are needed for different groups and if predictions result in decisions for those groups that are different from the decisions that would be made based on a pooled groups regression analysis (American Psychological Association 1985). Perhaps the best example to differentiate the two concepts is drawn from examining the relationship between education and income. It has been shown that that relationship is stronger for whites than it is for blacks; that is, education differentially predicts income. If education were then used as a basis for selection into jobs at a given income level, education would be said to have a predictive bias against blacks because they would have to have a greater number of years of education to be selected for a given job level compared to whites.
Differential prediction should not be confused with differential validity, a term used in the context of job placement and classification. Differential validity refers to the ability of a measure or, more commonly, a battery of measures to differentially predict success or failure in one job compared to another. Thus, the armed services use the battery of subtests in the Armed Services Vocational Aptitude Battery (U.S. Government Printing Office 1989; McLaughlin et al. 1984) in making the initial assignment of enlistees to military occupational specialties.
MORE RECENT FORMULATIONS OF VALIDITY
More recent definitions of validity have been even broader than that used in the 1985 Standards. Messick (1989) defines validity as an evaluative judgment about the degree to which "empirical and theoretical rationales support the adequacy and appropriateness of inferences and actions based on . . . scores or other modes of assessment" (p. 13). For Messick, validity is more than a statement of the existing empirical evidence linking a score to a latent construct; it is also a statement about the evidence for the appropriateness of using and interpreting the scores. While most measurement specialists separate the use of scores from their interpretation, Messick (1989) argues that the value implications and social consequences of testing are inextricably bound to the issue of validity:
[A] social consequence of testing, such as adverse impact against females in the use of a quantitative test, either stems from a source of test invalidity or a valid property of the construct assessed, or both. In the former case, this adverse consequence bears on the meaning of the test scores and, in the later case, on the meaning of the construct. In both cases, therefore, construct validity binds social consequences to the evidential basis of test interpretation and use." (p. 21)
Whether the interpretation and social consequences of the uses of measures become widely adopted (i.e., are adopted in the next edition of the Standards) remains to be seen. Messick's (1989) definition does reinforce, the idea that although there are many facets to and methods for garnering evidence for inferences about validity, it remains a unitary concept; evidence bears on inferences about a single measure or instrument.
references
American Psychological Association 1985 standards for educational and psychological testing. Washington, D.C.: American Psychological Association.
Bohrnstedt, G. W. 1983 "Measurement." In Rossi, P. H., J. D. Wright, and A. B. Anderson, eds., handbook of survey research. New York: Academic Press
Bohrnstedt, G. W., and D. Knoke 1988, statistics for social data analysis. Itasca, Ill.: F.E. Peacock.
Bohrnstedt, G. W. 1992 "Reliability." In E. F. Borgatta (ed.) encyclopedia of sociology. 1st ed., New York: Macmillan.
Bollen, K. A. 1989 structural equations with latent variables. New York: Wiley.
Campbell, D. T. 1953 a study of leadership among submarine officers. Columbus: Ohio State University Research Foundation.
—— 1954 "Operational Delineation of What Is Learned' via the Transportation Experiment." psychological review 61:167–174.
—— 1956 leadership and its effects upon the group. Monograph no. 83. Columbus: Ohio State University Bureau of Business Research.
——, and D.W. Fiske 1959 "Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix." psychological bulletin 56:81–105.
Guttman, L. 1959 "A Structural Theory for Intergroup Beliefs and Action." american sociological review 24:318–328.
Hambleton, R., and H. Swaminathan 1985 item response theory: principles and applications. Norwell, Mass.: Kluwer Academic.
Hedges, L. V., and I. Olkin 1985 statistical methods for meta-analysis. Orlando, Fla.: Academic Press.
Jöreskog, K. G. 1969 "A general approach to confirmatory maximum likelihood factor analysis." psyclometrika 36:409–426.
Levine, R. 1998 "What Do Cognitive Labs Tell Us about Student Knowledge?" Paper presented at the 28th Annual Conference on Large Scale Assessment sponsored by the Council of Chief State School Officers, Colorado Springs, Colo. Palo Alto, Calif.: American Institutes for Research
Levine, R., J. Chambers, I. Duenas, and C. Hikido 1997 improving the measurement of staffing resources at the school level: the development of recommendations for nces for the schools and staffing surveys (sass). Palo Alto, Calif.: American Institutes for Research
Lord, F. M., and M. R. Novick 1968 statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.
McLaughlin, D. H., P. G. Rossmeissl, L. L. Wise, D. A. Brandt, and M. Wang 1984 validation of current and alternative armed services vocational aptitude battery (asvab) area composities. Washington, D.C.: U.S. Army Research Institute for the Behavioral and Social Sciences.
Messick, S. 1989 "Validity." In L. Linn, ed., educational measurement, 3rd ed. New York: Macmillan.
Miller, D. 1977 handbook of research design and social measurement, 3rd ed. New York: David McKay.
Nunnally, J. C. 1967 psychometric theory. New York: McGraw-Hill.
Robinson, J. P., R. Athanasiou, and K. B. Head 1969 Measures of Occupational Attitudes and Occupational Characteristics. Ann Arbor: Institute for Social Research.
——, J. G. Rusk, and K. B. Head 1968 measures of political attitudes. Ann Arbor, Mich.: Institute for Social Research.
Robinson, J. P. and P. R. Shaver 1973 measures of social psychological attitudes. Ann Arbor, Mich.: Institute for Social Research.
Shaw, M., and J. Wright 1967 scales for the measurement of attitudes. New York: McGraw-Hill.
Sudburn, S., N. Bradburn, and N. Schwarz 1995 thinking about answers, the application of cognitive processes to survey methodology. San Francisco: Jossey-Bass.
U.S. Government Printing Office 1989 a brief guide: asvab for counselors and educators. Washington, D. C.: U.S. Government Printing Office.
George W. Bohrnstedt
validity
Rules of thumb have been developed which rule out certain types of question completely. For example, it is generally held to be pointless enquiring long after the event about the attitudes and reasons linked to a decision or choice made many years ago, on the grounds that views tend to be reconstructed with the benefit of hindsight. Arguments of validity rule out proxy interviews for anything except the most basic factual data, such as someone's occupation, if even that. Logical validation, checking for ‘face validity’ in theoretical or commonsense terms, continues to be the most important tool, which is strengthened by employing as wide a range of people as possible to make the checks. This can be extended into the use of panels of experts, judges or juries who are in fact ordinary people who have close familiarity with the topic in question, and can judge whether questions and classifications of replies cover all the situations that arise, and are appropriately worded. Another approach is to present the research instrument to groups of people who are known to have particular views or experience, and see whether it differentiates adequately between the groups. However, the ultimate test is whether the research tools, and the results obtained, are accepted by other scholars as having validity. It is rare for researchers to submit their work to the scrutiny of the research subjects themselves, though this sometimes happens in policy research. Population Censuses are unique in having post-enumeration surveys after each census to check data validity and general quality.
There are many different definitions of validity in the available literature. It is clear that different authors use the terminology in different ways. Part of the problem is that most of the relevant discussion takes place among psychologists, who also provide many of the examples and established procedures for testing for validity, but it is not clear that these always translate easily into sociological contexts.
One elementary and useful distinction sometimes made is that between criterion and construct validity. The former refers to the closeness of fit between a measure (let us say a concept) and the reality that it is supposed to reflect. For example, the Goldthorpe class scheme is intended as and claims to be a measure of ‘conditions and relations of employment’, in particular the material and other benefits of the ‘service relationship’ as against that of the ‘labour contract’. However, for practical reasons its operationization is effected via data about each individual's occupational title (teacher, nurse, or whatever) on the one hand, and employment status on the other (manager, employee, self-employed, and so on). One could therefore investigate the criterion validity of the Goldthorpe scheme by drawing a sample of individuals from within each of the social classes represented in the classification, and then collect independent data relating to the actual conditions and relations of employment of these respondents, for example evidence about their location on an established career ladder and incremental salary scale, enjoyment of enhanced pension rights, and of a certain autonomy in how they use their time at work. In other words, one would examine the degree to which the Goldthorpe classes measure those aspects of employment that they are said to measure, using independent criteria of the concept under investigation.
Construct validation, on the other hand, involves an assessment of whether or not a particular measure (a concept or whatever) relates to other variables in ways that would be predicted by the theory behind the concept. For example, in the case of Goldthorpe (or any other) social classes, one would expect—if this measurement of class was valid—that the social classes so identified would be readily differentiated in terms of (say) voting behaviour, levels of educational attainment, and inequalities in health (or literal ‘life-chances’). These are the sorts of things that, from our understanding of the theory of social class, we anticipate will be associated with the class location of individuals. For this reason construct validity is sometimes also referred to as ‘predictive validity’—rather misleadingly, perhaps, since one is here simply looking for correlations, rather than offering predictions about how weak or strong these correlations will be in actuality. Moreover, the term ‘construct validation’ is itself sometimes applied to the whole validation process, including those aspects earlier discussed under the label of criterion validity.
Arguably, all definitions and concepts of validity are to some extent circular, in the sense that one is attempting to confirm that a sociological construct (a classification, concept, or variable) actually measures what it claims to measure, by comparing that construct with something else (other indicators) that one hopes and assumes are independent of the original measurement. A useful discussion of the many complexities raised by this troublesome notion will be found in R. A. Zeller and and E. G. Carmines , Measurement in the Social Sciences (1980)
. For elaboration of the example provided above, and an illustration of how tests for validity are conducted in practice, see Geoffrey Evans and and Colin Mills , ‘Identifying Class Structure: A Latent Class Analysis of the Criterion-Related and Construct Validity of the Goldthorpe Class Scheme’, European Sociological Review (1998)
. See also RELIABILITY; VARIABLE.