Psychometrics
Psychometrics
Psychometrics, broadly defined, includes all aspects of the science of measurement of psychological variables and all research methodologies related to them. In addition to this article, the area of psychometrics is also discussed in Achievement Testing; Aptitude Testing; Experimental Design; Factor Analysis; Intelligence And Intelligence Testing; Latent Structure; Personality Measurement; Psychophysics; Quantal Response; Response Sets; Scaling; Sociometry; Vocational Interest Testing.Also of relevance are Attitudesand Protective Methods.
This article deals with five major topics: measurement, item analysis, reliability, validity, and test norms.
Measurement
Measurement is generally considered to be any procedure whereby numbers are assigned to individuals (used herein to mean persons, objects, or events) according to some rule. The rule usually specifies the categories of an attribute or some quantitative aspect of an observation, and hence defines a scale. A scale is possible whenever there exists a one-to-one relationship between some of the properties of a group of numbers and a set of operations (the measurement procedure) which can be performed on or observed in the individuals. Scales of measurement are commonly classified as nominal scales, ordinal scales, interval scales, and ratio scales; the variables they measure can be discrete (i.e., providing distinct categories that vary from each other in a perceptibly finite way) or continuous (i.e., not readily providing distinct categories; varying by virtually imperceptible degrees).
Nominal scales
In a nominal scale the numbers merely identify individuals or the categories of some attribute by which individuals can be classified. Letters or words or arbitrary symbols would do just as well. Simple identification is illustrated by the assignment of numbers to football players; classification by assigning numbers to such attributes as sex, occupation, national origin, or color of hair. We can cross-classify according to the categories of two or more attributes; e.g., sex by occupation, or sex by occupation by national origin. With nominal scales that classify, the variables are always treated as discrete. Sex, occupation, and national origin are genuine discrete variables. Color of hair, on the other hand, is a multidimensional continuous variable. If, for instance, we treat it as a discrete variable by establishing the categories blond, brunette, and redhead, the measure becomes unreliable to some degree, and some individuals will be misclassified. Where such misclassification can occur, the scale may be termed a quasi-nominal scale. Subject to the limitation of unreliability, it has the properties and uses of any other nominal scale.
The basic statistics used with nominal scales are the numbers, percentages, or proportions of individuals in the categories of an attribute or in the cells of a table of cross classification. Hypotheses about the distribution of individuals within the categories of one attribute, or about the association of attributes and categories in a table of cross classification, are usually tested with the chi-square test. Descriptive statistics used include the mode (the category which includes the largest number of individuals) and various measures of association, the commonest of which is the contingency coefficient. These statistics remain invariant when the order of the categories of each attribute is rearranged and when the numbers that identify the categories are changed. If the categories of an attribute have a natural order, this order is irrelevant to nominal scaling, and nominal-scale statistics do not use the information supplied by any such order. [SeeCounted Data; Statistics, Descriptive, article onAssociation; Survey Analysis, article onThe Analysis Of Attribute Data.]
Ordinal scales
An ordinal scale is defined by a set of postulates. We first introduce the symbol “>,” and define it broadly as almost any asymmetrical relation; it may mean “greater than,” “follows,” “older than,” “scratches,” “pecks,” “ancestor of,” etc. We also define ≠ to mean “is unequal to” or “is different from.” Then, given any class of elements (say, a, b,c,d,…) the relation > must obey the following postulates:
If a ≠ b, then a > b or b > a.
If a > b, then a ≠ b.
If a > b and b > c, then a> c.
If a, b, c, …, are the positive integers and > means “greater than,” these postulates define the ordinal numerals. If a, b, c, …, are chicks, and > means “pecks,” the postulates define the behavioral conditions under which a “pecking order” exists. If a,b,c, …, are minerals and > means “scratches,” conformity to the postulates determines the existence of a unique scratching order. If it is suggested that aggression implies pecking or that hardness implies scratching, conformity of the behavior of chicks or minerals to the postulates indicates whether aggression or hardness, each assumed to be a single variable, can be measured on an ordinal scale in terms of observations of pecking or by scratching experiments.
Where the variable underlying the presumed order is discrete or where it is possible for two or more individuals to have identical or indistinguishably different amounts of this variable, we must enlarge the concept of order to include equality. To do so we define > (“not greater than,” “does not follow,” “not older than,” “does not scratch,” “does not peck,” “is not an ancestor of,” etc.) and add the postulate:
If a > b and b > a, then a = b.
For example, two chicks are equally aggressive if neither one pecks the other, and two minerals are equally hard if neither scratches the other.
For an ordinal scale, the relation between the ordinal numerals and the attribute they measure is monotonic. If one individual has more of the attribute than another does, he must have a higher rank in any group of which they are both members. But differences in this attribute are not necessarily associated with proportionately equal differences in ordinal numerals. The measurements may in fact be replaced by their squares or their logarithms (if the measurements are positive), their cube roots, or any one of many other mono-tonic functions without altering their ordinal positions in the series.
Medians and percentiles of ordinal distributions are themselves seldom of interest because each of them merely designates an individual, or two individuals adjacent in the order. Hypotheses involving ordinal scale data may be tested by the Wilcoxon-Mann-Whitney, Kruskal-Wallis, Siegel-Tukey, and other procedures. (Interval-scale data are often converted by ranking into ordinal scales to avoid the assumption of normality.) Both the Kendall and the Spearman rank correlation procedures apply to ordinal scales. [SeeNonparametric Statistics, articles onOrder StatisticsandRanking methods.]
Quasi-ordinal scales. Rankings made by judges are of course subject to errors. These are of two types: within-judge errors, where judges fail to discriminate consistently, and between-judge errors, where judges disagree in their rankings.
Within-judge discrimination can be evaluated, and the errors partially “averaged out,” by the method of paired comparisons. A set of judgments such as a> b, b > c, c> a, which violates the third postulate, is termed a circular triad. If a judge’s rankings provide some but not too many circular triads, a “best” single ranking is obtained by assigning a score of 1 to an individual each time he is judged better than another, summing the scores for each individual, and ranking the sums. This procedure yields a quasi-ordinal scale. A true ordinal scale exists only if there are no circular triads.
Between-judge agreement can be estimated by the coefficient of concordance or the average rank correlation, when several judges have ranked the same set of individuals on the same attribute. A true ordinal scale exists if all judges assign the same ranks to all individuals, so that the coefficient of concordance and the average rank correlation are both unity. If these coefficients are not unity but are still fairly high, the “best” single ranking of the individuals is obtained by summing all ranks assigned to each individual by the several judges, and then ranking these sums. This case is again a quasi-ordinal scale: an ordinal scale affected by some unreliability.
All statistical procedures which apply to ordinal scales apply also to quasi-ordinal scales, with the reservation that the results will be attenuated by scale unreliability.
Interval scales
An interval scale has equal units, but its zero point is arbitrary. Two interval scales measuring the same variable may use different units as well as different arbitrary zero points, but within each scale the units are equal. The classic examples are the Fahrenheit and centigrade scales of temperature used in physics.
For an interval scale, the relation between the scale numbers and the magnitudes of the attribute measured is not only monotonic; it is linear. Hence, if two interval scales measure the same variable in different units and from different zero points, the relation between them must be linear also. The general linear equation, for variables X and Y, is of the form Y = a + bX. Thus, for the Fahrenheit and centigrade scales we have
°F = 32 + 1.8°C; °C = .5556°F - 17.7778.
On interval scales, differences in actual magnitudes are reflected by proportional differences in scale units. Thus, if two temperatures are 18 units apart on the Fahrenheit scale, they will be 10 units apart on the centigrade scale, no matter where they are located on these scales (e.g., 0° to 18°F or 200° to 218°F).
Interval-scale units may be added and subtracted, but they may not be multiplied or divided. We cannot say that Fahrenheit temperature 64° means twice as hot as Fahrenheit 32°. Almost all ordinary statistical procedures may be applied to interval-scale measurements, with the reservation that measures of central tendency must be interpreted as depending upon the arbitrary zero points. Almost all other statistical procedures are functions of deviations of the measures from their respective means and, hence, involve only addition and subtraction of the scale units. [SeeMultivariate Analysis, articles onCorrelation; Statistics, Descriptive, article onLocation And Dispersion.]
Quasi-interval scales. Suppose that to each successive unit of an interval scale we apply a random stretch or compression. In this context randomness means that if the actual length of each unit is plotted against its ordinal position, there would be no trend of any sort: the larger units would not occur more frequently at one end, at both ends, in the middle, or in either or both intermediate regions. If the largest unit is small compared with the range of the variable in the group measured, we have a quasi-interval scale. All ordinary statistical procedures apply to quasi-interval scales, with the reservation that they have reduced reliability: errors of measurement are built into the scale units.
Ratio scales
A ratio scale has the properties of an interval scale and in addition a true zero point, the scale-value zero meaning absence of any amount of the variable measured. Classic examples from physics are length and weight. The relation between the actual quantities and the scale values is linear, and the equations, moreover, have no constants and are of the form Y = bX. If two ratio scales measure the same variable in different units, any measurement on one scale is the same multiple or fraction of the corresponding measurement on the other scale. Thus, if the length of an object is X inches, its length is also 2.54X centimeters. And if the length of another object is Y centimeters, its length is also .3937Y inches.
Ratio-scale units may be multiplied and divided as well as added and subtracted. A man who is six feet high is exactly twice as tall as is one who is three feet high. Statistics applicable to ratio scales include geometric means, harmonic means, coefficients of variation, standard scores, and most of the common transformations (square root, logarithmic, arcsine, etc.) used to achieve improved approximations to normality of data distributions, homogeneity of variances, and independence of sampling distributions from unknown parameters. [SeeStatistical Analysis, Special Problems Of, article onTransformations Of Data.]
Quasi-ratio scales. A quasi-ratio scale is a ratio scale with random stretches and compressions applied to its units, in much the same way described for quasi-interval scales. All of the statistics appropriate to ratio scales apply also to quasi-ratio scales, with the reservation of reduced reliability.
Operational definitions
For many of the variables of the physical sciences, and some of the variables of the social sciences, the variable itself is defined by the operations used in measuring it or in constructing the measuring instrument. Thus length can be defined as what is measured by a ruler or yardstick. If we have first a “standard inch,” say two scratches on a piece of metal (originally, according to story, the length of the first joint of a king’s index finger), we can lay off successive inches on a stick by using a pair of dividers. For smaller units we can subdivide the inch by successive halving, using the compass and straightedge of classical Euclidean geometry. Height, in turn, may be defined as length measured vertically, with vertical defined by a weight hanging motionless on a string.
Psychological and social variables, on the other hand, can less often be defined in such direct operational terms. For example, psychophysics and scaling have as major concerns the reduction of sensory, perceptual, or judgmental data to interval scale, quasi-interval scale, ratio scale, or quasi-ratio scale form. [SeePsychophysics; Scaling.]
Test scores as measurements
In the context of this discussion, a test is usually simply a set of questions, often with alternative answers, printed in a paper booklet, together with the instructions given by an examiner; a test performance is whatever an examinee does in the test situation. The record of his test performance consists of his marks on the test booklet or on a separate answer sheet. If he is well motivated and has understood the examiner’s instructions, we assume that the record of his test performance reflects his knowledge of and ability in the field covered by the test questions. For the simpler types of items, with a simple scoring procedure, we credit him with the score +1 for each item correctly marked and 0 for each item incorrectly marked. We know that the organization of knowledge of an area in a human mind is complex and that an examinee’s answer to one question is, in consequence, not independent of his answers to other questions. His score on the test is supposed to represent his total knowledge of the field represented by all the items (questions and alternative answers). In order to justify using such a score, we must be able to make at least two assumptions: (1) that knowledge of the area tested is in some sense cumulative along a linear dimension and (2) that there is at least a rough one-to-one relation between the amount of knowledge that each individual possesses and his test score.
The scores on a test, then, should form a quasi-ordinal scale. But suppose we have a 100-item, five-alternative, multiple-choice test with items arranged in order of difficulty from very easy to extremely hard. Individual A gets the first 50 right and a random one-fifth of the remainder. Individual B gets all the odd-numbered items right and a random one-fifth of all the even-numbered items. If a simple scoring formula that credits each correct response with a score of +1 is used, the score of each is 60. Yet most persons would tend to say that individual B has more of the ability measured by the test than has individual A. These cases are of course extreme, but in general we tend to attribute higher ability to an individual who gets more hard items right, even though he misses several easy items, than to one who gets very few hard items right but attains the same score by getting more easy items right. Thorndike (1926) distinguishes between altitude of intellect and range of intellect (at any given altitude) as two separate but correlated variables. He discusses ways of measuring them separately, but his suggestions have so far had little impact on the mainstream of psychometric practice. [Additional discussion of test scores as measurements is provided in the section “Item analysis,” subsection on “Indexes of difficulty”; see alsoThorndike.]
Correction for guessing
When an objective test is given to an individual, the immediate aim is to assess his knowledge of the field represented by the test items or, more generally, his ability to perform operations of the types specified by these items. But with true-false, multiple-choice, matching, and other recognition-type items, it is also possible for the examinee to mark right answers by guessing. It is known that the guessing tendency is a variable on which large individual differences exist, and the logical purpose of the correction for guessing is to reduce or eliminate the expected advantage of the examinee who guesses blindly instead of omitting items about which he knows nothing. [SeeResponse Sets.]
The earliest derivations were based on the all-or-none assumption, which holds that an examinee either knows the right answer to a given item with certainty and marks it correctly, or else he knows nothing whatever about it and the mark represents a blind guess, with probability 1/a of being right (where a is the number of alternatives, only one of which is correct). Under this assumption we infer that when the examinee has marked a — 1 items incorrectly, there were really a items whose answers he did not know, and that he guessed right on one of them and wrong on the other a — 1. Hence, for every a — 1 wrong answers we deduct one from the number of right answers for the one he presumably got right by guessing. The correction formula is then
S = R-W/(a-l),
where S is the corrected score (the number of items whose answers the examinee presumably knew), R is the right number, and W is the number wrong. It is assumed that this formula will be correct on the average, although in any particular case an examinee may guess the correct answers to more or fewer than W/(a — 1) items.
There is empirical evidence (Lord 1964) that correction for guessing corrects fairly well for high guessing tendency, but not so well for extreme caution, since the examinee is credited with zero knowledge for every item he omits. If an examinee omits items about which he has some, but not complete, knowledge, he will still be penalized. Hence, instructions should emphasize the point that an item should be omitted only if an answer would be a pure guess. If an examinee has a “hunch,” he should always play it; and if he can eliminate one alternative, he should always guess among the remainder. This is a matter of ethics applying to all tests whose items have “right” answers; an examinee should never be able to increase his most probable score by disobeying the examiner’s instructions.
A timed power test should begin with easy items and continue to items too hard for the best examinee. The time limit should be so generous that every examinee can continue to the point where his knowledge becomes substantially 0 for every remaining item. In this case the correction formula will cancel, on the average, the advantage that would otherwise accrue to those examinees who, near the end of the test period, give random responses to all remaining items.
There are only three conditions under which the correction for guessing need not be used: (1) there is no time limit, and examinees are instructed to mark every item; (2) the time limit is generous, examinees are instructed to mark every item as they come to it, and a minute or two before the end of the session the examiner instructs them to record random responses to all remaining items; (3) the test is a pure speed test, with no item having any appreciable difficulty for any examinee. In this case, errors occur only when an examinee works too fast.
Item analysis
In the construction of standardized tests, item analysis consists of the set of procedures by which the items are pretested for difficulty and discrimination by giving them in an experimental edition to a group of examinees fairly representative of the target population for the test, computing an index of difficulty and an index of discrimination for each item, and retaining for the final test those items having the desired properties in greatest degree. Difficulty refers to how hard an item is, to how readily it can be answered correctly. A test item possesses discrimination to the extent that “superior” examinees give the right answer to it oftener than do “inferior” examinees. “Superior” and “inferior” are usually defined by total scores on the experimental edition itself. This is termed internal consistency analysis. When the less discriminating items are eliminated, the test becomes more homogeneous. A perfectly homogeneous test is one in which the function or combination of related functions measured by the whole test is also measured by every item. A test may, however, be designed for a specific use—e.g., to predict college freshman grades—in which case “superior” and “inferior” may be defined externally by the freshman grade-point average.
Wherever possible, the experimental edition is administered to the item-analysis sample without time limit and with instructions to the examinees to mark every item. If the experimental session has a time limit, the subset of examinees who mark the last few items form a biased subsample, and there is no satisfactory way to correct for this bias (Wesman 1949).
The experimental group for an item analysis should be reasonably representative of the target population, and particularly representative with regard to age, school grade(s), sex, socioeconomic status, residence (city or country), and any other variables that might reasonably be expected to correlate substantially with total scores. Its range of ability should be as great as that of the target population. Beyond this, it does not have to be as precisely representative as does a norms sample. Item analyses based on a group from a single community (e.g., a city and the surrounding countryside) are often quite satisfactory if this community is representative of the target population on all of the associated variables.
There are two major experimental designs for item analysis. The first is called the upper-and-lower groups (ULG) design. On the basis of the total scores (or some external criterion scores) an upper group and a lower group are selected: usually the upper and lower 27 per cent of the whole sample, since this percentage is optimal. With the ULG design, the only information about the total scores (or the external criterion scores) that is used is the subgroup membership. Hence, this design calls for large experimental samples.
In the second design all the information in the data is used: for each item the distribution of total scores (or external criterion scores) of those who mark it correctly is compared with the distribution of those who mark it incorrectly. This is the item-total score (ITS) design. Here the sample size can be smaller.
Indexes of difficulty
With either design, a quasi-ordinal index of the difficulty of an item is provided by the per cent of the total sample who respond correctly. With the ULG design, a very slightly biased estimate is given by the average per cent correct in the upper and lower groups; a correction for this bias is found in the tables compiled by Fan (1952). For many purposes, however, an index of difficulty with units which in some sense form a quasi-interval scale is desired. With free-choice items, and the assumption that the underlying ability is normally distributed in the experimental sample, the normal deviate corresponding to the per cent correct yields a quasi-interval scale of difficulty. But under this assumption the distribution of difficulty of a recognition-type item will be skewed, with amount of skewness depending on the number of alternatives. The precise form of this distribution is not known. Common practice involves discarding items with difficulties not significantly higher than chance, even if they show high discrimination; redefining per cent correct as per cent correct above chance, p’ = (p — l/a)/(l — 1/a), where p and p’ are now proportions rather than percentages; and treating as a very rough quasi-interval scale the normal deviates corresponding to these adjusted proportions.
Another method first replaces the raw total scores (or external criterion scores) with normalized standard scores to form a quasi-interval score scale. The distributions of these normalized standard scores for those who pass and fail the item are then formed and smoothed, and the difficulty of the item is taken as the score corresponding to the point of intersection of the two distributions. This is strictly an ITS procedure.
When item difficulties have a rectangular distribution ranging from easy for the least able examinee to hard for the most able, and when items are all equally discriminating on the average, the distribution of the test scores will be approximately the same as the distribution of the ability which underlies them; and these scores will form a quasi-interval scale. Almost the only tests which actually are so constructed are those for the measurement of general intelligence, such as the Stanford-Binet. Most tests have roughly normal, or at best mildly flat-topped, distributions of item difficulties. When applied to a group for which the mean item difficulty corresponds to the mean ability level and in which the ability is approximately normally distributed, the resulting score distribution tends to be flat-topped. Empirical data support this theoretical conclusion.
Tests constructed with all items of almost equal difficulty are useful for selection purposes; they have maximum reliability at the given ability level. With a rectangular distribution of item difficulties, a test is equally reliable at all scale levels, but its reliability at any one level is relatively low. With a normal or near-normal distribution of item difficulties, the reliability is at a maximum in the region of the modal difficulty and decreases toward the tails, but this decrease is less marked than it is in the case of a test whose items are all equally difficult.
Although scores on tests with near-normal distributions of item difficulties are frequently treated as forming quasi-interval scales, they should more properly be treated as forming only quasi-ordinal scales. All the strictures against treating percentile ranks as interval scales apply to such raw-score scales with only slightly diminished force.
Indexes of discrimination in ULG design
For some purposes we need only to eliminate items for which the number of right answers is not significantly greater in the upper group than in the lower group, using the chi-square test of association. This procedure is often used in the selection of items for an achievement test.
In other cases we may wish, say, to select the 100 “best” items from an experimental test of 150. Here “best” implies a quasi-ordinal index of discrimination for each item. Widespread-tails tetra-choric correlations are often employed (Fan 1952; Flanagan 1939). The correlation indexes are statistically independent of the item difficulties. Where we may need quasi-interval scales, the Fisher z’-transformation is commonly applied to the widespread-tails tetrachoric correlation, yielding at least a crude approximation to an interval scale.
A less common procedure is to use the simple difference between the per cents correct in upper and lower groups as the index of discrimination. This index is precisely the percentage of cases in which the item will discriminate correctly between a member of the upper group and a member of the lower group (Findley 1956).
Indexes of discrimination in ITS design
With the ITS design, a t-test may be used to test the hypothesis that the mean total (or external criterion) scores of those who do and do not mark the right answer to the item are equal. If we cannot assume normality of the score distribution, we can replace the raw scores with their ranks and use the two-group Wilcoxon-Mann-Whitney test with only slight loss of efficiency. [See Linear Hypotheses, article onAnalysis Of Variance, for a discussion of the t-test; seeNonparametric Statistics, article onRanking Methods, for a discussion of the Wilcoxon-Mann-Whitney test.]
To obtain an ordinal index of discrimination, the biserial, point-biserial, or Brogden biserial correlation (1949) between the item and the total (or external criterion) scores may be used. A crude approximation to interval scaling is given by applying the Fisher z’-transformation to the biserial or Brogden biserial correlations. [See Multivariate Analysis, articles onCorrelation.]
Item analysis with wide-range groups
Some tests are designed to be used over several consecutive ages or grades, and the mean growth of the underlying variable may be assumed to be roughly linear. In such cases an item may be very hard at the lowest level but very easy at the highest, and highly discriminating at one level but quite undis-criminating at another. In such cases we may plot for each item the per cent correct at successive ages or grades, or the intersections of the score distributions for those who pass and fail the item at each age or grade. Before using this latter procedure, the raw scores may be scaled by first assigning normalized standard scores at each age or grade, assuming that the underlying variable is normally distributed at each level, and then combining them into a single scale by adjusting for the mean differences from age to age or grade to grade. An item is then retained if it shows a regular increase from age to age or grade to grade, or if it shows a large increase from any one age or grade to the next and no significant decrease at any other level. The scale difficulty of each item is the score level at which the per cent correct is 50, or for recognition items, the score level at which p’, defined as above, is .50.
Two-criterion item analysis
When a test is designed to predict a single external criterion, such as freshman grade-point average, success in a technical training course, or proficiency on a given job, we can do somewhat better than merely to select items on the basis of their correlations with the criterion measure. The original item pool for the experimental edition is deliberately made complex, in the hope that items of different types will assess different aspects of the criterion performance. The best subset of items will then be one in which the items have relatively high correlations with the criterion and relatively low correlations with the total score on the experimental edition. Methods of item selection based on this principle have been discussed by Gulliksen (1950, chapter 21) and by Horst (1936), using in both cases the ITS design.
Inventory items
Aptitude, interest, attitude, and personality inventories usually measure several distinct traits or include items of considerable diversity, subsets of which are scored to indicate similarity between the examinee’s answer pattern and those of a number of defined groups. The items are usually single words or, more commonly, short statements, and the examinee marks them as applicable or inapplicable to him, true or false, liked or disliked, or statements with which he agrees or disagrees. The scoring may be dichotomous (like or dislike), trichotomous (Yes, ?, No), or on a scale of more than three points (agree strongly, agree moderately, uncertain, disagree moderately, disagree strongly). Often the statements are presented in pairs or triplets and the examinee indicates which he likes most and which least, or which is most applicable to him and which least applicable. The distinction between inventories based on internal analysis and those based on external criteria is a major one.
For internal analysis, the items are first allocated by judgment to preliminary subscales, often on the basis of some particular theoretical formulation. Each item is then correlated with every subscale and reallocated to the subscale with which it correlates highest; items which have low correlations with all subscales are eliminated. If the sub-scales are theoretically determined, all items which do not correlate higher with the subscales to which they were assigned than to any other are eliminated. If the subscales are empirically determined, new subscale scores are computed after items are reallocated, and new item-subscale correlations are obtained; this process is repeated until the sub-scales “stabilize.” Purely empirical subscales may also be constructed by rough factor analyses of the item data or by complete factor analyses of successive subsets of items. [SeeFactor Analysis.]
For a normative scale, the job is finished at this point. (All aptitude and achievement tests form normative scales.) But when the statements are presented in pairs or triplets, they form an ipsative or partly ipsative scale. For a perfectly ipsative scale, the items of each subscale must be paired in equal numbers with the items of every other sub-scale, and only differences among subscale scores are interpretable. The California Test of Personality and the Guilford and Guilford-Martin inventories are examples of normative scales. The Edwards Personal Preference Schedule is perfectly ipsative, and the Kuder Preference Record is partly ipsative.
In filling out inventories of these types, whose items do not have right or wrong answers, we want examinees to be honest and accurate. Normative inventories are easily fakeable; ipsative inventories somewhat less so. Response sets also affect inventory scores much more than they do aptitude and achievement test scores. Most of the better inventories therefore have special scales to detect faking and to correct for various types of response sets. In forming pairs and triplets, efforts are made to equalize within-set social desirability or general popularity, while each statement represents a different subscale. [SeeAchievement Testing; Aptitude TESTING; PERSONALITY MEASUREMENT, articles OnPERSONALITY INVENTORIESandTHE MINNESOTA MULTIPHASIC PERSONALITY INVENTORY; Response Sets; Vocational Interest Testing.]
Inventories constructed on the basis of external criteria use a base group, usually large (“normal” individuals, “normal” individuals of given sex, professional men in general, high school seniors in general, high school seniors of given sex, and the like), and a number of special groups (hospital or clinic patients with the same clear diagnosis, or men or women in a given occupation). An answer (alternative) is scored for a special group if the people in that group mark it significantly more often (scored positively) or significantly less often (scored negatively) than do the people in the base group. In some inventories the more highly significant or highly correlated answers are given greater positive or negative weights than are the less significant or less highly correlated answers. Inventories of this type are almost always normative, and new subscales can be developed whenever new special groups can be identified and tested. The outstanding examples are the Strong Vocational Interest Blank and the Minnesota Multiphaslc Personality Inventory.
In inventories of this type, the same item may be scored for several subscales. In consequence there are inter-key correlations, and the reliabilities of differences between pairs of subscale scores vary with the numbers of common items. A further consequence is that general subscales based on factor analyses of individual subscale intercorrelations are difficult to evaluate, since the individual subscale scores are not experimentally independent. Similar difficulties arise in the interpretation of factor analyses of ipsative and partly ipsative scale scores.
Reliability
Reliability is commonly defined as the accuracy with which a test measures whatever it does measure. In terms of the previous discussion, it might be defined in some cases as a measure of how closely a quasi-ordinal or quasi-interval scale, based on summation of item scores, approximates a true ordinal or interval scale. The following treatment assumes quasi-interval scales, since reliability theories based entirely on the allowable operations of ordinal arithmetic, which do not define the concepts of variance and standard deviation, have not been worked out. However, definitions and results based on correlations probably apply to quasi-ordinal scales if the correlations are Spearman rank correlations. [SeeMultivariate Analysis, articles onCorrelation.]
The raw score of an individual on a test may be thought of as consisting of the sum of two components : a true score representing his real ability or achievement or interest level or trait level, and an error of measurement. Errors of measurement are of two major types. One type reflects the limitation of a test having a finite number of items. Using ability and aptitude tests as the basis for discussion, the individual’s true score would be his score on a test consisting of all items “such as” the items of the given test. On the finite test he may just happen to know the right answers to a greater or lesser proportion of the items than the proportion representing his true score. Errors of this type are termed inconsistency errors. A second type of error reflects the fact that the working ability of an individual fluctuates about his true ability. On some occasions he can “outdo himself”: his working ability exceeds his true ability. On other occasions he cannot “do justice to himself: his working ability is below his true ability. Working ability fluctuates about true ability as a result of variations in such things as motivation, inhibitory processes, physical well-being, and external events that are reflected in variation in concentration, cogency of reasoning, access to memory, and the like. Such fluctuations occur in irregular cycles lasting from a second or two to several months. Errors of this type are termed instability errors.
If a second test samples the same universe of items as does the first and in the same manner (random sampling or stratified random sampling with the same strata), the two tests are termed parallel forms of the same test. Parallel forms measure the same true ability, but with different inconsistency errors.
The basic theorem which underlies all formulas of reliability, and of empirical validity as well, may be stated as follows: In a population of individuals, the errors of measurement in different tests and in different forms of the same test are uncorrelated with one another and are uncorrelated with the true scores on all tests and forms.
Coefficient and index of reliability
The reliability coefficient, R, may be defined as the ratio of the true score variance to the raw score variance; it is also the square of the correlation between the raw scores and the true scores. The index of reliability is the square root of the reliability coefficient; it is the ratio of the standard deviation of the true scores to the standard deviation of the raw scores, or the correlation between the true scores and the raw scores. These definitions are purely conceptual. They are of no computational value because the true scores cannot be measured directly.
Furthermore, where RA and RB are the reliability coefficients of the two parallel forms and pAB is the correlation between them, it is implied in the basic theorem that . If, moreover, as is usually the case, the two forms are equally reliable, RA = RB = pA; i.e., the correlation between the two forms is the reliability coefficient of each of them. When we estimate pAB by computing rAB, the correlation in a sample, the estimate is not unbiased, but the bias is usually small if the sample is reasonably large.
Consistency coefficient. If two equally reliable parallel forms of a test are administered simultaneously (e.g., by merging them as odd and even items in the same test booklet), the reliability coefficient becomes a consistency coefficient, since instability errors affect both forms equally.
The split-half correlation (e.g., the correlation between odd and even items) provides the consistency coefficient of each of the half-tests. The consistency coefficient of the whole test, as estimated from the sample, is then derived from the Spearman-Brown formula:
where rAn is the correlation between the half-tests. The more generalized Spearman-Brown formula is
where rAB is again the correlation between the half-tests and Rn is the consistency coefficient of a parallel form n times as long as one half-test. In deriving the Spearman-Brown formula, we must assume that the half-tests are equally variable as well as equally reliable, but these requirements are not very stringent. Kuder and Richardson (1937) also present several formulas for the consistency of one form of a homogeneous test. Their most important formula (formula 20) was generalized and discussed at some length by Cronbach (1951).
Interform reliability coefficient. If two equally reliable parallel forms of a test are administered to the same group of examinees at two different times, the correlation between them is an interform reliability coefficient. The interform reliability is lower than the consistency because it is affected by instability errors, which increase with time. Reports of interform reliability should include the length of the time interval.
Stability coefficient. Instability errors are related to the interval between testings and are independent of the inconsistency errors. The stability coefficient may be defined as the interform reliability that would be found if both forms of the test were perfectly consistent. It may be estimated by the formula
where rAn is the interform reliability coefficient and C.t and CB are the consistency coefficients of the two forms, each computed from the split-half correlation and the Spearman-Brown formula, or from the Kuder-Richardson formula 20. Its value is independent of the lengths of the two forms of the test but dependent upon the time interval separating their administration.
The increase in interform reliability resulting from increase in test length may be estimated by the formula
where rAm is the interform reliability and CAB is the consistency of each form. The two forms must be assumed equally consistent, and CAn is computed as . Then R” is an estimate of the interform reliability of a parallel form n times as long as one of the two actual forms. If n = 2, the formula
gives the interform reliability of scores on the two forms combined.
Test-retest correlation. When the same form is given to the same examinees on two different occasions, the correlation is not a stability coefficient, and it would not be a stability coefficient even if every examinee had total amnesia of the first testing (and of nothing else) on the second occasion. In addition to the quantitative fluctuations in working ability which give rise to instability errors, there are qualitative fluctuations in perceptual organization, access to memory, and reasoning-procedure patterns. In consequence, the same set of items, administered on different occasions, gives rise to different reactions; and in consequence there are still some inconsistency errors. Perseveration effects, including but not limited to memory on the second occasion of some of the responses made to particular items on the first occasion, introduce artificial consistency, in varying amounts for different examinees. In consequence, test-retest coefficients cannot be clearly interpreted in terms of reliability theory.
Standard error of measurement
If several parallel forms, all equally reliable and with identical distributions of item difficulty and discrimination, could be given simultaneously to one examinee, the standard deviation of his scores would be the standard error of measurement of one form for him. Such a standard error of measurement is an estimate of the average standard error of measurement for all members of an examinee group. The formula is
where s is the standard deviation of the total scores and r,“is their consistency, computed by the split-half correlation and the Spearman-Brown formula.
The standard error of measurement may also be defined, for the whole sample or population of examinees, as the standard deviation of the inconsistency errors or the standard deviation of the differences between raw scores and the corresponding true scores. With this last definition we can also compute a standard error of measurement which includes instability errors over a given time period, by letting s represent a pooled estimate of the standard deviation of one form based on the data for both forms, and rAB an interform reliability coefficient. In this case, SEm is the standard error of measurement of one (either) form.
Reliability and variability
The variance of true scores increases with the variability of ability in the group tested, while the variance of the errors of measurement remains constant or almost constant. By the variance-ratio definition of the reliability coefficient, it follows that this coefficient increases as the range of ability of the group measured increases. The reliability coefficient of a particular test is higher for a random sample of all children of a given age than for a sample of all children in a given grade, and lower for all children in a single class. When a reliability coefficient is reported, therefore, the sample on which it was computed should be described in terms which indicate as clearly as possible the range of ability of the subjects.
The formula relating variability to reliability is
where s2 and rAB are the variance and the reliability coefficient of the test for one group, and S2 and RAB are the variance and reliability coefficient for another group. The group means should be similar enough to warrant the assumption that the average standard error of measurement is the same in both groups. If a test author reports rAB and s2 in his manual, a user of the test need only compute S2 for his group, and RAB for that group can then be computed from the formula. Note that this formula applies exactly only if the test-score units form a quasi-interval scale over the whole score range of both groups.
Reliability at different score levels
If the test-score units do not form a quasi-interval scale, the standard error of measurement will be different at different score levels. If two forms of the test or two halves of one form are equivalent, and the experimental sample is large enough, the standard error of measurement may be computed for any given score level. Two parallel forms or half-tests are equivalent if their joint distributions of item difficulty and item discrimination are essentially identical. In this case, their score distributions will also be essentially identical.
To compute the standard error of measurement at a given score level, we select from a large experimental sample a subgroup whose total scores on the two forms or half-tests combined are equal within fairly narrow limits. The standard error of measurement of the total scores is then the standard deviation of the differences beween the half-scores. When, as is usually the case, the half-scores are based on splitting one form into equivalent halves administered simultaneously, the standard errors of measurement at different score levels are based only on inconsistency errors.
The reliability coefficient at a given score level, still referred to the variability of the whole group, is given by
where SEm is defined as above and s2 is the total-score variance of the whole group.
Comparability
Two forms of a test, or two tests measuring different true-score functions, are termed comparable if and only if their units of measurement are equal. If the units do not form quasi-interval scales, they can be made comparable only if their score distributions are of the same shape and their standard errors of measurement are proportional at all score levels. Only equivalent forms have comparable raw-score units.
If two different tests have proportionally similar joint distributions of item difficulties and discriminations, they will meet these conditions. Meaningful interpretations of profiles of scores on different tests can be made only if the scores are comparable.
Validity
Test validity has to do with what a test measures and how well it measures what it is intended to measure, or what it is used to measure in any particular application if it is a multiple-use test.
Content validity
Content validity applies mainly to achievement tests, where the questions themselves define the function or combination of related functions measured and there is no external criterion of achievement with which the scores can be compared. The test developer should provide a detailed outline of both the topics and the processes —such as information, comprehension, application, analysis, synthesis, evaluation, etc.—that the test measures. A more detailed list of processes, with illustrative test questions from several fields, is given in Taxonomy of Educational Objectives (1963-1964). The item numbers of the test items are then entered in the cells of the outline, along with their indexes of difficulty and discrimination.
Evaluations of content validity are essentially subjective. The prospective user of the test may agree or disagree to a greater or lesser extent with the outline or the basis on which it was constructed, with the allocation of items to topics and processes, or with the author’s classification of some of the items. If all such evaluations are positive, the test’s validity is equal, for all practical purposes, to its interform reliability over some reasonable time period.
In constructing an achievement test, item analysis ordinarily consists only of the elimination of nondiscriminating items. If the test is to yield a single score, the various topics and processes must be sufficiently homogeneous to permit every item to correlate positively and significantly with the total score. All further elimination occurs in the balancing of the item contents against the requirements of the topic-by-process outline. In discussing school achievement tests, content validity is often termed curricular validity.
Empirical validity
Empirical validity is concerned with how well a test, either alone or in combination with others, measures what it is used to measure in some particular application. The empirical validity of a test is evaluated in terms of its correlation with an external criterion measure: an experimentally independent assessment of the trait or trait complex to be predicted. The term “prediction”is used here, without regard to time, to designate any estimate made from a regression equation, expectancy table, or successive-hurdles procedure. The term “forecast”will be used when we explicitly predict a future criterion. Empirical validity is also termed statistical validity and criterion validity.
There are two basically different types of criteria. The first may be termed sui generis criteria, criteria that exist without any special effort made to predict them. Examples include persistence in college, success or failure in a training course, dollar volume of sales, years of service in a company, and salary level. The unreliability of the criterion measure sets a natural upper limit for the validity of any predictor. The validity of a predictor or predictor battery is simply its correlation or multiple correlation with the criterion. We term such a correlation an index of raw validity.
The second type of criteria may be termed constructed criteria, and are developed upon the basis of a trait concept such as academic ability, job proficiency, or sales accomplishment. For academic achievement such a criterion might be grade-point average in academic subjects only. For job proficiency it might be based on quantity of output, quality of output, material spoilage, and an estimate of cooperation with other workers. For sales accomplishment it might be based on number of sales, dollar volume, new customers added, and an estimate of the difficulty of the territory. In any event, it must be accepted as essentially an operational definition of the trait concept and, hence, intrinsically content-valid. And since the error of measurement is no part of the operational definition of a trait concept, it is evident that we should predict true criterion scores rather than raw scores.
Concurrent and forecast validity. We can recognize two types of assessments of true validity: concurrent true validity and forecast true validity. In the first case, the criterion measure is usually one which is expensive or difficult to obtain, and the predictor is designed to be a substitute measure for it. In this case the predictor test or battery should be administered at the middle of the time interval over which the criterion behavior is observed; otherwise, instability errors will distort validity estimates.
For forecast true validity, the predictor test or battery is administered at some “natural”time: at or shortly before college entrance or admission to training or initial employment, and the criterion data should cover the later time period over which they will be most valid in terms of the criterion trait concept. In this case, the instability errors resulting from the earlier administration of the predictor test or battery are intrinsic to the prediction enterprise.
To obtain a quick rough estimate of forecast value, an investigator often tests present employees, students, or trainees at essentially the same time that the criterion data are obtained. This procedure should not be termed concurrent validation but, rather, something like retroactive validation.
Test selection and cross-validation. A common type of empirical validation study consists in administering to the experimental sample more predictors than are to be retained for the final battery. The latter then consists of that subset of predictors, of manageable length, whose multiple correlation most nearly approximates that of the whole experimental battery. Predictors commonly include scored biographical inventories, reference rating scales, and interview rating scales as well as tests.
When a subset of predictors is selected by regression procedures, its multiple correlation with the criterion is inflated: sampling error determines in part which predictors will be selected and what their weights will be. In cross-validation, the reduced predictor battery is applied to a second criterion group, using the weights developed from the first group. The aggregate correlation in the second group (now no longer a multiple correlation) is an unbiased estimate of the battery validity. In estimating forecast true validity, two criterion measures for each examinee are required only in the cross-validation sample. [SeeLinear Hypotheses, article onRegression.]
The same situation arises in even more exaggerated form when predictor items are selected on the basis of their correlations with an external criterion measure. Each predictor requires a separate item-analysis sample. A different sample is required to determine predictor weights. And a still different sample is required to estimate the validity of the whole predictor battery.
Various split-sample methods have been devised to use limited data more effectively. Thus, test selection may be carried out on two parallel samples, keeping finally the subset of tests selected by both samples. The validity of the battery is estimated in each sample by using the weights from the other. The average of the two validity indexes is then a lower bound for the battery validity when the weights used are the averages of the weights from the two samples.
Validation procedures. The commonest methods of test selection and use are those described above, using multiple regression as the basic procedure. These procedures assume that the criterion elements are all positively and substantially correlated, and can be combined with suitable weights into a single criterion measure of at least moderate homogeneity. It is then further assumed that a low score on one predictor can be compensated for by high scores on others, a weighted total score on all predictors being accepted as a single predictor of criterion performance.
Some criteria, however, consist of elements which are virtually uncorrelated. In such cases the elements must be predicted separately. In practice there is usually one predictor for each element, although it is possible to predict an element by its multiple regression on two or more predictors. In this situation, the preferred procedure is the multiple cutoff procedure. Each criterion-element measure is dichotomized at a critical (pass-fail) level, the corresponding predictor level is determined, and a successful applicant must be above the critical levels on all predictors. A further refinement consists in rating the criterion elements on their importance to the total job and requiring a successful applicant to be above the critical levels on the predictors of all the more important elements, but permitting him to be a little (but not far) below this level on one or two of the predictors of the less important elements.
In predicting a dichotomous criterion, the most accurate predictions are made when the predictor cutoff score is at the point of intersection of the smoothed frequency curves of predictor scores for the upper and lower criterion groups. If the applicant group is large, however, the predictor cutoff score may be set one or two standard errors of measurement above this point.
Correction for attenuation. Correction for attenuation is a procedure for estimating what the correlation between two variables would be if both of them were perfectly reliable or consistent; i.e., if the correlation were based on the true scores of both variables. The unreliabilities of the variables attenuate (reduce) this correlation. To determine the proper correction, the experimental design must be such that the instability errors in the intercorrelation(s) are identical with those in the reliability or consistency coefficients.
Index and coefficient. In discussing formulas for validity, the term “index” rather than the term “coefficient” has been used, although the latter is the term commonly used. The square of each of these correlations, however, is a coefficient of determination (of raw criterion scores by raw predictor scores, true criterion scores by raw predictor scores, or true criterion scores by true predictor scores). The reliability coefficient is also a coefficient of determination (of true scores by raw scores on the same variable), and the index of reliability is its square root.
As the intrinsic validity (the true validity of a perfectly reliable predictor) approaches unity, the index of true validity approaches the index of reliability of the predictor, since in this case the predictor and criterion true scores are identical; and an unreliable predictor cannot predict anything better than it predicts its own true scores. The statement “The upper limit of a test’s validity is the square root of its reliability”is erroneous: the upper limit of its validity is its reliability when both are expressed either as indexes or as coefficients. The error is due to the common practice of calling “indexes”of validity “coefficients”of validity.
Synthetic validity
Synthetic validation is test selection without a criterion variable; and synthetic validity is an estimate of empirical validity, also without a criterion variable. The procedure is based on job analysis and the accumulated experience from many previous empirical validation studies. The number of possible jobs, and the number of real jobs as well, greatly exceeds the number of distinct job elements and job-qualification traits. If previous studies have shown which qualification traits are required for each job element of a given job, and what predictors best predict each of these traits, predictors can be selected for the given job on the basis of a job analysis without a new empirical study, and rough estimates can even be made of the probable validity of the prediction battery. The procedures of job analysis are by now fairly well refined; there are substantial bodies of data on the qualification requirements for many job elements; and there are also fairly large amounts of data on the correlations between predictors and qualification traits and the intercorrelations among such predictors. Hence, synthetic validation is by now at least a practical art, and its methodology is approaching the status of an applied science. Synthetic validation is the only procedure which can be used when the number of positions in a given job category is too small to furnish an adequate sample for an empirical validation study.
Factor analysis of validity
Factor analysis provides a way of answering the question “What does this test measure?”The simplest answer is merely a collection of correlations between the given test and a variety of other better-known tests. However, if the given test and several others have all been administered to the same large sample, the factor structure of the given test provides a much better answer.
When a large number of tests are administered serially in a factor analysis study, the instability errors are greater for tests far apart in the series than for tests administered consecutively. This leads to the generation of small instability factors and complicates making the decision about when to stop factoring. If there are two forms of each test, all the A forms might be given serially, and then after an interval of a week or more all the B forms might be given in a different serial order. Two parallel factor analyses could then be performed, one (for tests 1 and 2, say) using the correlation between 1A and 2B; the other, the correlation between IB and 2A. The correlations between 1A and 2A and between IB and 2B would not be used. Interform reliabilities consistent with the intercorrelations would be given by the correlations between 1A and IB and between 2A and 2B. [SeeFactor Analysis.]
Construct validity
Construct validation is an attempt to answer the question “Does this test measure the trait it was designed to measure?”when no single criterion measure or combination of criterion measures can be considered a well-defined, agreed-upon, valid measure of the trait; and there is in fact no assurance that the postulated trait is sufficiently unitary to be measurable. We start with a trait construct: a hypothesis that a certain trait exists and is measurable. Then we build a test we hope will measure it. There are two questions, to be answered simultaneously: (1) Does the trait construct actually represent a measurable trait? (2) Does the test measure that trait rather than some other trait?
From the trait construct we draw conclusions about how the test should correlate with other variables. With some it ought to correlate fairly highly, and if in every such case it does, the resulting evidence is termed convergent validity. Variables of this type are in a sense partial criteria. With other variables it should have low correlations, and if in every such case it does, the resulting evidence is termed discriminant validity. Consider the trait construct “general intelligence”and the Stanford-Binet Scale.
(a) General intelligence should increase with age during the period of growth. Mean scores on the Stanford-Binet increase regularly throughout this period, but so do height and weight.
(b) General intelligence can be observed and rated with some validity by teachers. Children rated bright by their teachers make higher Stanford-Binet scores than do children of the same age rated dull. There is some judgmental invalidity, however; docile and cooperative children are overrated, and classroom nuisances are underrated.
(c) The extremes of general intelligence are more certainly identifiable. Individuals judged mentally deficient make very low scores, but so do prepsychotics and children with intense negative attitudes toward school. Outstanding scholars, scientists, writers, and musical composers make high scores. Equally outstanding statesmen, executives, military leaders, and performing artists make somewhat lower but still quite high scores. This finding also agrees with the hypothesis, for people in the latter categories need high, but not so very high, general intelligence; and they also need special talents not so highly related to general intelligence.
(d) Items measuring diverse cognitive traits should correlate fairly highly with one another and should generate a large general factor if general intelligence is indeed a relatively unitary measurable trait. The items of the Stanford-Binet clearly do measure diverse cognitive traits, and their inter-correlations do generate a large general factor.
(e) Reliable homogeneous tests of clearly cognitive traits should correlate fairly highly with general intelligence. Tests of vocabulary, verbal and nonverbal reasoning, arithmetic problems, and visual-space manipulation correlate fairly highly with the Stanford-Binet, and not quite so highly with one another.
(f) Tests judged “borderline cognitive”should have positive but moderate correlations with general intelligence. Tests of rote memory, verbal fluency, mechanical knowledge, and the like do have positive but moderate correlations with the Stanford-Binet.
(g) Wholly noncognitive tests should have near-zero correlations with general intelligence. Tests of writing speed, visual acuity, physical strength, and the like do have near-zero correlations with the Stanford-Binet.
The full combination of these predictions and results, along with others not cited above, leads us to place considerable confidence in the trait status of the construct “general intelligence”and in the Stanford-Binet Scale (along with others) as a measure of this trait.
If there is even a single glaring discrepancy between theory and data, either the theory (the trait construct) must be revised, or the test must be considered invalid. A test of “social intelligence”was shown to correlate as highly with a test of general-verbal intelligence as other tests of general-verbal intelligence did. From this one finding we must conclude either that social intelligence is not a trait distinct from general-verbal intelligence, or that the test in question is not a discriminatingly valid measure of social intelligence because it is a valid measure of general-verbal intelligence. [SeeIntelligence And Intelligence Testing.]
Construct validity is, in the end, a matter of judgment and confidence. The greater the quantity of supporting evidence, the greater the confidence we can place in both the trait construct and the test. But the possibility of a single glaring discrepancy is never wholly ruled out by any quantity of such evidence (see especially Cronbach & Meehl 1955).
Test norms
Since the raw scores on educational and psychological tests consist of arbitrary units with arbitrary zero points—units which in most cases vary systematically as well as randomly with score level —these individual scores can be interpreted intelligently only by comparing them with the distributions of scores of defined groups, or norms. The comparisons are facilitated by using various types of score transformations based on these distributions.
Norms may be local, regional, or national; they may refer to the whole population or to defined subgroups, such as sex and occupation. Local norms are usually determined by testing everyone in a particular group: all children in certain grades of a school system, all freshmen applying to or admitted to a college or university, all employees in specified jobs in a company, etc. Regional and national norms must be determined by sampling. Since random sampling of the whole defined population is never practical, much care is necessary in the design of the sampling procedure and in the statistical analysis of the data to assure the representativeness of the results. The principles and procedures of survey sampling are beyond the scope of this article, but one caution should be noted. Use of a “pick-up”sample, depending upon the vagaries of cooperation in various communities, however widespread over regions, rural and urban communities, etc., followed by weighting based on census data, is never wholly satisfactory, although it is the method commonly employed. When norms are based on a sample which omits some major regions or population subgroups entirely, no weighting system can correct the bias, and such norms must be used with extreme caution in these other regions and with omitted subgroups. [SeeSample Surveys.]
Grade norms
For elementary and junior high school achievement tests, grade norms are commonly employed. The data used are the mean or median scores of children in successive grades, and the unit is one-tenth of the difference between these averages for successive grades. There is probably some error in assuming one month of educational growth over the summer vacation, but this error in interpreting grade scores is small in comparison with the standard errors of measurement of even the best achievement tests.
Age norms
Age scores are used mainly with general intelligence tests, where they are termed mental ages. The data are the average scores of children within one or two months (plus or minus) of a birthday, or the average scores of all children between two given birthdays, and the unit is one-twelfth of the difference between the averages for successive ages; i.e., one month of mental age. For ages above 12 or 13, extrapolations and corrections of various sorts are made, so that mental ages above 12 or 13 no longer represent age averages but, rather, estimates of what these averages would be if mental growth did not slow down during the period from early adolescence to maturity. Thus, when we say that the mental age of the average adult is 15, we do not mean that for most people mental growth ceases at age 15. What we do mean is that the total mental growth from age 12 to maturity is about half the mental growth from age 6 to age 12; so that at age 15 the average person would have reached the mental level he actually reached at maturity (in the early or middle twenties) if mental growth from 12 to 15 had continued at the same average rate as from 6 to 12.
Age norms have been used in the past with elementary and junior high school achievement tests also, defining “arithmetic age,”“reading age,”“educational age,”etc., in like manner; but they and the corresponding “arithmetic quotient,”“reading quotient,”and “educational quotient”are no longer used.
Quotient scores
The IQ as defined originally, for children up to age 12 or 13, is 100(MA/CA), the ratio of mental age to chronological age, multiplied by 100 to rid it of decimals. For older ages the divisor is modified so that the “equivalent chronological age”goes from 12 or 13 to 15 as the actual age goes from 12 or 13 to mental maturity. The de facto corrections have usually been crude.
The IQ would have the same standard deviation at all ages if the mental growth curve were a straight line, up to age 12 or 13, passing through true zero at birth, and if the standard deviation of mental ages were a constant fraction of the mean mental age at all chronological ages. These conditions are met very roughly by age scales such as the Stanford-Binet and are not met at all well by any other tests. In consequence, and because of the troubles encountered in the extrapolation of mental age scales and the derivation of equivalent chronological ages beyond age 12 or 13, IQs are no longer computed for most tests by the formula 100(MA/CA). They have been replaced quite generally by “deviation IQs,”which are standardized scores or normal-standardized scores for age groups; and even the names “intelligence”and “IQ”are tending to be replaced by other names because of their past widespread misinterpretation as measuring innate intellectual capacity. [SeeIntellectual Development; Intelligence And Intelligence Testing.]
Modal-age grade scores
As we proceed from the first grade to the ninth, the age range within a grade increases, and the distribution of ages becomes skewed, with the longer tail in the direction of increased age. The reason is that retardation is considerably more common than acceleration, and the consequence is that in the upper grades the grade averages on tests are no longer equal to the averages for pupils making normal progress (one grade per year) through school. Modal-age grade scores are used to compare the level of an individual child’s performance with the level representing normal school progress. The data are the means or medians, not of all children in a given grade but, rather, of those children whose ages are within six to nine months of the modal age (not the median or mean age) for the grade. The units are otherwise the same as those of total-group grade scores, with all the interpretive difficulties noted previously. Modal-age grade scores are recommended for judging the progress of an individual child; total-group grade scores for comparing classes, schools, and school systems.
When total-group grade scores are based on grade medians, they are about one-third closer to the corresponding modal-age grade scores than when they are based on grade means.
Standard scores and standardized scores
When individual scores are to be compared with those of a single distribution, rather than with the means or medians of successive groups (as is the case with grade and age scores), scores based on the mean and standard deviation of the distribution are frequently employed. Originally, standard scores were defined by the formula Z — (x — x̄)/s, where x is a raw score, x is the group mean, and s is the standard deviation. Thus Z-scores have a mean of zero and a standard deviation of unity; the variance is also unity, and the product-moment correlation between two sets of Z-scores is the same as the covariance: rAB = ΣZAZB/N. For raw scores below the mean, Z-scores are negative. Thus the Z-score 1.2 corresponds to the raw score which is 1.2 standard deviations above the mean, and the Z-score —.6 corresponds to the raw score which is six-tenths of a standard deviation below the mean.
Because of the inconveniences in using negative scores and decimals, Z-scores are usually converted via a linear transformation into some other system having an arbitrary mean and an arbitrary standard deviation. These other systems are commonly termed standard-score systems, but the present writer prefers, like Ghiselli (1964), to reserve the term “standard score”for Z-scores and to call the other systems “standardized scores.”
The units of a standardized-score system form a quasi-interval scale if and only if the raw scores form such a scale. When the item-difficulty distribution is roughly normal rather than rectangular, the units are smallest near the score corresponding to the modal difficulty, and they become progressively larger with distance in either direction from this level.
Normal-standardized scores
Normal-standardized scores are standard or standardized scores for a normal distribution having the same mean and standard deviation as the raw-score distribution. They are found by looking up in a table of the normal distribution the Z-scores corresponding to the percentile ranks of the raw scores in the actual distribution, and then subjecting these Z-scores to any desired arbitrary linear transformation. This procedure corrects for departure of the score distribution from normality, but it does not insure equality of units in any practical sense, even if the distribution of the underlying ability is also normal.
The phrase “normal-standardized scores”is to be preferred to the more common “normalized standard scores.”To a mathematician, “standardizing”means reducing to Z-scores, and “normalizing”means producing scores each equal to , with sum of squares (instead of standard deviation) unity, and has no reference to the normal distribution.
Percentiles and percentile ranks
A percentile is defined as that score (usually fractional) below which lies a given percentage of a sample or population. The median is the 50th percentile: half the group make scores lower than the median and half make scores higher. The lower quartile is the 25th percentile and the upper quartile is the 75th percentile. All score distributions are grouped distributions, even though the grouping interval may be only one score unit. Percentiles are computed by interpolation under the assumption that the abilities represented by the scores within an interval are evenly distributed across that interval.
Percentile ranks are the percentiles corresponding to given scores. Since a single score represents an interval on a continuous scale, its percentile rank should be the percentage of individuals who make lower scores plus half the percentage who make the given score. In practice they are frequently computed as simply the percentage who make lower scores, and occasionally as the percentage who make the same or lower scores. Neither of these errors is large in comparison with the error of measurement.
Percentiles and percentile ranks are sometimes given for grade groups, age groups, or normal-age grade groups with elementary and junior high school achievement tests and intellience tests. They are used more commonly with high school and college tests, with the reference group all students in a given class (grade), or all college applicants, in the case of college entrance tests. For tests in particular subject areas, the reference groups are more commonly all students who have studied the subject for a given number of years in high school or college.
Strict warnings are commonly given against treating percentile ranks as though they form quasi-interval scales; but as noted above, raw scores, standardized scores, and normal-standardized scores may be little, if any, better in this respect when item-difficulty distributions are far from rectangular. It is quite possible, in fact, that for some not uncommon item-difficulty distributions, the percentile ranks may have more nearly the properties of an interval scale than the raw scores have.
Score regions
Centiles and deciles are the regions between adjacent percentiles and sets of ten percentiles. The first centile is the score region below the first percentile. The 100th centile is the region above the 99th percentile. The feth centile is the region between the (k — l)th and feth percentiles. The first decile is the region below the tenth percentile, sometimes termed the first decile point. The tenth decile is the region above the 90th percentile or ninth decile point. The feth decile is the region between the 10(fe — l)th and lOfcth percentiles, or the (k - l)th and feth decile points.
The term “quartile”is often used also to represent a region. The lower quartile, the median, and the upper quartile are the three quartile points. The first quartile is the region below the lower quartile, the second quartile the region between the lower quartile and the median, the third quartile the region between the median and the upper quartile, and the fourth quartile the region above the upper quartile.
Centiles, deciles, and quartiles are equal-frequency score regions. Stanines and stens define equal standard score or normal-standard score regions, with unequal frequencies.
Scaled scores
A few intelligence and achievement tests and test batteries are designed to cover wide ranges of ability, e.g., grades 3–9 inclusive. More commonly, however, they are issued for successive levels, such as primary, elementary, advanced (junior high school), and in some cases secondary (senior high school) and adult. In achievement test batteries, additional subject areas are usually included at the higher levels; and at the primary level, picture tests may replace tests which at other levels require reading. The successive levels usually have similar materials differing in average difficulty, but with the harder items at one level overlapping in difficulty the easier items at the next higher level.
With wide-range tests and tests issued at several levels, grade scores, age scores, modal-age grade scores, and grade, age, or modal-age grade percentile ranks or standardized scores may represent quite unequal units at different levels. Scaled score systems are designed to have units which are equal in some sense, at least on the average, from level to level throughout the range. They are based on assumptions about the shape of the underlying ability distribution within a grade or age group, and are derived by considering both the score distributions at successive grades or ages and the mean or median gains from grade to grade or age to age. None of these methods are wholly satisfactory.
Further problems arise when attempts are made to scale tests of different abilities in comparable units, since the relations between mean gains and within-grade or within-age variability are quite different for different functions. Thus, mean annual gain in reading is a much smaller fraction of within-group standard deviation than is mean annual gain in arithmetic; or, stated in terms of growth units, variability in reading is much greater than in arithmetic.
Equating
Before the scores on two tests, or even the scores on two forms of the same test, can be compared, the relations between their score scales must be established. The preferred experimental design is to give both tests to the same group of examinees: to half the group in the A-B order and to the other half in the B-A order.
The simplest method of establishing comparable scores is termed line-of-relation equating. Scores on test A and test B are considered comparable if they correspond to the same standard score. This method is satisfactory only if the item-difficulty distributions are of the same shape and are equally variable, which is seldom the case. The preferred method is termed equipercentile equating. Scores are considered comparable if they correspond to the same percentile. Selected percentiles are computed for each distribution, such as percentiles 1, 2, 3, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 85, 90, 95, 97, 98, and 99. A two-way chart is prepared, with the scores on one form as ordinates and the scores on the other form as abscissas. For each of the selected percentiles a point is plotted representing the scores on the two tests corresponding to this percentile, and a smooth curve is drawn as nearly as possible through all these points. If, but only if, this curve turns out to be a straight line, will line-of-relation equating have been satisfactory.
If the two distributions are first smoothed, the equipercentile points are more likely to lie on a smooth curve, and the accuracy of the equating is improved.
This method of equating is satisfactory if the two tests are equally consistent. If they are not, the scores on each test should all be multiplied by the consistency coefficient of that test, and the resulting “estimated true scores”should be equated.
When new forms of a test are issued annually, a full norms study is usually conducted only once every five or ten years, and the norms for successive forms are prepared by equating them to the “anchor form”used in the last norms study.
Standards
In a few cases, standards of test performance can be established without reference to the performances of members of defined groups. Thus, in typing 120 words per minute with not more than one error per 100 words is a fairly high standard.
“Quality scales,” in areas such as handwriting and English composition, are sets of specimens at equal intervals of excellence. A standard of handwriting legibility can be set by measuring the speed with which good readers read the various specimens. The standard would be the poorest specimen which is not read significantly slower than the best specimen. Units above this standard would then represent mainly increases in beauty; units below the standard, decreases in legibility. In English composition, the poorest specimen written in substantially correct grammar could be identified by a consensus of English teachers. Then units above the standard would represent mainly improvements in style; units below the standard, decreases in grammatical correctness.
Research is in progress to determine standards for multiple-choice tests of subject-matter achievement. When the items of such a test are arranged in order of actual difficulty, experienced teachers expert in the subject might be able to agree on the item which barely passing students should get right half the time. Given this item and the item-analysis data, the passing score for the test can be determined fairly readily.
Expectancy tables
When the regression of a test or battery on a criterion has been determined from a representative sample of the same population, norms for the population can be expressed in terms of expected criterion scores. The predictor scores are usually expressed in fairly broad units. Then, if the criterion is dichotomous, the expectancy table gives for each score level the probability that a person at that score level will be in the upper criterion group. If the criterion is continuous, the criterion scores are grouped into quartiles, deciles, stanines, stens, or grade levels; and for each predictor score level the table gives probabilities for the several criterion levels. Thus if the criterion is a grade in a course, the expectancy table will show for each predictor score level the probabilities that the grade will be A, B, C, D, or F.
Edward E. Cureton
[Directly related are the entriesIntelligence And Intelligence Testing; Mathematics; Personality Measurement; Scaling. Other relevant material may be found inFactor Analysis; Mul-tivariate Analysis, articles onCorrelation; Non-parametric Statistics; Statistics, Descriptive; and in the biographies ofBinet; Pearson; Spearman; Thorndike; Thurstone.]
BIBLIOGRAPHY
The standard work in psychometrics isGuilford 1936. It includes psychophysics and scaling, as well as the topics covered in this article. A somewhat more elementary treatment is given in Ghiselli 1964, and a somewhat more advanced treatment in Gulliksen 1950. Another general work is Stevens 1951, which treats with some care the foundations of measurement, a topic not covered by Guilford, Ghiselli, or Gulliksen.
Brogden, Hubert E. 1949 A New Coefficient: Application to Biserial Correlation and to Estimation of Selective Efficiency. Psychometrika 14:169-182.
Cronbach, Lee J. 1951 Coefficient Alpha and the Internal Structure of Tests. Psychometrika 16:297-334.
Cronbach, Lee J.; and Meehl, P. E. (1955)1956 Construct Validity in Psychological Tests. Pages 174–204 in Herbert Feigl and Michael Scriven (editors), The Foundations of Science and the Concepts of Psychology and Psychoanalysis. Minneapolis: Univ. of Minnesota Press. → First published in Volume 52 of the Psychological Bulletin.
Fan, Chung-teh 1952 Item Analysis Table. Princeton, N.J.: Educational Testing Service.
Findley, Warren G. 1956 A Rationale for Evaluation of Item Discrimination Statistics. Educational and Psychological Measurement 16:175-180.
Flanagan, John C. 1939 General Considerations in the Selection of Test Items and a Short Method of Estimating the Product-moment Coefficient From the Data at the Tails of the Distribution. Journal of Educational Psychology 30:674-680.
Ghiselli, Edwin E. 1964 Theory of Psychological Measurement. New York: McGraw-Hill.
Guilford, Joy P. (1936) 1954 Psychometric Methods. 2d ed. New York: McGraw-Hill.
Gulliksen, Harold 1950 Theory of Mental Tests. New York: Wiley.
Horst, Paul 1936 Item Selection by Means of a Maximizing Function. Psychometrika 1:229-244.
Kuder, G. F.; and Richardson, M. W. 1937 The Theory of the Estimation of Test Reliability. Psychometrika 2:151-160.
Lord, Frederic M. 1964 An Empirical Comparison of the Validity of Certain Formula-scores. Journal of Educational Measurement 1:29-30.
Lyerly, Samuel B. 1951 A Note on Correcting for Chance Success in Objective Tests. Psychometrika 16:21-30.
Stevens, S. S. (1951)1958 Mathematics, Measurement, and Psychophysics. Pages 1–49 in S. S. Stevens (editor), Handbook of Experimental Psychology. New York: Wiley.→ See especially pages 13–15 and 21-30.
Taxonomy of Educational Objectives. Edited by Benjamin S. Bloom. 2 vols. 1956–1964 New York: McKay. → Handbook 1: The Cognitive Domain, by B. S. Bloom and D. R. Kratwohl, 1956. Handbook 2: The Affective Domain, by D. R. Kratwohl, B. S. Bloom, and B. B. Masia, 1964.
Thorndike, Edward L. et al. 1926 The Measurement of Intelligence. New York: Columbia Univ., Teachers College.
Thorndike, Robert L. 1951 Reliability. Pages 560–620 in E. F. Lindquist (editor), Educational Measurement. Washington: American Council on Education.
Wesman, Alexander G. 1949 Effect of Speed on Item-Test Correlation Coefficients. Educational and Psychological Measurement 9:51-57.
Psychometrics
Psychometrics
NEW DEVELOPMENTS IN PSYCHOMETRIC THEORY
Psychometrics literally means “psychological measurement.” It is the methodology that deals with designs, administrations, and interpretations of measurement on individuals’ constructs such as abilities, attitudes, personality, knowledge, quality of life, and so on. There are several major components in psychometric theory, including classical test theory, item response theory, factor analysis, structure equation modeling, and statistical methods and computing. A psychometrician is an expert who practices psychometrics. He or she usually holds a postgraduate degree in either educational measurement or quantitative psychology.
CLASSICAL TEST THEORY
Reliability is a major concern for any kind of measurement, which copes with issues in measurement consistency. Classical test theory (CTT) views the score of an individual as a random variable X that can be decomposed by a fixed true score T plus an error: X = T + E, where the expected value of X is T, that is, E [X ] = T and E [E ] = 0. Moreover, for a given population, .then the measurement X is perfectly reliable. Thus, the reliability of a test, often denoted as ρxx, can be defined as
According to the above, . Therefore, for a student who received score X, a 68 percent confidence interval for his or her true score is X ±1(σE). Adequately quantifying indices to measure reliability is at the core of CTT. Let ρ XT be the correlation coefficient between X and T. It can be proved , where ρXT is the Pearson correlation between X and T. In statistics is interpreted as the proportion of variation in T that is related to the variation in X. Thus, the larger the value of ρ XT, the more reliable the measurement. Because T is unobservable, methods were proposed to estimate For example, if two parallel forms of a test are administered simultaneously to the same population, it can be proved , where X and X' are scores of the two tests and ρ XX' is the correlation. This implies that reliability can be estimated from two parallel tests. When there is only one single test available, the correlation between its odd and even items, say ρ YY', measures the internal consistency between the half-tests and is called the “split-half reliability coefficient.” According to the Spearman-Brown formula, the reliability for the entire test can be obtained from that in half-test:
In the early days of psychometric research, numerous methods were proposed to estimate reliability indices, including Cronbach’s coefficient-α, Kuder-Richardson’s KR-20, and so on.
Another important concept in psychometrics is validity, which concerns the purpose of measurement; a measure is valid if it measures what it purports to measure. A pivotal task in psychometric research is to search for adequate methods to assess all kinds of validities, such as content validity, criterion related validity, and construct validity. Though they all were created to gauge the correlation between a measure and its purpose, each one plays a unique role. Criterion related validity is used to demonstrate the accuracy of a measure with respect to a criterion that has been demonstrated to be valid. For example, a job-screening test was shown to be an accurate test for job performance. Let X be the job-screening test score collected last year and Y be the job-performance rating of this year, where X can be called a predictor score and Y is called a criterion score. A straightforward quantity to assess the validity by using X to predict Y is the Pearson correlation between X and Y, ρ XY. The larger the value of ρ XY is the more valid it is to use test score X to predict Y. Y can be predicted by least-squares linear regression
where Ŷ is the predicted criterion score, X̄ and Ȳ are the sample means, and σ Y and σ X are sample standard deviations.
Construct validity is a relatively newly developed form of validity. It refers to the degree to which the measure associates the construct that it was designed to measure. According to Mary Allen and Wendy Yen (1979), establishing construct validity is an ongoing process that involves the verification of predictions made about the test scores. Suppose a new test is proposed to measure a construct; according to theory, male and female should perform similarly if they are all at the same construct level. This hypothesis needs to be tested. If the testing is supported by the data analysis, the construct validity is enhanced.
FACTOR ANALYSIS
Factor analysis (FA) is another important component in psychometric theory. It has been commonly used to examine the structure of correlations among a set of observed scores. These scores can be either subscores of a test or the scores of several different tests. When FA is conducted, tests that are influenced by the same factor are shown to have high factor loadings on such factor. By conducting FA, researchers can identify factors that explain a variety of results among subtests or different tests. For example, a potential research question could be “How many traits are these tests measuring?” There have been immense applications and generalizations, including both confirmatory factor analysis and exploratory factor analysis. FA was originally developed within psychometrics for studying human intelligence testing, but it has become a frequently used methodology in many areas of psychology, social sciences, business, economics, engineering, and biology.
ITEM RESPONSE THEORY
Item response theory (IRT) is a relatively new methodology in psychometric theory. It is also referred to as “latent trait theory.” In CTT the true score of an examinee, which can be interpreted as the examinee’s ability level, is test dependent. When the test is easy, the examinee tends to have high true score; when the test is difficult, the examinee tends to have lower true score. The difficulty level for either an item or a test is population dependent. In order to overcome the shortcomings of CTT, IRT was created in an attempt to incorporate certain desirable features, such as examinee ability estimates, which are not test dependent, and item characteristics, which are not group dependent. IRT is based on certain fundamental assumptions: (1) For a dichotomously scored item, the performance of an examinee on the item can be predicted by knowing his or her latent trait (or set of latent traits); and (2) the relationship between examinees’ item performance and the required latent trait (or traits) to perform on the item can be described by a response function (see, e.g., Hambleton, Swaminathan and Rogers 1991). As for a polytomously scored item, a set of response functions are needed.
Different models were proposed. The most commonly used model for dichotomously scored items is a three-parameter logistic model. Let Xj be the score for a randomly selected examinee on the j th item; Xj = 1 if the answer is correct and Xj = 0 if incorrect, and let Xj = 1 with probability Pj (θ ) and Xj = 0 with probability 1 -Pj(θ ) where Pj (θ ) denotes the probability of a correct response for a randomly chosen examinee of latent trait θ ; that is, Pj (θ ) = P {Xj = 1ǀ θ }, where θ is unknown and has the domain (–∞, ∞) or some subinterval on (–∞, ∞). When the three-parameter logistic model (3PL) is used, the probability becomes
where aj is the item discrimination parameter, bj is the difficulty parameter, cj is the guessing parameter.
Polytomous IRT modeling is another important application in IRT. Assume that the response of an examinee to the j -th item can be categorized into one of a set of m + 1 categories; that is,
pjk(θ) Ξ Prob{Xj=ujklθ }.
In other words,
where , and Pjk (θ ) is referred to as the item category response function. When the category sequence is an increasing order uj0 < uj1 < … < ujm, the model is referred to as an “ordered polytomous model.” There are numerous IRT models, such as the graded response, partial credit, and nominal models.
Popular applications of IRT include latent trait estimation, item parameter calibration, modeling and detection of differential item functioning (DIF), linking and equating, and computerized adaptive testing.
NEW DEVELOPMENTS IN PSYCHOMETRIC THEORY
Perhaps one of the biggest challenges for psychometricians today is how to keep abreast of the rapid developments in technology. Computerized Adaptive Testing (CAT) and Internet-Based Testing (IBT) are undergoing rapid growth. Although the implementation of new technologies has led to many advantages, such as new question formats, new types of skills that can be measured, easier and faster data analysis, faster score reporting, and continuous testing, many research questions remain unanswered. For example, how does one improve CAT test security without sacrificing estimation efficiency? How does one detect cheating behavior from an examinee’s response pattern? How does one automatically grade examinees’ performance in a large scale performance based assessment? How does one use test scores to make inferences about examinees’ cognitions? Another big challenge stems from the fact that an unprecedented number of people are taking tests daily, from K-12 educational assessments, college admissions tests, job application and placement tests, professional licensing exams, survey research, psychiatric evaluations, and medical diagnostic tests. As such, the need for new methods is apparent in many aspects of psychometric development. Examples of some new developments are discussed below.
Large-Scale Automated Test Assembly Large-scale application of computer-based achievement tests and credentialing exams has generated many challenges to test development. One of these challenges, maintaining content representation in multiple forms, is central to test defensibility and validity. Manually assembling parallel test forms is not only time consuming, but also infeasible when a great number of forms are needed. Utilizing automated test-assembly (ATA) procedures reduces the workload of test developers and ensures the quality of the assembled test forms. ATA methods can be achieved by constrained combinatorial optimization, which involves how to best arrange the controllable elements in large complex systems in order to achieve a specified goal. A typical test-assembly problem can be treated by selecting items so that the assembled test satisfies a certain reliability index (objective function) subject to constraints such as test length and content coverage. Several methods were proposed. According to Wim van der Linden (2005), binary linear programming seems to be a popular method for test assembly due to two reasons: (1) the techniques are well developed within the field of operations research, and (2) some commercial software packages are readily available and user-friendly. Other promising ATA methods include sampling and stratification, weighted deviation, network flow, and optimization methods.
Automatically Scoring Performance-Based Assessment A new trend in large-scale assessment is to increase the portion of performance-based tasks in standardized testing. With the rapid development in computer technologies, more and more computer-based tests (CBT) for a variety of innovative constructed-response tasks have become available. Examples of such tasks include writing an essay or diagnosing a computer-simulated patient. However, grading on these complex tasks demands tremendous effort. Due to the subjectivity in human readers’ scoring process, each task requires two or more content experts to review, which is very time consuming and pricey. Moreover, oftentimes there are several thousand or more students taking a given exam. How do we grade such examinations? Can we use automated scoring systems to address the cost issues and make the scoring more consistent? One of the most innovative developments in psychometrics for the last decade is automated scoring of complex tasks.
The Electronic Essay Rater (E-rater) is a technology developed by the Educational Testing Service (ETS) to score essays automatically based on holistic scoring guide criteria (see Burstein et al. 1998). The E-rater is designed to provide a distinct scoring model for each new essay topic. In the first step, a few hundred essays on the same topic were randomly sampled as a “training sample” and then scored by well trained human raters. The human scores are treated as the values of a dependent variable Y. Second, a set of variables are derived statistically from the learning sample, either through Natural Language Processing (NLP) techniques or by simple “counting” procedures. These feature variables, say X 1, …, Xn, are treated as a set of “independent variables” and a linear regression model is established
Y = B0 + B1X1 + … + BnXn + ε.
Third, a stepwise linear regression analysis is performed on the feature variable X 1, …, Xn extracted from the training dataset to identify a subset of features that parsimoniously explains observed variation in the consensus scores. Lastly, the final score prediction for cross-validation sets is performed using these estimated equations. Once the scoring model is established, the scores of the examinees outside the learning sample can be “predicted” by the linear model. Several million essays have been scored by E-rater since it was adopted for scoring the GMAT in 1999, and the technology is being considered for use with the Graduate Record Examination, for graduate school admissions, and the Test of English as a Foreign Language, which assesses the English proficiency of foreign students entering U.S. schools (Mathews 2004).
Measuring Patient-Reported Outcomes Conventional clinical measures of disease such as x-rays and lab results do not fully capture information about chronic diseases and how treatments affect patients. In order to get remedial measure, self-completed questionnaires are often administered to patients to assess their subjective experiences such as symptom severity, social well-being, and perceived level of health. Such measurement of patient-reported outcomes (PROs) is important for disease intervention. Many psychometric approaches can be used to meet the needs of clinical researchers across a wide variety of chronic disorders and diseases.
In particular, the CAT technology developed in educational testing can be used innovatively in health-related quality-of-life (HQOL) measures, in which a next item is selected based on the response the patient has given to the previous question. According to Frederic Lord (1971), an examinee is measured most effectively when test items are neither too “difficult” nor too “easy.” Heuristically, if the examinee answers an item correctly, the next item selected should be more difficult; if the answer is incorrect, the next item should be easier. Note that in HQOL applications, the term difficulty is analogous to severity. For example, asking a patient if it is difficult to climb the stairs might measure a lower level of severity than asking if it is difficult to walk 1 mile. Thus, items are tailored to the individual with greater estimation precision and content validity. According to the National Institutes of Health (2003), such a CAT HQOL system would be useful in clinical practice to assess response to interventions and to inform modification of treatment plans.
Psychometric application in HQOL shares many similarities with its use in educational testing, with, for example, reliability and validity being the highest priorities for both. However, different perspectives do exist. For example, the length of the assessment for a particular domain in HQOL is a much greater concern, especially when many domains must be assessed in a population of patients with a chronic disease, because patient burden must be carefully considered. The need for more psychometric research in developing, evaluating, and applying HQOL measures is growing, and undoubtedly, this will advance the field of psychometrics.
SEE ALSO Cliometrics; Eugenics; Factor Analysis; Galton, Francis; Intelligence; Measurement; Pearson, Karl; Probability; Psychology; Reliability, Statistical; Scales; Spearman Rank Correlation Coefficient; Statistics; Statistics in the Social Sciences; Structural Equation Models; Validity, Statistical
BIBLIOGRAPHY
Allen, Mary J., and Wendy Yen. 1979. Introduction to Measurement Theory. Monterey, CA: Brooks Cole.
Burstein, Jill, Lisa Braden-Harder, Martin Chodorow, et al. 1998. Computer Analysis of Essay Content for Automated Score Prediction: A Prototype Automated Scoring System for GMAT Analytical Writing Assessment Essays. ETS Research Report 98-15. http://www.ets.org/research/researcher/RR-98-15.html.
Hambleton, Ronald K., Hariharan Swaminathan, and H. Jane Rogers. 1991. Fundamentals of Item Response Theory. Newbury Park, CA: Sage.
Lord, Frederic. 1971. Robbons-Monro Procedure for Testing. Educational and Psychological Measurement 31: 3–31.
Mathews, Jay. 2004. Computer Weighing in on the Elements of Essay. Washington Post, August 1.
National Institutes of Health. 2003. Dynamic Assessment of Patient-Reported Chronic Disease Outcomes. RFA-RM-04–011. http://grants.nih.gov/grants/guide/rfa-files/RFA-RM-04-011.html.
Van der Linden, Wim J. 2005. Linear Models for Optimal Test Design. New York: Springer.
Hua-Hua Chang
Psychometry
Psychometry
A faculty, claimed by many psychics and mediums, of becoming aware of the characters, surroundings, or events connected with an individual by holding or touching an object, such as a watch or ring, that the individual possessed or that was strongly identified with the person. Medium Hester Dowden described psychometry as "a psychic power possessed by certain individuals which enables them to divine the history of, or events connected with, a material object with which they come in close contact."
No doubt such an ability has been manifest from ancient times, but it was first named and discussed in modern history by the American scientist Joseph Rhodes Buchanan in 1842. The term derives from the Greek psyche (soul) and metron (mea-sure) and signifies "soul-measuring," or measurement by the human soul. Buchanan's theory was based on the belief that everything that has ever existed—every object, scene, or event that has occurred since the beginning of the world—has left on the ether, or astral light, a trace of its being. This trace is indelible while the world endures and is impressed not only on the ether but on more palpable objects, such as trees and stones. Sounds and perfumes also leave impressions on their surroundings, said Buchanan. Just as a photograph may be taken on film or plate and remain invisible until it has been developed, so may those psychometric "photographs" remain im-palpable until the developing process has been applied. That which can bring them to light is the psychic faculty and mind of the medium, he said.
Buchanan claimed that this faculty operated in conjunction with what he termed a community of sensation of varying intensity. The psychometric effect of medicines in Buchanan's experiments as a physician was similar to their ordinary action. When an emetic was handed to a subject, the subject could only avoid vomiting by suspending the experiment. Buchanan's earliest experiments, with his own students, showed that some of them were able to distinguish different metals merely by holding them in their hands. Later he found that some among them could diagnose a patient's disease simply by holding his hand. Many of his acquaintances, on pressing a letter against their foreheads, could tell the character and surroundings of the writer, the circumstances under which the letter was written, and other details.
Many mediums who have practiced psychometry have since become famous in this line. As has been said, their method is to hold in the hand or place against the forehead some small object, such as a fragment of clothing, a letter, or a watch; appropriate visions are then seen or sensations experienced.
While on rare occasions a psychometrist may be entranced, normally he or she is in a condition scarcely varying from the normal. The psychometric pictures, presumably somehow imprinted on the objects, have been likened to pictures carried in the memory, seemingly faded, yet ready to start into vividness when the right spring is touched. Some have suggested, for example, that the rehearsal of bygone tragedies so frequently witnessed in haunted houses is really a psychometric picture that, during the original occurrence, impressed itself on the room. The same may be said of the sounds and smells that haunt certain houses.
The psychological effect of the experimental objects appears to be very strong. When a Mrs. Cridge, William Denton's subject, examined a piece of lava from the Kilauea volcano she was seized with terror and the feeling did not pass for more than an hour.
On examining a fragment of a mastodon tooth, Elizabeth Denton said,
"My impression is that it is a part of some monstrous animal, probably part of a tooth. I feel like a perfect monster, with heavy legs, unwieldy head, and very large body. I go down to a shallow stream to drink. I can hardly speak, my jaws are so heavy. I feel like getting down on all fours. What a noise comes through the wood! I have an impulse to answer it. My ears are very large and leathery, and I can almost fancy they flap my face as I move my head. There are some older ones than I. It seems, too, so out of keeping to be talking with these heavy jaws. They are dark brown, as if they had been completely tanned. There is one old fellow, with large tusks, that looks very tough. I see several young ones; in fact, there is a whole herd."
She derived further impressions from a fragment of a meteorite: "It carries my eyes right up. I see an appearance of misty light. I seem to go miles and miles very quickly, up and up. Streams of light come from the right, a great way off…. Light shining at a vast distance."
Some negative impressions can prostrate the psychic and cause illness. On occasion, if the impressions are too antagonistic, the psychic will refuse to handle the object. Some psychometrists have been known, when given an object belonging to a deceased person, to take on the personal appearance and mannerisms of the owner and even to suffer from his or her ailments.
Eugene Crowell, in The Identity of Primitive Christianity and Modern Spiritualism (2 vols., 1875-79), writes of a sentry box in Paris in which the sentry on duty committed suicide by hanging. Another soldier was assigned to the same duty, and within three weeks took his life by similar means. Still another succeeded to the post, and in a short time met a similar fate. When these events were reported to Emperor Louis Napoleon, he ordered the sentry box removed and destroyed.
There are many instances on record in which corpses have been traced through psychometric influence. Attempts have also been made to employ it in criminology with varying results. In his book Thirty Years of Psychical Research (1923), Charles Richet narrates the experience of a Dr. Dufay with a nonprofessional somnambulist called Marie. He handed her something in several folds of paper. She said that the paper contained something that had killed a man. A rope? No. A necktie, she continued. The necktie had belonged to a prisoner who hanged himself because he had committed a murder, killing his victim with a gouet (a woodman's hatchet). Marie indicated the spot where the gouet was thrown on the ground. The gouet was found in the place indicated.
While most psychometrists give their readings in a normal state, a few are hypnotized. Maria Reyes de Z. of Mexico, with whom Gustav Pagenstecher conducted a series of successful experiments, belongs to the latter class. From a shell picked up on the beach of Vera Cruz she gave the following reading: "I am under water and feel a great weight pressing upon my body. I am surrounded by fishes of all kinds, colors, shapes, and sizes. I see white and pink coral. I also see different kinds of plants, some of them with large leaves. The water has a dark green, transparent colour. I am among the creatures but they do not seem to notice my presence, as they are not afraid of me in spite of touching me as they pass by."
Many psychometrists in the Spiritualist community have asserted that they are simply instruments and that spirits do the reading. Trance mediums often ask for objects belonging to the dead to establish contact. It was a habit with Leonora Piper. But other psychics, like Pascal Forthuny, repudiated the theory of spirit intervention and considered psychometry a personal gift, a sensitivity to the influence of the objects possessed. This influence, or emanation, was likened by Waldemar Wasielawski to the "rhabdic force" that he believed bends the rod of the water-witcher while dowsing.
William T. Stead suggested that very slight contact would suffice to impart such personal influence. On one occasion he cut pieces of blank paper from the bottom pages of letters of eminent people, just below the signature of each, and sent them to a Miss Ross marked "No. 1. Lady," "No. 2. Gentleman." The readings were very successful (see Stead's journal, Borderland, October 1895).
The psychometric vision sometimes comes in quickly flashed images and requires an effort of will to slow down, say mediums. Acccording to D'Aute Hooper in Spirit Psychometry, "It would be impossible to follow up and write the impressions as they pass through my consciousness. It is far too rapid. They are like cinematographic pictures. I seem to fly, and at other times I seem to be the piece of stone, without thinking power but seeing things and happenings around me."
The scope of the visions has been described as small or encompassing the whole room. There is no definite order in their emergence. The picture is kaleidoscopic, there is an oscillation in periods of time, but the images of more important events seem to have better sway, say mediums.
The exercise of the faculty requires a relaxed, receptive mind. After the object is touched, some psychometrists feel they are immediately at the location; others mentally travel there first. Some may tear off a piece of paper from an envelope and put it into their mouths. Others are satisfied to handle an object, or hold it wrapped up in their hands.
As a rule, a clue containing an "influence" is indispensable for psychometric readings. But experiments with exceptional psychics led Joseph Buchanan to the conclusion that the clue may be supplanted by an index, for instance, by a name written on a piece of paper. Such cases appear to be rare.
It is usually said that a medium cannot get a reading for himself or herself by psychometry. An incident told some years ago in the journal Light is therefore very interesting. E. A. Cannock was handed, without her knowing the origin, a broad piece of elastic that was actually her own. She not only gave a character reading of herself, but also made a prediction that proved to be correct.
It is said that the image of engravings is retained by the glass and that by some processes, such as the use of mercury vapor, this image can be developed. There is a suggestion of some similar effect in an incident related by Elizabeth Denton. She had entered a car from which the passengers had gone to dinner and was surprised to see all the seats occupied. She later recalled:
"Many of them were sitting perfectly composed, as if, for them, little interest was attached to this station, while others were already in motion (a kind of compressed motion), as if preparing to leave. I thought this was somewhat strange, and was about turning to find a vacant seat in another car, when a second glance around showed me that the passengers who had appeared so indifferent were really losing their identity, and, in a moment more, were invisible to me. I had sufficient time to note the personal appearance of several; and taking a seat, I awaited the return of the passengers, thinking it more than probable I might find them the prototypes of the faces and forms I had a moment before so singularly beheld. Nor was I disappointed. A number of those who returned to the cars I recognized as being, in every particular, the counterparts of their late, but transient representatives."
Psychometric impressions may come so spontaneously as to seriously distract the medium in the daily course of life. The British medium Bessie Williams complained of this trouble. The Dutch psychometrist Lotte Plaat said she could not go into the British Museum in London because she felt that the exhibits were literally shouting their history. By a strong effort of will, however, such impressions can usually be dispelled.
Buchanan made a suggestion to test direct writing by spirits by submitting it to psychometric reading. He thought that if the writing was purely the product of the medium, the reading would give the medium's character; if not, the character of the spirit author would be described. The experiments were unsuccessful, however, because he had seemingly overlooked the complications of the ectoplasm from which the "spirit" hand was said to be formed. If the writing was done by a materialized hand built out of the bodily substance of the medium, it might bear as little impression of the spirit as a dictated text bears of the dictator, he reasoned.
As already mentioned, psychometry has been utilized to gain information about hauntings. "That the victim of some century old villainy," writes Sir Arthur Conan Doyle in his book The Edge of the Unknown (1930), "should still in her ancient garments frequent in person the scene of her former martyrdom, is indeed, hard to believe. It is more credible, little as we understand the details, that some thought-form is used and remains visible at the spot where great mental agony has been endured." But he was not unmindful of the difficulties of such speculation, adding, "Why such a thought-form should only come at certain hours, I am compelled to answer that I do not know." The psychometric impression should always be there and should always be perceived, if the theory is correct. The ghost apparently is not; its ways are strange.
Searching for Explanations
Psychometry was identified by Buchanan and entered into the terminology of Spiritualism at a time when a somewhat elaborate and detailed understanding of the spirit world was being conceived in order to explain the many varied phenomena emerging in the séance room. Many of these ideas were offered in an attempt to explain one mystery, such as psychometry, by another, such as ectoplasm. Much of that speculation disappeared along with the mass of physical phenomena. Stephan Ossowiecki, a prominent modern psychometrist, has noted correctly that should the psychometric speculation be even partially true, it would explain nothing. Psychometry is just a word and not an explanation, he said. Its essential nature, its exercise, is a mystery. He writes:
"I begin by stopping all reasoning, and I throw all my inner power into perception of spiritual sensation. I affirm that this condition is brought about by my unshakable faith in the spiritual unity of all humanity. I then find myself in a new and special state in which I see and hear outside time and space…. Whether I am reading a sealed letter, or finding a lost object, or psychometrising, the sensations are nearly the same. I seem to lose some energy; my temperature becomes febrile, and the heartbeats unequal. I am confirmed in this supposition because, as soon as I cease from reasoning, something like electricity flows through my extremities for a few seconds. This lasts a moment only, and then lucidity takes possession of me, pictures arise, usually of the past. I see the man who wrote the letter, and I know what he wrote. I see the object at the moment of its loss, with the details of the event; or again, I perceive or feel the history of the thing I am holding in my hands. The vision is misty and needs great tension. Considerable effort is required to perceive some details and conditions of the scenes presented. The lucid state sometimes arises in a few minutes, and sometimes it takes hours of waiting. This largely depends on the surroundings; skepticism, incredulity, or even attention too much concentrated on my person, paralyses quick success in reading or sensation."
Illuminating as this subjective account is, it conveys little about the specific nature of psychometric influence. Gustav Pagenstecher conjectured as follows:
"The associated object which practically witnessed certain events of the past, acting in the way of a tuning fork, automatically starts in our brain the specific vibrations corresponding to the said events; furthermore, the vibrations of our brain once being set in tune with certain parts of the Cosmic Brain already stricken by the same events, call forth sympathetic vibrations between the human brain and the Cosmic Brain, giving birth to thought pictures which reproduce the events in question."
Spiritualist Sir Arthur Conan Doyle, in plainer language, compared psychometric impressions to shadows on a screen. The screen is the ether, "the whole material universe being embedded in and interpenetrated by this subtle material which would not necessarily change its position since it is too fine for wind or any coarser material to influence it." Doyle himself, although by no means psychic, would always be conscious of a strange effect—almost a darkening of the landscape with a marked sense of heaviness—when he was on an old battlefield. A more familiar example of the same faculty may be suspected in the gloom that gathers over the mind of even an average person upon entering certain houses. Such sensitivity may find expression in more subtle and varied forms. "Is not the emotion felt on looking at an old master [painting] a kind of thought transference from the departed?" asked Sir Oliver Lodge. The query cannot be answered conclusively, since the labels attached to psychic phenomena are purely arbitrary.
Akashic Records
Attempts at such a synthesis have been made by Theosophists. In his introduction to W. Scott-Elliot's The Story of Atlantis and the Lost Lemuria (1904), the first book drawn from the so-called akashic records, A. P. Sinnett explains that the pictures of memory are imprinted on some nonphysical medium; they are photographed by nature on some imperishable page of superphysical matter. They are accessible, but the interior spiritual capacities of ordinary humanity are as yet too imperfectly developed to establish touch, he says. He further notes:
"But in a flickering fashion, we have experience in ordinary life of efforts that are a little more effectual. Thought-transference is a humble example. In that case, 'impressions on the mind' of one person, Nature's memory pictures with which he is in normal relationship, are caught up by someone else who is just able, however unconscious of the method he uses, to range Nature's memory under favourable conditions a little beyond the area with which he himself is in normal relationship. Such a person has begun, however slightly, to exercise the faculty of astral clairvoyance."
Such highly speculative ideas are beyond the scope of psychical research, but the concept of the akashic records in its philosophical depths can be partly supported by an astronomical analogy. Because of the vastness of interstellar distances it takes hundreds of thousands of years for light, traveling at the enormous speed of 186,000 miles per second, to reach us from distant stars. Anyone who could look at the Earth from such a distant star would witness, at the present moment, the primeval past. From various distances the creation of our world could be seen as a present reality. Theoretically, therefore, astronomy admits the existence of a scenic record of the world's history. The concept of this cosmic picture gallery and that of the akashic records is similar.
There is no generally validated method of access to such records in sublimated psychometry. However, Theosophist G. R. S. Mead, in his book Did Jesus Live 100 B.C. ? (1903), asserted the following regarding akashic research:
"It would be as well to have it understood that the method of investigation to which I am referring does not bring into consideration any question of trance, either self-induced, or mesmerically or hypnotically effected. As far as I can judge, my colleagues are to all outward seeming in quite their normal state. They go through no outward ceremonies, or internal ones for that matter, nor even any outward preparation but that of assuming a comfortable position; moreover, they not only describe, as each normally has the power of description, what is passing before their inner vision in precisely the same fashion as one would describe some objective scene, but they are frequently as surprised as their auditors that the scenes or events they are attempting to explain are not at all as they expected to see them, and remark on them as critically, and frequently as sceptically, as those who cannot 'see' for themselves but whose knowledge of the subject from objective study may be greater than theirs."
Simultaneous Perception of "Memory Records"
One need not go to occultists for psychic experiences in which there is a clear suggestion of memory records existing independently of individual powers of cognition. Something of that nature has been perceived by several people simultaneously, thus suggesting some sort of objectivity.
The Battle of Edge Hill (on the borders of Warwickshire and Oxfordshire, England) was fought on October 22, 1624. Two months later a number of shepherds and village people witnessed an aerial reenactment of the battle with all the noises of the guns, the neighing of the horses and the groans of the wounded. The vision lasted for hours and was witnessed by people of reputation for several consecutive days. When rumors of it reached the ears of Charles I, a commission was sent out to investigate. The commission not only reported having seen the vision on two occasions, but actually recognized fallen friends of theirs among the fighters; one was Sir Edmund Varney.
A similar instance was recorded by Pausanias (second century B.C.E.), according to whom on the plains of Marathon, four hundred years after the great battle, the neighing of horses, the shouts of the victors, the cries of the vanquished, and all the noise of a well-contested conflict, were frequently to be heard.
Patrick Walker, the Scottish Presbyterian covenanter, is quoted in Biographia Presbyteriana (1827) as stating that in 1686, about two miles below Lanark, on the water of Clyde, "many people gathered together for several afternoons, where there were showers of bonnets, hats, guns and swords, which covered the trees and ground, companies of men in arms marching in order, upon the waterside, companies meeting companies … and then all falling to the ground and disappearing, and other companies immediately appearing in the same way." But Patrick Walker himself saw nothing unusual occur. About two-thirds of the crowd saw the phenomena; the others saw nothing strange. "Patrick Walker's account," states Andrew Lang in his book Cock Lane and Common Sense (1896), "is triumphantly honest and is, perhaps, as odd a piece of psychology as any on record, thanks to his escape from the prevalent illusion, which, no doubt, he would gladly have shared."
Under the pseudonyms Miss Morrison and Miss Lamont, Anne Moberly, daughter of the bishop of Salisbury, and Eleanor Jourdain published in 1911 a remarkable book entitled An Adventure, in which they claim that in 1901 and 1902 they had a simultaneous vision, on the grounds of Versailles, of the place as it was in 1789. Some time after the first publication of their account of their Versailles adventure, testimony was given by people who lived in the neighborhood of Versailles that they also had seen the mysterious appearances, the strange phenomena being witnessed only on the anniversary of the attack on Versailles during the French Revolution. The most inexplicable feature of the story is that the people of the eighteenth century saw, heard, and spoke to the people of the twentieth century, who never doubted at the time that they were in communication with real individuals.
Psychometric Premonitions
Another class of phenomena could be classified as psycho-metric foreshadowings of the future. The report on the Census of Hallucinations made by the Society for Psychical Research in Great Britain in 1889 recorded one incident concerning a solitary excursion to a lake. The individual noted:
"My attention was quite taken up with the extreme beauty of the scene before me. There was not a sound or movement, except the soft ripple of the water on the sand at my feet. Presently I felt a cold chill creep through me, and a curious stiffness of my limbs, as if I could not move, though wishing to do so. I felt frightened, yet chained to the spot, and as if impelled to stare at the water straight in front of me. Gradually a black cloud seemed to rise, and in the midst of it I saw a tall man, in a suit of tweed, jump into the water and sink. In a moment the darkness was gone, and I again became sensible of the heat and sunshine, but I was awed and felt eery…. A week afterwards Mr. Espie, a bank clerk (unknown to me) committed suicide by drowning in that very spot. He left a letter for his wife, indicating that he had for some time contemplated death."
Princess Karadja narrates in the Zeitschrift für Metapsychische Forschung (March 15, 1931) a story of a personal experience of the late Count Buerger Moerner that contains this incident:
"Passing through the little garden and glancing in at the window as he approached the house (looking for public refreshment) the Count was horrified to see the body of an old woman hanging from a ceiling beam. He burst into the room with a cry of horror, but once across the threshold was stunned with amazement to find the old woman rising startled from her chair, demanding the reason of his surprising intrusion. No hanging body was to be seen and the old lady herself was not only very much alive, but indignant as well…. Some days later, being again in that locality, he decided to visit the hut once more, curious to see if by some peculiarity of the window pane he might not have been observing an optical illusion. Nearing the hut through the garden as before, the same terrible sight met his eye. This time, however, the Count stood for some minutes studying the picture, then after some hesitation knocked at the door. No answer, even to repeated knocks, until at length Count Moerner opened the door and entered to find what he saw this time was no vision. The old woman's body was indeed hanging from the beam. She had committed suicide."
Psychometry remains a popular practice in both psychic and Spiritualist circles. There has been little work done on it in parapsychology since it is difficult to quantify results and many consider it but a variation on clairvoyance. It may also be seen as merely a helpful tool to assist the psychic into the proper state for receiving clairvoyant impressions.
Sources:
Buchanan, J. Rhodes. Manual of Psychometry: The Dawn of a New Civilization. Boston: Dudley M. Holman, 1885.
Butler, W. E. How to Develop Psychometry. London: Aquarian Press; New York: Samuel Weiser, 1971.
Denton, William, and Elizabeth Denton. Nature's Secrets, or Psychometric Researches. London: Houston & Wright, 1863.
Ellis, Ida. Thoughts on Psychometry. Blackpool, England, 1899.
[Hooper, T. D'Aute]. Spirit Psychometry and Trance Communications by Unseen Agencies. London: Rider, 1914.
Pagenstecher, Gustav. "Past Events Seership." Proceedings of the American Society for Psychical Research 16 (January 1922).
Prince, Walter Franklin. "Psychometrical Experiments with Senora Maria Reyes de Z." Proceedings of the American Society for Psychical Research 15 (1921).
Richet, Charles. Thirty Years of Psychical Research. N.p., 1923.
Verner, A. Practical Psychometry (pamphlet). Blackpool, England, 1903.
Psychometry
Psychometry
Psychometry or psychometrics is a field of psychology which uses tests to quantify psychological aptitudes, reactions to stimuli, types of behavior, etc., in an effort to devlop reliable scientific models that can be applied to larger populations.
Reliability
Reliability refers to the consistency of a test, or the degree to which the test produces approximately the same results over time under similar conditions. Ultimately, reliability can be seen as a measure of a test’s precision.
A number of different methods for estimating reliability can be used, depending on the types of items on the test, the characteristic(s) a test is intended to measure, and the test user’s needs. The most commonly used methods to assess reliability are the test-retest, alternate form, and split-half methods. Each of these methods attempts to isolate particular sources and types of error.
Error is defined as variation due to extraneous factors. Such factors may be related to the test-taker, if for instance he or she is tired or ill the day of the test and it affects the score. Error may also be due to environmental factors in the testing situation, such as an uncomfortable room temperature or distracting noise.
Test-retest methods look at the stability of test scores over time by giving the same test to the same people after a reasonable time interval. These methods try to separate out the amount of error in a score related to the passing of time. In test-retest studies, scores from the first administration of a test are compared mathematically through correlation with later score(s).
Test-retest methods have some serious limitations, one of the most important being that the first test-taking experience may affect performance on the second test administration. For instance, the individual may perform better at the second testing, having learned from the first experience. Moreover, tests rarely show perfect test-retest reliability because many factors unrelated to the tested characteristic may affect the test score. In addition, test-retest methods are only suitable to use with tests of characteristics that are assumed to be stable over time, such as intelligence. They are unsuitable for tests of unstable characteristics like emotional states such as anger or anxiety.
The alternate-form method of assessing reliability is very similar to test-retest reliability except that a different form of the test in question is administered the second time. Here two forms of a test are created to be as similar as possible so that individual test items should cover the same material at the same level of ease or difficulty. The tests are administered to a sample and the scores on the two tests are correlated to yield a coefficient of equivalence. A high coefficient of equivalence indicates the overall test is reliable in that most or all of the items seem to be assessing the same characteristic. Low coefficients of equivalence indicate the two test forms are not assessing the same characteristic.
Alternate form administration may be varied by the time interval between testing. Alternate form with immediate testing tries to assess error variance in scores due to various errors in content sampling. Alternate form with delayed administration tries to separate out error variance due to both the passage of time and to content sampling. Alternate-form reliability methods have many of the same limitations as test-retest methods.
Split-half reliability methods consist of a number of methods used to assess a test’s internal consistency, or the degree to which all of the items are assessing the same characteristic. In split-half methods a test is divided into two forms and scores on the two forms are correlated with each other. This correlation coefficient is called the coefficient of reliability. The most common way to split the items is to correlate even-numbered items with odd-numbered items.
Validity
Validity refers to how well a test measures what it intends to, along with the degree to which a test validates intended inferences. Thus a test of achievement motivation should assess what the researcher defines as achievement motivation. In addition, results from the test should, ideally, support the psychologist’s insights into, for example, the individual’s level of achievement in school, if that is what the test constructors intended for the test. Most psychometric research on tests focuses on their validity. Because psychologists use tests to make different types of inferences, there are a number of different types of validity. These include content validity, criterion-related validity, and construct validity.
Content validity refers to how well a test covers the characteristic(s) it is intended to measure. Thus test items are assessed to see if they are: (a) tapping into the characteristic(s) being measured; (b) comprehensive in covering all relevant aspects; and (c) balanced in their coverage of the characteristic(s) being measured. Content validity is usually assessed by careful examination of individual test items and their relation to the whole test by experts in the characteristic(s) being assessed.
Content validity is a particularly important issue in tests of skills. Test items should tap into all of the relevant components of a skill in a balanced manner, and the number of items for various components of the skill should be proportional to how they make up the overall ability. Thus, for example, if it is thought that addition makes up a larger portion of mathematical abilities than division, there should be more items assessing addition than division on a test of mathematical abilities.
Criterion-related validity deals with the extent to which test scores can predict a certain behavior referred to as the criterion. Concurrent and predictive validity are two types of criterion related validity. Predictive validity looks at how well scores on a test predict certain behaviors such as achievement, or scores on other tests. For instance, to the extent that scholastic aptitude tests predict success in future education, they will have high predictive validity. Concurrent validity is essentially the same as predictive validity except that criterion data is collected at about the same time it is collected from the predictor test. The correlation between test scores and the researcher’s designated criterion variable indicates the degree of criterion-related validity. This correlation is called the validity coefficient.
Construct validity deals with how well a test assesses the characteristic(s) it is intended to assess. Thus, for example, with a test intended to assess an individual’s sense of humor one would first ask “What are the qualities or constructs that comprise a sense of humor?” and then, “Do the test items seem to tap
KEY TERMS
Coefficient— In statistics, a number that expresses the degree of relationship between variables. It is most commonly used with a qualifying term that further specifies its meaning as in “correlation coefficient.”
Correlation— A statistical measure of the degree of relationship between two variables.
Error variance— The amount of variability in a set of scores that cannot be assigned to controlled factors.
Normative data— A set of data collected to establish values representative of a group such as the mean, range, and standard deviation of their scores. It is also used to get a sense of how a skill, or characteristic is distributed in a group.
Norms— Values that are representative of a group and that may be used as a baseline against which subsequently collected data is compared.
Reliability— The consistency of a test, or the degree to which the test produces approximately the same results under similar conditions over time.
Representative sample— Any group of individuals that accurately reflects the population from which it was drawn on some characteristic(s).
Sample— Any group of people, animals, or things taken from a particular population.
Validity— How well a test measures what it intends to, as well the degree to which a test validates scientific inferences.
Variance— A measure of variability in a set of scores that may be due to many factors such as error.
those qualities or constructs?” Issues of construct validity are central to any test’s worth and utility, and they usually play a large part in the early stage of constructing a test and initial item construction. There is no single method for assessing a test’s construct validity. It is assessed using many methods and the gradual accumulation of data from various studies. In fact, estimates of construct validity change constantly with the accumulation of additional information about how the test and its underlying construct relate to other variables and constructs.
In assessing construct validity, researchers often look at a test’s discriminant validity, which refers to the degree that scores on a test do not correlate very highly with factors that theoretically they should not correlate very highly with. For example, scores on a test designed to assess artistic ability might not be expected to correlate very highly with scores on a test of athletic ability. A test’s convergent validity refers to the degree that its scores do correlate with factors they theoretically would be expected to. Many different types of studies can be done to assess an instrument’s construct validity.
Item analysis
In constructing various tests, researchers perform numerous item analyses for different purposes. As mentioned previously, at the initial stages of test construction, construct validity is a major concern, so that items are analyzed to see if: (a) they tap the characteristic(s) in question, and (b) taken together, the times comprehensively capture qualities of the characteristic being tested. After the items have been designed and written, they will often be administered to a small sample to see if they are understood as the researcher intended, to examine if they can be administered with ease, and to see if any unexpected problems crop up. Often the test will need to be revised.
Now the potentially revised and improved test is administered to the sample of interest, and the difficulty of the items is assessed by noting the number of incorrect and correct responses to individual items. Often the proportion of test takers correctly answering an item will be plotted in relation to their overall test scores. This provides an indication of item difficulty in relation to an individual’s ability, knowledge, or particular characteristics. Item analysis procedures are also used to see if any items are biased toward or against certain groups. This is done by identifying those items certain groups of people tend to answer incorrectly.
It should be noted that in test construction, test refinement continues until validity and reliability are adequate for the test’s goals. Thus item analysis, validity, or reliability data may prompt the researcher to return to earlier stages of the test design process to further revise the test.
Normative data
When the researcher is satisfied with the individual items of a test, and reliability and validity are established at levels suitable to the intended purposes of the test, normative data is collected. Normative data is obtained by administering the test to a representative sample in order to establish norms. Norms are values that are representative of a group and that may be used as a baseline against which subsequently collected data is compared. Normative data helps get a sense of the distribution or prevalence of the characteristic being assessed in the larger population. By collecting normative data, various levels of test performance are established and raw scores from the test are translated into a common scale.
Common scales are created by transforming raw test scores into a common scale using various mathematical methods. Common scales allow comparison between different sets of scores and increase the amount of information a score communicates. For example, intelligence tests typically use a common scale in which 100 is the average score and standard deviation units are 15 or 16.
Current research/trends
Currently many new psychometric theories and statistical models are being proposed that will probably lead to changes in test construction. In addition, the use of computers to administer tests interactively is on the rise. Finally, studies of test bias and attempts to diminish it will likely increase in response to lawsuits challenging various occupational and school decisions based on test results.
Resources
BOOKS
Groth-Marnat, Gary. Handbook of Psychological Assessment. 4th ed. New York: John Wiley and Sons, 2003.
Kaplan, Robert M. and Dennis P. Saccuzzo. Psychological Testing: Principles, Applications, and Issues. Belmont, CA: Wadsworth Publishing, 2004.
Mitchell, Joel. An Introduction to the Logic of Psychological Measurement. Hillsdale, NJ: Erlbaum, 1990.
Marie Doorey
Psychometry
Psychometry
Psychometry or psychometrics is a field of psychology which uses tests to quantify psychological aptitudes, reactions to stimuli, types of behavior , etc., in an effort to devlop reliable scientific models that can be applied to larger populations.
Reliability
Reliability refers to the consistency of a test, or the degree to which the test produces approximately the same results over time under similar conditions. Ultimately, reliability can be seen as a measure of a test's precision.
A number of different methods for estimating reliability can be used, depending on the types of items on the test, the characteristic(s) a test is intended to measure, and the test user's needs. The most commonly used methods to assess reliability are the test-retest, alternate form, and split-half methods. Each of these methods attempts to isolate particular sources and types of error.
Error is defined as variation due to extraneous factors. Such factors may be related to the test-taker, if for instance he or she is tired or ill the day of the test and it affects the score. Error may also be due to environmental factors in the testing situation, such as an uncomfortable room temperature or distracting noise.
Test-retest methods look at the stability of test scores over time by giving the same test to the same people after a reasonable time interval. These methods try to separate out the amount of error in a score related to the passing of time. In test-retest studies, scores from the first administration of a test are compared mathematically through correlation with later score(s).
Test-retest methods have some serious limitations, one of the most important being that the first test-taking experience may affect performance on the second test administration. For instance, the individual may perform better at the second testing, having learned from the first experience. Moreover, tests rarely show perfect test-retest reliability because many factors unrelated to the tested characteristic may affect the test score. In addition, test-retest methods are only suitable to use with tests of characteristics that are assumed to be stable over time, such as intelligence. They are unsuitable for tests of unstable characteristics like emotional states such as anger or anxiety .
The alternate-form method of assessing reliability is very similar to test-retest reliability except that a different form of the test in question is administered the second time. Here two forms of a test are created to be as similar as possible so that individual test items should cover the same material at the same level of ease or difficulty. The tests are administered to a sample and the scores on the two tests are correlated to yield a coefficient of equivalence. A high coefficient of equivalence indicates the overall test is reliable in that most or all of the items seem to be assessing the same characteristic. Low coefficients of equivalence indicate the two test forms are not assessing the same characteristic.
Alternate form administration may be varied by the time interval between testing. Alternate form with immediate testing tries to assess error variance in scores due to various errors in content sampling. Alternate form with delayed administration tries to separate out error variance due to both the passage of time and to content sampling. Alternate-form reliability methods have many of the same limitations as test-retest methods.
Split-half reliability methods consist of a number of methods used to assess a test's internal consistency, or the degree to which all of the items are assessing the same characteristic. In split-half methods a test is divided into two forms and scores on the two forms are correlated with each other. This correlation coefficient is called the coefficient of reliability. The most common way to split the items is to correlate even-numbered items with odd-numbered items.
Validity
Validity refers to how well a test measures what it intends to, along with the degree to which a test validates intended inferences. Thus a test of achievement motivation should assess what the researcher defines as achievement motivation. In addition, results from the test should, ideally, support the psychologist's insights into, for example, the individual's level of achievement in school, if that is what the test constructors intended for the test. Most psychometric research on tests focuses on their validity. Because psychologists use tests to make different types of inferences, there are a number of different types of validity. These include content validity, criterion-related validity, and construct validity.
Content validity refers to how well a test covers the characteristic(s) it is intended to measure. Thus test items are assessed to see if they are: (a) tapping into the characteristic(s) being measured; (b) comprehensive in covering all relevant aspects; and (c) balanced in their coverage of the characteristic(s) being measured. Content validity is usually assessed by careful examination of individual test items and their relation to the whole test by experts in the characteristic(s) being assessed.
Content validity is a particularly important issue in tests of skills. Test items should tap into all of the relevant components of a skill in a balanced manner, and the number of items for various components of the skill should be proportional to how they make up the overall ability. Thus, for example, if it is thought that addition makes up a larger portion of mathematical abilities than division, there should be more items assessing addition than division on a test of mathematical abilities.
Criterion-related validity deals with the extent to which test scores can predict a certain behavior referred to as the criterion. Concurrent and predictive validity are two types of criterion related validity. Predictive validity looks at how well scores on a test predict certain behaviors such as achievement, or scores on other tests. For instance, to the extent that scholastic aptitude tests predict success in future education, they will have high predictive validity. Concurrent validity is essentially the same as predictive validity except that criterion data is collected at about the same time it is collected from the predictor test. The correlation between test scores and the researcher's designated criterion variable indicates the degree of criterion-related validity. This correlation is called the validity coefficient.
Construct validity deals with how well a test assesses the characteristic(s) it is intended to assess. Thus, for example, with a test intended to assess an individual's sense of humor one would first ask "What are the qualities or constructs that comprise a sense of humor?" and then, "Do the test items seem to tap those qualities or constructs?" Issues of construct validity are central to any test's worth and utility, and they usually play a large part in the early stage of constructing a test and initial item construction. There is no single method for assessing a test's construct validity. It is assessed using many methods and the gradual accumulation of data from various studies. In fact, estimates of construct validity change constantly with the accumulation of additional information about how the test and its underlying construct relate to other variables and constructs.
In assessing construct validity, researchers often look at a test's discriminant validity, which refers to the degree that scores on a test do not correlate very highly with factors that theoretically they should not correlate very highly with. For example, scores on a test designed to assess artistic ability might not be expected to correlate very highly with scores on a test of athletic ability. A test's convergent validity refers to the degree that its scores do correlate with factors they theoretically would be expected to. Many different types of studies can be done to assess an instrument's construct validity.
Item analysis
In constructing various tests, researchers perform numerous item analyses for different purposes. As mentioned previously, at the initial stages of test construction, construct validity is a major concern, so that items are analyzed to see if: (a) they tap the characteristic(s) in question, and (b) taken together, the times comprehensively capture qualities of the characteristic being tested. After the items have been designed and written, they will often be administered to a small sample to see if they are understood as the researcher intended, to examine if they can be administered with ease, and to see if any unexpected problems crop up. Often the test will need to be revised.
Now the potentially revised and improved test is administered to the sample of interest, and the difficulty of the items is assessed by noting the number of incorrect and correct responses to individual items. Often the proportion of test takers correctly answering an item will be plotted in relation to their overall test scores. This provides an indication of item difficulty in relation to an individual's ability, knowledge, or particular characteristics. Item analysis procedures are also used to see if any items are biased toward or against certain groups. This is done by identifying those items certain groups of people tend to answer incorrectly.
It should be noted that in test construction, test refinement continues until validity and reliability are adequate for the test's goals. Thus item analysis, validity, or reliability data may prompt the researcher to return to earlier stages of the test design process to further revise the test.
Normative data
When the researcher is satisfied with the individual items of a test, and reliability and validity are established at levels suitable to the intended purposes of the test, normative data is collected. Normative data is obtained by administering the test to a representative sample in order to establish norms. Norms are values that are representative of a group and that may be used as a baseline against which subsequently collected data is compared. Normative data helps get a sense of the distribution or prevalence of the characteristic being assessed in the larger population. By collecting normative data, various levels of test performance are established and raw scores from the test are translated into a common scale.
Common scales are created by transforming raw test scores into a common scale using various mathematical methods. Common scales allow comparison between different sets of scores and increase the amount of information a score communicates. For example, intelligence tests typically use a common scale in which 100 is the average score and standard deviation units are 15 or 16.
Current research/trends
Currently many new psychometric theories and statistical models are being proposed that will probably lead to changes in test construction. In addition, the use of computers to administer tests interactively is on the rise. Finally, studies of test bias and attempts to diminish it will likely increase in response to lawsuits challenging various occupational and school decisions based on test results.
Resources
books
Anastasi, A. Psychological Testing. New York: Macmillan, 1982.
Goldstein, G., and M. Hersen, eds. Handbook of PsychologicalAssessment. 2nd ed. New York: Pergamon Press, 1990.
Mitchell, J. An Introduction to the Logic of Psychological Measurement. Hillsdale, NJ: Erlbaum, 1990.
Marie Doorey
KEY TERMS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .- Coefficient
—In statistics, a number that expresses the degree of relationship between variables. It is most commonly used with a qualifying term that further specifies its meaning as in "correlation coefficient."
- Correlation
—A statistical measure of the degree of relationship between two variables.
- Error variance
—The amount of variability in a set of scores that cannot be assigned to controlled factors.
- Normative data
—A set of data collected to establish values representative of a group such as the mean, range, and standard deviation of their scores. It is also used to get a sense of how a skill, or characteristic is distributed in a group.
- Norms
—Values that are representative of a group and that may be used as a baseline against which subsequently collected data is compared.
- Reliability
—The consistency of a test, or the degree to which the test produces approximately the same results under similar conditions over time.
- Representative sample
—Any group of individuals that accurately reflects the population from which it was drawn on some characteristic(s).
- Sample
—Any group of people, animals, or things taken from a particular population.
- Validity
—How well a test measures what it intends to, as well the degree to which a test validates scientific inferences.
- Variance
—A measure of variability in a set of scores that may be due to many factors such as error.
psychometry
psy·chom·e·try / sīˈkämətrē/ • n. 1. the supposed ability to discover facts about an event or person by touching inanimate objects associated with them.2. another term for psychometrics.DERIVATIVES: psy·chom·e·trist / -trist/ n.
psychometrics
psy·cho·met·rics / ˌsīkəˈmetriks/ • pl. n. [treated as sing.] the science of measuring mental capacities and processes.DERIVATIVES: psy·chom·e·tri·cian / -məˈtrishən/ n.