Standardized Tests
Standardized Tests
Standardized testing as a gateway to higher education was first established in the United States with the development of the College Entrance Examination Board in 1900. This board created a test designed to standardize admissions to elite universities in the northeastern United States and to encourage the development of a common curriculum among elite boarding schools (Chandler 1999; Lemann 1999). The original test consisted of essays and was not designed for mass testing. The College Board, however, created a broader test of IQ in 1926, the Scholastic Aptitude Test, commonly known today as the SAT I. This test was intended to help elite schools identify high-achieving students in nonelite high schools. In the early years, it also distinguished between white-collar students who were suitable for college and blue-collar students presumed to be ill prepared for such an education (Blau et al. 2003). By the mid-1950s, the demand for college education soared, spawning the development of the American College Testing Program (currently known as the ACT) in 1959. This test is the main alternative to the SAT. The origins of the SAT and ACT clarify their differing approaches. The SAT test was originally based on Army IQ tests as a measure of intelligence, whereas the ACT was historically designed to measure achievement rather than intelligence or aptitude.
Despite these differences in intent, the tests are similar in structure. The SAT I (also known as the SAT Reasoning Test) is designed to measure students’ critical thinking and problem-solving skills. The test consists of three sections. The critical-reading section includes questions on analytical reading, reading comprehension, and sentence completion. The writing section evaluates students’ ability to write clearly, concisely, and competently. It also assesses students’ ability to critically assess sentence and paragraph structure, as well as grammar. Finally, the mathematics section includes questions covering algebra, geometry, statistics, and advanced data analysis.
The ACT is similar to the SAT, but it has four broad sections. The English section evaluates writing and rhetorical skills. The mathematics section includes questions on algebra, geometry, and trigonometry. The reading section measures reading comprehension. The science section tests scientific skills including reasoning, analysis, and problem solving. Finally, the writing section tests writing skills.
The SAT and ACT are widely utilized among students and colleges. In 2006 about 1.5 million high school seniors took the SAT and approximately 1.2 million students took the ACT. Most colleges accept either the SAT or ACT for admissions since standards for comparing these scores are easily accessible.
The importance to higher education of standardized testing persists into graduate school, but testing tools are more diverse for graduate admissions than for undergraduate admissions. Professional schools require standardized tests that emphasize skills required by their specific disciplines. These include the Law School Admission Test (LSAT), the Medical College Admission Test (MCAT), and the Graduate Management Admission Test (GMAT). A more general and widely used standardized testing tool is the Graduate Record Examination (GRE). In 2005 nearly 500,000 persons took the GRE, accounting for 35 percent of persons with bachelor’s degrees (National Center for Education Statistics 2006). The GRE has three sections. The verbal-reasoning section tests the respondent’s ability to recognize concepts and to analyze information and relationships among parts of sentences. The quantitative-reasoning section tests algebra, geometry, data analysis, and quantitative reasoning. Finally, the analytical-writing section assesses the respondent’s ability to write clearly, effectively, logically, coherently, and competently.
Key criticisms of standardized testing that have generated widespread sociological interest are: (1) the neglect of environmental differences among students, particularly those associated with cultural and racial differences; and (2) testing bias and validity. Criticisms of cultural and racial bias abound within the literature. One notable example is Tukufu Zuberi’s Thicker than Blood (2001). Zuberi contends that the IQ test, the predecessor of modern standardized testing, developed out of the eugenics movement. This movement was committed to identifying biological differences between the races and classes. Proponents posited that racial inequalities in society were biologically determined because whites were perceived to be genetically superior. According to Zuberi, IQ tests provided statistical support for eugenics because white students scored higher on these tests than black and immigrant students.
During this period, many scholars argued that IQ tests, which measured math and verbal skills, accurately reflected biological differences in intelligence. Scholars influenced by this tradition purport that differences in test scores between blacks and whites reflect inherent biological differences between the races (Herrnstein and Murray 1994). However, sufficient data have not been provided to support this hypothesis. Today, most scholars acknowledge that standardized testing is biased and reflects more than biological differences between students.
Christopher Jencks, an influential scholar in this debate, has identified multiple biases in standardized testing. First, he argues that standardized tests neglect environmental differences between students, which creates bias. Comparisons of scores among racial groups are problematic because the IQ test was originally designed to compare the mental ability of students who were raised in comparable environments with similar levels of educational opportunity. Yet mass testing neglects environmental differences between students. This proposition has received widespread empirical support. William Rodgers and William Spriggs (1996) offer one of the most methodologically sophisticated assessments of environmental background factors by showing that a consideration of family and educational background reduces racial differences in test scores. However, they also find that the impact of the environment on test scores varies by race. Furthermore, racial biases exist in the measurement of standardized tests because components of these tests have different long-term effects on individuals’ wages, depending on race and gender. Thus, Rodgers and Spriggs argue that standardized tests are racially biased because they measure different factors for different races.
Standardized tests are also biased in content (Jencks 1998). This is obvious when considering the language of the test. For students whose primary language is not English, standardized testing measures both English proficiency and scholastic achievement. As a result, these tests do not accurately reflect the achievement or readiness for college of language-minority students (LaCelle-Peterson 2000). Less-obvious content biases are prevalent in vocabulary words and essay topics.
Standardized tests are also biased methodologically if they claim (or are assumed) to measure ability because groups historically subjugated in society, including blacks, women, and the poor, are disadvantaged in this situation (Jencks 1998). Indeed, researchers have found that these tests create anxiety among African Americans and students of low socioeconomic status, who underperform on tests perceived to measure intellectual ability (Croizet and Dutrevis 2004; Steele and Aronson 1998). Additionally, women underperform when gender stereotypes are made salient (Benbow 1988).
Judith Blau and colleagues argue that blacks and whites place different significance on achievement tests. Whites believe that these tests measure ability, while blacks perceive unfair discrimination in testing practices. Thus, they conclude that black students and their parents place less weight on standardized test scores when considering postsecondary educational goals. Blau finds that test scores are a better predictor of educational attainment for white students than for black students. Furthermore, low-scoring black students are more likely than low-scoring white students to pursue postsecondary education. Thus, low scores are more likely to discourage white students, suggesting cultural differences in the value placed on tests (Blau et al. 2003). Further research is needed to determine how Blau’s theory applies to gender and class issues. Preliminary research suggests that females place less value on mathematical portions of standardized test scores due to stereotype threats (Lesko and Corpus 2006).
Differences in the ability of standardized tests to predict future outcomes highlight an additional criticism of standardized testing: The tests are not valid because they are not accurate predictors of students’ success in college or graduate school (Jencks 1998). Indeed, many scholars have found that standardized test scores do not predict grade point average in college (Gandara and Lopez 1998; Fleming 2000, 2002) or in graduate school (Oldfield 1994, 1996), and they do not predict success in the labor market (Blackburn 2004; Rodgers and Spriggs 1996).
The effect of the debates on standardized testing is evident. The title of the SAT has changed multiple times from the Scholastic Aptitude Test (a test of ability) to the Scholastic Assessment Test (this more general term suggests that the test measures more than ability) and finally to simply the SAT. In addition, the College Board has altered testing questions on the SAT to reduce cultural bias introduced from disparate knowledge and interests between groups in society. Furthermore, it has cut sections of the test to reduce reliance on vocabulary and increase reliance on verbal problem-solving skills. Even with these changes, however, racial and gender disparities persist. In 2004 the average SAT verbal score was 508 for college-bound high school seniors, ranging from 430 for black seniors to 451 and 528 for Mexican American and white seniors, respectively. Similarly, mathematics scores ranged from an average of 427 for black students to 531 for white students. ACT scores also vary by race. Average English ACT scores were 20.4 in 2004, ranging from 17.2 for black students to 18.3 for Mexican-American students and 22.5 for white students (Freeman and Fox 2005).
The persistence of the race gap is attributable to differences in family background and educational opportunity. Black students are generally raised in families with fewer resources than white students. Indeed, according to Melvin Oliver and Thomas Shapiro (1995), 63 percent of black households have zero or negative financial assets, meaning that their debt outweighs their assets. Only 28 percent of white families have negative financial assets. Furthermore, white median net worth (defined as the sum of all assets minus debt) is nearly twelve times black median net worth. This has important implications for test scores because students raised in families with greater wealth have the financial resources to prepare for standardized testing and attend college. Indeed, parental wealth and education are the two most important predictors of college attendance (Conley 1999).
The racial gap in test scores also persists because black students have fewer opportunities to prepare for the test. Schooling in the United States is highly segregated by race and socioeconomic status. Roslyn Mickelson (2006) found that predominantly black schools offer fewer SAT prep courses than integrated or predominantly white schools. Furthermore, even when black and white students study in the same schools (i.e., in integrated schools), they are offered different educational opportunities because they are grouped into classes by ability. Black students are more likely to be assigned to “lower-ability” classes than white students with the same grades and test scores. These classes are often taught by less-experienced teachers, and the courses offer a more general education rather than a college-preparatory education. Thus socioe-conomic resources and educational opportunities explain the existing gap in standardized test scores by race.
As for gender, the standardized test score disparity is not uniform. Historically, boys and girls had equivalent verbal scores, but boys scored higher in math (Benbow 1988). The math score gap has diminished over time, in part because girls’ educational opportunities have expanded, and they are taking more advanced math courses in high school. In 2004 boys and girls scored 538 and 504, respectively, on the math section of the SAT. Much of this remaining gender gap in test scores develops during high school because women continue to study in less rigorous math courses, and they are less likely than boys to participate in mathematically oriented extracurricular activities (Leahey and Guo 2001; Pallas and Alexander 1983; Vogt Yuan 2005).
It is important to understand what standardized tests measure because standardized testing has gained national recognition with the passage of the No Child Left Behind Act in 2002. This policy initiative requires standardized testing for students in the third through eighth grades and at least once during high school. The primary goal of the legislation is to reduce achievement gaps between students, particularly by race, poverty status, disability, ethnicity, and English proficiency. This act magnifies the significance of standardized testing. By neglecting the impacts of the environmental differences, testing biases, and validity issues discussed here, standardized testing will be of limited use to educators and policymakers as they seek to close achievement gaps.
SEE ALSO Education, USA; Race and Education
BIBLIOGRAPHY
ACT. http://www.act.org.
Benbow, Camilla Persson. 1988. Sex Differences in Mathematical Reasoning Ability in Intellectually Talented Preadolescents: Their Nature, Effects, and Possible Causes. Behavioral and Brain Sciences 11: 169–183.
Blackburn, M. L. 2004. The Role of Test Scores in Explaining Race and Gender Differences in Wages. Economics of Education Review 23: 555–576.
Blau, Judith, Stephanie Moller, and Lyle V. Jones. 2003. Going to College. In Race in the Schools: Perpetuating White Dominance? ed. Judith R. Blau, 177–202. Boulder, CO: Lynne Reinner.
Chandler, Michael, dir. 1999. Frontline: Secrets of the SAT. Boston. WGBH Educational Foundation. http://www.pbs.org/wgbh/pages/frontline/shows/sats/.
College Board. http://www.collegeboard.com.
Conley, Dalton. 1999. Being Black, Living in the Red: Race, Wealth, and Social Policy in America. Berkeley: University of California Press.
Croizet, Jean-Claude, and Marion Dutrevis. 2004. Socioeconomic Status and Intelligence: Why Test Scores Do Not Equal Merit. Journal of Poverty 8: 91–107.
Educational Testing Service: GRE—Graduate Record Examinations. http://www.ets.org/gre.
Fleming, Jacqueline. 2000. Affirmative Action and Standardized Test Scores. Journal of Negro Education 69: 27–37.
Fleming, Jacqueline. 2002. Who Will Succeed in College? When the SAT Predicts Black Students’ Performance. Review of Higher Education 25: 281–296.
Freeman, Catherine, and Mary Ann Fox. 2005. Status and Trends in the Education of American Indians and Alaska Natives. NCES 2005–108. Washington, DC: National Center for Education Statistics, Department of Education.
Gandara, Patricia, and Elias Lopez. 1998. Latino Students and College Entrance Exams: How Much Do They Really Matter? Hispanic Journal of Behavioral Sciences 21:17–38.
Herrnstein, Richard J., and Charles Murray. 1994. The Bell Curve: Intelligence and Class Structure in American Life. New York: Free Press.
Jencks, Christopher. 1998. Racial Bias in Testing. In The Black-White Test Score Gap, eds. Christopher Jencks and Meredith Phillips, 55–85. Washington, DC: Brookings Institution Press.
LaCelle-Peterson, Mark. 2000. Choosing Not to Know: How Assessment Policies and Practices Obscure the Education of Language Minority Students. In Assessment: Social Practice and Social Product, ed. Ann Filer, 27–42. London: Routledge.
Leahey, Erin, and Guang Guo. 2001. Gender Differences in Mathematical Trajectories. Social Forces 80: 713–732.
Lemann, Nicholas. 1999. The Big Test: The Secret History of the American Meritocracy. New York: Farrar, Straus and Giroux.
Lesko, Alexandra, and Jennifer H. Corpus. 2006. Discounting the Difficult: How High Math-Identified Women Respond to Stereotype Threat. Sex Roles: A Journal of Research 54: 113–125.
Mickelson, Roslyn. 2006. Segregation and the SAT. Ohio State Law Journal 67: 157–199.
National Center for Education Statistics. 2006. Digest of Education Statistics: 2005. NCES 2006–030. Washington, DC: Department of Education. http://nces.ed.gov/programs/digest/d05.
National Center for Education Statistics. 2005. Trends in Educational Equity of Girls and Women: 2004. NCES 2005–016. Washington, DC: Department of Education. http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2005016.
Oldfield, Kenneth. 1994. On the Importance of Informing Students about the Potential Risk Associated with Taking the Graduate Record Exam. Journal of Thought 29: 61–70.
Oldfield, Kenneth. 1996. The Political and Economic Reasons the Graduate Record Examination Persists Despite Its Generally Low Predictive Validity. Journal of Thought 31: 55–68.
Oliver, Melvin L., and Thomas M. Shapiro. 1995. Black Wealth/White Wealth: A New Perspective on Racial Inequality. New York: Routledge.
Pallas, Aaron M., and Karl L. Alexander. 1983. Sex Difference in Quantitative SAT Performance New Evidence on the Differential Coursework Hypothesis. American Educational Research Journal 20: 165–182.
Rodgers, William M., and William E. Spriggs. 1996. What Does the AFQT Really Measure: Race, Wages, Schooling, and the AFQT Score. Review of Black Political Economy 24: 13–47.
Steele, Claude, and Joshua Aronson. 1998. Stereotype Threat and the Test Performance of Academically Successful African Americans. In The Black-White Test Score Gap, eds. Christopher Jencks and Meredith Phillips, 401–430. Washington, DC: Brookings Institution Press.
Vogt Yuan, Anastasia. 2005. Sex Differences in School Performance During High School: Puzzling Patterns and Possible Explanations. Sociological Quarterly 46: 299–321.
Zuberi, Tukufu. 2001. Thicker than Blood: How Racial Statistics Lie. Minneapolis: University of Minnesota Press.
Stephanie Moller
Stephanie Potochnick
Standardized Tests
Standardized Tests
Standardized tests have a long history in American education. Beginning in the late nineteenth century, these tests were largely used to make decisions about college admission and high school graduation. Following World War II, were administered more broadly. Many educators welcomed the tests as tools that would create empirical data about student performance, thus making educational decision–making processes more objective and scientific. After the Brown v. Board of Education decision in 1954, however, tests began to serve a different function. As increasing numbers of African–American and Latino students were integrated into white schools, began to serve gatekeeper functions against minority students. In the early 1970s, the College Board, the governing body for the Scholastic Aptitude Test (SAT), began to keep statistics on race. What was largely suspected was then made evident: There was a significant gap between the test scores of blacks and whites. In addition, this gap was not confined to the SAT but manifested across the standardized test world, and while shrinking at times, it has failed to narrow significantly. Furthermore, due to political and social changes in the 1980s and 1990s, are having a greater impact on the lives of American students than ever before. The negative impact on African American and Latino students has been significant.
NO CHILD LEFT BEHIND
No Child Left Behind (NCLB) became national law on January 8, 2002. The purpose of the legislation was to increase American academic standards and performance and to shrink the performance gap between minority and white students. The law has been met with sharp criticism from parents, students, administrators, politicians, teachers, and concerned communities. A key point of contention is the sole use of to make critical decisions about students’ education. The law requires that, as of the 2005–2006 school year, students be tested in reading and mathematics in the third and eighth grades, with science tests added in the 2007–2008 term. Before implementation of NCLB, only six states tested students at this rate. NCLB mandates these tests despite caveats from the preeminent guide for determining test validity, the Standards for Education and Psychological Tests, which states no long–term decisions for a student’s education should be made as a result of one score.
The pressure to increase scores on high–stakes tests has affected all of American education, but it has had a particularly significant effect on the overall education experience of African Americans and Latinos. Teachers are more likely to “teach to the test,” spending valuable classroom time on test preparation in an effort to meet testing goals. Research has demonstrated, however, that this preparation actually has negative effects. Some schools suspend the established curriculum for a month or more to prepare for these tests. Moreover, teachers who have a large percentage of students of color report that affect their teaching styles more than those who teach predominantly whites.
Those students who fail the third– or eighth–grade tests are forced to repeat the grade. Theoretically, this would offer another opportunity to learn the skills and subject matter necessary for academic success at the next level. Retention, however, is highly correlated with dropout rates. Since instituting high–stakes tests, the state of Texas has forced African–American and Latino students to repeat grades at a rate almost twice that of whites. Moreover, the dropout rates for those students held back are more than twice those of students who have never been kept back.
Student success on high–stakes tests is highly correlated with teacher experience. Students of color, however, are more likely to be taught by teachers who have less education than teachers of white students. A Stanford University study found that, nationwide, in schools with the highest concentrations of students of color, students have less than a 50 percent chance of being taught math or science by a teacher who has credentials in those fields. A study in California found that schools with the lowest passing rates on California’s high school exit exam have high minority enrollment and double the amount of uncertified teachers. Those students who need strong, capable teachers the most, and who are most prone to the negative effects of high–stakes testing, are thus less likely to get the help they need.
Moreover, for students who are not motivated by tests, the inclusion of more test–specific materials and curricular changes has a negative ripple effect. Increased test pressure for a student who is not motivated by test outcome can serve to cast the tests, and subsequently the entire educational enterprise, in a negative light, thus making academic achievement a low priority and ultimately leading to what Jason Osborne calls academic disidentification. This means that students, to preserve their self–esteem, stop identifying with academic success and disengage from the academic process altogether.
There are also funding implications tied to testing. California and Texas award schools incentive funding for performance on standardized tests. But schools with high concentrations of students of color tend to receive half of the performance awards as school with high concentrations of whites. At first, this may appear to be the just result of a meritorious policy. But there is a correlation between academic performance and school funding. So the performance awards then serve to exacerbate the problem of performance at the schools by not distributing critical funds to those schools that need it most.
Perhaps most discouraging is that the law’s efficacy and the mandated tests’ effectiveness are highly questionable. According to National Assessment for Educational Progress, the pressure of NCLB and mandated tests produced no significant gains in reading scores at any grade level in 2003.
TRACKING
Most American public schools have some method of differentiating instruction according to perceptions of ability. The common method is a three–tiered tracking system, with gifted, general, and special education tracks. Grades and teacher input highly affect placement, but only within the general track. For the special education and gifted tracks, however, tests are relied upon most heavily. Therefore, lower performance on IQ tests directly contributes to the disproportional representation of African Americans and Latinos on these tracks. Students of color are overrepresented in special education classes in forty–five out of fifty states. Africans Americans are four times as likely to be designated mentally retarded or “special needs” than whites in five states.
A chief indicator of college success is the rigor of one’s secondary curriculum. When students are deprived of this level of education, their chances of success in college are sharply diminished. African–American and Latino students are less likely to attend schools where advanced or honors/accelerated curricula are taught. Even in schools with such curricula, students of color are less likely to been enrolled in these classes. An over–reliance on test scores to make the determination is a likely culprit. Studies have demonstrated that racial bias on the part of teachers and administrators also plays a significant role in the selection process. In a study of a San Jose school, Latino students were half as likely to take accelerated courses as whites with similar scores.
More subtle forms of tracking still persist in schools that have abolished traditional test–based tracking systems. Parents of middle–class students often successfully make demands of administrators to have their students placed with highly qualified teachers with fewer students of color. As a result, schools that are racially diverse often harbor hidden in–school stratification systems.
THE SAT
Since the College Board began to keep data on race, the gap that exists between African American and Latinos and whites has been a lightning rod, drawing attention from parents, politicians, and educators alike. The problem is persistent and remains salient due to a myriad of reasons. African Americans tend to score, on average, 100 points less on the Scholastic Aptitude Test (SAT) than whites.
One of the chief predictors for success on the SAT, as is the case with postsecondary success, is the rigor of the secondary curriculum. But given the presence of test–based tracking systems, many African Americans are situated on tracks that do not offer instruction in higher–order skills, making success on the SAT problematic in the extreme.
The SAT, despite subtle class–related challenges, is considered by most researchers to be a valid test. The problem occurs with its use. Many colleges and universities rely too heavily on the SAT in making determinations about admissions. This over–reliance has put diversity efforts in jeopardy. Opponents to affirmative action and other programs designed to make adjustments for a history of American racist policies and institutions have used this over–reliance to challenge the constitutionality of affirmative action.
On the opposite end of the spectrum, African Americans who attend elite postsecondary institutions tend to have lower freshman grade point averages than whites with comparable SAT scores. The reasons for this phenomenon are unclear. However, the fact remains that the SAT alone, in those situations, does not conclusively predict success at those schools. Other factors must be examined to explain this difference in performance.
The Stanford University psychologist Claude Steele has researched African–American test performance and discovered a condition he terms “stereotype threat.” Steele found that among the most capable African–American students, test performance is negatively impacted by a desire to avoid being characterized by prevailing perceptions and stereotypes of intellectual inferiority. For those students who are most invested in academics, the possibility of failure creates inordinate psychological and even physiological stresses, which impair performance. Steele also found that even asking students to state their race on a standardized test is enough to reduce notions of efficacy and cause a decline in performance.
High–stakes are becoming hardened fixtures in American education, and they are used to make long–term determinations about the educational futures of students. While there are inherent problems in test design, more problems occur in the systems in which these tests exist. Students of color will continue to languish in this system as long as tests play a considerable role in decisions concerning their futures.
BIBLIOGRAPHY
Jencks, Christopher, and Meredith Phillips. 1998. The Black–White Test Score Gap. Washington, DC: Brookings Institution Press.
Osborne, Jason W. 1997. “Race and Academic Disidentification.” Journal of Educational Psychology 89 (4): 728–735.
Ramist, Leonard, Charles Lewis, and Laura McCamley–Jenkins. 1994. Student Group Differences in Predicting College Grades: Sex, Language, and Ethnic Groups. College Board Report No. 93–1. New York: College Board.
Steele, Claude M., and Joshua Aronson. 1995. “Stereotype Threat and the Intellectual Test Performance of African Americans.” Journal of Personality and Social Psychology 69: 797–811.
Bruce Webb
Standardized Tests
Standardized Tests
Standardized tests are administered in order to measure the aptitude or achievement of the people tested. A distribution of scores for all test takers allows individual test takers to see where their scores rank among others. Well-known examples of standardized tests include "IQ" (Intelligence Quota) tests, the PSAT (Preliminary Scholastic Achievement Test) and SAT (Scholastic Achievement Test) tests taken by high school students, the GRE (Graduate Requirements Examination) test taken by college students applying to graduate school, and the various admission tests required for business, law, and medical schools.
The "Normal" Curve
The mathematics behind the distribution of scores on standardized tests comes from the fields of probability theory and mathematical statistics. A cornerstone of this mathematical theory is the "Central Limit Theorem," which states that for large samples of observations (or scores in the case of standardized tests), the distribution of the observations will follow the bell-shaped normal probability curve illustrated below. This means that most of the observations will cluster symmetrically around the mean or average value of all the observations, with fewer observations farther away from the mean value.
One measure of the spread or dispersion of the observations is called the standard deviation . According to statistical theory illustrated above, about 68 percent of all observations will lie within plus or minus one standard deviation of the mean; 95 percent will lie within plus or minus two standard deviations of the mean (see graph below); and 99.7 percent will lie within plus or minus three standard deviations of the mean. Standardized test scores are examples of observations that have this property.
Consider, for example, a standardized test for which the mean score is 500 and the standard deviation is 100. This means that about 68 percent of all test takers will have scores that fall between 400 and 600; 95 percent will have scores between 300 and 700; and virtually all of the scores will fall between 200 and 800. In fact, many standardized tests, including the PSAT and SAT, have just such a scale on which 200 and 800 are the minimum and maximum scores, respectively, that will be given.
Scaled Scores
The "standardized" in standardized tests means that similar scores must represent the same level of performance from year to year. Statisticians and test creators work together to ensure that, for example, if a student scores 650 on one version of the SAT as a junior and 700 on a different version as a senior, that this truly represents a gain in achievement rather than one version of the test being more difficult than the other.
By "embedding" some questions that are identical in all versions of a test and analyzing the performance of each group on those common questions, test creators can ensure a level of standardization. If one group scores significantly lower on the common questions, this is interpreted to mean that the lower scoring group is not as strong as the higher scoring group.
If group A scores higher than group B on questions identical to both their tests but then scores the same or lower than group B on the complete test, it would be assumed that the test given to group A was more difficult than that given to group B. Statisticians can develop a mathematical formula that will correct for such a variance in the difficulty of tests.
Such a formula would be applied to the "raw" scores of the test takers in order to obtain "scaled" scores for both groups. These scaled scores could then be compared. A scaled score of 580 on version A means the same thing as a scaled score of 580 on version B, even though the raw scores may be different. In this sense the scores are said to have been "standardized."
Statistical Scores
A second meaning of "standardized" is more subtle, more mathematically involved, and not well understood by the general public. This meaning has to do with the bell-shaped normal probability curve mentioned at the beginning of this article. Theoretically, there are an infinite number of normal curves—one for each different set of observations that might be made. Mathematicians would say that there is an entire "family" of normal curves, and, the members of the normal curve family share similarities as well as differences.
All bell-shaped curves are high in the middle and slope down to long "tails" to the right and left. Although different types of observations will have different mean values, those mean values will always occur at the middle of the distributions. They may also have different standard deviations as discussed earlier, but the percentage of values lying between plus or minus one of those standard deviations will still be about 68 percent, the percentage of values lying between plus or minus two standard deviations will still be about 95 percent, and so on.
In order to make the analysis of normal distributions simpler, statisticians have agreed upon one particular normal curve that will represent all the rest. This special normal curve has a mean of 0 and a standard deviation of 1 and is called the "standard normal curve." A "standardized" test result, therefore, is one based on the use of a standard normal curve as its reference.
The advantage of having the standard normal curve represent all the other normal curves is that statisticians can then construct a single table of probabilities that can be applied to all normal distributions. This can be done by "mapping" those distributions onto the standard normal curve and making use of its probability table. The term "mapping" in mathematics refers to the transformation of one set of values to a different set of values.
To illustrate, consider the test with a mean of 500 and a standard deviation of 100. The mean of this set of scores lies 500 units to the right of the standard normal distribution's mean of 0. So to "map" the mean of the test scores onto the standard normal mean, 500 is subtracted from all the test scores. Now there is a new distribution with the correct mean but the wrong standard deviation.
To correct this, all of the scores in the new distribution are divided by 100, since , which is the standard deviation of the standard normal distribution. The two distributions are now identical. In mathematical terms the test scores have been "mapped" onto the standard normal values.
This mapping is composed of two transformations: a translation of 500 to the left and a scale change of 1/100. This composition can be represented by , where x is any test score.
Building on this example, suppose one wants to know the percentage of test takers who scored 650 or above. First, compute . Then go to a standard normal table, look up a standard score of 1.5, and see that about 6.88 percent of standard normal scores are at 1.5 or above. This means that about 6.88 percent of the test scores are 650 or higher. This procedure may be used with any normally distributed data set for which the mean and standard deviation are known.
see also Central Tendency, Measures of; Mapping, Mathematical; Statistical Analysis; Transformations.
Stephen Robinson
Bibliography
Angoff, William. "Calibrating College Board Scores." In Statistics: A Guide to the Unknown, ed. Judith Tanur, Frederick Mosteller, William H. Kruskal, Richard F. Link, Richard S. Pieters, and Gerald R. Rising. San Francisco: Holden-Day, Inc.,1978.
Blakeslee, David W., and William Chin. Introductory Statistics and Probability. Boston: Houghton Mifflin Company, 1975.
Narins, Brigham, ed. World of Mathematics. Detroit: Gale Group, 2001.
Standardized Test
Standardized test
A test administered to a group of subjects under exactly the same experimental conditions and scored in exactly the same way.
Standardized tests are used in psychology, as well as in everyday life, to measure intelligence , aptitude, achievement, personality , attitudes and interests. Attempts are made to standardize tests in order to eliminate biases that may result, consciously or unconsciously, from varied administration of the test. Standardized tests are used to produce norms—or statistical standards— that provide a basis for comparisons among individual members of the group of subjects. Tests must be standardized, reliable (give consistent results), and valid (reproducible) before they can be considered useful psychological tools.
Standardized tests are highly controversial both in psychological circles and particularly in education because true standardization is difficult to attain. Certain requirements must be rigidly enforced. For example, subjects must be given exactly the same amount of time to take the test. Directions must be given using precisely the same wording from group to group, with no embellishments, encouragement, or warnings. Scoring must be exact and consistent. Even an unwitting joke spoken by the test administrator that relaxes the subjects or giving a test in a room that is too hot or too cold could be considered violations of standardization specifications. Because of the difficulty of meeting such stringent standards, standardized tests are widely criticized.
Critics of the use of standardized tests for measuring educational achievement or classifying children are critical for other reasons as well. They say the establishment of norms does not give enough specific information about what children know. Rather, they reveal the average level of knowledge. Secondly, critics contend that such tests encourage educators and the public to focus their attention on groups rather than on individuals. Improving tests scores to enhance public image or achieve public funding become more of a focus than teaching individual children the skills they need to advance. Another criticism is that the tests, by nature, cannot measure knowledge of complex skills such as problem solving and critical thinking. "Teaching to the test"—drilling students in how to answer fill-in-the-blank or multiple-choice questions— takes precedence over instruction in more practical, less objective skills such as writing or logic.
Achievement tests , I.Q. tests, and the Stanford-Binet intelligence scales are examples of widely used standardized tests.
Further Reading
Houts, Paul L., ed. The Myth of Measurability. New York: Hart Publishing Co., 1977.
Wallace, Betty, and William Graves. Poisoned Apple: The Bell-Curve Crisis and How Our Schools Create Mediocrity and Failure. New York: St. Martin's Press, 1995.
Zimbardo, Philip G. Psychology and Life. Glenview, IL: Scott, Foresman, 1988.