Reliability, Validity and Norms...OH! MY!

Lisa Cranford

on 14 October 2015

Transcript of Reliability, Validity and Norms...OH! MY!

Reliability, Validity and Norms...OH! MY!
Compare and rank test takers in relation to one another
Where in our reading you will find information on these topics...
Gallavan - pages, 9-10, 32-36, 48-49, performance-based 9, 10 (table), 17

Your assessments will be valid only if they are reliable
Your assessment appear to satisfy the requirements of these three aspects of validity...
Gallavan defines validity as asking learners to demonstrate outcomes that correspond directly to the objectives of your teaching.
Norm-referenced tests are specifically designed to rank test takers on a “bell curve,” or a distribution of scores that resembles, when graphed, the outline of a bell—i.e., a small percentage of students performing well, most performing average, and a small percentage performing poorly. To produce a bell curve each time, test questions are carefully designed to accentuate performance differences among test takers, not to determine if students have achieved specified learning standards, learned certain material, or acquired specific skills and knowledge.
Content Validity
Criterion Validity

Construct Validity
When your assessment matches your curriculum and instruction
assessments anticipate the learner's the degree of success
Shows the purpose for learning
Face Validity
the quality of being logically or factually sound; soundness or cogency.
"one might question the validity of our data"
the state of being legally or officially binding or acceptable.
"return travel must be within the validity of the ticket"
You can depend on the assessment to give a consistent range of results every time
Stability Reliability
The assessment gives information needed to make decisions regarding teaching, learning and schooling
Alternative Reliability
Having multiple versions of the same assessment that will produce the same results
Validity is more important than reliability...
If the assessment is VALID it will be RELIABLE

Definition of RELIABILITY. 1. : the quality or state of being reliable. 2. : the extent to which an experiment, test, or measuring procedure yields the same results on repeated trials.
In order to establish VALIDITY in your assessment you need to be sure that the assessment is assessing the knowledge...
What students need to know
Think about preassessment
How students demonstrate or show knowledge
Why the learners are learning the identified outcomes connected to the world around them
Valuable information that guides your planning and teaching
Norm-referenced refers to standardized tests that are designed to compare and rank test takers in relation to one another. Norm-referenced tests report whether test takers performed better or worse than a hypothetical average student, which is determined by comparing scores against the performance results of a statistically selected group of test takers, typically of the same age or grade level, who have already taken the exam.
Norm-Referenced vs. Criterion-Referenced Tests
Tests that measure performance against a fixed set of standards or criteria are called criterion-referenced tests.
Calculating norm-referenced scores is called the “norming process,” and the comparison group is known as the “norming group.” Norming groups typically comprise only a small subset of previous test takers, not all or even most previous test takers. Test developers use a variety of statistical methods to select norming groups, interpret raw scores, and determine performance levels.
The following are a few representative examples of how norm-referenced tests and scores may be used:
To determine a young child’s readiness for preschool or kindergarten. These tests may be designed to measure oral-language ability, visual-motor skills, and cognitive and social development.
To evaluate basic reading, writing, and math skills. Test results may be used for a wide variety of purposes, such as measuring academic progress, making course assignments, determining readiness for grade promotion, or identifying the need for additional academic support.
To identify specific learning disabilities, such as autism, dyslexia, or nonverbal learning disability, or to determine eligibility for special-education services.
To make program-eligibility or college-admissions decisions (in these cases, norm-referenced scores are generally evaluated alongside other information about a student). Scores on SAT or ACT exams are a common example.
Norm-referenced scores are generally reported as a percentage or percentile ranking. For example, a student who scores in the seventieth percentile performed as well or better than seventy percent of other test takers of the same age or grade level, and thirty percent of students performed better (as determined by norming-group scores).
Norm-referenced tests often use a multiple-choice format, though some include open-ended, short-answer questions. They are usually based on some form of national standards, not locally determined standards or curricula. IQ tests are among the most well-known norm-referenced tests, as are developmental-screening tests, which are used to identify learning disabilities in young children or determine eligibility for special-education services. A few major norm-referenced tests include the California Achievement Test, Iowa Test of Basic Skills, Stanford Achievement Test, and TerraNova.
Criterion-referenced test results are often based on the number of correct answers provided by students, and scores might be expressed as a percentage of the total possible number of correct answers. On a norm-referenced exam, however, the score would reflect how many more or fewer correct answers a student gave in comparison to other students. Hypothetically, if all the students who took a norm-referenced test performed poorly, the least-poor results would rank students in the highest percentile. Similarly, if all students performed extraordinarily well, the least-strong performance would rank students in the lowest percentile.
It should be noted that norm-referenced tests cannot measure the learning achievement or progress of an entire group of students, but only the relative performance of individuals within a group. For this reason, criterion-referenced tests are used to measure whole-group performance.
Norm-referenced tests have historically been used to make distinctions among students, often for the purposes of course placement, program eligibility, or school admissions. Yet because norm-referenced tests are designed to rank student performance on a relative scale—i.e., in relation to the performance of other students—norm-referenced testing has been abandoned by many schools and states in favor of criterion-referenced tests, which measure student performance in relation to common set of fixed criteria or standards.
It should be noted that norm-referenced tests are typically not the form of standardized test widely used to comply with state or federal policies—such as the No Child Left Behind Act—that are intended to measure school performance, close “achievement gaps,” or hold schools accountable for improving student learning results. In most cases, criterion-referenced tests are used for these purposes because the goal is to determine whether schools are successfully teaching students what they are expected to learn.
Similarly, the assessments being developed to measure student achievement of the Common Core State Standards are also criterion-referenced exams. However, some test developers promote their norm-referenced exams—for example, the TerraNova Common Core—as a way for teachers to “benchmark” learning progress and determine if students are on track to perform well on Common Core–based assessments.
While norm-referenced tests are not the focus of ongoing national debates about “high-stakes testing,” they are nonetheless the object of much debate. The essential disagreement is between those who view norm-referenced tests as objective, valid, and fair measures of student performance, and those who believe that relying on relative performance results is inaccurate, unhelpful, and unfair, especially when making important educational decisions for students. While part of the debate centers on whether or not it is ethically appropriate, or even educationally useful, to evaluate individual student learning in relation to other students (rather than evaluating individual performance in relation to fixed and known criteria), much of the debate is also focused on whether there is a general overreliance on standardized-test scores in the United States, and whether a single test, no matter what its design, should be used—in exclusion of other measures—to evaluate school or student performance.
It should be noted that perceived performance on a standardized test can potentially be manipulated, regardless of whether a test is norm-referenced or criterion-referenced. For example, if a large number of students are performing poorly on a test, the performance criteria—i.e., the bar for what is considered “passing” or “proficient”—could be lowered to “improve” perceived performance, even if students are not learning more or performing better than past test takers. For example, if a standardized test administered in eleventh grade uses proficiency standards that are considered to be equivalent to eighth-grade learning expectations, it will appear that students are performing well, when in fact the test has not measured learning achievement at a level appropriate to their age or grade. For this reason, it is important to investigate the criteria used to determine “proficiency” on any given test—and particularly when a test is considered “high stakes,” since there is greater motivation to manipulate perceived test performance when results are tied to sanctions, funding reductions, public embarrassment, or other negative consequences.
The following are representative of the kinds of arguments typically made by proponents of norm-referenced testing:
Norm-referenced tests are relatively inexpensive to develop, simple to administer, and easy to score. As long as the results are used alongside other measures of performance, they can provide valuable information about student learning.
The quality of norm-referenced tests is usually high because they are developed by testing experts, piloted, and revised before they are used with students, and they are dependable and stable for what they are designed to measure.
Norm-referenced tests can help differentiate students and identify those who may have specific educational needs or deficits that require specialized assistance or learning environments.
The tests are an objective evaluation method that can decrease bias or favoritism when making educational decisions. If there are limited places in a gifted and talented program, for example, one transparent way to make the decision is to give every student the same test and allow the highest-scoring students to gain entry.
The following are representative of the kinds of arguments typically made by critics of norm-referenced testing:
Although testing experts and test developers warn that major educational decisions should not be made on the basis of a single test score, norm-referenced scores are often misused in schools when making critical educational decisions, such as grade promotion or retention, which can have potentially harmful consequences for some students and student groups.
Norm-referenced tests encourage teachers to view students in terms of a bell curve, which can lead them to lower academic expectations for certain groups of students, particularly special-needs students, English-language learners, or minority groups. And when academic expectations are consistently lowered year after year, students in these groups may never catch up to their peers, creating a self-fulfilling prophecy. For a related discussion, see high expectations.
Multiple-choice tests—the dominant norm-referenced format—are better suited to measuring remembered facts than more complex forms of thinking. Consequently, norm-referenced tests promote rote learning and memorization in schools over more sophisticated cognitive skills, such as writing, critical reading, analytical thinking, problem solving, or creativity.
Overreliance on norm-referenced test results can lead to inadvertent discrimination against minority groups and low-income student populations, both of which tend to face more educational obstacles that non-minority students from higher-income households. For example, many educators have argued that the overuse of norm-referenced testing has resulted in a significant overrepresentation of minority students in special-education programs. On the other hand, using norm-referenced scores to determine placement in gifted and talented programs, or other “enriched” learning opportunities, leads to the underrepresentation of minority and lower-income students in these programs. Similarly, students from higher-income households may have an unfair advantage in the college-admissions process because they can afford expensive test-preparation services.
An overreliance on norm-referenced test scores undervalues important achievements, skills, and abilities in favor of the more narrow set of skills measured by the tests.
