### Present Remotely

Send the link below via email or IM

CopyPresent to your audience

Start remote presentation- Invited audience members
**will follow you**as you navigate and present - People invited to a presentation
**do not need a Prezi account** - This link expires
**10 minutes**after you close the presentation - A maximum of
**30 users**can follow your presentation - Learn more about this feature in our knowledge base article

# Transforming & Equating

No description

by

Tweet## DAUN JEONG

on 25 May 2011#### Transcript of Transforming & Equating

Transforming

&

Equating Jin-sun, Yoo

Da-un, Jeong in SWU are used to Make important decisions @ individual level

@ institutional level

@ public policy level can be administered Tests on multiple occasions.

over many years to track educational trends over time. In this situations, we need to Equating ! Different test forms on different test dates might differ somewhat in difficulty. Equating is a statistical process that is used to adjust scores on test forms so that score on the forms can be used interchangeably. The process of equating is used in situations where

such alternate forms of a test exist and

scores earned on different forms are compared to each other. Equating adjusts for differences in difficulty,

not for differences in content. Transforming Linking Scaling Equating interchangeable

similar in content & statistical characteristics comparable

different contents & levels Scaling

& Equating

process Score scales typically are established using a sigle test form. For subsequent test forms,

the scale is maintained through an equating process that places raw score from subsequent forms

on the established score scale. Typically,

raw scores on the new form are equated to raw scores on the old form, and these equated raw scores are then converted to scale scores

using the raw-to-scale score transformation for the old form. 1. Decide on the purpose for equating As these steps in the equating process suggest,

individuals responsible for conducting eqauting make choices designs,

operational definitions,

statistical techniques,

and evaluation procedures. about Property Symmetry Same specifications Equity Observed score equating Group invariance Symmetry property requires that the function used to transform a score on Form X to the Form Y scale be the inverse of the function used to transform a score on Form Y to the Form X scale.

This property rules out regression as an equating method. of equating test forms must be built to the same content and statistical specifications

if they are to be equated. Lord's equity property holds

if examimees with a given true score have the same distribution of converted scores on Form X as they would on Form Y. This property implies that examinees with a given true score

would have identical observed scores on Form X and scores on Form Y. Using Lord's equity property as the criterion, equating is either impossible of innecessary. The converted scores on Form X

have the same distribution as scores on Form Y Equipercentile equating property implies that the cummulative distribution of equated scores on Form X is equal to

the cumulative distribution of scores on Form Y. Under the group invariance property,

the equating relationship is the same

regardless of the group of examinees

used to conduct the eqauting. Equating designs Random groups design Single group design Single group design

with counterbalancing Common item

nonequivalent groups design NAEP reading anomaly

- Problems with common items Error Evaluating

the results of Equating Examinees are randomly assigned the form to be administered. Spiraling Form X - Form Y - Form x - Form Y ... Spiraling process typically leads to

comparable, randomly equivalent groups. Each examinee takes only one form of the test, thus minimizing testing time relative to a design in which examinees take more than one form. More than one new form can be equated at the same time by including the additional new forms in the spiraling process. practical

features Limitation All the forms must be availabel and administered

at the same time. Large sample sizes are typically needed. The same examinees are administered both Form X & Form Y. Fatigue Familiarity Order effects Because thease are typically present,

this design is rarely used in practice. One way to deal with order effects in the single design. Form X+Form Y - Form Y+Form X - Form X+Form Y ... The effect of

taking Form X

after taking Form Y = The effect of

taking Form Y

after taking Form X Equating relationships Differential order effect If thease relationships differ each other... The data for the form that is second might need to be disregarded. In practice,

the single group design with counterbalancing

might be used instead of the random groups design When administering two forms to examinees is operationally possible,

differential order effects are not expected to occur,

it is difficult to obtain participation of a sufficient number of examinees. ASVAB problems

with a Single Group Design The Armed Services Vocational Aptitude Battery A battery of ability tests

that is used in the process of

selecting individuals for the military. The scores on the old form : were used for selection. on the new form : were not used for selection. Many examinees can distinguish between

the old and the new forms. also knew that only the scores on the old form were to be used for selection purpose. The examinees were likely more motivated

when taking the old form than taking the new form. The result of Maier's study(1993) : Motivation differences caused the scale scores on the new form to be too high when the new form was used to make selection decisions for examinees. in estimating

equating relationships Estimated equating relationships typically contain estimation error.

A major goal in designing and conducting equating is to minimize such equating error. Random equating error is present whenever samples from populations of examinees are used to estimate parameters. (e.g., means, standard deviations...) ; Standard error of equating As the sample size becomes larger,

the standard error of equating becomes smaller. Sample size! Random equating error Systematic equating error Systematic equating error results from violations of the assumptions and conditions of equating. Although the amount of random error can be quantified

using the standard error of equating,

systematic error is much more difficult to quantify. In the random groups design, if spiraling process is inadequate for achieving group comparability... In the single groups design, if differential order effects can not be controlled... In the common-item nonequivalent groups design, if the assumptions of statistical methods used to seperate form and group differences are not met... After the equating is conducted, the results should be evaluated. The criteria

for equating Standard errors of equating The properties of equating to estimate random error

(consistency of results) also can be used to develop evaluative criteria. Observed score equating properties are especially important

when equating is evaluated from an institutional perspective. We will discuss several different types of raw-score transformations that aid in the interpretation of test scores. raw-score problem difficult to interpret : isolated raw score does not give any information about how one examinee's performance is related to the performance of the other examinees. difficult to compare across tests Two basic types

of transforming scores linear transformation nonlinear transformation Y=ax+b Y=X 2 standard and standardized scores

& some formula scores Percentiles, age and grade scores, expectancy tables,

normalized scores, equal-interval scales

& some formula scores the shape of the distribution of the transformed scores

is the same as the shape of the distribution of the raw scores. change correlations and the shape of the score distribution,

so that the transformed-score distribution can be very different

from the raw-score distribution. Monotonic transformations will not alter

an examinee's rank order in the sample. Norm-referenced test Criterion-referenced test comparing an examinee's performance to the performance of other examinee's (;norm group) whether the examinee has reached a certain specific criterion performance or mastered a specific task. not require transformation. Percentiles Age and Grade Scores Expectancy Tables Standard and Standardized Scores Normalized Scores Corrections for Guessing and Omissions Equal-Interval Scales Vertical equating Horizontal equating to equate different levels of a test so that an examinee will get the same score

regardless of whether an levels of the test is harder or easier. to equate test forms within a specified difficulty level. (ex. the test on grade 4 - the test on grade 5) (ex. test on grade 4 in 2010 - the test on grade 4 in 2011) Norm group is a specified sample of examinees is defined as the percentage of people in a norm group

who have trait values less than or equal to that particular trait value. Percentile rank of a trait value Limitations Percentiles can be assumed to form

Thus, arithmetical manipulations of percentiles can produce the distribution of percentiles within the norm group is Percentile scores may lead to of small differences,

especially when the test is short. a rectangular distribution curve a horizontal line. Therefore,

researchers who desire to use

common statistical techniques

that assume normal distributions

should avoid the use of percentiles. ordinal scales. misleading results. rectangular, not normal. exaggerated

interpretations Age or grade equivalents A third-grader may be said

to read at the fifth-grade level

or have the mental ability of a 10-year-old. Ex. Limitations These scores are assumed to form ordinal scores, arithmetical manipulations of these scores can lead to misleading results. The interpretation of these scores is not

as straightforward as it appears. Score distributions for adjacent grades typically tend to

have increasing overlap as grade level increases. School may differ in their curricula and introduce topics at different rates. The use of age or grade scores is only reasonable

when the trait being measured increases(or decreases)

monotonically with age or grade. interpolation between tests may be inaccurate. The National Assessment of Educational Progress The survey of the educational achievement of students in American schools The reading results showed a surprisingly large decrease form 1984 at age 17 and, to a lesser degree, at age 9...(Zwick, 1991) 1. In 1984, the test booklets administered to examinees contained reading and writing sections. In 1986, the booklets administered to examinees contained reading, mathematics,

and/or science sections at ages 9 reading, computer science, history

and/or literature at ages 17 In 1986, the booklets administered to examinees contained 2. The composition of the reading sections differed in 1984 and 1986. The orders of common items The available time to complete common items Context effects

can lead to

very misleading results. This design often used when more than one form per test date cannot be administered because of test security or other practical concerns. Internal common items External common items When the score on the set of common items contributes to the examinee's score on the test When the score on the set of common items

does not contribute to the examinee's score on the test = miniversion in content & statistical charactersitcs of the total test form To accurately reflect group differences... To help common items behave similarly,

each common items should occupy a similar location(item number) in the two forms. be exactly the same in the old and new forms. Differences between means on Form X & Form Y examinee group differences

and test form differences can result from a combination of The central task in equating using this design is to seperate group differences and test form differences. Which of the two forms is easier?

What would have been the mean on Form X for Group 2 taken From X? Group 2 might be expected to correctly answer 10% more of the Form X items than would Group 1. The mean for Group 2 on Form X would be expected to be

82=72+10. Because Group 2 earned a mean of 77 on Form Y and has an expected mean of 82 on Form X, Form X appears to be than Form Y. 5 points easier The larger the differences between examinee groups,

the more difficult it becomes for the statistical methods

to seperate the group and form differences. the conditional distribution of criterion scores for different test scores. A counselor would advise

a student with a high pre-course test score to take the course,

perhaps with a warning that a few students from this test-score level

still did poorly in the course. Time or monetary considerations

or clear-cut criterion is not available. The expectancy table illustrates

the probabilistic nature of psychological prediction.

:high pre-course test score is not guaranteed an "A." Limitations Large enough to ensure that the probabilities in the table are reasonably stable. often called Z scores ①T scores

② Transformations of scores can also be made to adjust

for the effects of guessing and the effects of omitting items. ①hypothesizes that the continuous trait being measured by a test

has a normal distribution in some specified population. linear transformations of raw scores two frequency distribution have the same shape,

it is difficult to compare scores on two standardized scales, it is particularly risky to interpret small differences

in standard scores. half the score are negative (standard-score equivalents) that eliminate the problem involved with negative number The transformation to normalized scores

involves forcing the distribution of transformed score to be as close as possible to a normal distribution

by smoothing out, stretching, or condensing irregularities and departures

from normality in the raw-score distribution. Two nomalized scores are normalized scores with

mean=50, standard deviation=10 are one-digit normalized scores Stanines The manual carefully to normalized or standardized mean=5, standard deviation=approximately 2

‣ may not be reasonable

if the underlying trait has The use of normalized scores a very non-normal distribution In the interpretation of an examinee's performance

and in the comparison of the performances of different examinees When there are no omitted items... On multiple-choice tests,

examinees can get an item correct,

without knowing the right answer,

simply by guessing. These transformations can aid Formula scores Transformaions that take into account

guessing or omissions When there are omitted items... F2, is the estimate number of items

that would be correct

if every blank item were replaced by random guess. F1, a linear funtion of X, is perfectly correlated with X,

and has the same reliability and validity as X. F1 and X are not perfectly correlated. Equal intervals is particularly useful

for measuring growth or change in a trait or behavior. raw scores can be transformed into a set of scores

that does have equal intervals. Thurstone's absolute scaling method hypothesizes that raw scores on the test

are monotonically related to trait values. (A monotonic relationship is one in which every increases in the raw score reflects an increase in the trait value.) Unlike the bell-shaped normal curve,

looks like 2. Construct alternate forms 3. Choose a design for data collection 4. Implement the data collection design 5. Choose one or more operational definitions of equating 6. Choose one or more statistical estimation methods 7. Evaluate the results of equating

Full transcript&

Equating Jin-sun, Yoo

Da-un, Jeong in SWU are used to Make important decisions @ individual level

@ institutional level

@ public policy level can be administered Tests on multiple occasions.

over many years to track educational trends over time. In this situations, we need to Equating ! Different test forms on different test dates might differ somewhat in difficulty. Equating is a statistical process that is used to adjust scores on test forms so that score on the forms can be used interchangeably. The process of equating is used in situations where

such alternate forms of a test exist and

scores earned on different forms are compared to each other. Equating adjusts for differences in difficulty,

not for differences in content. Transforming Linking Scaling Equating interchangeable

similar in content & statistical characteristics comparable

different contents & levels Scaling

& Equating

process Score scales typically are established using a sigle test form. For subsequent test forms,

the scale is maintained through an equating process that places raw score from subsequent forms

on the established score scale. Typically,

raw scores on the new form are equated to raw scores on the old form, and these equated raw scores are then converted to scale scores

using the raw-to-scale score transformation for the old form. 1. Decide on the purpose for equating As these steps in the equating process suggest,

individuals responsible for conducting eqauting make choices designs,

operational definitions,

statistical techniques,

and evaluation procedures. about Property Symmetry Same specifications Equity Observed score equating Group invariance Symmetry property requires that the function used to transform a score on Form X to the Form Y scale be the inverse of the function used to transform a score on Form Y to the Form X scale.

This property rules out regression as an equating method. of equating test forms must be built to the same content and statistical specifications

if they are to be equated. Lord's equity property holds

if examimees with a given true score have the same distribution of converted scores on Form X as they would on Form Y. This property implies that examinees with a given true score

would have identical observed scores on Form X and scores on Form Y. Using Lord's equity property as the criterion, equating is either impossible of innecessary. The converted scores on Form X

have the same distribution as scores on Form Y Equipercentile equating property implies that the cummulative distribution of equated scores on Form X is equal to

the cumulative distribution of scores on Form Y. Under the group invariance property,

the equating relationship is the same

regardless of the group of examinees

used to conduct the eqauting. Equating designs Random groups design Single group design Single group design

with counterbalancing Common item

nonequivalent groups design NAEP reading anomaly

- Problems with common items Error Evaluating

the results of Equating Examinees are randomly assigned the form to be administered. Spiraling Form X - Form Y - Form x - Form Y ... Spiraling process typically leads to

comparable, randomly equivalent groups. Each examinee takes only one form of the test, thus minimizing testing time relative to a design in which examinees take more than one form. More than one new form can be equated at the same time by including the additional new forms in the spiraling process. practical

features Limitation All the forms must be availabel and administered

at the same time. Large sample sizes are typically needed. The same examinees are administered both Form X & Form Y. Fatigue Familiarity Order effects Because thease are typically present,

this design is rarely used in practice. One way to deal with order effects in the single design. Form X+Form Y - Form Y+Form X - Form X+Form Y ... The effect of

taking Form X

after taking Form Y = The effect of

taking Form Y

after taking Form X Equating relationships Differential order effect If thease relationships differ each other... The data for the form that is second might need to be disregarded. In practice,

the single group design with counterbalancing

might be used instead of the random groups design When administering two forms to examinees is operationally possible,

differential order effects are not expected to occur,

it is difficult to obtain participation of a sufficient number of examinees. ASVAB problems

with a Single Group Design The Armed Services Vocational Aptitude Battery A battery of ability tests

that is used in the process of

selecting individuals for the military. The scores on the old form : were used for selection. on the new form : were not used for selection. Many examinees can distinguish between

the old and the new forms. also knew that only the scores on the old form were to be used for selection purpose. The examinees were likely more motivated

when taking the old form than taking the new form. The result of Maier's study(1993) : Motivation differences caused the scale scores on the new form to be too high when the new form was used to make selection decisions for examinees. in estimating

equating relationships Estimated equating relationships typically contain estimation error.

A major goal in designing and conducting equating is to minimize such equating error. Random equating error is present whenever samples from populations of examinees are used to estimate parameters. (e.g., means, standard deviations...) ; Standard error of equating As the sample size becomes larger,

the standard error of equating becomes smaller. Sample size! Random equating error Systematic equating error Systematic equating error results from violations of the assumptions and conditions of equating. Although the amount of random error can be quantified

using the standard error of equating,

systematic error is much more difficult to quantify. In the random groups design, if spiraling process is inadequate for achieving group comparability... In the single groups design, if differential order effects can not be controlled... In the common-item nonequivalent groups design, if the assumptions of statistical methods used to seperate form and group differences are not met... After the equating is conducted, the results should be evaluated. The criteria

for equating Standard errors of equating The properties of equating to estimate random error

(consistency of results) also can be used to develop evaluative criteria. Observed score equating properties are especially important

when equating is evaluated from an institutional perspective. We will discuss several different types of raw-score transformations that aid in the interpretation of test scores. raw-score problem difficult to interpret : isolated raw score does not give any information about how one examinee's performance is related to the performance of the other examinees. difficult to compare across tests Two basic types

of transforming scores linear transformation nonlinear transformation Y=ax+b Y=X 2 standard and standardized scores

& some formula scores Percentiles, age and grade scores, expectancy tables,

normalized scores, equal-interval scales

& some formula scores the shape of the distribution of the transformed scores

is the same as the shape of the distribution of the raw scores. change correlations and the shape of the score distribution,

so that the transformed-score distribution can be very different

from the raw-score distribution. Monotonic transformations will not alter

an examinee's rank order in the sample. Norm-referenced test Criterion-referenced test comparing an examinee's performance to the performance of other examinee's (;norm group) whether the examinee has reached a certain specific criterion performance or mastered a specific task. not require transformation. Percentiles Age and Grade Scores Expectancy Tables Standard and Standardized Scores Normalized Scores Corrections for Guessing and Omissions Equal-Interval Scales Vertical equating Horizontal equating to equate different levels of a test so that an examinee will get the same score

regardless of whether an levels of the test is harder or easier. to equate test forms within a specified difficulty level. (ex. the test on grade 4 - the test on grade 5) (ex. test on grade 4 in 2010 - the test on grade 4 in 2011) Norm group is a specified sample of examinees is defined as the percentage of people in a norm group

who have trait values less than or equal to that particular trait value. Percentile rank of a trait value Limitations Percentiles can be assumed to form

Thus, arithmetical manipulations of percentiles can produce the distribution of percentiles within the norm group is Percentile scores may lead to of small differences,

especially when the test is short. a rectangular distribution curve a horizontal line. Therefore,

researchers who desire to use

common statistical techniques

that assume normal distributions

should avoid the use of percentiles. ordinal scales. misleading results. rectangular, not normal. exaggerated

interpretations Age or grade equivalents A third-grader may be said

to read at the fifth-grade level

or have the mental ability of a 10-year-old. Ex. Limitations These scores are assumed to form ordinal scores, arithmetical manipulations of these scores can lead to misleading results. The interpretation of these scores is not

as straightforward as it appears. Score distributions for adjacent grades typically tend to

have increasing overlap as grade level increases. School may differ in their curricula and introduce topics at different rates. The use of age or grade scores is only reasonable

when the trait being measured increases(or decreases)

monotonically with age or grade. interpolation between tests may be inaccurate. The National Assessment of Educational Progress The survey of the educational achievement of students in American schools The reading results showed a surprisingly large decrease form 1984 at age 17 and, to a lesser degree, at age 9...(Zwick, 1991) 1. In 1984, the test booklets administered to examinees contained reading and writing sections. In 1986, the booklets administered to examinees contained reading, mathematics,

and/or science sections at ages 9 reading, computer science, history

and/or literature at ages 17 In 1986, the booklets administered to examinees contained 2. The composition of the reading sections differed in 1984 and 1986. The orders of common items The available time to complete common items Context effects

can lead to

very misleading results. This design often used when more than one form per test date cannot be administered because of test security or other practical concerns. Internal common items External common items When the score on the set of common items contributes to the examinee's score on the test When the score on the set of common items

does not contribute to the examinee's score on the test = miniversion in content & statistical charactersitcs of the total test form To accurately reflect group differences... To help common items behave similarly,

each common items should occupy a similar location(item number) in the two forms. be exactly the same in the old and new forms. Differences between means on Form X & Form Y examinee group differences

and test form differences can result from a combination of The central task in equating using this design is to seperate group differences and test form differences. Which of the two forms is easier?

What would have been the mean on Form X for Group 2 taken From X? Group 2 might be expected to correctly answer 10% more of the Form X items than would Group 1. The mean for Group 2 on Form X would be expected to be

82=72+10. Because Group 2 earned a mean of 77 on Form Y and has an expected mean of 82 on Form X, Form X appears to be than Form Y. 5 points easier The larger the differences between examinee groups,

the more difficult it becomes for the statistical methods

to seperate the group and form differences. the conditional distribution of criterion scores for different test scores. A counselor would advise

a student with a high pre-course test score to take the course,

perhaps with a warning that a few students from this test-score level

still did poorly in the course. Time or monetary considerations

or clear-cut criterion is not available. The expectancy table illustrates

the probabilistic nature of psychological prediction.

:high pre-course test score is not guaranteed an "A." Limitations Large enough to ensure that the probabilities in the table are reasonably stable. often called Z scores ①T scores

② Transformations of scores can also be made to adjust

for the effects of guessing and the effects of omitting items. ①hypothesizes that the continuous trait being measured by a test

has a normal distribution in some specified population. linear transformations of raw scores two frequency distribution have the same shape,

it is difficult to compare scores on two standardized scales, it is particularly risky to interpret small differences

in standard scores. half the score are negative (standard-score equivalents) that eliminate the problem involved with negative number The transformation to normalized scores

involves forcing the distribution of transformed score to be as close as possible to a normal distribution

by smoothing out, stretching, or condensing irregularities and departures

from normality in the raw-score distribution. Two nomalized scores are normalized scores with

mean=50, standard deviation=10 are one-digit normalized scores Stanines The manual carefully to normalized or standardized mean=5, standard deviation=approximately 2

‣ may not be reasonable

if the underlying trait has The use of normalized scores a very non-normal distribution In the interpretation of an examinee's performance

and in the comparison of the performances of different examinees When there are no omitted items... On multiple-choice tests,

examinees can get an item correct,

without knowing the right answer,

simply by guessing. These transformations can aid Formula scores Transformaions that take into account

guessing or omissions When there are omitted items... F2, is the estimate number of items

that would be correct

if every blank item were replaced by random guess. F1, a linear funtion of X, is perfectly correlated with X,

and has the same reliability and validity as X. F1 and X are not perfectly correlated. Equal intervals is particularly useful

for measuring growth or change in a trait or behavior. raw scores can be transformed into a set of scores

that does have equal intervals. Thurstone's absolute scaling method hypothesizes that raw scores on the test

are monotonically related to trait values. (A monotonic relationship is one in which every increases in the raw score reflects an increase in the trait value.) Unlike the bell-shaped normal curve,

looks like 2. Construct alternate forms 3. Choose a design for data collection 4. Implement the data collection design 5. Choose one or more operational definitions of equating 6. Choose one or more statistical estimation methods 7. Evaluate the results of equating