Accounting for unit of scale in standard setting methodologies
Heldsinger, Sandra (2006) Accounting for unit of scale in standard setting methodologies. Professional Doctorate thesis, Murdoch University.
Substantial sums of money are invested by governments in state, national and international testing programs. Australia in particular engages at all three levels. There are number of purposes served by these programs. One of these is to report student performance against standards.
Standard setting exercises with respect to a particular assessment are commonly used by testing programs where there is a requirement to determine the point at which it can be said that students have demonstrated achievement of a standard. Several methodologies have been devised that use expert judgements to derive a numerical cut-score on an achievement scale. A commonly used standard setting methodology is one proposed by Angoff (1971).
The kernel of the Angoff procedure is the independent judgement of the probability that a minimally competent person can or cannot answer a dichotomously scored item correctly. This methodology typically involves three stages: orientation and training, a first round of performance estimation followed by feedback, and then a second round of performance estimation. In the orientation session, judges are asked to define a hypothetical target group. This definition is dependent upon the judges' tacit understanding of the standard. For example, in the context of a mathematics test, judges would be asked to agree the skills the students should be expected to have mastered. Then they would be asked to envisage a student with those skills and to estimate the proportion of a hypothetical group of equally competent students (as defined by the expected standard) who would answer each item correctly. This proportion is the estimate of the required probability. Then the sum of these probabilities is taken as the raw cut-score on a test composed of the items.
Several studies, however, question the validity of the Angoff methodology because of the finding that judges were unable to perform the fundamental task required of them: to estimate the probability a student would answer an item correctly, (Shepard, 1995) even for groups of students who are well known to them (Impara and Blake, 1996).
In addition, standard-setting exercises invariably take place in situations where the reporting of educational standards has a high profile and is of political importance. To address the accountability requirements that accompany such a task, a wide range of stakeholders are invited to act as judges in the exercises. Inevitably, however, variability between the judges conception of the standard, as represented by the cut-score set by each of them, causes concern. Can the public have confidence in the standard set if the judges themselves cannot agree? Several studies report the introduction of further rounds of performance estimation and more refined feedback in an attempt to obtain greater consistency between the judges' ratings (Impara and Blake, 2000; McGinty and Neel, 1996; Reckase, 2000).
In more recent studies Green, Trimble and Lewis (2003) report a study in which three standard setting procedures were implemented to set cut-scores and which required judges to synthesise the results to establish final cut-points. Green et al report studies such as Impara and Blake (2000) where convergence of results among multiple standard settings are used as evidence of validity of cut-scores, but note that while convergence may occur to a reasonable degree when variations of the same method are used, there are few reports of convergence when different procedures are used.
The distinguishing factor between the standard-setting exercises reported in the literature, which rely on judges' tacit understanding of the standard and this study, is the existence of an explicitly and operationally defined standard. In 1996 the Australian Ministers for Education agreed to a national framework for reporting of student achievement in literacy and numeracy and arising from this decision was the drafting of benchmark standards against which the achievement of students in years 3, 5, 7 and 9 could be reported. The benchmark standards are articulated in two components. Criteria describe the skills that students need to have acquired if it is to be said that they achieved the standard and sample work exemplify these criteria.
The setting of standards independently of placing them on a scale permitted a more rigorous assessment of the effects of different designs on the setting of cut-scores. Two different standard-setting methodologies have been employed in this study to translate descriptions of the standards into cut-scores. One draws on the Angoff method and involves the use of a rating scale. Judges consider the items of a test and indicate the probability that a student at the cut-score will answer each item correctly. The probabilities are in increments of 0.10, ranging from 0.0 to 1.0. The sum of the probabilities that a judge gives to the items is taken as the raw score cut-score from that judge. The second study involves a method of pairwise comparison of the same items together with items that are operationalised to be benchmark items. The judge has to decide which of each pair of items is the more difficult.
The results of the two benchmark setting designs appear to support findings from other standard-setting exercises reported in the literature. Namely,
i. Judges were unable to estimate absolute item difficulty for a student of prescribed ability.
ii. Where two different designs were used, there is no convergence in results.
iii. Ratings from different judges within each design varied widely.
To indicate the resultant discrepancy in setting the benchmark on the same test, the rating methodology gives a value of 16.08 and the pairwise a value of 7.10 on ostensibly the same scale. A closer examination of the judges' ratings, however, suggests that despite the evidence of dramatically different cut scores between the two exercises, the judges were highly consistent in their interpretation of relative item difficulty. Two lines of evidence indicate this high level of internal consistency: (i) the reliability index for the pairwise data; and (ii) the correlation between the item estimates obtained from the rating and pairwise exercises, which was 0.95. In addition, the correlation of the relative item difficulties with those obtained from students responding to the same items was a satisfactory 0.80 and 0.74 for the ratings and for the pairwise designs, respectively.
The high correlation between judgements across the two exercises, in conjunction with the relatively high correlation of the item difficulties from the judges' data and from the student data, suggests that problems observed in the literature do not arise because judges cannot differentiate the relative difficulties of the items. Accordingly, the unit of scale as assessed by the standard deviations of the item difficulties were calculated and examined.
The standard deviation of the items from judges in the likelihood design was half that of the item difficulties from the student responses, and the standard deviation of the items from the pairwise design was over twice that of the student scale. The substantial difference between the standard deviations suggests a difference between the units of scale, which presents a fundamental problem for common equating. In general, and in the literature, it seems that the unit of scale as evidenced from the standard deviations is not considered and it seems that it is simply assumed that the unit of scale produced by the students and the judges is the same and each design should be the same. Then if the results of different modes of the data collections do not arrive at the same or very similar cut-scores, it is not considered that this might be only a result of different units of scale.
In retrospect, it is not surprising that different formats for data collection produce different units of scale, and that different cut-scores result. In addition, it is not surprising that these might also produce a different unit of scale from that produced by the responses of the students. The reasons that the different designs are likely to produce different units of scale are considered in the thesis.
Differences in the unit of scale will inevitably have an impact on the location of the benchmark or cut-score. When the difference in standard deviation is accounted for, and the cut-scores are placed on the same scale as that produced by the students, the two exercises provide similar locations of the benchmark cut-score. Importantly, the thesis shows that these locations can be substantiated qualitatively as representing the defined standard. There are two main conclusions of the study. First, some of the problems reported in the literature in setting benchmarks can be attributed to difference in the units of scale in the various response formats of judges relative to those of students. Second, this difference in unit of scale needs to be taken into account when locating the standard on the student scale.
This thesis describes in detail the two cut-score setting designs for the data collection, and the transformations that are necessary in order to locate the benchmark on the same scale as that produced by the responses of the students.
|Publication Type:||Thesis (Professional Doctorate)|
|Murdoch Affiliation:||School of Education|
|Item Control Page|
Downloads per month over past year