Catalog Home Page

Contructuring and interpreting achievement scales using polytomously scored items: A comparison between the Rasch and Thurstone models

van Wyke, John (2003) Contructuring and interpreting achievement scales using polytomously scored items: A comparison between the Rasch and Thurstone models. Professional Doctorate thesis, Murdoch University.

PDF - Whole Thesis
Available upon request


The Department of Education in Western Australian conducts an annual testing program across all state schools for the purposes of demonstrating and improving the quality of the government school system. The program is called Monitoring Standards in Education. Each year a sample of students in their third, seventh and tenth years of schooling, drawn from schools across the state, are tested in two of the eight learning areas which make up the school curriculum. All learning areas are included on a cyclical basis, although English and Mathematics, with their links to the critical areas of literacy and numeracy, are the main focus of testing program.

The testing program is designed to provide both descriptive and comparative information about the performance of the sampled students. Descriptive information is given in terms of the knowledge, skills and understandings described in the Western Australian standards framework, the Student Outcome Statements. Comparative information is also provided about their performance in order to monitor their improvement in educational outcomes in successive testing programs.

Provision of both descriptive and comparative kinds of information requires the construction of a suitable measurement scale, in order that the knowledge, skills and understandings required by the testing program, as well as the performance of the students who have been tested, can be mapped and compared. The Rasch measurement model provides the means of establishing such a scale.

Testing programs in Mathematics were conducted in 1992 and 1996. These programs are of particular interest because in each of these years a different testing organisation, with different interpretations of the Rasch model, was contracted by the Education Department of Western Australia to conduct the testing program. In particular, one class of test items was treated quite differently in the different testing programs. These were polytomous items with ordered categories, in which the thresholds - the points at which the probability of a response in one of two adjacent categories is equal - were not in their natural order.

According to one interpretation, the values of the thresholds were required to follow a natural order for categories to be treated as working as intended. Reversed threshold estimates were taken as evidence that an item was failing to function as intended. As a result, such items were either rescored or rewritten, so that they did work as intended, or they were removed from the testing program.

In the other interpretation, there was no requirement that threshold values be ordered. Reversed threshold estimates were not taken to mean that there was a problem with the way an item was functioning, but only that there was a problem in the way the item was displayed. The problem of using such items to construct the achievement scale was resolved by calculating an alternative set of thresholds based on the Thurstone Cumulative Probability Model, using parameter estimates produced by the Rasch model. The thresholds derived from the Thurstone model are always in a natural order, irrespective of the order of the thresholds produced by the Rasch model. Using this procedure, items with disordered thresholds were routinely included in the testing program.

This study is an attempt to understand the effect of these two different procedures on the analysis of the different testing programs. Its focus is not on the theoretical differences however. There are also practical implications of the different methods. The process of building the achievement scales on which the 1992 and 1996 test results were reported and compared was based on different interpretations of the Rasch model, particularly in the use of Thurstone thresholds to make sense of items where reversal of Rasch thresholds had occurred.

The purpose of this study was to analyse how the use of Thurstone thresholds affected the way that the achievement scale was constructed and interpreted.

The analysis is in two parts. The first part focuses on the general effect that the use of Thurstone thresholds in place of Rasch thresholds has on the location of item thresholds, and the effect of this on the construction and interpretation of the achievement scale. The second part narrows this focus to a number of specific items with a view to gaining a greater understanding of the significance of disordered thresholds as an indication of item functioning, and the consequences for retaining or rejecting such items.

In order to do this, three classes of polytomous items, based on the way each is scored, were identified. Each class contains items with ordered thresholds as well as items with disordered thresholds.

The first class consists of hierarchically scored items, where the marking key identifies a hierarchy of responses, and provides a description of the response at each level in the hierarchy. Scoring is based on comparing the student's work with the range of descriptions and allocating a score according to the level in the hierarchy that the response matches.

The second class consists of incrementally scored items, where the marking key specifies a number of elements which go together to make up a correct response. The level of response is determined by the number of elements that the student has correct. Scoring is based on incrementing the student's score for each element that the student has correct.

The third class consists of decrementally scored items, where the marking key specifies the correct response together with a range of partially correct responses, for which marks are deducted. Typically, marks are deducted because the student's response, although substantially correct, contains computational errors, is inaccurate, or is incomplete. Scoring is based on decrementing the student's score for each of these omissions.

Within each class, one example of an item with ordered thresholds and one example of an item with disordered thresholds is examined, with a view to drawing out the significance of disordered thresholds. One of the key points to emerge from this analysis is that disordered Rasch thresholds frequently point to misunderstandings about the way the underlying ability continuum is constructed. The use of Thurstone thresholds masks such misunderstandings, and prevents the opportunity to learn from this important feature of the Rasch model.

Publication Type: Thesis (Professional Doctorate)
Murdoch Affiliation: School of Education
Notes: Note to the author: If you would like to make your thesis openly available on Murdoch University Library's Research Repository, please contact: Thank you
Supervisor: Andrich, David
Item Control Page Item Control Page