3 min read

My (more or less coherent) views on model fit

Ivailo Partchev

Why measure

Your psychometric model should fit the data, they keep telling me, or else you are in trouble. I find the idea bizarre. I am certainly not out there to explain the exam with a model, I just want to grade it. My (Rasch) model fits the scoring rule, and it is there basically to serve as an equating tool for multiple test forms, an alternative to equipercentile or kernel equating. If I am modeling anything, it is not the data but a particular social situation: mutual agreement that the sum score is a reasonable, acceptable, sensible, optimal way to grade the exam. Some theory but, ultimately, decades of collective experience keep me convinced that the model will also generally fit the data, as there is precious little useful information that is not already in the sum score. After all, the test was made this way.

Hence, in the realm of practical testing (as opposed to research, which is a different story altogether), item fit is primarily a quality control thing. If an item has no correlation with the sum score, it is badly written. If the correlation is negative, the scoring key is wrong. Such situations can usually be identified and corrected quite easily. When the items are decently written and correctly scored, the Rasch model will fit the data in grosso modo, and differences in discrimination will cancel – certainly at test level but possibly even when we put together a small number of items as a subscale.

How to measure

As a quality control measure or otherwise, it is of course a good idea to look at item fit. In dexter, our preferred approach is a visual inspection of the item-total regressions. These provide a detailed picture of fit over the whole ability range, involving the observed data, the calibration model, and the interaction model. An overall number might be useful, if anything, to sort the plots for the individual items, such that we look at the worst (or best) fitting items first.

As we dexter people have stated elsewhere, the interaction models captures and reproduces the three aspects of the data that play a role in classical test theory: item difficulty, the item-total correlation, and the score distribution. Moreover, all our practical experience to date shows that the observe data tends to cluster around the interaction model as the sample size increases. As long as we use the interaction model to summarize everything of interest in the data, and the Rasch model to calibrate (i.e., estimate and equate) our multiple test forms, I would only be interested in a numerical summary of the disagreement between these two regressions.

A practice-based gauge (excellent – good – … – check and revise item) on that summary might be useful. But I am not keen to know whether, at my sample size, I have enough evidence to reject a particular statistical hypothesis because, to start with, there is no particular apriori hypothesis in which I am interested. Some sort of variance breakdown of the data between the two models might be feasible, but I am not sure I want to look at the test. All I want is to be able to arrange those lovely item-total regression by the degree of departure between the calibration model and the systematic component in the data.