Research on rater bias in classroom observation


The research literature is divided into three areas: studies of rater bias, studies of rater background, and studies of differential rater functioning. See also Designing Teacher Evaluation Systems

 Studies of Rater Bias

The examination of rater bias addresses features of the observers (raters) and the observed (teachers and classroom setting) as they relate to scoring accuracy. Previous studies in educational and psychological measurement have examined these issues within the context of essay scoring or assessment of speaking ability (Ling, Mollaun, & Chen, 2011; Park & DeCarlo, 2011; Xi & Mollaun, 2009). In the medical literature, observation has been used to assess the performance of doctors being trained to diagnose patients (Colliver & Williams, 1993; van der Vleuten & Swanson, 1990). The main concern in the use of raters to assess performance is the large variability in scores. For example, in a classic study by Diederich, French, and Carlton (1961) in which three hundred essays were judged by fifty-three raters on a nine-point scale, it was found that 94 percent of the essays received at least seven different scores. Researchers have found differences in rater severity to be a factor that leads to differences in scores assigned (Shohamy, Gordon, & Kraemer, 1992), where some raters are more stringent or lenient than other raters. Other studies have attributed differences in raters to scoring precision—how well raters are able to discriminate differences between categories of the scoring rubric (DeCarlo, 2005); when raters have lower scoring precision, they cannot dis- criminate differences between a high or a low score, and this can obscure the true meaning of their scores.

Studies that have noted differences in rater characteristics have called for developing rigid protocols within scoring systems to train and monitor rater performance (Congdon & McQueen, 2000). These studies have implications for rater training and measurement of performance-based tasks and behaviors. However, to date, there has not been a study that investigated these characteristics for observations of teaching effectiveness with the scoring rigor used in the MET study. To improve consistency and minimize rating errors, the literature asserts that raters need to (1) be familiar with the measures they are using, (2) understand the sequence of operation, and (3) be trained on how they should interpret the scoring rubric (Coffman, 1971). There are several examples of classic studies that support the effectiveness of these strategies. For example, in a study by Latham, Wexley, and Purcell (1975), employ- ment interviewers were trained to reduce rater effects, and the training used by Pulakos (1986), which focused on the type, interpretation, and usage of data, yielded greater inter-rater reliability. Furthermore, Shohamy, Gordon, and Kraemer (1992) found that the overall reliability coefficients were higher for trained raters than for untrained raters, whereas the background of the rat- ers did not affect their reliability. Although rater training may help to alleviate rater differences to a degree, studies have shown that completely overcoming them is difficult (Hoskens & Wilson, 2001; Wilson & Case, 2000).

 Studies of Rater Background

Beyond examining scoring characteristics of raters with respect to severity and scoring precision, a number of studies have investigated how raters’ background may impact their scoring performance. Most of these studies were conducted in the context of language tests, such as those for writing or speaking (e.g., Brown, 1991; Hamp-Lyons, 2003, Hinkel, 1994; Pula & Huot, 1993; Schoonen, Vergeer, & Eiting, 1997; Weigle, 2002; Xi & Mollaun, 2009). These studies provide no consensus on how rater backgrounds impact their scoring performance. For instance, some researchers (e.g., Johnson & Lim, 2009; Myford, Marr, & Linacre, 1996) found no strong, consistent correlation between raters’ native language background and measures of their performance in scoring oral and written responses. However, other stud- ies (Brown, 1995; Eckes, 2008) found that raters’ background variables, such as native linguistic background, partially accounted for some scoring differences. In Carey, Mannell, and Dunn (2011), familiarity in accented speaking of English was examined, where a significant proportion of non-native-speaker raters scored candidates from their home country higher than candidates who were not from their home country. Little research has been conducted on the effect of raters’ professional background on their video scoring performance.


Studies of Differential Rater Functioning

Compared to studies on rater background, few studies have examined differen- tial rater functioning, which can occur when a rater exercises differential scoring behavior, such as severity toward a specific gender or ethnicity (Engelhard, 2007; Tamanini, 2008). In a study conducted by Chase (1986), the impact of interaction between student gender, race, expectations of the reader, and quality of penmanship was examined to assess its effects on raters’ perception of essays. Using essays of two different qualities of penmanship, eighty-three in-service teachers who varied in ethnicity and gender scored packets of essays that contained records and pictures of the students, in order to investigate the expectations of the raters. Using an analysis of variance model, the authors found that the interactions had a significant effect on the score. In studies of medical training, the interaction between the gender of the patient and the doctor has been examined. The results were mixed regarding significance of the interaction effect (Colliver, Vu, Marcy, Travis, & Robbs, 1993; Furman, Colliver, & Galofre, 1993; Stillman, Regan, Swanson, & Haley, 1992).


The findings from these studies on raters place emphasis on the need to train and monitor raters. This same principle can be applied for scoring classroom observations, which this chapter investigates. Although the context of assessment content may differ between previous studies on essay scoring and teaching effectiveness, scoring of performance is based on raters who may be subject to bias; in fact, measuring of teaching quality may be subject to an even greater array of factors contributing to bias because scoring is based on observations that not only involve teachers, but also involve various characteristics of classroom settings. For these reasons, prior research on rater bias translates into measurement of teaching effectiveness, where characteristics of teachers and classrooms may present areas of training and monitoring of scores assigned by raters.


Investigating rater bias has particular value, because classroom observations can be influenced by various subjective factors. Given that scores assigned by raters can have significant impact on teacher evaluation, paying attention to rater bias becomes important and necessary. To minimize rater bias, the MET study implemented a rigorous scoring process that involved training and monitoring of rater performance.
Examining characteristics of raters, teachers, and classroom settings in the MET data provided limited evidence to suggest significant and meaningful bias that raters had on scoring quality. Furthermore, in general, the group-level behaviors of raters were relatively invariant of construct-irrelevant factors.
Among rater characteristics, background variables such as gender, race/ ethnicity, experience, and educational level did not have significant influence on scoring accuracy. Factors such as self-reported levels of familiarity, clarity, or understanding of the instruments also did not generate any meaningful effects on scoring accuracy. Attention to detail and raters’ ability to follow directions were not found to be relevant in affecting scoring accuracy. For classroom settings and teacher characteristics, most factors had weak correlations with rater agreement. Finally, there was no conclusive evidence to support meaningful effects of interactions between rater and classroom/ teacher characteristics.
The following policy recommendations for states, school districts, and local agencies can be made.

 Develop a Scoring Protocol That Trains and Monitors Rater Performance

The MET study implemented a scoring system that outlined specific requirements for raters through the hiring, training, certification, and recalibration stages; there were efforts to provide ongoing feedback and remediation for raters who did not perform well for calibration and validation cases, relative to other raters. These ongoing efforts to track rater performance cannot be ignored and should, in fact, be emphasized with greater significance. Given the evidence in the literature on differences in rater behavior that are reflected in variability of scoring, the bias training and scoring protocol developed by ETS may have significantly contributed to minimizing rater effects.

 Implement Ongoing Statistical Monitoring of Raters

Although this study found very little evidence of rater bias, ongoing statistical monitoring of raters should be conducted.
Conducting statistical monitoring of rater performance requires agencies that collect scores from classroom observations to have a scoring sys- tem that provides readily accessible data for routine and operational analysis. This means that a protocol for routine monitoring of raters should be implemented that outlines the type of analysis to be executed and the personnel to conduct such statistical work. Operational methods for monitoring raters can include examining measures of agreement with expert observers and from double-scored classrooms. Although ongoing monitoring of raters is necessary, some analyses require larger sample sizes. A technical advisory panel is also recommended that can review and provide advice regarding the patterns or trends in rater performance, including identification of analyses that can be conducted frequently and studies that can occur as periodic checks on rater accuracy, following industry standards in testing.

 Provide Individual Feedback and Remediation for Raters

Although most factors associated with classroom and teacher characteristics were not significant, identifying specific raters who are not accurate observers requires the use of systems to monitor raters. The MET study used calibration scores, validation cases, and double scoring of data as sources for identifying raters who need remediation. When such raters can be identified, diagnostic information and feedback can be provided to improve training.
In summary, this chapter provides basic principles that districts should consider implementing in the development of their scoring systems. However, these guidelines do not necessarily indicate that the exact procedures implemented in the MET study (e.g., thirty-four hours of training and certification testing) should be followed. The most important points to consider are that raters should be provided with high-quality training and should demonstrate their ability to score accurately prior to scoring when the stakes are high. Moreover, scores assigned by trained raters should be monitored on a regular and frequent basis. Given varying degrees of resource constraints and feasibility concerns, districts should prioritize and weigh various consequences of implementing each decision.


American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bejar, I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2–9.

Bill & Melinda Gates Foundation, Measures of Effective Teaching (MET). (2012). Gathering feed- back for teaching: Combining high-quality observations with student surveys and achievement gains. Seattle, WA: Author.

Brown, A. (1995). The effect of rater variables in the development of an occupation-specific lan- guage performance test. Language Testing, 12, 1–15.

Brown, J. D. (1991). Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25, 587–603.

Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing, 28, 201–219.

Chase, C. I. (1986). Essay test scoring: Interaction of relevant variables. Journal of Educational Measurement, 23, 33–41.

Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 271–302). Washington, DC: American Council on Education.

Cohen, J. A. (1960). Coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

Cohen, J. A. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

Colliver, J. A., Vu, N. V., Marcy, M. L., Travis, T. A., & Robbs, R. S. (1993). The effects of examinee and standardized-patient gender and their interaction on standardized-patient ratings of
interpersonal and communication skills. Academic Medicine, 68(2), 153–157.

Colliver, J. A., & Williams, R. G. (1993). Technical issues: Test application. Academic Medicine, 68(6), 454–463.

Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163–178.

DeCarlo, L. T. (2005). A model of rater behavior in essay grading based on signal detection theory. Journal of Educational Measurement, 42(1), 53–76.

Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability (Research Bulletin No. RB-61–15). Princeton, NJ: Educational Testing Service.

Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185.

Engelhard, G. (2007). Differential rater functioning. Rasch Measurement Transactions, 21(3), 1124.

Furman, G., Colliver, J. A., & Galofre, A. (1993). Effects of student gender and standardizedpatient gender in a single case using a male and a female standardized patient. Academic Medicine, 68, 301–303.

Hamp-Lyons, L. (2003). Writing teachers as assessors of writing. In B. Kroll (Ed.), Exploring the dynamics of second language writing (pp. 162–189). Cambridge, UK: Cambridge University Press.

Hill, P. L., & Roberts, B. W. (2011). The role of adherence in the relationship between conscientiousness and perceived health. Health Psychology, 30, 797–804.

Hinkel, E. (1994). Native and nonnative speakers’ pragmatic interpretations of English texts. TESOL Quarterly, 28, 353–376.

Hoskens, M., & Wilson, M. (2001). Real-time feedback on rater drift in constructed-response items: An example from the Golden State Examination. Journal of Educational Measurement, 38(2), 121–145.

Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86.

Jackson, J. J., Wood, D., Bogg, T., Walton, K. E., Harms, P. D., & Roberts, B. W. (2010). What do conscientious people do? Development and validation of the behavioral indicators of conscientiousness (BIC). Journal of Research in Personality, 44, 501–511.

Joe, J. N., Tocci, C. M., Holtzman, S. L., & Williams, J. C. (2013). Foundations of observation: Considerations for developing a classroom observation system that helps districts achieve consistent and accurate scores. MET Project Policy and Practice Brief. Seattle, WA: Bill & Melinda Gates Foundation.

Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26(4), 485–505.

Latham, G. P., Wexley, K. N., & Purcell, E. D. (1975). Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 60, 550.

Ling, G., Mollaun, P., & Chen, L. (2011). An investigation of factors that contribute to speaking responses with human rating disagreement. Unpublished manuscript.

Myford, C. (2012). Rater cognition research: Some possible directions for the future. Educational Measurement: Issues and Practice, 31(3), 48–49.

Myford, C. M., Marr, D. B., & Linacre, J. M. (1996). Reader calibration and its potential role in equating for the Test of Written English (TOEFL Research Report No. 52). Princeton, NJ: Educational Testing Service.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4, 386–422.

Park, Y. S., & DeCarlo, L. T. (2011, April). Effects on classification accuracy under rater drift via latent class signal detection theory and item response theory. Paper presented at the Annual
Meeting of the American Education Research Association, New Orleans, LA.

Pula, J. J., & Huot, B. A. (1993). A model of background influences on holistic raters. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 237–265). Cresskill, NJ: Hampton Press.

Pulakos, E. D. (1986). The development of training programs to increase accuracy of different rating forms. Organizational Behavior and Human Decision Processes, 37, 76–91.

Rudner, L. M. (1992). Reducing errors due to the use of judges. Practical Assessment, Research & Evaluation, 3(3). Retrieved from

Schaeffer, G. A., Briel, J. B., & Fowles, M. E. (2001). Psychometric evaluation of the new GRE writing assessment (Research Report No. RR-01–18). Princeton, NJ: Educational Testing Service.

Schoonen, R., Vergeer, M., & Eiting, M. (1997). The assessment of writing ability: Expert readers versus lay readers. Language Testing, 14, 157–184.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effects of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76, 27–33.

Stillman, P. L., Regan, M. B., Swanson, D. B., & Haley, H. A. (1992). Gender differences in clinical skills as measured by an examination using standardized patients. In I. Hart, R. M. Harden, & J. Des Marchais (Eds.), Current developments in assessing clinical competence (pp. 390–395). Montreal, Canada: Can-Heal.

Tamanini, K. B. (2008). Evaluating differential rater functioning in performance ratings: Using a goal-based approach (Unpublished doctoral dissertation). Ohio University, Athens, OH.

van der Vleuten, C. P., & Swanson, D. B. (1990). Assessment of clinical skills with standardized patients: State of the art. Teaching and Learning in Medicine, 2(2), 58–76

Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.

Wilson, M., & Case, H. (2000). An examination of variation in rater severity over time: A study in rater drift. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice
(Vol. V, pp. 113–133). Stamford, CT: Ablex.

Xi, X., & Mollaun, P. (2009). How do raters from India perform in scoring the TOEFL iBT® speaking section and what kind of training helps? (TOEFL iBT Research Series No. 11). Princeton, NJ: Educational Testing Service.

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *