Quality of judgments of inspectors

(See below for English version)

De kwaliteit van oordelen van inspecteurs in de Gezondheidszorg
In haar Proefschrift onderzocht Saskia Tuin de betrouwbaarheid en validiteit van oordelen in het toezicht op de gezondheidszorg binnen het systeem van risicogestuurd toezicht. Het onderzoek beschrijft in welke mate inspecteurs hetzelfde oordeel toekennen in gelijke situaties (de betrouwbaarheid van de oordelen) en in hoeverre deze oordelen overeenkomen met de standaarden die de Inspectie voor de Gezondheidszorg (IGZ) heeft ontwikkeld voor haar toezicht (de validiteit van de oordelen). Onderzocht is welke interventies effectief zijn om zowel de betrouwbaarheid als de validiteit van inspecteursoordelen te verbeteren. Het monitoren en verbeteren van de betrouwbaarheid en validiteit van inspecteursoordelen is een belangrijke component van het toezicht door de IGZ.

Inleiding

Dit proefschrift begint met de introductie van de betekenis van de beoordelaarsbetrouwbaarheid en validiteit van oordelen. Betrouwbare en valide oordelen zijn van groot belang in het toezicht. Op basis van oordelen van inspecteurs, moeten zorginstellingen – als dat nodig blijkt – verbetermaatregelen nemen om de kwaliteit van hun zorg te verbeteren. Als deze verbeteringen niet passend zijn, kan de IGZ maatregelen treffen. Als oordelen in het toezicht niet betrouwbaar zijn, worden vergelijkbare instellingen, verschillend beoordeeld. Het is dan moeilijk te verantwoorden waarom sommige instellingen hun zorg moeten verbeteren terwijl andere instellingen met vergelijkbare zorg dat niet hoeven te doen. Onder vergelijkbare omstandigheden moeten gelijke oordelen gegeven worden. Al sinds de 17e eeuw wordt er aandacht besteed aan beoordelaarsbetrouwbaarheid in verschillende beroepen.
Het concept van beoordelaarsbetrouwbaarheid is uitgebreid onderzocht in bijvoorbeeld het onderwijs (van de Nederlandse taal), de (verzekerings)geneeskunde, bij de rechtspraak en bij financiële controle. Het is niet alleen belangrijk dat oordelen in het toezicht betrouwbaar zijn, ook de validiteit van oordelen is essentieel. Als oordelen niet valide zijn, kennen inspecteurs hetzelfde oordeel toe aan instellingen met gelijke kenmerken, maar komt dit oordeel niet overeen met de standaarden van de toezichthouder. In het geval van vals-positieve oordelen, wordt er vergeleken met de norm een relatief te positief oordeel gegeven en bestaat het risico dat instellingen geen verbetermaatregelen hoeven te nemen om hun zorg te verbeteren, terwijl ze dit eigenlijk wel hadden moeten doen.

De onderzoeksvragen die ten grondslag liggen aan dit proefschrift zijn de volgende:

• Verschillen inspecteurs van IGZ systematisch in hun oordelen over instellingen met gelijke kenmerken?
• Komen de oordelen over instellingen van IGZ-inspecteurs overeen met de standaarden die IGZ voor haar toezicht hanteert?
• Heeft het type toezichtsinstrument invloed op de beoordelaarsbetrouwbaarheid en validiteit van inspecteur- soordelen?
• Welke interventies zijn effectief om de beoordelaarsbetrouwbaarheid van professionals te vergroten?
• Welke interventies zijn effectief om de betrouwbaarheid en validiteit van oordelen van IGZ-inspecteurs te vergroten?

Het ene oordeel is het andere niet

Hoofdstuk twee beschrijft de analyse van de betrouwbaarheid van inspecteursoordelen over criteria van zorg in verpleeghuizen. Deze oordelen zijn toegekend in de dagelijkse toezichtspraktijk in 2005/2006. Het toezichtsinstrument dat de oordeelsvorming ondersteunt bestaat uit criteria waarmee de kwaliteit van zorg onderzocht wordt. Deze criteria zijn een combinatie van metingen op structuur-, proces- en uitkomstniveau op een aantal onderwerpen die worden beschouwd als indicator voor kwalitatief goede en veilige zorg. Een van deze criteria is ‘ doorligwonden’ (decubitus). Bij dit criterium onderzoeken inspecteurs of de aanwezigheid van doorligwonden wordt geregistreerd door het personeel (proces) en of het personeel de beschikking heeft over een protocol voor de preventie en het behandelen van decubitus (structuur). Tijdens toezichtsbezoeken onderzoeken inspecteurs de kwaliteit van zorg op basis van deze criteria en oordelen over de zorg op basis van deze criteria. Zij oordelen op een vierpuntsschaal: ‘afwezig’, ‘aanwezig’, ‘operationeel’ en ‘geborgd’. Het toezichtsinstrument schrijft precies voor wanneer welk oordeel in welke situatie van toepassing is,
Inspecteursoordelen over de kwaliteit van zorginstellingen lopen uiteen als inspecteurs instellingen onderzoeken: vergelijkbare zorg in instellingen wordt niet altijd op gelijk wijze beoordeeld. Het gebruikte toezichtsinstrument vraagt van inspecteurs een onderbouwing van hun oordeel. De aanwezigheid van onderbouwingen bij de oordelen blijkt zowel af te hangen van de individuele inspecteur als van de aard van het gegeven oordeel, dat wil zeggen of het negatief of positief is. Sommige inspecteurs onderbouwen hun oordelen, terwijl anderen dat niet doen. Positieve oordelen worden minder vaak onderbouwd dan negatieve oordelen. De beoordelaarsbetrouwbaarheid is niet optimaal in het toezicht op de zorg in verpleeghuizen door IGZ.
Het vervolgonderzoek is gericht op het verkrijgen van inzicht in de mate van strengheid van oordelen. Dit geeft antwoord op de vraag of de validiteit van de oordelen een verklaring is voor de gevonden beoordelaarsverschillen.

De relatie tussen standaarden en oordelen

Dit deel van het onderzoek beschrijft de analyse van de validiteit van oordelen in het toezicht op de zorg in verpleeghuizen. Oordelen en de bijbehorende onderbouwingen over de volgende vier criteria zijn onderzocht: ‘decubitus’, ‘voldoende hulp bij eten en drinken’, ‘ continue toezicht in woonkamers’ en ‘de mate waarin zorg no- dig is’. Geanalyseerd is in welke mate de onderbouwingen van de oordelen overeenkomen met de standaarden van de IGZ voor het toezicht op de zorg in verpleeghuizen. Nagegaan is in welke mate de feitelijke oordelen overeenkomen met oordelen die gegeven zouden moeten worden bij strikte toepassing van de IGZ-standaarden. Het onderzoek laat zien dat het oordeel van de inspecteurs niet altijd conform de IGZ-standaarden is. Vergelijking van de gegeven oordelen met de standaarden van de IGZ leert dat ongeveer de helft van de geanalyseerde oordelen te positief is. Het percentage vals-positieve oordelen hangt af van het criterium dat is beoordeeld, maar alle inspecteurs kennen in meer of mindere mate vals-positieve oordelen toe. Deze bevindingen geven inzicht in de validiteit van de inspecteursoordelen: de mate van overeenkomst tussen de gegeven oordelen en de standaarden van de IGZ. Zowel de betrouwbaarheid als de validiteit van de oordelen is niet optimaal. Het type toezichtsinstrument van de IGZ varieert per zorgveld. De volgende fase van het onderzoek is gericht op het verkrijgen van inzicht in de relatie tussen het type toezichtsinstrument en de betrouwbaarheid en validiteit van inspecteursoordelen.

Het ene instrument is het andere niet

De analyse van de betrouwbaarheid en validiteit van inspecteursoordelen die met twee verschillende type toe- zichtsinstrumenten zijn toegekend, staat centraal in dit deel van het onderzoek. De oordelen die toegekend zijn met een hoog-gestructureerd instrument (HSI) dat gebruikt wordt in het toezicht op zorg in verpleeghuizen zijn vergeleken met de oordelen die toegekend zijn met een laag-gestructureerd instrument (LSI) dat gebruikt wordt voor het toezicht op ziekenhuizen. Een HSI bestaat uit een vast aantal criteria dat bij elk toezichtsbezoek beoordeeld wordt. In het HSI is precies beschreven wanneer welk oordeel over de criteria van toepassing is. Een LSI bestaat uit een vast aantal criteria of zogenaamde indicatoren. Wanneer een instelling afwijkend scoort op een van deze indicatoren (een instelling scoort bijvoorbeeld opvallend goed of opvallend slecht, of er is sprake van een bepaalde trend in gegevens over meerdere jaren), dan is er sprake van een signaal op de betreffende indicator en dan moet deze indicator tijdens een toezichtsbezoek besproken worden. In het LSI is niet beschreven wanneer welk oordeel van toepassing is.
Het onderzoek toont aan dat het aantal indicatoren dat inspecteurs bespreken in een toezichtsbezoek bij ziekenhuizen erg uiteenloopt met een LSI. De betrouwbaarheid en validiteit van de inspecteursoordelen die toegekend zijn met een LSI kunnen hierdoor niet berekend worden. Er zijn onvoldoende gegevens om te kunnen vergelijken tussen instellingen met gelijke kenmerken. Het gemiddeld aantal criteria dat besproken wordt tijdens het toezichtsbezoek in verpleeghuizen met het HSI varieert veel minder. In tegenstelling tot het LSI, waarbij de niet-besproken indicatoren steeds verschillen, zijn de niet-besproken criteria bij het HSI over het algemeen steeds dezelfde. Dit betekent dat instellingen die beoordeeld worden met een HSI, met dezelfde set criteria on- derzocht worden.

De analyse van de oordelen gegeven met een LSI laat ook zien dat er meer indicatoren zonder signaal besproken zijn dan indicatoren met signaal: inspecteurs kiezen de indicatoren die zij bespreken in een toezichts- bezoek op basis van hun individuele professionele inschatting en niet op basis van een signaal. Dit in contrast met het HSI: hiermee worden zo goed als alle criteria besproken in toezichtsbezoeken in verpleeghuizen. De resultaten laten problemen zien in de betrouwbaarheid en validiteit van de oordelen die toegekend zijn met het HSI, maar in elk geval worden met het HSI alle instellingen langs dezelfde meetlat gelegd. Het gebruik van een HSI heeft daarom de voorkeur boven het gebruik van een LSI. Hiermee is het beter mogelijk verantwoording af te leggen over beslissingen in het toezicht.
Hoewel een HSI de voorkeur geniet boven een LSI, kent ook het gebruik van een HSI beperkingen in de betrouwbaarheid en validiteit van oordelen. Het gebruik van een dergelijk instrument is mogelijk niet de enige oplossing om de betrouwbaarheid en validiteit van inspecteursoordelen te verbeteren. Hoe verbeteren andere professionals hun beoordelaarsbetrouwbaarheid? Om deze vraag te beantwoorden is een systematische literatuurstudie uitgevoerd.

Kan de overeenstemming tussen oordelen worden bevorderd: een meta- analytische review

In de literatuur over beoordelaarsbetrouwbaarheid staat de verbetering van de kwaliteit van het instrument centraal. Een systematische literatuurstudie en meta-analyse zijn uitgevoerd om te onderzoeken of additionele training van de beoordelaars een waardevolle aanvulling van deze benadering is. Omdat beoordelaarsbetrouwbaarheid in veel verschillende soorten beroepen een rol speelt, werd literatuur in zowel medische als sociaal-wetenschappelijke databases gezocht. De interventies zijn in drie groepen gecategoriseerd: training van de professionals, verbeteren van het diagnostische instrument en een combinatie van training en het verbeteren van het diagnostische instrument. Er zijn uitsluitend artikelen over interventies om de beoordelaarsbetrouwbaarheid van (para)medische professionals te verbeteren gevonden. Er zijn geen em- pirische studies over interventies om de beoordelaarsbetrouwbaarheid van andere professionals zoals rechters, docenten of inspecteurs te verbeteren, gevonden.
Het effect van de drie soorten interventies (aanpassen van het instrument, training van beoordelaars en de combinatie van beiden) is significant. Het verbeteren van (technische) instrumenten heeft het grootste effect op de beoordelaarsbetrouwbaarheid, maar ook training vergroot de overeenstemming tussen beoordelaars. Twee van deze drie interventies zijn vervolgens onderzocht in een experimentele casusstudie onder IGZ-inspecteurs.

Kan de betrouwbaarheid en validiteit van oordelen worden bevorderd: een experiment

In een experimenteel opgezette casusstudie is het effect van twee interventies op de betrouwbaarheid en validiteit van inspecteursoordelen over zorg in verpleeghuizen onderzocht: aanpassing van het toezichtsinstrument en deelname van inspecteurs aan een consensusbijeenkomst. Ook is het effect nagegaan van het aantal oordelende inspecteurs op de betrouwbaarheid en validiteit van de oordelen. Om het effect van het aanpassen van het toezichtsinstrument te onderzoeken, is een gerandomiseerd design met een controlegroep gebruikt. Hierbij is de toewijzing van de inspecteur aan één van de twee groepen aselect (door het lot) bepaald. De ene groep bespreekt en beoordeelt de casussen met het ongewijzigde instrument (de controlegroep), de andere groep met het aangepaste instrument. Het instrument is aangepast op twee punten: de beschrijving van de risicoaspecten is positief geformuleerd in plaats van negatief. Hierdoor is zowel de beschrijving van de norm als de beschrijving van de aspecten positief geformuleerd. Daarnaast is het aanvinken van de risicoaspecten verplicht gemaakt.

Het effect van de consensusbijeenkomst is onderzocht door een voor- en nameting uit te voeren en door de twee groepen met elkaar te vergelijken. In de consensus bijeenkomst bespreken inspecteurs casuïstiek en proberen tot overeenstemming te komen over het oordeel. Inspecteurs bespreken een aantal criteria, dat zij op volgorde van laag risico tot hoog risico moeten rangschikken. Om het effect van de consensusbijeenkomst te onderzoeken, hebben alle inspecteurs van het toezicht op de verpleeghuiszorg deelgenomen aan deze bijeenkomst. Het doel ervan was om gemeenschappelijke bronnen van variatie in oordelen met elkaar te identificeren. Inspecteurs kregen de opdracht om consensus te bereiken over de volgorde van twee sets van vier casussen die zij van laag naar hoog risico moesten ordenen. Na de bijeenkomst onderzochten inspecteurs casussen die veel leken op de casussen van de voormeting, maar die niet precies hetzelfde waren om leereffecten van de vorige casussen te voorkomen.

Zowel de betrouwbaarheid als de validiteit van de inspecteursoordelen is het hoogst na de consensus bijeenkomst. De resultaten laten ook zien dat het vergroten van het aantal inspecteurs dat een casus beoordeelt, zowel de betrouwbaarheid als de validiteit van de oordelen doet toenemen. In deze casusstudie hebben inspecteurs niet met elkaar kunnen overleggen over hun oordelen. Dit is een gegeven geweest bij de analyse van het effect van het vergroten van het aantal inspecteurs dat een casus beoordeelt. Onder deze experimentele omstandigheden leidt het vergroten van het aantal inspecteurs tot een substantiële toename van zowel de betrouwbaarheid en validiteit van de oordelen. In de praktijk zullen inspecteurs, als zij in duo’s of teams instellingen bezoeken, hun bevindingen en oordelen wel met elkaar bespreken. Het is redelijk te verwachten dat de toename van de betrouwbaarheid van de inspecteursoordelen hierdoor hoger zal zijn dan in de experimentele situatie. Echter, of deze verwachte toename in betrouwbaarheid en validiteit zonder meer gegeneraliseerd kan worden naar de praktijk, waarin inspecteurs wel (kunnen) overleggen over hun oordeel staat niet vast. Dit zou nader onderzocht kunnen worden.

Discussie

De uitkomsten van deze studies laten zien dat de structurering van beoordelingsinstrumenten en het gebruik van deze instrumenten een belangrijke rol spelen bij het realiseren van (meer) betrouwbare en valide inspecteursoordelen. Alleen focussen op het instrument lijkt echter te beperkt: continue scholing in het gebruik van toezichtsinstrumenten kan voorkomen dat inspecteurs hun beslissingsproces teveel individualiseren. Wat zijn de implicaties van het onderzoek voor de praktijk van het toezicht? Zowel in de professionele context als de organisatiecontext zijn verbeteringen mogelijk. In de theorie van ‘reflectie-in-actie’ wordt ervan uit gegaan dat professionals in de dagelijkse praktijk kennis op een impliciete manier verwerven door reflectie op deze praktijk.

Consensusbijeenkomsten gericht op oordeelsvorming stimuleren en kaderen deze reflectie door de uitwisseling van ervaringen en ideeën. De werkwijze in lerende organisaties biedt ook kansen voor IGZ: het monitoren en verbeteren van de betrouwbaarheid en validiteit van de oordelen is een kenmerk van een organisatie die zich voordurend wil blij- ven ontwikkelen. In een lerende organisatie zijn er voorwaarden om individueel leren om te zetten in teamleren. In de Academische Werkplaats Toezicht worden de werelden van onderzoek en praktijk van het toezicht samengebracht. Dit stimuleert het teamleren binnen een sterke kennisstructuur.

English Summary

This study (Proefschrift) examined the reliability and validity of regulatory judgments within the system of risk-based supervision. This research describes the correspondence between regulatory judgments, and provides insight into the extent to which health care inspectors assign similar judgments to similar situations (reliability) and whether these judgments correspond with the standards developed by the regulatory authority (the Dutch Health Care Inspectorate, IGZ) for its regulatory task (validity). This study examined which interventions are effective for improving the reliability and validity of regulatory judgments. Monitoring and improving the reliability and validity of the judgments can be considered to be a component of the overall performance of the IGZ.

General introduction

This dissertation starts by defining interrater reliability and validity of judgments. Reliable and valid judgments are important in the regulation of health care. Based on the judgments of their inspectors, the IGZ asks health care institutions to improve the quality of the care they deliver when necessary. If the improvements are not satisfactory, the IGZ can impose administrative sanctions and initiate penal measures. When regulatory judgments are not reliable, institutions with similar characteristics may be judged differently. When this happens, it is hard to explain why some institutions have to improve the quality of their care while others with similar characteristics, do not have to improve their quality of care.

However, it is not only the reliability of regulatory decisions that is important – it is equally important that these decisions be valid. When regulatory judgments are not valid, even though inspectors might all assign the same judgment to institutions with similar characteristics, this judgment will not correspond with the regulatory authority’s corporate standards. In the case of false-positive judgments, there is the risk that institutions will not be asked to improve their care, while in fact this should have happened. Interrater reliability has been discussed since the seventeenth century, and the subject is a common one in a variety of professions.
The concept of observer error has been studied extensively in the fields of education, medicine, medical insurance science, penal regulation, and accounting and auditing. The research questions of this thesis are:
1 Do IGZ inspectors systematically differ in the regulatory judgments they assign to similar health care insti- tutions?
2 Do IGZ inspectors assign judgments to health care institutions that conform to the corporate standards and thus result in valid judgments?
3 Do the reliability and validity of the regulatory judgments of IGZ inspectors vary between two types of reg- ulatory instruments?
4 Which interventions are effective for increasing the interrater reliability of professionals?
5 Which interventions are effective for increasing the reliability and validity of the regulatory judgments of IGZ inspectors?

Not all judgments are the same

This chapter describes the analysis of the interrater reliability of the regulatory judgments of nursing home care inspectors. These judgments were assigned to criteria for nursing home care in 2005/2006. The regulatory instrument consisted of criteria for examining the quality of care. These criteria were a combination of measure- ments of structure, processes, and outcomes. One of these criteria was “pressure ulcers.” For this criterion, inspectors assessed whether the prevalence of pressure ulcers is recorded by the staff (process) as well as whether the staff has a protocol for pressure ulcers (structure). During regulatory visits, inspectors examined the quality of care using these criteria, and assigned scores to the criteria on a four-point scale: “absent,” “present,” “operational,” or “fulfilled.” The regulatory instrument describes exactly which judgment applies in which situation. The results indicated that inspectors’ regulatory judgments vary when examining institutions: institu- tions with similar characteristics with regard to health care indicators are judged differently.

Moreover, inspectors have to provide grounds for their judgments. The presence of grounds for the judgments seems to depend on both the individual inspector and the judgment assigned. Some inspectors provide grounds for their judgment while others do not. Moreover, compared with negative judgments, grounds are provided for positive judgments less often. Suboptimal interrater agreement is a cause for concern in the regulation of nursing home care. The next step in this research would be to gain insight into the level of stringency of the regulatory judgments. This could clarify whether the validity of the judgments could also be considered a source of variation.

The relationship between the employment of standards and judgments

This part of the dissertation describes the analysis of the validity of regulatory judgments on nursing home care. Judgments and the grounds for such judgments were selected for four criteria: “pressure ulcers”, “sufficient help with eating and drinking”, “continuous supervision in living rooms,” and “the extent of care needed.” We analyzed the extent to which the argumentations contained in the grounds for the judgments corresponded with the IGZ regulatory standards. We also studied the extent to which the actual judgments corresponded with the judgments that should have been assigned based on the arguments presented and the strict employment of the IGZ standards (corporate judgments).
The results indicated that inspectors do not always formulate their judgments according to the corporate standards. About half of the analyzed judgments were too positive compared with the judgments that would have been assigned if the corporate standards had been strictly employed. Although the percentage of false-positive judgments depended on the criterion being judged, they were assigned by all inspectors. These findings provide insight into the validity of the regulatory judgments: the correspondence be- tween the judgments assigned and the corporate standards. The results indicated there are problems with both the reliability and the validity of these judgments. The type of regulatory instrument varies between health care sectors within the IGZ. The next step in this research will be to gain insight into the relationship between the types of regulatory instruments and the reliability and validity of regulatory judgments.

Not all instruments are the same

During this part of the research, we studied the reliability and validity of regulatory judgments assigned with two different types of regulatory instruments. Judgments assigned using a highly structured instrument (HSI) for the regulation of nursing home care were compared with the judgments assigned using a lightly structured in- strument (LSI) for the regulation of hospital care in the Netherlands.
An HSI consists of a non-variable set of criteria that are examined and scored (judged) during every regulatory visit to a nursing home. An HSI describes exactly when a judgment should be assigned. An LSI consists of a permanent set of indicators. If an institution has a deviant score for one of these indicators (the indicator contains a warning signal), this indicator should be discussed during a regulatory visit. The LSI does not describe exactly when a judgment should be assigned.
The results showed that with the LSI, the number of indicators discussed varied widely between inspectors, and reliability and validity could not be calculated. Not enough data were available to compare institutions with similar characteristics. In contrast to the LSI, the average number of criteria discussed using the HSI varied less, and the criteria that were not discussed were generally the same ones.

There was no relationship between the presence of a warning signal in an indicator and a discussion of that indicator during a regulatory visit: more indicators without signals were discussed compared with indicators with signals. Inspectors select the indicators to be discussed at their own discretion. With the HSI, all of the criteria are discussed during regulatory visits. The results indicated that although there are problems with the reliability and validity of the judg- ments assigned with the HSI, at least the same set of criteria is used to compare all of the institutions. The results indicated that using an HSI is preferable because it makes it possible to account for regulatory decisions.

The results showed that using an HSI has limitations as well. Because of this, an HSI does not seem to be the only solution for improving reliability and validity. How do other professionals improve their interrater reliability? To answer this question, a systematic review of the scientific literature was performed.

Improving interrater reliability: a meta-analytic review

According to the literature on reliability, the central approach for improving reliability seems to improve the quality of the instrument. A systematic review and meta-analysis was performed to find out whether additional training of the raters could be a valuable complement to this approach. Because interrater variability occurs in a wide variety of professions, we searched medical and socio- logical databases. The interventions were categorized into three groups: training of professionals, improving the diagnostic instrument, and a combination of training and improving the instrument.

The results of our searches contained only articles about interventions for improving reliability among health care professionals. No empirical studies were found on interventions for increasing reliability among other professionals, such as judges, teachers, or inspectors.

The results indicated that the effect of the three types of interventions is significant for the three groups of interventions. However, improving highly technical instru- ments (like ct-scans) has the largest effect on agreement. It could be concluded that although all types of interventions are effective, improving the instruments seems to be most effective, especially when it concerns highly technical instruments. This review suggests solid arguments that can complement the literature and practice, with a focus on training the user of the instrument.

To gain insight into whether these outcomes can be generalized to IGZ health care inspectors, the next step in this research was to perform an experimental case study.

Improving interrater reliability and validity: an experiment

We used a case study to investigate the effect of two interventions on the reliability and validity of judgments of nursing home care inspectors: adjustment of the regulatory instrument for the regulation of nursing home care and participation of inspectors in a consensus meeting. Moreover, we explored the effect of an increase in the number of inspectors on the reliability and validity of regulatory judgments.

A randomized controlled trial was used to examine the effect of the adjustment of the regulatory in- strument. A before and after case study was used to examine the effect of the consensus meeting. Inspectors were randomly assigned to two groups, and they examined cases with either the adjusted or the unadjusted in- strument.

The instrument was adjusted in two ways. First, we formulated the description of the aspects of risk positively rather than negatively. As a result, the descriptions of both the standard and the aspects of risk were formulated positively. Second, we made it mandatory to check off the aspects of risk. In a consensus meeting, professionals come together to discuss cases and try to reach consensus about a judgment. Inspectors discuss a set of cases that they have to rank from “no risk” to “high risk.” To examine the effect of a consensus meeting, all nursing home care inspectors attended one. The purpose of this meeting was to identify common sources of variation. Therefore, the inspectors had to reach consensus about the order of two sets of four cases.

After the consensus meeting, the inspectors examined cases that were very similar to – but not completely identical to – those used in the pretest to prevent learning effects from the cases used previously. The results showed that the reliability and validity of the judgments was highest after the consensus meeting. The results of increasing the number of inspectors indicated that this increases both the reliability and the validity of the regulatory judgments. These calculations presume that inspectors assigned scores under the same conditions as in the case study: Inspectors do not talk with each other about their scores when examining the cases. However, it seems unrealistic to expect that, when visiting in pairs or teams, inspectors will not discuss their observations with each other. Therefore, it seems reasonable to expect that there will be a greater in- crease in the reliability of the regulatory judgments in actual practice (when inspectors do talk with each other about their scores). Whether this expected increase in reliability and validity can be unconditionally generalized to daily practice could be examined in the future.

General discussion

The results of this study showed that the level of structure of regulatory instruments and the use of these instruments are important factors in arriving at reliable and valid regulatory judgments. However, focusing only on the instrument would seem to be too narrow. Continuous education in the use of the regulatory instruments may prevent inspectors from excessively individualizing their regulatory decision process.
What are the implications of this study for daily regulatory practice? Improvements are possible in both the professional and the organizational context. In the “reflection-in-action” theory, the professional acquires knowledge in an implicit manner in daily practice. In the “reflection-on-action” theory he or she learns in an explicit way by reflecting on daily practice. To be able to reflect on their actions, their interpretation of the regulatory observations, and the accompanying regulatory judgments, it is important that the inspectors share their experiences and ideas. “Reflection-on-action” can be facilitated by organizing consensus meetings. Continuous improvement implies constant transformation, which is a characteristic of learning organizations.
The method of thinking used by learning organizations offers opportunities for the IGZ as well: monitoring and improving the reliability and validity of the judgments can be considered a characteristic of an organ- ization that aims to develop itself continuously. Within a learning organization there must be mechanisms for transmitting individual learning so that it becomes organizational learning or what is known as team learning. The presence of structures that facilitate team learning that feature boundary crossing and openness are im- portant characteristics of learning organizations.
In the Netherlands this has given rise to organizations like academic collaborative centers, which aim to bring the worlds of research and regulatory practice closer together and facilitate team learning within a strong knowledge infrastructure.

Frans J.G. Janssens

Quality of judgments of inspectors

(See below for English version)

Inleiding

Het ene oordeel is het andere niet

De relatie tussen standaarden en oordelen

Het ene instrument is het andere niet

Kan de overeenstemming tussen oordelen worden bevorderd: een meta- analytische review

Kan de betrouwbaarheid en validiteit van oordelen worden bevorderd: een experiment

Discussie

English Summary

General introduction

Not all judgments are the same

The relationship between the employment of standards and judgments

Not all instruments are the same

Improving interrater reliability: a meta-analytic review

Improving interrater reliability and validity: an experiment

General discussion

Geef een reactie Reactie annuleren

University of Twente/Inter-Continental University of the Caribbean

(See below for English version)

Inleiding

Het ene oordeel is het andere niet

De relatie tussen standaarden en oordelen

Het ene instrument is het andere niet

Kan de overeenstemming tussen oordelen worden bevorderd: een meta- analytische review

Kan de betrouwbaarheid en validiteit van oordelen worden bevorderd: een experiment

Discussie

English Summary

General introduction

Not all judgments are the same

The relationship between the employment of standards and judgments

Not all instruments are the same

Improving interrater reliability: a meta-analytic review

Improving interrater reliability and validity: an experiment

General discussion

Dit delen:

Geef een reactie Reactie annuleren

University of Twente/Inter-Continental University of the Caribbean