Direct Answer

Inter-rater reliability is the degree to which two or more independent raters give consistent assessments of the same person. In hiring, it answers a critical question: if two different references evaluate the same candidate, do they arrive at similar conclusions? High inter-rater reliability means the assessment is capturing something real about the person, not just reflecting the idiosyncratic perspective of one rater.

Why It Matters

Imagine asking two of a candidate’s former managers to rate their teamwork skills. If one gives a rating of 9 out of 10 and the other gives 4 out of 10, you have a reliability problem. Either the question means different things to different people, the raters observed very different behavior, or one of them is simply not providing accurate information.

Low inter-rater reliability makes it impossible to trust the data. Even if a tool has strong predictive validity in theory, that validity is undermined if different raters produce wildly different scores for the same individual. Reliability sets the ceiling for validity — a tool cannot predict outcomes better than it can measure consistently.

The Science Behind It

Inter-rater reliability is typically measured using the intraclass correlation coefficient (ICC), which captures both the consistency and absolute agreement among raters. In the context of multisource feedback — where multiple people rate the same individual — Greguras and Robie (1998) established benchmark ICCs of .38 for supervisors, .57 for peers, and .48 for direct reports in developmental feedback contexts.

For structured employment references specifically, Hedricks et al. (2013) examined inter-rater reliability using generalizability theory in a sample of 4,236 candidates rated by 20,822 references across 33 companies. With approximately five references per candidate, typical inter-rater reliability ranged from .37 to .42 depending on the competency scale. For manager references specifically (averaging three per candidate), reliability ranged from .36 to .41. Importantly, Fisher et al. (2022) found that a non-correlation-based agreement index — the Finn coefficient — was consistently above .90 for structured references, suggesting that the moderate ICCs reflect range restriction inherent to high-stakes contexts rather than genuine disagreement among raters.

Adding more raters improves reliability substantially. Hedricks et al. (2013) demonstrated that generalizability coefficients rose from .11–.13 with a single reference to .61–.68 with 15 references for all rater types, and from .16–.19 to .72–.76 for manager references alone.

Common Misconceptions

People sometimes assume that disagreement between references means someone is being dishonest. In reality, moderate inter-rater agreement is normal and expected — different observers see different slices of a person’s behavior. A manager may observe your performance under deadline pressure, while a colleague sees your collaborative style in team meetings. Both perspectives are valid. The goal is not perfect agreement, but enough consistency to demonstrate that the assessment is capturing real behavioral patterns rather than random noise.

How This Connects to Better Hiring

Inter-rater reliability explains why collecting references from multiple sources is not just a nice-to-have — it is a psychometric necessity. A single reference, no matter how thoughtful, provides limited reliability. Aggregating structured ratings from three to five independent observers produces a substantially more dependable signal, the same way averaging multiple measurements in any scientific context produces a more accurate estimate than a single reading.