What makes a professional reference genuinely predictive of job performance?

A reference becomes predictive when it is structured, behaviour-based, and provided by someone with close, recent, and sustained exposure to the candidate’s work. The most useful references cover multiple relevant performance dimensions, use clear rating scales, are collected via a standardised instrument, and are aggregated across several referees. Anonymous, calibrated feedback from direct supervisors tends to offer the strongest predictive signal.

Why are traditional reference checks often low value?

Traditional reference checks are usually informal, unstructured conversations with referees hand-picked by the candidate. They rely on vague, global impressions, focus on a narrow slice of performance, and rarely use standardised questions or scoring. Without structure, calibration, or anonymity, referees tend to provide uniformly positive, non-differentiating feedback that adds little beyond what interviews already reveal.

How many references should employers collect for each candidate?

For most professional roles, three to five structured references strike a good balance between predictive value and practicality. This number allows you to include multiple relationship types — such as a recent direct manager, a peer, and a cross-functional stakeholder — and to average out individual rater biases while still keeping the process manageable for candidates and hiring teams.

Do peer references have value, or should we only use managers?

Peer references do have value, especially for assessing collaboration, communication, and day-to-day working style. However, research shows that direct supervisors typically provide more valid performance ratings than peers. The strongest approach is to combine both: prioritise at least one recent direct manager while also including peers or stakeholders who can speak to complementary performance dimensions.

How can we encourage referees to be honest and candid?

The most effective levers for candour are process design and psychological safety. Make individual responses anonymous to the candidate, use online structured questionnaires instead of live calls, and clearly explain that feedback will be aggregated and used solely for hiring decisions. Provide behaviourally anchored scales and ask for both strengths and development areas so that balanced, honest responses feel normal rather than risky.

What is the role of scoring and benchmarking in reference checks?

Scoring and benchmarking turn references from anecdotes into data. By converting responses into quantitative scores, you can compare candidates consistently, identify patterns across roles and teams, and link reference outcomes to later performance. Benchmarking against internal or external norms also helps correct for lenient or severe raters and highlights when a candidate stands out positively or negatively on key dimensions.

Can off-the-shelf HR templates deliver high-quality references?

Generic templates rarely deliver high-quality, predictive references because they are not tailored to specific roles or validated against performance outcomes. They often rely on broad, non-behavioural questions and lack clear rating anchors. To get real value, organisations should either adopt validated, research-based instruments or adapt templates using job analysis, behaviourally specific items, and standardised scoring rules.

← Back to Insights

What Makes a Professional Reference Useful?

Dr. Sarah Sapiens

Chief, R&D (People Science)

| Reviewed by Luke

| Published 11 March 2026

What Makes a Professional Reference Useful?

Not all references are created equal. A glowing testimonial from a colleague who barely worked with the candidate tells you far less than a candid, structured assessment from a direct manager who observed them daily for two years. Yet traditional reference processes make almost no distinction between these sources — treating all references as interchangeable tokens of endorsement.

This article examines what separates a useful professional reference from a useless one, drawing on I-O psychology research, psychometric principles, and real-world evidence from structured reference programmes.

The Reference Quality Spectrum

Professional references exist on a spectrum from highly informative to essentially noise. Understanding this spectrum is the first step toward designing a reference process that actually predicts performance.

At one end is a structured, behaviourally specific assessment from someone who directly supervised the candidate's work over a meaningful period, completed anonymously using validated instruments. At the other is a brief, off-the-cuff telephone endorsement from a friend the candidate happened to work near.

The difference in predictive value between these two extremes is enormous — yet traditional processes treat them identically.

Six Dimensions of Reference Quality

1. Relationship Proximity

The single strongest predictor of reference usefulness is the referee's proximity to the candidate's actual work. Research consistently shows that direct supervisors provide more valid performance ratings than peers, who in turn provide more valid ratings than skip-level managers or external contacts.

Schmidt and Hunter's (1998) landmark meta-analysis of selection methods found that supervisor ratings achieve criterion validity of around r = .35, while peer ratings reach approximately r = .27. The closer the referee is to the work, the more signal they can provide.

What to look for

Did the referee directly oversee the candidate's work?
Did they collaborate closely or receive deliverables from the candidate?
For how long, and in what capacity?

2. Observation Duration and Recency

A referee who worked with the candidate for six months two years ago provides a fundamentally different signal than one who supervised them for three years and left the role last month.

Psychometric research on rater accuracy demonstrates two clear patterns:

Longer observation periods reduce random error and capture a wider range of behaviours.
More recent observation better predicts current capability, as skills and behaviours evolve over time.

The ideal reference comes from someone with both extended and recent exposure to the candidate's work.

What to look for

Total length of working relationship (in months/years).
How recently they worked together.
Whether the referee observed the candidate across different projects, teams, or conditions.

3. Behavioural Specificity

Vague endorsements ("She's great," "He's a real team player") carry almost zero predictive value. Research on behaviourally anchored rating scales (BARS) demonstrates that ratings grounded in specific, observable behaviours are substantially more reliable and valid than global impressions.

A useful reference provides concrete examples, such as:

"She managed a cross-functional team of 12 through a product launch that delivered on time and 15% under budget."
"He struggled with prioritisation when managing more than three concurrent projects, often missing intermediate deadlines."

These behavioural observations can be meaningfully compared across candidates in ways that "She's fantastic" cannot.

What to look for

Descriptions of specific tasks, projects, and outcomes.
Clear behavioural indicators (what the candidate did, not just what they are like).
Examples of both strengths and development areas.

4. Dimension Coverage

A reference that only speaks to one aspect of performance — for example, technical competence — tells you nothing about interpersonal effectiveness, initiative, adaptability, or integrity.

Research on 360-degree feedback consistently finds that multi-dimensional assessment is more predictive of overall job performance than single-dimension ratings. The same principle applies to references: breadth of coverage increases diagnostic utility.

A well-designed reference instrument maps to competency frameworks derived from job analysis, ensuring that each referee is prompted to evaluate the dimensions that matter most for the role.

What to look for

Coverage of multiple, clearly defined competencies (e.g. problem-solving, collaboration, communication, execution, leadership, integrity).
Alignment between the dimensions assessed and the requirements of the target role.
Space for referees to comment on both role-specific and general professional behaviours.

5. Calibration and Honesty

Some referees are generous raters; others are strict. Without calibration, a "4 out of 5" from one referee may be equivalent to a "3 out of 5" from another. This is the classic problem of rater leniency and severity in performance appraisal research.

Structured reference systems address this through:

Forced distribution items that require referees to rank behaviours rather than simply rate them.
Relative anchoring, such as "Compared to other professionals at a similar career stage, how would you rate this person on…?"
Statistical norm-referencing that adjusts individual referee tendencies against population baselines.

Honesty is equally critical. A reference is only useful to the extent that the referee provides candid, accurate information rather than socially desirable answers. Anonymity protections and process design significantly affect candour.

What to look for

Clear rating anchors that define what each scale point means.
Questions framed in relative, not absolute, terms.
Processes that protect referee anonymity and reduce fear of repercussions.

6. Independence from the Candidate

Candidate-selected references carry an inherent conflict of interest: the referee was chosen precisely because they are expected to be positive. While completely eliminating this selection effect is impractical in most settings, its impact can be mitigated.

Structured programmes reduce bias by:

Collecting more references (typically 3–5 per candidate) to dilute individual bias.
Requesting specific relationship types (e.g. most recent direct manager, a peer, a cross-functional stakeholder).
Using structured instruments that make it harder to provide uniformly inflated responses without appearing inconsistent.

What to look for

Diversity of referee roles and perspectives.
Evidence that not all referees were hand-picked purely for positivity.
Question formats that surface nuance rather than blanket praise.

What Research Tells Us About Useful References

Taylor et al. (2004) conducted one of the few rigorous studies of structured telephone reference checks. They found that structured references achieved criterion validity of around r = .36 for predicting supervisory ratings of job performance — comparable to cognitive ability tests and substantially better than unstructured interviews.

Critically, this validity was achieved only when references were:

Collected using a standardised protocol.
Scored quantitatively.
Aggregated across multiple referees.

When any of these conditions was absent, validity dropped sharply. The structure — not just the reference itself — is what creates the predictive power.

More recently, work by Woehr and colleagues on multi-source feedback validity has reinforced that the combination of structured instruments and multiple rater perspectives produces assessment data that is both more reliable and more valid than any single-source approach.

Practical Implications for Employers

Design for quality, not quantity

Three well-structured references from appropriate sources will always outperform seven unstructured character endorsements. Focus on referee selection criteria and instrument quality rather than volume.

Specify referee relationships

Avoid generic requests for "three references of your choosing." Instead, specify relationship types that ensure diverse, relevant perspectives on the candidate's performance, such as:

Most recent direct manager.
A peer or close collaborator.
A stakeholder from another team or function.

Use validated instruments

Off-the-shelf reference check templates from generic HR platforms are rarely validated. Invest in instruments that are grounded in job analysis and psychometric research, with:

Behaviourally anchored rating scales.
Clear competency definitions.
Standardised questions and scoring rules.

Score and benchmark

Convert reference data into quantitative scores that can be compared across candidates, roles, and time. Without scoring, references remain anecdotal and hard to interpret.

Where possible:

Aggregate scores across multiple referees.
Compare candidates against relevant norms.
Track how reference scores relate to subsequent job performance.

Protect referee anonymity

When referees know their individual responses will not be shared with the candidate, they provide more honest, more differentiated, and more useful feedback. This is one of the most consistent findings in the feedback literature.

Design your process so that:

Individual responses are confidential and only aggregated data is shared.
Referees are clearly informed about anonymity protections.
There is no expectation that the candidate will see verbatim comments tied to specific names.

Conclusion

A useful professional reference is not a character endorsement — it is a structured, multi-dimensional performance assessment from someone with direct knowledge of the candidate's work.

The six quality dimensions — relationship proximity, observation duration, behavioural specificity, dimension coverage, calibration, and independence — together determine whether a reference adds genuine signal or merely noise.

Organisations that understand these dimensions can redesign their reference processes to extract substantially more value from a practice they are already conducting. The tools exist; the question is whether the commitment to evidence-based hiring extends to the final mile of the selection process.

Frequently Asked Questions

What makes a professional reference genuinely predictive of job performance?: A reference becomes predictive when it is structured, behaviour-based, and provided by someone with close, recent, and sustained exposure to the candidate’s work. The most useful references cover multiple relevant performance dimensions, use clear rating scales, are collected via a standardised instrument, and are aggregated across several referees. Anonymous, calibrated feedback from direct supervisors tends to offer the strongest predictive signal.
Why are traditional reference checks often low value?: Traditional reference checks are usually informal, unstructured conversations with referees hand-picked by the candidate. They rely on vague, global impressions, focus on a narrow slice of performance, and rarely use standardised questions or scoring. Without structure, calibration, or anonymity, referees tend to provide uniformly positive, non-differentiating feedback that adds little beyond what interviews already reveal.
How many references should employers collect for each candidate?: For most professional roles, three to five structured references strike a good balance between predictive value and practicality. This number allows you to include multiple relationship types — such as a recent direct manager, a peer, and a cross-functional stakeholder — and to average out individual rater biases while still keeping the process manageable for candidates and hiring teams.
Do peer references have value, or should we only use managers?: Peer references do have value, especially for assessing collaboration, communication, and day-to-day working style. However, research shows that direct supervisors typically provide more valid performance ratings than peers. The strongest approach is to combine both: prioritise at least one recent direct manager while also including peers or stakeholders who can speak to complementary performance dimensions.
How can we encourage referees to be honest and candid?: The most effective levers for candour are process design and psychological safety. Make individual responses anonymous to the candidate, use online structured questionnaires instead of live calls, and clearly explain that feedback will be aggregated and used solely for hiring decisions. Provide behaviourally anchored scales and ask for both strengths and development areas so that balanced, honest responses feel normal rather than risky.
What is the role of scoring and benchmarking in reference checks?: Scoring and benchmarking turn references from anecdotes into data. By converting responses into quantitative scores, you can compare candidates consistently, identify patterns across roles and teams, and link reference outcomes to later performance. Benchmarking against internal or external norms also helps correct for lenient or severe raters and highlights when a candidate stands out positively or negatively on key dimensions.
Can off-the-shelf HR templates deliver high-quality references?: Generic templates rarely deliver high-quality, predictive references because they are not tailored to specific roles or validated against performance outcomes. They often rely on broad, non-behavioural questions and lack clear rating anchors. To get real value, organisations should either adopt validated, research-based instruments or adapt templates using job analysis, behaviourally specific items, and standardised scoring rules.

References

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Taylor, P. J., Pajo, K., Cheung, G. W., & Stringfield, P. (2004). Dimensionality and validity of a structured telephone reference check procedure. Personnel Psychology, 57(3), 745–772.
Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189–205.
Woehr, D. J., Sheehan, M. K., & Bennett, W. (2005). Assessing measurement equivalence across rating sources: A multitrait–multimethod approach. Journal of Applied Psychology, 90(3), 592–600.