Direct Answer

A behavioral observation scale (BOS) is a rating format that asks evaluators to assess how frequently a person exhibits specific, observable work behaviors — rather than making global judgments about traits or overall performance. Closely related to the behaviorally anchored rating scale (BARS), behavioral rating formats ground evaluations in concrete actions that raters have actually witnessed, rather than abstract impressions they have formed.

Why It Matters

Think about the difference between these two questions:

"How would you rate this person's leadership?" (global judgment)

"How often does this person clearly communicate priorities to team members before a deadline?" (behavioral observation)

The first question invites the rater to draw on their overall impression — which may be shaped by the halo effect, recency bias, leniency, or personal liking. The second question directs the rater's attention to a specific, observable action. This distinction is not trivial. It is one of the most effective ways to improve the quality of performance ratings.

The Science Behind It

Behavioral rating formats were developed to address persistent problems with traditional graphic rating scales, which typically use vague anchors like "poor," "average," and "excellent." Smith and Kendall (1963) introduced behaviorally anchored rating scales (BARS), which provide specific behavioral descriptions at each point on the scale. Latham and Wexley (1977) developed the behavioral observation scale (BOS), which asks raters to indicate how frequently specific behaviors occur.

The evidence for behavioral formats is nuanced but generally positive. Lubbe and Nitsche (2019) found that BARS reduced assimilation effects — the tendency to rate subsequent candidates similarly to previous ones — by 40% compared to unanchored scales. Drawing on earlier work by Melchers and colleagues, they also reported that anchored scales dramatically improved inter-rater reliability for untrained raters: ICCs rose from .40 with unanchored scales to .71 with anchored scales. For trained raters, the improvement was smaller but still meaningful (.72 to .78).

Hoffman et al. (2012) demonstrated that frame-of-reference scales (FORS), an evolution of behavioral formats, increased the variance attributable to actual performance dimensions while decreasing error variance and dimension overlap. Their laboratory study found that FORS produced accuracy levels comparable to frame-of-reference training — a well-established but time-intensive intervention — without requiring any rater training.

More recent innovations include the computerized adaptive rating scale (CARS), which uses algorithms to sequentially present pairs of behavioral statements. Darr et al. (2017) found that CARS achieved a 20–25% improvement in measurement precision (standard error of measurement) compared to traditional BARS in a military leadership assessment.

Common Misconceptions

A common concern is that behavioral formats are too rigid — that by specifying behaviors, you miss important aspects of performance that do not fit neatly into predefined categories. There is some truth to this. Borman (1979) noted that raters sometimes struggle to match observed behavior to specific BARS anchors, forcing them to make inferences. Modern approaches like FORS address this by providing behavioral context without rigidly linking specific examples to scale points (Hoffman et al., 2012).

How This Connects to Better Hiring

Behavioral observation scales represent a core principle: good measurement starts with good questions. When you ask raters to evaluate specific, observable behaviors rather than make global trait judgments, you get data that is more reliable, less biased, and more useful for decision-making. This principle applies directly to reference checking — structuring reference questions around observable work behaviors, rather than asking for general impressions, is one of the most effective ways to improve the quality of reference information.