Lab Wiki

> Qualitative Analysis

Path A — Inter-Rater Reliability

Whether to use IRR, which measure to pick, how to read it, and the kappa paradox. · 14 min

Inter-rater reliability (IRR) quantifies how much independent coders agree, corrected for the agreement you’d expect by chance. It belongs to the codebook/coding-reliability path — and, importantly, not every study should report it (McDonald et al., 2019) (O'Connor & Joffe, 2020).

The word that does the work is independent. A reliability coefficient is only meaningful on a set the two coders code separately, before discussing it — so the consolidate-as-you-go loop from the previous lesson (which builds the codebook by consensus) can’t produce one. If you need a number, reserve a fresh, un-reconciled set and code it independently first.

First question: should you use IRR at all?

If themes are the interpretive product of a reflexive researcher, a reliability coefficient measures the wrong thing (Braun & Clarke, 2019). Use this guide:

Each branch lands on a measure. Click one to see its equation, what the symbols mean, when to use it, and the source:

Cohen's κ

κ = po − pe1 − pe

where po = observed agreement; pe = agreement expected by chance, from each coder’s own category rates.

Observed agreement minus what two coders would reach by chance, rescaled by the room left above chance.

Use it for: Two coders, unordered (nominal) categories.

Source: Cohen, 1960

Hands-on tutorial for computing these: (Hallgren, 2012).

Compute Cohen’s κ — and see the paradox

Two coders mark whether a code is present or absent on each segment. Edit the 2×2 counts, or load a preset:

B: present B: absent
A: present
A: absent
Segments (n)
Observed agreement (po)
Expected by chance (pe)
Cohen's κ
Gwet's AC1

Try Balanced then Kappa paradox: observed agreement stays ~85%, but κ collapses. That’s the kappa paradox — when one category dominates, chance-corrected agreement is deflated even though raters mostly agree (Feinstein & Cicchetti, 1990) (Cicchetti & Feinstein, 1990). The calculator reports Gwet’s AC1 right beside κ (Gwet, 2008) — on the paradox preset AC1 stays sensible while κ craters, which is exactly why AC1 is the fairer summary under skew.

Reading the number

A common (and explicitly arbitrary) benchmark set (Landis & Koch, 1977):

κStrength of agreement
< 0.00Poor
0.00–0.20Slight
0.21–0.40Fair
0.41–0.60Moderate
0.61–0.80Substantial
0.81–1.00Almost perfect