> Qualitative Analysis
Path A — Inter-Rater Reliability
Whether to use IRR, which measure to pick, how to read it, and the kappa paradox. · 14 min
Inter-rater reliability (IRR) quantifies how much independent coders agree, corrected for the agreement you’d expect by chance. It belongs to the codebook/coding-reliability path — and, importantly, not every study should report it (McDonald et al., 2019) (O'Connor & Joffe, 2020).
The word that does the work is independent. A reliability coefficient is only meaningful on a set the two coders code separately, before discussing it — so the consolidate-as-you-go loop from the previous lesson (which builds the codebook by consensus) can’t produce one. If you need a number, reserve a fresh, un-reconciled set and code it independently first.
First question: should you use IRR at all?
If themes are the interpretive product of a reflexive researcher, a reliability coefficient measures the wrong thing (Braun & Clarke, 2019). Use this guide:
Each branch lands on a measure. Click one to see its equation, what the symbols mean, when to use it, and the source:
Cohen's κ
where po = observed agreement; pe = agreement expected by chance, from each coder’s own category rates.
Observed agreement minus what two coders would reach by chance, rescaled by the room left above chance.
Use it for: Two coders, unordered (nominal) categories.
Source: Cohen, 1960
Weighted κ
where wij = disagreement weight between categories i and j; oij / eij = observed / expected counts.
Disagreements are penalised in proportion to how far apart the categories are, so near-misses cost less than gross errors.
Use it for: Two coders, ordered (ordinal) categories where some disagreements are worse than others.
Source: Cohen, 1968
Scott's π
where pe = Σ p̄k², using the pooled category proportions p̄k shared across both coders.
Like kappa, but chance is estimated from one shared category distribution (the pooled marginals) rather than each coder’s own.
Use it for: Two coders, nominal — a simple alternative to kappa using pooled chance.
Source: Scott, 1955
Fleiss' κ
where P̄ = mean per-item agreement across coders; P̄e = Σ pj² over all category assignments.
Scott's π generalised to any number of coders — agreement is averaged over items and corrected for the overall category split.
Use it for: Three or more coders, nominal (coders need not be the same across items).
Source: Fleiss, 1971
Krippendorff's α
where Do = observed disagreement; De = disagreement expected by chance, via a difference function δ matched to the data’s level.
Works on disagreement: one minus the ratio of observed to expected disagreement, with δ chosen for nominal, ordinal, or interval data.
Use it for: Any measurement level, two or more coders, and/or missing data — the most general single coefficient.
Source: Hayes & Krippendorff, 2007
Gwet's AC1
where pe = 1q − 1 · Σ πk(1 − πk); q = number of categories; πk = mean proportion classified into category k.
Same shape as kappa, but the chance term shrinks when one category dominates, so high agreement is no longer punished by skew.
Use it for: Skewed / high-prevalence coding where kappa is paradoxically low despite high agreement.
Source: Gwet, 2008
ICC
where MSb / MSw = between-target / within-target mean squares; k = number of raters. (One-way form; six ICC variants exist.)
The share of total variance reflecting real differences between targets rather than rater noise; the exact form depends on your design.
Use it for: Continuous or interval ratings rather than categories.
Source: Shrout & Fleiss, 1979, McGraw & Wong, 1996, Koo & Li, 2016
Hands-on tutorial for computing these: (Hallgren, 2012).
Compute Cohen’s κ — and see the paradox
Two coders mark whether a code is present or absent on each segment. Edit the 2×2 counts, or load a preset:
| B: present | B: absent | |
|---|---|---|
| A: present | ||
| A: absent |
Try Balanced then Kappa paradox: observed agreement stays ~85%, but κ collapses. That’s the kappa paradox — when one category dominates, chance-corrected agreement is deflated even though raters mostly agree (Feinstein & Cicchetti, 1990) (Cicchetti & Feinstein, 1990). The calculator reports Gwet’s AC1 right beside κ (Gwet, 2008) — on the paradox preset AC1 stays sensible while κ craters, which is exactly why AC1 is the fairer summary under skew.
Reading the number
A common (and explicitly arbitrary) benchmark set (Landis & Koch, 1977):
| κ | Strength of agreement |
|---|---|
| < 0.00 | Poor |
| 0.00–0.20 | Slight |
| 0.21–0.40 | Fair |
| 0.41–0.60 | Moderate |
| 0.61–0.80 | Substantial |
| 0.81–1.00 | Almost perfect |