about.txt

Why this exists

The NBA's play-by-play API puts the calling official's name in the description field of every foul. It's an unstructured string that programmatic consumers skip entirely. I parsed it across 13,278 games. The result is the first whistle-by-whistle attribution of shooting fouls to individual NBA officials across full games. It was hiding in plain sight.

The first question I tried to answer was whether refs who travel more and sleep less have higher error rates. They don't. So I asked a different one: what situations produce the errors that do happen? That became the Attention Load. The Harden study is more personal. The 2023 Sixers-Celtics series left a permanent mark. Harden put the team on his back and willed two wins in Games 1 and 4, then completely shit the bed in Games 6 and 7. How does that happen? Does it happen to anyone else? After ruling out the obvious answers, one thing holds: you're more likely to have a terrible playoff game if you lose your free throws. What causes the FTA loss, defense or refereeing or something else, I don't know yet. I'm working on it.

What I've done so far

  • Per-official suppressor/amplifier profiles for 101 officials (ANOVA p = 0.000003)
  • Defense-adjusted FTA/36 deltas for 40 high-FTA players across 3,846 player-official pairs
  • Predictive crew models that forecast individual FTA deviation from crew assignment (Spearman r = 0.406)
  • Attention Load model on 51,130 L2M events. Errors cluster by situation, not by referee
  • 300 shooting foul clips graded by hand. LLMs couldn't do it. They topped out at 55% precision

What I killed

The trigger taxonomy doesn't replicate. The box-score architecture model failed (R² = 0.128). The timing axis for foul classification was killed by the Giannis counterexample. I'm including the stuff that didn't work too. That's part of the answer.

Open data

Dataset: CC-BY-4.0. Code: MIT. Officials are named by design. Anonymizing them would kill the point of the dataset.

Reproducibility

The ref-ball pipeline runs from the repo Makefile. This site is generated by site/scripts/build_site_data.py. The frontend never reads parquet.

Source on GitHub →

Author: Harris Gordon