Win Probability Probabilities

Posted by Dan Cervone on March 8, 2017

Win probability models (WPM) have come under siege in the past year or so, with a seemingly unusual barrage of "improbable" events happening under the brightest spotlights. These include Leicester City's Premier League championship, both the Cleveland Cavaliers and Chicago Cubs overcoming 3-1 finals deficits (and breaking their sports' longest championship droughts in the process), the dramatic finish of the Clemson/Alabama BCS title game, and the New England Patriots' comeback victory in Super Bowl LI.

In this post, I'll introduce a few tools that can help us evaluate WPMs probabilistically. All WPMs assume some kind of probabilitistic evolution of the game's future given the present situation. By assuming randomness in the sequence of game events, and because win probability is itself a function of game events (e.g., the score, time remaining, etc.) win probabilities themselves are random variables; a function applied to something random produces something random; randomness in, randomness out. This allows us to ask questions like "what's the probability that the winning team's win probability could dip as low as 5%"?

To help with this, we have NFL in-game win probability data for the past 8 seasons from pro football reference, thanks to Maksim Horowitz's data scraping package. There are more sophisticated WPMs out there, but this already has public historical data. We can also simulate win probabilities for a mathematical game/example where we know win probability as fact. If you're interested in how that works, see the end of this post for a technical postscript. Simulated win probability helps us illustrate how some theoretical aspects of WPMs should bear out in reality. If our NFL WPM curves have a different behavior, then that's a bad sign for our NFL WPM.

1. Teams win 23% of the times win probability is 23%

OK, that sounds really obvious, but it's worth talking a bit about what this really means. When a team is given a win probability of 65% at halftime and ends up winning the game, that may look like a "correct" prediction. However, if we take all the teams historically that have a 65% win probability at halftime and find out that 80% of them actually end up winning, this tells us our WPM is miscalibrated---its predictions don't match reality. Calibration is a check on whether probabilities predicted by WPMs match the eventual winning frequencies. Crucially in debating the merits of WPMs, calibration asks us to make sure rare events are actually rare (and also not impossible!).

All WPMs should be properly calibrated across the entire probability spectrum (0% to 100%) at all points in the game (end of the first quarter, halftime, right before the last play of the game, etc.).

The simulated win probability calibration looks really good, as expected. In the beginning of the "game", probabilities tend to be bunched up around 0.5 (they start at 0.5), whereas near the end they are close to 0 or 1, as that's where they'll eventually end. We see the same in the NFL, where the win probability model we use has pretty good calibration.

2. WPMs do not ride the hot hand

Momentum may exist in sports, but not in in-game win probability numbers. Explicitly, this means that future changes in win probability are uncorrelated from past changes; if the win probability increased by 0.05 on the most recent play, that tells us nothing about what will happen on the next play. So if momentum actually exists in some aspects of a sport, a perfect WPM would have that baked in.

We can measure this by calculating the autocorrelation of successive changes in a win probability curve. This autocorrelation at lag $k$ is the correlation in the win probability added of events $k$ plays apart. For instance, a positive autocorrelation at a lag of 1 would be a "hot hand" effect: good win-probability-added plays tend to follow other good win-probability-added plays. Because WPMs do not ride the hot hand or exhibit any other kind of momentum, the autocorrelation should be 0 across the board.

We get this for simulated win probability, but NFL win probabilities show a bit of negative autocorrelation. Instead of a hot hand effect, it's more like a "when your right hand is hot, use your left" effect---but it's still bad news. Given what we know about football, this effect isn't too surprising to see, since NFL offenses do things like trade short gains in the running game for a future play-action pass. The NFL WPM we're looking at should have this effect built in, but they don't.

3. Hindsight is 20/80

WPM skepticism usually only comes up after the fact. We'll question a 1% win probability if that team ends up winning, but probably ignore it had that team actually lost, even though in theory the 1% number's correctness depends only on the data up until that point in the game, and not the eventual game outcome. The problem here is that, even given the fact that a team wins, there was almost certainly a point during the game where that team had below 50% win probability. In fact, the probability that the winning team's in-game win probability dropped below $c$ at one point is approximately $c/(1-c)$. Let's call this the comeback rule. Thus there is actually a $20/80 = 25\%$ chance that a winning team had a win probability below 20% at some point.

First, the comeback rule actually reveals a comeback bias: it's more common to see a winning team face a win probability of c at some point than it is for a team with a win probability of c to eventually win. But the comeback rule also allows us to check whether "improbable" comebacks happen more often than they should. For instance, we'd predict that $5/95 = 5.26\%$ (higher than 5%, but still long odds) of winning teams would face a win probability as low as 5%; thus, if we actually saw that 20% of winning teams faced such a win probability deficit at some point, that'd signal a problem with the WPM.

For simulated win probability, the eventual winner's minimum in-game win probability lines up very well with the $c / (1-c)$ comeback rule. Once again, we're seeing the the chance of the eventual winner facing a win probability below $c \:$ is $c / (1-c)$. For the NFL, winning teams are noticeably less likely to have faced long odds at some point during the game, but this is actually still consistent with the comeback rule. We called $c / (1-c)$ an approximate formula, and the nature of that approximation agrees with the behavior of the NFL curves relative to the simulated curves (for details on that, see the technical material at the end of this post). So other than some weirdness in the 45% area, the NFL results don't raise any red flags.

4. If a WPM has a pulse, it's dead

Tightly contested nail-biters and back-and-forth slugfests both keep viewers on the edge of our seats, since we're in suspense about who will win. For this reason, both should have win probability graphs that are actually pretty smooth, even though the amount of scoring could be very different. Win probability graphs should not look like EKGs, spiking up and down repeatedly, like this NFL game from 2013. (They also need to stay between 0 and 1!)

As with our earlier points, we actually have a formula for how much oscillation we should expect to see in a win probability graph. The number of times teams switch having win probability less than $c\:$ should be no bigger than $c / (1 - 2c)$. Let's call this the switching rule. For instance, for $c = 1/3$, we get $c / (1 - 2c) = 1$, thus we can expect one time during the game where a team that once had a win probability below $1/3$ now has win probability above $2/3$---maybe zero, but not likely more than one. As with our other formulas, the switching rule should hold true across all sports and win probability models.

Here, we see that when we simulate win probability, unsurprisingly teams switch about as often as we expect, following the $c/(1-2c)$ switching rule. For the NFL, teams switch win probabilities much less often, which is okay because $c/(1-2c)$ is an upper limit.

So is Win Probability broken?

WPMs are surely not always right in every situation, but calling them "broken" would be a dramatic overstatement. As Michael Lopez recently wrote about, "all WPMs are wrong but some are useful." By far, the most important check on a WPM is the calibration one we did in step 1. That didn't pick up any red flags for the NFL WPM we used, and neither did our checks of the comeback and switching rules. The NFL WPM we looked at did show some autocorrelation, but it was pretty minor. Not failing our tests is not the same as passing them, however, so the haters can keep hating.

Of course, other WPMs in the NFL or other sports may fare differently on the checks we explored in this post. Also, WPMs could fail conditional on specific game situations despite looking okay overall. For instance, it's reasonable to think an NFL WPM would underestimate the New England Patriots' true comeback probability in Super Bowl LI, as most models are not factoring the statistical strength of New England's offense and the weakness of Atlanta's defense. If we looked at calibration or any of these other checks team-by-team, we might uncover some biases in different WPMs. With this post, now, we have some tools to evaluate WPMs, which can help us diagnose bad WPMs and ensure future WPMs play by the rules.

Technical notes

The NFL data and code for this post are on github.

Our simulated win probability comes from simulating a Gaussian random walk (Brownian motion) for $T = 10000$ steps, with a team "winning" if the Brownian motion $B_t$ is greater than 0 at termination, $B_T > 0$. At time $0 \leq t \leq T$, we thus get simulated win probability from $X_t = \Phi(B_t / \sqrt{T - t})$, where $\Phi$ is the Gaussian CDF. Note that we could start simulated win probability at $X_0 = p\:$ (instead of 0.5) using $B_t + b$, where $b$ satisfies $p = \Phi(b/\sqrt{T})$.

Any win probability sequence through time is martingale that converge almost surely to a Bernoulli random variable. This characterization automatically implies the calibration property of (1), as well as the "momentum" property of (2) since martingale differences are uncorrelated.

To show (3), let $X_t$ represent a team's win probability, with $X_t \stackrel{a.s.}{\rightarrow} X_T \sim \text{Bern}(p)$. For some $c\:$ let

$$\tau = \{\min t : X_t \leq c \text{ or } X_t = 1\}$$

be a stopping time. By the optional stopping theorem, $E[X_{\tau}] = E[X_0] = p$. We also know

$$p = E[X_{\tau}] \leq (1 - \pi_c) + \pi_c c$$

where $P(X_{\tau} \leq c) = \pi_c$. Thus, $\pi_c \leq \frac{1 - p}{1-c}$. The comeback rule represents

$$P(\min_{0 \leq t \leq T} X_t \leq c | X_T = 1),$$

which can be written

$$\frac{P(\min_{0 \leq t \leq T} X_t \leq c \text{ and }X_T = 1)}{P(X_T = 1)} \leq \frac{c \pi_c}{p} \leq \frac{(1-p) c}{p (1-c)}.$$

If $X_t$ is continuous (not the case for WPMs that update after each play/possession), equality holds for the derivation above; so this is the sense by which the comeback rule is "approximate". Note that assuming $p = 0.5$ (teams have equal win prob at the start of the game), we get $c / (1-c)$.

Now, to show the switching rule, let $U(c)$ be the number of upcrossings for $X_t$ from $c\:$ to $1-c$. Doob's upcrossing lemma gives

$$ U(c) \leq \frac{\max_{0 \leq t \leq T} E[(c - X_t) \vee 0]}{1-c-c}. $$

Since $X_t$ is a martingale, $(c - X_t) \vee 0$ is a submartingale, so

$$ \max_{0 \leq t \leq T} E[(c - X_t) \vee 0] = E[(c - X_T) \vee 0] = (1-p) c.$$

Thus, $E[U(c)] \leq \frac{(1-p)c}{1 - 2c}$. Assuming $p = 0.5$, given that a "switch" corresponds to an upcrossing for either team (i.e., the number of switches in a game is $U_1(c) + U_2(c)$), we get the $c / (1-2c)$ rule.