Modelling Team Scores as Weibull Distributions
Back in February I wrote about the use of the Pythagorean Expectation formula for modelling end-of-season VFL/AFL team winning percentages and highlighted its efficacy for that purpose. More recently I described how the Pythagorean Expectation formula could be interpreted as making a probabilistic statement about the relationship between a team's likely probability of victory and the ratio of the points it's expected to concede and the points it's expected to score in a particular game.
A recent paper on arxiv provided a statistical motivation for that interpretation of the Pythagorean Expectation formula by showing that it can be derived if we consider the two teams' scores in a contest to be distributed as independent Weibull variables under certain assumptions.
THE WEIBULL DISTRIBUTION
The Weibull distribution is not one I've written about before in MatterOfStats, and not one I've otherwise seen come up in discussions about the statistical modelling of sports.
It's a reasonably flexible, continuous distribution, defined for all real values of its input greater than or equal to 0, and in one of its most common forms includes two real parameters that together allow it to take on a variety of shapes.
Though the Weibull is typically associated with survival analysis - that is, it measures the life of something - there's no particular reason to exclude it as a plausible model of a team's score in an AFL game. In its favour is the fact that it's defined only for values of 0 or greater, though its continuous nature means that some discretisation would be necessary for its use in practice.
In the context of the result from the paper cited earlier, the conditions we require to derive the Pythagorean Expectation equation are that the Weibull distributions for the two team's scores are independent and share the same value of k.
If those assumptions are met we can derive the Pythagorean Expectation equation linking a team's victory probability to the ratio of its opponent's and its own expected score. I've described this result a little more mathematically at right.
EMPIRICAL PLAUSIBILITY
It's an interesting result, no doubt, that we can derive the Pythagorean Expectation formula by assuming the teams' scores are bivariate independent Weibull, but is there any empirical support for the notion?
First, let's simulate the situation where the Home Team score is distributed as a Weibull with mean 90 and the Away Team score is distributed as a Weibull with mean 70, each distribution having parameter k equal to 4, which is, roughly speaking, the Pythagorean Exponent we found applied, on average, across the modern eras.
The chart for that situation appears at left, and the striking feature of it is the Normal-like nature of the simulated scores for the Home and the Away teams, and for the Home Team victory margin. The Weibull distribution, it turns out, is roughly symmetric for values of k near 3.6, which is close to what I've used for the simulations charted here.
We've also seen before the Normal-like nature of Home Team handicap-adjusted margins, mirrored here by the black line tracking the home team victory margin. As well, we've found that home team and away team scores, conditioned on some transformation of bookmaker prices, can be modelled reasonably well by the Normal distribution.
If we look, instead, at situations where the Home and the Away team expected scores are different from 90 and 70 but still plausible as expected final scores, the overall picture remains the same, and the Normal-like nature of the distributions remains striking.
In operationalising the Weibull distribution it is useful to know how, in particular, the mean and variance of the distribution relate to its parameters. The relevant formulae appear at left and include the gamma function, which is described here.
So, for example, assuming that k = 4, for a team whose expected score is S we should set the lambda of its Weibull distribution equal to S / gamma(1 + 1/4), which is about 1.103 x S.
Assuming Weibull distributions with fixed k for both the Home and the Away teams in a contest, we can calculate the probability of a Home team victory given different expected scores for each team. Those calculations for k = 4 and k = 3.5 appear below.
The first thing to note is how similar are the probability estimates regardless of whether we assume k is 4 or 3.5: the differences are no more than about 3% points.
If we compare the k = 4 scenario to the scenario where we assume the difference between the home team and away team scores is, instead, distributed as a Normal random variate with mean equal to the difference in the expected scores and with standard deviation of 37 points, we can also see how similar too these probability estimates are, especially for the most-plausible combinations of home team and away team expected scores.
In most cases, the differences are no more than about 3% points, so in that regard there's little difference in assuming that the team scores are distributed as independent Weibulls versus assuming that the difference in these scores is distributed as a Normal with constant variance.
That constant variance assumption, however, does not hold if we assume that the Home and Away team scores are distributed as Weibulls. In that case, the variance of the individual scores varies with the parameters of the relevant distribution, as reflected in the formulae shown above, and the variance of the difference in those scores also varies with the parameters assumed for both distributions. The formula below provides the details for this second result.
The 0.281 coefficient in that equation is based on a value of k equal to 4. As a rule of thumb, if the expected home score is approximately equal to the expected away score, this formula gives a standard deviation for the difference between the home and away score of about 0.4 times the expected home (or away) team score.
Estimating this equation for a range of different values of expected home and away team scores produces the table shown below, in which I've reduced the font size for combinations that are less common empirically.
Focussing solely on the combinations of home and away scores that are more likely to occur, we can see that the estimated standard deviations range from about 30 to about 44 points, which is broadly consistent with values we've estimated previously based on an assumption of Normality.
In short then, the assumption of Weibullness in team AFL scores can't be ruled out a priori on empirical grounds. In a future post I'll test that assumption further by applying it to some historical team scoring data and comparing the fit obtained using the Weibull with that achieved using the more standard assumption of Normality.