Here's a simple question: what are the characteristics of games that tend to produce the highest aggregate scores?
In investigating this question I'm going to ignore "on-the-day" factors, such as the weather, wind conditions, game surface and so on, and consider only the pre-game bookmaker head-to-head prices, team Ratings and whether or not the contest forces one or other participants to travel interstate.
A priori, I could concoct a few hypotheses about which games are more likely to be high-scoring:
- they'll be games that pit a strong team against a weak team, because these games are most likely to produce a blowout margin for the strong team, which will inflate the aggregate score
- they'll be games between two roughly evenly matched teams as this scenario maximises the expected points scored by both teams combined
- they'll be games where the home team is strong and the away team weak as these games are most likely to produce a blowout margin for the home team, which will inflate the aggregate score. Where the opposite is true, and it's the away team that's the stronger, the home team crowd will serve to reduce the margin and reduce the aggregate points scored, so these games won't tend to produce as high an aggregate
To come up with a final statistical model, I flirted with some non-linear formulations - using Eureqa and fitting a random forest in R - but it was a linear model that ultimately proved strongest. I used data from all games from the 2006 to 2011 season, splitting these games into model training and model testing samples of roughly equal sizes. The variables I used were:
- Season - a variable that took on the values 2006, 2007, 2008, 2009, 2010 or 2011
- Round - a variable that took on values equal to the round number of the game concerned. Finals took on consecutive round numbers following on from those of the home and away season. So, for example, if the final round of the home and away season was Round 22, then the first week of the Finals was Round 23
- Interstate Status - the usual variable, taking on a value of +1 if the home team was in its home state and the away team playing interstate, 0 if both the home and away teams were in their home state or both were interstate, and -1 if the home team was playing out of its home state and the away team was playing in its home state (a rare occurrence)
- Home MARS Rating and Away MARS Rating - the MARS Ratings of the teams
- Absolute Difference of the Home and Away Ratings - a measure of the relative difference in the strengths of the teams
- Home MARS Rating / Away MARS Rating - another measure of relative team strength
- Home and Away Team Prices - as per the TAB Sportsbet market
- Bookmaker Implicit Home Team Probability - defined as Away Team Price / (Home Team Price + Away Team Price)
- Absolute Probability Difference - given by 2 x Bookmaker Implicit Home Team Probability - 1, and serving as a measure of the relative chances of the two teams
- Odds Ratio - (Home Price - 1) / (Away Price - 1) - another measure of the relative chances of the teams
Here are the results for the linear models that I fitted:
(With 12 regressors I could have fitted 2^12 = 4,096 models, but life's too short. The 15 that I fitted were selected on the basis of convenience - the model with all 12 regressors and the model produced by using the stepAIC function in R - enlightened trial-and-error, and whimsy. There's no guarantee that my "best" model is absolutely the best, but it is better than any random forest I could produce and better than the model that Formulize eventually settled on, so it's certainly "good" if not definitively "best".)
In the table, the asterisks denote the variables that were included in the relevant linear model, the R-squared provides the proportion of variance in aggregate game score explained by the model for the model training games, the R-squared on holdout provides the same number but for the holdout games, the mean APE on Holdout provides the average absolute prediction error of the model's predictions of the aggregate score in the holdout games, and the Prob(Pick Higher Scoring Holdout Game) provides an estimate of the probability that the ordering of the aggregate game score predictions of the model for two randomly chosen holdout games matches the ordering of the actual aggregate scores for those two games.
The first thing to note is how little of the variability in aggregate scores any of the models explain. Even the model with all 12 regressors, which mathematically must have the highest R-squared, only explains about 5% of the variability in the aggregate scores in the training sample. On the holdout games this model explains even less: just over 3%. No model does better than about 4%.
This suggests that the overwhelming majority of the variability in aggregate game scores is due to on-the-day factors or to factors I've not included in the models - for example, team specific factors such as the quality of their attack and defence.
Depending on what you choose as your metric, different of these models will be deemed best. On the assumption that it's the generalisability of the model that you care about, then it's performance on the holdout sample that you should be focussed on, which means that you should be focussed on the bottom three rows of the table.
My preferred metric is the one on the last row of the table. It measures the probability that the model's predicted aggregate scores for two randomly selected games will be in the same order as the actual scores for those two games. So, for example, if a model predicted that the aggregate score for game A would be higher than for game B, and the actual aggregate score in game A was higher than in game B, then the model would be right for that game.
On this metric, the best model is the one in the second column and includes the interstate status variable, the MARS Ratings of each team, and the absolute difference in the team probabilities as reflected in the pre-game TAB market. It's a quite parsimonious model yet it correctly orders the aggregate scores for any randomly selected pair of holdout games almost 58% of the time, which is considerably better than the chance result of 50%, and is remarkably high when you consider the tiny R-squared for this model.
That "best" model is the following:
It suggests that the games most likely to produce high aggregate score are those that are played in both teams' home state (or, better yet, played interstate from the home team's perspective only), that pit two relative weak teams against one another, but where there's a fairly strong favourite.
An interesting question to ask is, relatively speaking, how do each of the four variables in the final model contribute towards the (pitiably small proportion of) variability of aggregate game score explained by the whole model. We can answer this question using the relaimpo package in R and the lmg measure - though the results for the pmvd measure are not much different - and state that almost 80% of the explained variability is due to the Ratings of the two teams. This is one of those rare occasions where the MARS Ratings do a better job of explaining the variability in some football metric of interest than do the TAB bookmaker prices.