Predicting Bookmaker Head-to-Head Prices : Five Years On

Recently, in light of the discussions about the validity of the season simulations written up over on the Simulations journal, I got to thinking about modelling the Bookmaker's price-setting behaviour and how it might be expected to respond to the outcomes of earlier games in the season. It's a topic I've investigated before, but not for a while.

Back in July of 2010 I used Eureqa to fit an empirical model to Bookmaker head-to-head price data from 1999 to 2010, eventually landing on a logistic model with 5 coefficients and an R-squared of about 83% 

Since then I've added new tools to my analytical toolbox so I thought it time to revisit this particular modelling challenge to investigate what benefits, if any, time and experience have afforded me.

THE DATA

These days I just don't trust the Bookmaker data I have for games prior to 2006 (it's data I found on the web rather than captured myself), so for this analysis I'll be using only data from Round 1 of 2006 through to Round 22 of 2015.

My target variable will be the TAB Bookmaker's implicit home team probability, which I'll calculate using the Overround Equalising methodology (and which I usually capture around the time the line markets go up, which, until the last few years, was typically the Wednesday before the game but is now more often the Tuesday), and my regressors will be:

  • Home Team MARS Rating (MoS' own ELO-style Team Ratings)
  • Away Team MARS Rating (as above) 
  • Home Team Venue Experience (number of games played at the venue in the past 12 months)
  • Away Team Venue Experience (as above)
  • Interstate Status (+1 if the Home team is playing an out-of-State team; 0 if not; very occasionally -1 when the designated Home team flies interstate to play the designated Away team in the Away team's home State)
  • Home Team Average For and Against Differential in Last 1 game
  • Home Team Average For and Against Differential in Last 2 games 
  • (Same for Last 3 to 11 games)
  • Away Team Average For and Against Differential in Last 1 game
  • Away Team Average For and Against Differential in Last 2 games
  • (Same for Last 3 to 11 games)
  • Home Team
  • Venue

Note that the For and Against variables are permitted to include games only from the current season. So, for example, the notional average for the last, say, 5 games will include and average only games from Rounds 1 to 4 if the game in question is being played in Round 5.

One of the most obvious changes in my modelling over the past few years is that I worry less now about including more variables in models since I'm more often concerned about predicting rather than describing football phenomena and many of the modelling algorithms I use are not inordinately affected by the addition of (possibly highly multicollinear) variables.

THE MODEL

I've been genuinely surprised at the efficacy of the new margin predicting Ensemble models introduced in 2015 so it seemed logical to try the same technique on the current problem as well, once again using the caretEnsemble package in R.

For today, my ensemble will include one model of each of the following types:

  • Bagged MARS (bagearth)
  • Ordinary Least Squares (ols)
  • Random Forest (rf)
  • Conditional Forest (cf)
  • Gradient Boosted Regression Trees (blackboost)
  • Generalized Boosted Regression Model (gbm)
  • K-Nearest Neighbours (kknn)
  • MARS (earth)
  • Partial Least Squares (pls)
  • Projection Pursuit Regression (ppr)

Each base model is tuned over a grid of sensible tuning parameter values using 5 repeats of 5-fold repeated cross-validation.

Both a linear and a greedy ensemble was fitted using the tuned base models to a training sample of 1,000 games leaving 931 aside for testing purposes.

The weights of the underlying models appear in the table at right and reveal that the Projection Pursuit Regression, Bagged MARS and Gradient Boosted Regression Trees models carry the heaviest weights in both ensembles.

(Note that this is not the first time we've seen the Projection Pursuit Regression algorithm perform well in the context of football prediction.)

A comparison of the performance of ensembles and the underlying models on the holdout sample is instructive. It reveals that the three underlying models with the lowest (ie best) RMSE and MAPE on the holdout are also the three models with the largest weightings in the ensembles.

It also shows that the two ensembles have the lowest RMSEs and MAPEs of all, and that, very marginally, the greedy ensemble is superior to the linear ensemble. 

Given that, the analyses that follow all use the greedy ensemble.

One metric not recorded in the table above is the greedy ensemble's R-squared on the holdout sample. It's 89.2%, which you might take to signify that the model we've developed here is superior to the one developed in 2010, albeit they were each built on different samples and the R-squared for the 2010 model was not calculated on a holdout.

Ensemble models don't lend themselves to ready exposition of their content so let's focus instead on understanding the performance of the greedy ensemble.

In the chart below I've plotted the relationship between the ensemble's fitted Home team probability (y-axis) and the TAB Bookmaker's implied actual Home team probability (x-axis), where the points are coloured depending on the actual game outcome and the chart is faceted by the portion of the season in which the game took place.

What we see if a good fit between the fitted and actual probabilities for all "slices" of the season and across the entire range of actual and fitted probabilities. By this I mean that the lines are all roughly at 45-degrees and run through or close to the (50%, 50%) point in each facet.

We might also wonder whether the fitted probabilities represent better or worse probability estimates than the Bookmaker's, which we can measure using standard probability scores, here the Brier Score and the Logarithmic Score (both of which are defined such that lower scores are better).

The table at right presents the average Brier and Logarithmic scores of the Bookmaker's probability estimates and of the estimates emanating from the fitted model, for groups of games defined by the Bookmaker's pre-game implicit home probability. The results here have been calculated for the entire sample, not just the holdout.

It reveals that, barring those games where the Bookmaker assessed the Home team as between 49.5% and 63.5% chances, the Bookmaker's probability estimates are superior to the model's.

A hint about a possible cause of this is given by the data on the far right of the table, which shows the mean and mean absolute errors in the fitted model outputs relative to the actual Bookmaker probabilities. We see that the model tends to overestimate the Bookmaker's probability assessments when those assessments are low (hence negative average model errors) and underestimate the Bookmaker's probability assessments when those assessments are high (high positive average model errors). So, if the Bookmaker is well-calibrated in these regions relative to actual results, by definition the model cannot also be well-calibrated there too.

It's also instructive to explore the model's mean and mean absolute errors grouping the games in other ways.

Firstly, let's look at the view on a season-by-season basis. We see that, in a Mean Absolute Error sense, the model fitted best in the seasons 2009 to 2012, and fitted worst in the seasons immediate before and after that period. We also see that the model tended to overestimate the Bookmaker's probabilities in those years where the mean errors have been largest (ie 2006 to 2008, and 2013 to 2015), and underestimate the Bookmaker's probabilities in other years.

One way of improving the fit then might be to understand what was different about those middle seasons from 2009 to 2012.

Next, let's group games based on the portion of the season in which they were played.

Doing that we find that errors are largest at the start of the Home-and-Away season and in Finals. I'd speculate that a significant contributor to larger-than-average early-season errors is the difference between team abilities as assessed by MARS Rating and as assessed by the Bookmakers. These differences should narrow as the season proceeds, which is why we find the average model errors do too.

Average model errors might increase in Finals because of a changed importance - I'd speculate a lessening - of home ground status in these contests. The impact of those on the model might be ameliorated by including a regressor to reflect the portion of the season in which the game is being played.

Next, we look at grouping games by venue, the first of the variables that actually appears in the model.

We find, surprisingly, that some of the most commonly used venues have above-average mean absolute errors, including the MCG and Docklands, which are the two most common venues of all. One thing to note about many of the venues with above-average mean absolute errors is that they are home grounds for more than one team, which perhaps makes their "value" more difficult to model.

Some further supporting evidence for this view is the fact that the two lowest mean absolute error figures for venues where more than 30 games have been played are for venues where there is a single home team.

The fact that the mean errors for a number of commonly used venues are significantly non-zero suggests that the model might not have captured this dimension completely efficiently.

Finally, let's consider the teams themselves, playing as the designated Home or designated Away team. When playing at Home we find that the probabilities for teams such as Geelong, GWS and Sydney have been most accurately fitted, while those for the Dees, Lions, Roos, Tigers, Dogs and Saints have been least accurately fitted.

Playing Away, the Tigers' and Roos' probabilities have been most accurately fitted, and those of the Suns, Dockers, Pies and Saints least accurately fitted.

Looking instead at the mean errors, both the Home and Away views reveal a number of teams for which under- or overestimation of their probabilities has persisted. This is far more the case when we look at the Away as opposed to the Home view, however. Note that the model currently includes a regressor for the Home team but not one for the Away team.

SO WHAT, AND WHAT NEXT?

Above all else, I think the most instructive finding from the modelling exercise is that almost 90% of the variability in the TAB Bookmaker's implied home team probabilities can be explained by an ensemble model taking as inputs nothing more than the form and estimated ability of the competing teams, and the venue of the contest. Most notably, it includes nothing about player rosters for any single game.

The implication of this is that variability in team selections that make the team "weaker" or "stronger" than average, logically cannot be used to explain more than 10% of the variability in Bookmaker-assessed team chances. Given that the model doubtless excludes other relevant variables such as weather and team motivation levels, that 10% must be recognised as an upper bound of the contribution of team composition.

As a secondary point, I'd note that the more-sophisticated ensemble modelling approach has been able to explain a significantly larger proportion of the variability in implicit Bookmaker probabilities than the model I created back in 2010. So at least I can claim not to have wasted the last 5 years in my AFL modelling career.

Finally, I think the modelling exercise and the analysis of model errors has hinted at some avenues for further improvement. It suggested, for example, that including the identity of the Away team, and including a variable to reflect the portion of the season in which the game was played, are both likely to result in an improved model.

I have actually rebuilt the ensembles including these variables and can confirm that there is a small improvement - the R-squared on the holdout increases to 89.4%, the RMSE decreases by 0.45% and the MAPE by 1.01%. As well, the Brier and Log Probability Scores improve, and the Mean Absolute Errors are smaller for 9 of the 10 seasons, all of the season "slices", 8 of the 10 most-commonly used venues, 14 of the teams when they are the designated Home team, and 12 of the teams when they are the designated Away team. 

But I'm sure there's more improvement still to be had.

One thing that encourages me about refining this approach further is the historical in-market performance of the refined model (ie the one including season "slice" and Away team identity) even as it stands, when we use it to Kelly-stake in the Head-to-Head market

It's somewhat curious that the model performs better out-of-sample than in-sample, but I'd much rather that outcome than the opposite. The results also show the now-familiar pattern of superior returns to wagering on Home team versus Away teams (though additional analysis reveals that the gap can be reduced a little, and a profit made on the holdout sample, if Away teams with estimated probabilities under about 25% - which are those likely to be carrying excessive overround - are avoided).

Some work for the off-season, I think.