There Must Be 50 Ways to Build a Model (Reprise)
/Okay, this posting is going to be a lot longer and a little more technical than the average MAFL blog (and it's not as if the standard fare around here could be fairly characterised as short and simple).
Anyway, over the years of MAFL, people have asked me about the process of building a statistical model in sufficient number and with such apparent interest that I felt it was time to write a blog about it.
Step one in building a model is, as in life, finding a purpose and the purpose of the model I'll be building for this blog is to predict AFL victory margins, surely about as noble a purpose as a model can aspire to. Step two is deciding on the data that will be used to build that model, a decision heavily influenced by expedience; often it's more a case of 'what have I already got that might be predictive?' rather than 'what will I spend the next 4 weeks of my life trying to source because I've an inkling it might help?'.
Expediently enough, the model I'll be building here will use a single input variable: the TAB Sportsbet price of the home team, generally at noon on Wednesday before the game. I have this data going back to 1999, but I've personally recorded prices only since 2006. The remainder of the data I sourced from a website built to demonstrate the efficacy of the site-owner's subscription-based punting service, which makes me trust this data about as much as I trust on-site testimonials from 'genuine' customers. We'll just be using the data for the seasons 2006 to 2009.
Fitting the Simplest Model
The first statistical model I'll fit to the data is what's called an ordinary least-squares regression - surely a name to cripple the self-esteem of even the most robust modelling technique - and is of the form Predicted Margin = a + (b / Home Team Price).
The ordinary least-squares method chooses a and b to minimise the sum of the (squared) differences between the actual victory margin and that which would be predicted using it and, in this sense, 'fits' the data best of all the possible choices of a and b that we could make.
We've seen the result of fitting this model to the 2006-2009 data in an earlier blog where we saw that it was:
Predicted Margin = -49.17 + 96.31 / Home Team Price
This model fits the data for seasons 2006 to 2009 quite well. The most common measure of how well a model of this type fits is what's called the R-squared and, for this model, it's 0.236, meaning that the model explains a little less than one-quarter of the variability in margins across games.
But this is a difficult measure to which to attach any intuitive meaning. Better perhaps is to know that, on average, the predictions of this model are wrong by 29.3 points per game and that, for one-half of the games it is within 24.1 points of the actual result, and for 27% of the games it is within 12 points.
These results are all very promising but it would be a rookie mistake to start using this model in 2010 with the expectation that it will explain the future as well as it has explained the past. It's quite common for a statistical model to fit existing data well but to forecast as poorly as a surprised psychic ('Jeez, I didn't see that coming!').
Why? Because forecasting and fitting are two very different activities. When we build the model we deliberately make the fit as good as it can be and this can mean that the model we create doesn't faithfully represent the process that created that data. This is known in statistical circles - which, I guess, are only round on average - as 'overfitting' the data and it's one of the many things over which we obsess.
Overfitting is less likely to be a problem for the current model since it has only one variable in it and overfitting is more commonly a disease of multi-variable models, but it's something that it's always wise to check. A bit like checking that you've turned the stove off before you leave home.
Testing the Model
The biggest problem with modelling the future is that it hasn't happened yet (with apologies to whoever I stole or paraphrased that from). In modelling, however, we can create an artificial reality where, as far as our model's concerned, the future hasn't yet happened. We do this by fitting the model to just a part of the data we have, saving some for later as it were.
So, here we could fit the 2006 season's data and use the resulting model to predict the 2007 results. We could then repeat this by fitting a model to the 2007 data only and then use that model to predict the 2008 results, and then do something similar for 2009. Collectively, I'll call the models that I've fitted using this approach "Single Season" models.
Each Single Season model's forecasting ability can be calculated from the difference between the predictions it makes and the results of the games in the subsequent season. If the Single Season models overfit the data then they'll tend to fit the data well but predict the future badly.
The results of fitting and using the Single Season models are as follows:
The first column, for comparative purposes, shows the results for the simple model fitted to the entire data set (ie all of 2006 to 2009), and the next three columns show the results for each of the Single Season models. The final column averages the results for all the Single Season models and provides results that are the most directly comparable with those in the first column.
On balance, our fears of overfitting appear unfounded. The average and median prediction errors are very similar, although the Single Season models are a little worse at making predictions that are within 3 goals of the actual result. Still, the predictions they produce seem good enough.
What Is It Good For?
The Single Season approach looks promising. One way that it might have a practical value is if it can be used to predict the handicap winners of each game at a rate sufficient to turn a profit.
Unfortunately, it can't. In 2007 and 2008 it does slightly better than chance, predicting 51.4% of handicap winners, but in 2009 it predicts only 48.1% of winners. None of these performances is good enough to make money since, at $1.90 you need to tip at better than 52.6% to make money.
In retrospect, this is not entirely surprising. Using a bookie's own head-to-head prices to beat him on the line market would be just too outrageous.
Hmmm. What next then?
Working with Windows
Most data, in a modelling context, has a brief period of relevance that fades and, eventually, expires. In attempting to predict the result of this week's Geelong v Carlton game, for example, it's certainly relevant to know that Geelong beat St Kilda last week and that Carlton lost to Melbourne. It's also probably relevant to know that Geelong beat Carlton when they last played 11 weeks ago, but it's almost certainly irrelevant to know that Carlton beat Collingwood in 2007. Finessing this data relevance envelope by tweaking the weights of different pieces of data depending on their age is one of the black arts of modelling.
All of the models we've constructed so far in this blog have a distinctly black-and-white view of data. Every game in the data set that the model uses is treated equally regardless of whether it pertains to a game played last week, last month, or last season, and every game not in the data set is ignored.
There are a variety of ways to deal with this bipolarity, but the one I'll be using here for the moment is what I call the 'floating window' approach. Using it, a model is always constructed using the most recent X rounds of data. That model is then used to predict for just the current week then rebuilt next week for the subsequent week's action. So, for example, if we built a model with a 6-round floating window then, in looking to predict the results for Round 8 of a given season we'd use the results for Rounds 2 through 7 of that season. The next week we'd use the results for Rounds 3 through 8, and so on. For the early rounds of the season we'd reach back and use last year's results, including finals.
So, next, I've created 47 models using floating windows ranging from 6-round to 52-round. Their performance across seasons 2008 and 2009 is summarised in the following charts.
First let's look at the mean and median APEs:
Broadly what we see here is that, in terms of mean APE, larger floating windows are better than smaller ones, but the improvement is minimal from about an 11-round window onwards. The median APE story is quite different. There is a marked minimum with a 9-round floating window, and 8-round and 10-round floating windows also perform well.
Next let's take a look at how often the 47 models produce predictions close to the actual result:
The top line charts the percentage of time that the relevant model produces predictions that are 3-goals or less distant from the actual result. The middle line is similarly constructed but for a 2-goal distance, and the bottom line is for a 1-goal distance.
Floating windows in the 8- to 11-round range all perform well on all three metrics, consistent with their strong performance in terms of median APE. The 16-round, 17-round and 18-round floating window models also perform well in terms of frequently producing predictions that are within 2-goals of the actual victory margin.
Next let's look at how often the 47 models produce predictions that are very wrong:
In this chart, unlike the previous chart, lower is better. Here we again find that larger floating windows are better than smaller ones, but only to a point, the effect plateauing out with floating windows in the 30s
Again though to consider each model's potential punting value we can look at its handicap betting performance.
On this measure, only the model with an 11-round floating window seems to have any exploitable potential.
But, like Columbo, we just have one more question to ask of the data ...
Dynamic Weighted Floating Windows
(Warning: This next bit hurts my head too.)
We now have 47 floating window models offering an opinion on the likely outcomes of the games in any round. What if we pooled those opinions? But, not opinions are of equal value, so which opinions should we include and which should we ignore? What if we determined which opinions to pool based on the ability of different subsets of those 47 models to fit the results of, say, the last 26 rounds before the one we're trying to predict? And what if we updated those weights each round based on the latest results?
Okay, I've done all that (and yes it took a while to conceptualise and code, and my first version, previously published here, had an error that caused me to overstate the predictive power of one of the pooled models, but I got there eventually). Here's the APE data again now including a few extra models based on this pooling idea:
(The dynamic floating window model results are labelled "Dynamic Linear I (22+35)" and "Dynamic Linear II (19+36+39+52)" The numbers in brackets are the Floating Window model forecasts that have been pooled to form the Dynamic Linear model. So, for example, the Dynamic Linear I model pools only the opinions of the Floating Window models based on a 22-round and a 35-round window. It determines how best to weight the opinions of these two Floating Window models by optimising over the past 26 rounds.
I've also shown the results for the Single Season models - they're labelled 'All of Prev Season' - and for a model that always uses all data from the start of 2006 up to but excluding the current round, labelled 'All to Current'.)
The mean APE results suggest that, for this performance metric at least, models with more data tend to perform better than models with less. The best Dynamic Linear model I could find, for all its sophistication still only managed to produce a mean APE 0.05 points per game lower than the simple model that used all the data since the start of 2006, weighting each game equally.
It is another Dynamic Linear model that shoots the lights out on the median APE results, however. The Dynamic Linear model that optimally combines the opinions of 19-, 36-, 39- and 52-round floating windows produces forecasts with a median APE of just 22.54 points per game.
The next couple of charts show that this superior performance stems from this Dynamic Linear model's all-around ability - it isn't best in terms of producing the most APEs under 7 points nor in terms of producing the fewest APEs of 36 points or more.
Okay, here's the clincher. Do either of the Dynamic Linear models do much of a job predicting handicap winners?
Nope. The best models for predicting handicap winners are the 11-round floating window model and the model formed by using all the data since the start of 2006. They each manage to be right just over 53% of the time - a barely exploitable edge.
The Moral So Far ...
What we've seen in these results is consistent with what I've found over the years in modelling the footy. Models tend to be highly specialised, so one that performs well in terms of, say, mean APE, won't perform well in terms of median APE.
Perhaps no surprise then that none of the models we've produced so far have been any good at predicting handicap margin winners. To build such a model we need to start out with that as the explicit modelling goal, and that's a topic for a future blog.
Predicting Margins Using Market Prices and MARS Ratings
/Imagine that you allowed me to ask you for just one piece of data about an upcoming AFL game. Armed with that single piece of data I contend that I will predict the margin of that game and, on average, be within 5 goals of the actual margin. Further, one-half of the time I'll be within 4 goals of the final margin and one-third of the time I'll be within 3 goals. What piece of data do you think I am going to ask you for?
I'll ask you for the bookies' price for the home team, true or notional, I'll plug that piece of data into this equation:
Predicted Margin = -49.17 + 96.31 x (1 / Home Team Price)
(A positive margin means that the Home Team is predicted to win, a negative margin that the Away Team is predicted to win. So, at a Home Team price of $1.95 the Home Team is predicted to win; at $1.96 the Away Team is predicted to squeak home.)
Over the period 2006 to 2009 this simple equation has performed as I described in the opening paragraph and explains 23.6% of the variability in the victory margins across games.
Here's a chart showing the performance of this equation across seasons 2006 to 2009.
The red line shows the margin predicted by the formula and the crosses show the actual results for each game. You can see that the crosses are fairly well described by the line, though the crosses are dense in the $1.50 to $2.00 range, so here's a chart showing only those games with a home team price of $4 or less.
How extraordinary to find a model so parsimonious yet so predictive. Those bookies do know a thing or two, don't they?
Now what if I was prohibited from asking you for any bookie-related data but, as a trade-off, was allowed two pieces of data rather than one? Well, then I'd be asking you for my MARS Ratings of the teams involved (though quite why you'd have my Ratings and I'd need to ask you for them spoils the narrative a mite).
The equation I'd use then would be the following:
Predicted Margin = -69.79 + 0.779 x MARS Rating of Home Team - 0.702 x MARS Rating of Away Team
Switching from the bookies' brains to my MARS' mindless maths makes surprisingly little difference. Indeed, depending on your criterion, the MARS Model might even be considered superior, your Honour.
The prosecution would point out that the MARS Model explains about 1.5% less of the overall variability in victory margins, but the case for the defence would counter that it predicts margins that are within 6 points of the actual margin over 15% of the time, more than 1.5% more often than the bookies' model does, and would also avow that the MARS model predictions are 6 goals or more different from the actual margin less often than are the predictions from the bookies' model.
So, if you're looking for a model that better fits the entire set of data, then percent of variability explained is your metric and the bookies' model is your winner. If, instead, you want a model that's more often very close to the true margin and less often very distant from it, then the MARS Model rules.
Once again we have a situation where a mathematical model, with no knowledge of player ins and outs, no knowledge of matchups or player form or player scandals, with nothing but a preternatural recollection of previous results, performs at a level around or even above that of an AFL-obsessed market-maker.
A concept often used in modelling is that of information. In the current context we can say that a bookie's home team price contains information about the likely victory margin. We can also say that my MARS ratings have information about likely victory margins too. One interesting question is does the bookie's price have essentially the same information as my MARS ratings or is there some additional information content in their combination?
To find out we fit a model using all three variables - the Home Team price, the Home Team MARS Rating, and the Away Team MARS Rating - and we find that all three variables are statistically significant at the 10% level. On that basis we can claim that all three variables contain some unique information that helps to explain a game's victory margin.
The model we get, which I'll call the Combined Model, is:
Predicted Margin = -115.63 + 67.02 / Home Team Price + 0.31 x MARS Rating of Home Team - 0.22 x MARS Rating of Away Team
A summary of this model and the two we covered earlier appears in the following table:
The Combined Model - the one that uses the bookie price and MARS ratings - explains over 24% of the variability in victory margin and has an average absolute prediction error of just 29.2 points. It produces these more accurate predictions not by being very close to the actual margin more often - in fact, it's within 6 points of the actual margin only about 13% of the time - but, instead, by being a long way from the actual margin less often.
Its margin prognostications are sufficiently accurate that, based on them, the winning team on handicap betting is identified a little over 53% of the time. Of course, it's one thing to fit a dataset that well and another thing entirely to convert that performance into profitable forecasts.
The 2010 Draw
/What Price the Saints to Beat the Cats in the GF?
/If the Grand Final were to be played this weekend, what prices would be on offer?
We can answer this question for the TAB Sportsbet bookie using his prices for this week's games, his prices for the Flag market and a little knowledge of probability.
Consider, for example, what must happen for the Saints to win the flag. They must beat the Dogs this weekend and then beat whichever of the Cats or the Pies wins the other Preliminary Final. So, there are two mutually exclusive ways for them to win the Flag.
In terms of probabilities, we can write this as:
Prob(St Kilda Wins Flag) =
Prob(St Kilda Beats Bulldogs) x Prob (Geelong Beats Collingwood) x Prob(St Kilda Beats Geelong) +
Prob(St Kilda Beats Bulldogs) x Prob (Collingwood Beats Geelong) x Prob(St Kilda Beats Collingwood)
We can write three more equations like this, one for each of the other three Preliminary Finalists.
Now if we assume that the bookie's overround has been applied to each team equally then we can, firstly, calculate the bookie's probability of each team winning the Flag based on the current Flag market prices which are St Kilda $2.40; Geelong $2.50; Collingwood $5.50; and Bulldogs $7.50.
If we do this, we obtain:
- Prob(St Kilda Wins Flag) = 36.8%
- Prob(Geelong Wins Flag) = 35.3%
- Prob(Collingwood Wins Flag) = 16.1%
- Prob(Bulldogs Win Flag) = 11.8%
Next, from the current head-to-head prices for this week's games, again assuming equally applied overround, we can calculate the following probabilities:
- Prob(St Kilda Beats Bulldogs) = 70.3%
- Prob(Geelong Beats Collingwood) = 67.8%
Armed with those probabilities and the four equations of the form of the one above in bold we come up with a set of four equations in four unknowns, the unknowns being the implicit bookie probabilities for all the possible Grand Final matchups.
To lapse into the technical side of things for a second, we have a system of equations Ax = b that we want to solve for x. But, it turns out, the A matrix is rank-deficient. Mathematically this means that there are an infinite number of solutions for x; practically it means that we need to define one of the probabilities in x and we can then solve for the remainder.
Which probability should we choose?
I feel most confident about setting a probability - or a range of probabilities - for a St Kilda v Geelong Grand Final. St Kilda surely would be slight favourites, so let's solve the equations for Prob(St Kilda Beats Geelong) equal to 51% to 57%.
Each column of the table above provides a different solution and is obtained by setting the probability in the top row and then solving the equations to obtain the remaining probabilities.
The solutions in the first 5 columns all have the same characteristic, namely that the Saints are considered more likely to beat the Cats than they are to beat the Pies. To steal a line from Get Smart, I find that hard to believe, Max.
Inevitably then we're drawn to the last two columns of the table, which I've shaded in gray. Either of these solutions, I'd contend, are valid possibilities for the TAB Sportsbet bookie's true current Grand Final matchup probabilities.
If we turn these probabilities into prices, add a 6.5% overround to each, and then round up or down as appropriate, this gives us the following Grand Final matchup prices.
St Kilda v Geelong
- $1.80/$1.95 or $1.85/$1.90
St Kilda v Collingwood
- $1.75/$2.00 or $1.70/$2.10
Geelong v Bulldogs
- $1.50/$2.45 or $1.60/$2.30
Collingwood v Bulldogs
- $1.65/$2.20 or $1.50/$2.45
And the Last Shall be First (At Least Occasionally)
/So far we've learned that handicap-adjusted margins appear to be normally distributed with a mean of zero and a standard deviation of 37.7 points. That means that the unadjusted margin - from the favourite's viewpoint - will be normally distributed with a mean equal to minus the handicap and a standard deviation of 37.7 points. So, if we want to simulate the result of a single game we can generate a random Normal deviate (surely a statistical contradiction in terms) with this mean and standard deviation.
Alternatively, we can, if we want, work from the head-to-head prices if we're willing to assume that the overround attached to each team's price is the same. If we assume that, then the home team's probability of victory is the head-to-head price of the underdog divided by the sum of the favourite's head-to-head price and the underdog's head-to-head price.
So, for example, if the market was Carlton $3.00 / Geelong $1.36, then Carlton's probability of victory is 1.36 / (3.00 + 1.36) or about 31%. More generally let's call the probability we're considering P%.
Working backwards then we can ask: what value of x for a Normal distribution with mean 0 and standard deviation 37.7 puts P% of the distribution on the left? This value will be the appropriate handicap for this game.
Again an example might help, so let's return to the Carlton v Geelong game from earlier and ask what value of x for a Normal distribution with mean 0 and standard deviation 37.7 puts 31% of the distribution on the left? The answer is -18.5. This is the negative of the handicap that Carlton should receive, so Carlton should receive 18.5 points start. Put another way, the head-to-head prices imply that Geelong is expected to win by about 18.5 points.
With this result alone we can draw some fairly startling conclusions.
In a game with prices as per the Carlton v Geelong example above, we know that 69% of the time this match should result in a Geelong victory. But, given our empirically-based assumption about the inherent variability of a football contest, we also know that Carlton, as well as winning 31% of the time, will win by 6 goals or more about 1 time in 14, and will win by 10 goals or more a litle less than 1 time in 50. All of which is ordained to be exactly what we should expect when the underlying stochastic framework is that Geelong's victory margin should follow a Normal distribution with a mean of 18.8 points and a standard deviation of 37.7 points.
So, given only the head-to-head prices for each team, we could readily simulate the outcome of the same game as many times as we like and marvel at the frequency with which apparently extreme results occur. All this is largely because 37.7 points is a sizeable standard deviation.
Well if simulating one game is fun, imagine the joy there is to be had in simulating a whole season. And, following this logic, if simulating a season brings such bounteous enjoyment, simulating say 10,000 seasons must surely produce something close to ecstasy.
I'll let you be the judge of that.
Anyway, using the Wednesday noon (or nearest available) head-to-head TAB Sportsbet prices for each of Rounds 1 to 20, I've calculated the relevant team probabilities for each game using the method described above and then, in turn, used these probabilities to simulate the outcome of each game after first converting these probabilities into expected margins of victory.
(I could, of course, have just used the line betting handicaps but these are posted for some games on days other than Wednesday and I thought it'd be neater to use data that was all from the one day of the week. I'd also need to make an adjustment for those games where the start was 6.5 points as these are handled differently by TAB Sportsbet. In practice it probably wouldn't have made much difference.)
Next, armed with a simulation of the outcome of every game for the season, I've formed the competition ladder that these simulated results would have produced. Since my simulations are of the margins of victory and not of the actual game scores, I've needed to use points differential - that is, total points scored in all games less total points conceded - to separate teams with the same number of wins. As I've shown previously, this is almost always a distinction without a difference.
Lastly, I've repeated all this 10,000 times to generate a distribution of the ladder positions that might have eventuated for each team across an imaginary 10,000 seasons, each played under the same set of game probabilities, a summary of which I've depicted below. As you're reviewing these results keep in mind that every ladder has been produced using the same implicit probabilities derived from actual TAB Sportsbet prices for each game and so, in a sense, every ladder is completely consistent with what TAB Sportsbet 'expected'.
The variability you're seeing in teams' final ladder positions is not due to my assuming, say, that Melbourne were a strong team in one season's simulation, an average team in another simulation, and a very weak team in another. Instead, it's because even weak teams occasionally get repeatedly lucky and finish much higher up the ladder than they might reasonably expect to. You know, the glorious uncertainty of sport and all that.
Consider the row for Geelong. It tells us that, based on the average ladder position across the 10,000 simulations, Geelong ranks 1st, based on its average ladder position of 1.5. The barchart in the 3rd column shows the aggregated results for all 10,000 simulations, the leftmost bar showing how often Geelong finished 1st, the next bar how often they finished 2nd, and so on.
The column headed 1st tells us in what proportion of the simulations the relevant team finished 1st, which, for Geelong, was 68%. In the next three columns we find how often the team finished in the Top 4, the Top 8, or Last. Finally we have the team's current ladder position and then, in the column headed Diff, a comparison of the each teams' current ladder position with its ranking based on the average ladder position from the 10,000 simulations. This column provides a crude measure of how well or how poorly teams have fared relative to TAB Sportsbet's expectations, as reflected in their head-to-head prices.
Here are a few things that I find interesting about these results:
- St Kilda miss the Top 4 about 1 season in 7.
- Nine teams - Collingwood, the Dogs, Carlton, Adelaide, Brisbane, Essendon, Port Adelaide, Sydney and Hawthorn - all finish at least once in every position on the ladder. The Bulldogs, for example, top the ladder about 1 season in 25, miss the Top 8 about 1 season in 11, and finish 16th a little less often than 1 season in 1,650. Sydney, meanwhile, top the ladder about 1 season in 2,000, finish in the Top 4 about 1 season in 25, and finish last about 1 season in 46.
- The ten most-highly ranked teams from the simulations all finished in 1st place at least once. Five of them did so about 1 season in 50 or more often than this.
- Every team from ladder position 3 to 16 could, instead, have been in the Spoon position at this point in the season. Six of those teams had better than about a 1 in 20 chance of being there.
- Every team - even Melbourne - made the Top 8 in at least 1 simulated season in 200. Indeed, every team except Melbourne made it into the Top 8 about 1 season in 12 or more often.
- Hawthorn have either been significantly overestimated by the TAB Sportsbet bookie or deucedly unlucky, depending on your viewpoint. They are 5 spots lower on the ladder than the simulations suggest that should expect to be.
- In contrast, Adelaide, Essendon and West Coast are each 3 spots higher on the ladder than the simulations suggest they should be.
(In another blog I've used the same simulation methodology to simulate the last two rounds of the season and project where each team is likely to finish.)
Waiting on Line
/Hmmm. (Just how many ms are there in that word?)
It's Tuesday evening around 7pm and there's still no Line market up on TAB Sportsbet. In the normal course this market would go up at noon on Monday, and that's when the first match is on Friday night. So, this week the first game is 24 hours earlier than normal and the Line market looks as though it'll be delayed by 48 hours, perhaps more.
Curiouser still is the fact that the Head-to-Head market has been up since early March (at least) and there's an historical and strong mathematical relationship between Head-to-Head prices and the Line market, as the following chart shows.
The dark line overlaid on the chart fits the empirical data very well. As you can see, the R-squared is 0.944, which is an R-squared I'd be proud to present to any client.
Using the fitted equation gives the following table of Favourite's Price and Predicted Points Start:
Anyway, back to waiting for the TAB to set the terms of our engagement for the weekend ...