Okay, this posting is going to be a lot longer and a little more technical than the average MAFL blog (and it's not as if the standard fare around here could be fairly characterised as short and simple).
Anyway, over the years of MAFL, people have asked me about the process of building a statistical model in sufficient number and with such apparent interest that I felt it was time to write a blog about it.
Step one in building a model is, as in life, finding a purpose and the purpose of the model I'll be building for this blog is to predict AFL victory margins, surely about as noble a purpose as a model can aspire to. Step two is deciding on the data that will be used to build that model, a decision heavily influenced by expedience; often it's more a case of 'what have I already got that might be predictive?' rather than 'what will I spend the next 4 weeks of my life trying to source because I've an inkling it might help?'.
Expediently enough, the model I'll be building here will use a single input variable: the TAB Sportsbet price of the home team, generally at noon on Wednesday before the game. I have this data going back to 1999, but I've personally recorded prices only since 2006. The remainder of the data I sourced from a website built to demonstrate the efficacy of the site-owner's subscription-based punting service, which makes me trust this data about as much as I trust on-site testimonials from 'genuine' customers. We'll just be using the data for the seasons 2006 to 2009.
Fitting the Simplest Model
The first statistical model I'll fit to the data is what's called an ordinary least-squares regression - surely a name to cripple the self-esteem of even the most robust modelling technique - and is of the form Predicted Margin = a + (b / Home Team Price).
The ordinary least-squares method chooses a and b to minimise the sum of the (squared) differences between the actual victory margin and that which would be predicted using it and, in this sense, 'fits' the data best of all the possible choices of a and b that we could make.
We've seen the result of fitting this model to the 2006-2009 data in an earlier blog where we saw that it was:
Predicted Margin = -49.17 + 96.31 / Home Team Price
This model fits the data for seasons 2006 to 2009 quite well. The most common measure of how well a model of this type fits is what's called the R-squared and, for this model, it's 0.236, meaning that the model explains a little less than one-quarter of the variability in margins across games.
But this is a difficult measure to which to attach any intuitive meaning. Better perhaps is to know that, on average, the predictions of this model are wrong by 29.3 points per game and that, for one-half of the games it is within 24.1 points of the actual result, and for 27% of the games it is within 12 points.
These results are all very promising but it would be a rookie mistake to start using this model in 2010 with the expectation that it will explain the future as well as it has explained the past. It's quite common for a statistical model to fit existing data well but to forecast as poorly as a surprised psychic ('Jeez, I didn't see that coming!').
Why? Because forecasting and fitting are two very different activities. When we build the model we deliberately make the fit as good as it can be and this can mean that the model we create doesn't faithfully represent the process that created that data. This is known in statistical circles - which, I guess, are only round on average - as 'overfitting' the data and it's one of the many things over which we obsess.
Overfitting is less likely to be a problem for the current model since it has only one variable in it and overfitting is more commonly a disease of multi-variable models, but it's something that it's always wise to check. A bit like checking that you've turned the stove off before you leave home.
Testing the Model
The biggest problem with modelling the future is that it hasn't happened yet (with apologies to whoever I stole or paraphrased that from). In modelling, however, we can create an artificial reality where, as far as our model's concerned, the future hasn't yet happened. We do this by fitting the model to just a part of the data we have, saving some for later as it were.
So, here we could fit the 2006 season's data and use the resulting model to predict the 2007 results. We could then repeat this by fitting a model to the 2007 data only and then use that model to predict the 2008 results, and then do something similar for 2009. Collectively, I'll call the models that I've fitted using this approach "Single Season" models.
Each Single Season model's forecasting ability can be calculated from the difference between the predictions it makes and the results of the games in the subsequent season. If the Single Season models overfit the data then they'll tend to fit the data well but predict the future badly.
The results of fitting and using the Single Season models are as follows: