What Types of Statistical Models Does MoS Use for Prediction?
MoS has explored a huge variety of statistical modelling techniques generally, perhaps best exemplified by this post from 2011 where I pitted a number of them head-to-head in attempting to predict game margins.
One of the more exciting developments in recent years in the field of what's called machine learning has been the emergence of ensemble learning techniques in which the predictions of a number of base learners, each themselves statistical models, are combined to form overall predictions. MoS now makes extensive use of these for its Tipsters and Predictors.
Other Tipsters and Predictors apply less complex methodologies, such as the use of simple heuristics (for example, predict the winning team to be the one higher on the competition ladder) or ELO-based team rating systems (such as MoSSBODS and ChiPS).
Prior to 2016, three types of statistical model were in regular use in MoS:
- Conditional Inference Tree Forests: this technique is used to create probability predictions for the current Head-to-Head and Line Funds, and for the WinPred and ProPred Predictors. Empirically, Conditional Inference Tree Forests appear capable of capturing the non-linear relationships inherent in football data without toppling into the abyss of overfitting.
You can think of a single conditional inference tree as a decision tree built using a subset of the available data and employing specific algorithmic rules designed to produce ensembles that are both accurate and robust. For more details, try googling "conditional inference tree". - Neural Networks: up until 2011, to the best of my recollection I'd never used a neural network for MoS. That changed with the introduction of two such networks in that year, each designed to produce Margin predictions. Both were created using Phil Brierly's Tiberius tool (which, unfortunately, appears to no longer be available).
- Empirical Loss Minimisers: aside from the two neural networks, the models used by all Margin predictors were created using Eureqa Formulize, which is described on its website as "a scientific data mining software package that searches for mathematical patterns hidden in your data". The inputs to Eureqa were the actual Margins from a number of games and the predictions from other statistical and empirical models - for example the probability predictions of the WinPred and ProPred.
To fit a model using Eureqa an analyst needs to select a loss-function or error metric which is used by Eureqa to assess how well a model it is considering fits the training data. To be honest, I don't recall exactly which error metric I used to create the Margin predictors with Eureqa.
Another modelling technique that's regularly trotted out in the pages of MoS is the binary logit, which is used to model binary outcomes such as winning versus losing. I've tended to more often use this to fit historical data rather than to make predictions, but that distinction's probably moot.