The 2011 Performance of the MARS, Colley and Massey Ratings Systems

I was curious - and that's rarely a portent of lazy evenings - as to which of the three ratings systems we've been tracking since Round 14 is best, so I set about finding out.

First though we need to decide what we mean by "best" and in the end I decided on the following protocol for determining this: 

  • Fit predictive models to the results for Rounds 3 to 12 (n=77) using only the ratings of the participating teams, then use the fitted model to predict the results for Rounds 13 to 21 (n=70). Fit three separate models, one using only MARS Ratings, another using only Colley Ratings, and the third using only Massey Ratings
  • Fit two types of models: a regression model to predict the margin of victory by the home team, and a classification model to predict whether or not the home team will win
  • For the regression models, use R-squared in the prediction period as the performance metric
  • For the classification models, use predictive accuracy in the prediction period

There's no particular reason to favour any one predictive modelling technique over another, so again I used R's caret package to deploy a wide variety of regression and classification techniques.

Here are the results for the regression models in which we're trying to predict the home team's margin of victory:

I've sorted the results based on each method's performance in the prediction period for the model created using only MARS Ratings, but that choice is arbitrary.

It's fascinating to note once again how profoundly poorly a model's fit to the training data reflects its fit to the prediction data. Though they're not shown in the table, the correlations between the models' training and prediction performances are +0.39 for the models built using only MARS Ratings, -0.28 (sic) for those built using only Colley Ratings, and +0.27 for those built using only Massey Ratings.

The tree- and forest-based methods again provide some of the starkest examples of apparent overfitting. For example, the random forest methods parRF and rf (which, on further research, might well be the same underlying algorithm, their differing results being due only to the stochastic nature of their construction), plus treebag and cforest are all in the top 5 in terms of performance on the MARS Ratings training data, but in the bottom 16 in terms of performance in the prediction period.

Conversely, the Support Vector Machine, svmPoly, which is the best performed method for the prediction period using MARS or Colley Ratings data, ranks 31st and 35th of 38 respectively in the fitting period.

The implications of this are quite profound, at least in the current instance: selecting the best of a variety of methods fitted to your training data is unlikely to identify a method that performs well in your prediction period. In fact, you're probably better off selecting a model that does only moderately well in the training period in the knowledge that it's likely to also perform middlingly well in the prediction period.

My initial curiosity you'll recall was about the relative predictive efficacy of the three sets of Ratings data, and the column on the far right of the table records, for each method, which of the three Ratings Methods' data produced the best model in the prediction period. The overall results are summarised at the bottom of that column where you'll see that the MARS Ratings produce the superior model for 27 of the 39 methods tested, or almost 70%. Massey Ratings produce superior models for 11 of the 39 methods, or just under 30%, and Colley Ratings produce superior models only if svmRadial happened to be your method of choice. 

Further, using Massey or Colley data, no method produces a model with an R-squared greater than 72% in the prediction period; 24 of the 38 methods produce an R-squared higher than this when using the MARS Ratings.

That's compelling evidence for the relative superiority of the MARS Ratings in the chosen prediction period.

But what of the efficacy of the three Ratings Systems for predicting only winners and losers, not margins?

Here it's not MARS but Massey that's superior. Across the 51 model types that I fitted to results data, Massey proved to be outright superior in 35 cases and at least jointly-superior in 5 more.

Once more the relationship between training and prediction performance is quite weak. The correlations are, for MARS Ratings based models +0.24, for Colley Ratings based models +0.30, and for Massey Ratings based models +0.18. Some of the random forest and tree-based methods again show signs of overfitting, though this time they give the game away a little by being perfect on the training data.

So, in summary:

  • recognise that relative and absolute in-sample performance can be a very poor indicator of relative and absolute post-sample performance
  • MARS Ratings are best for predicting home team victory margins and Massey Ratings are best for helping you to win your local tipping competition