Predicting the Home Team's Final Margin: A Competition Amongst Predictive Algorithms

With fewer than half-a-dozen home-and-away rounds to be played, it's time I was posting to the Simulations blog, but this year I wanted to see if I could find a better algorithm than OLS for predicting the margins of victory for each of the remaining games.

Here's the setup: 

  • Use R's caret package to tune a variety of regression-type algorithms
  • Fit each to the results for seasons 2000-2010 and then test them on the results for season 2011 to date (to the end of Round 19)
  • Use as the target variable the margin of victory for the Home team
  • Use only the following regressors: Home Team MARS Rating, Away Team MARS Rating, Home Team Venue Experience, Away Team Venue Experience, Binary to Reflect the Interstate Status of the Contest from the Home Team's viewpoint (1 if they're in their home state and their competitors are not, 0 if both or neither team's in their home state, -1 on those rare occasions where the home team is playing in the away team's home state)
  • Use RMSE as the metric for tuning each algorithm (which caret does automatically)
  • Rank each algorithm on the basis of its Mean Absolute Prediction Error for the Test period (2011 to the end of R19)

The results of following this process for 52 different regression algorithms are summarised in the following table (which can - nay must - be clicked on to display a larger version):

(If you're at all curious about the details of any of the algorithms used, please Google the name of the algorithm plus the character R (eg "icr R"); here I'll only be talking to a handful of the algorithms.)

I've sorted the table on the basis of each algorithm's mean absolute prediction error performance for 2011, on which measure ppr (Projection Pursuit Regression) leads out. Its score of 29.17 points per game would place it 3rd on MAFL's Margin Predictor Leaderboard, an especially impressive performance in light of the fact that it's had no help from bookmaker starting prices, unlike all the Predictors that have performed better than it.

The ppr algorithm also has the best RMSE for the Testing period and the third-best R-squared. It was, however, only 14th-best in terms of RMSE, 13th-best in terms of MAPE, and 15th-best in terms of R-squared during the Fitting period (2000-2010), neatly demonstrating the dangers of judging an algorithm solely on the basis of its fit.

Next best in terms of MAPE for the Testing period were Partial Least Squares and Sparse Partial Least Squares, both of which are described in their promotional material as being well-suited to situations where regressors are correlated, a situation that's clearly applicable here since the Interstate binary is positively correlated with Home Team Venue Experience (+0.69) and negatively correlated with Away Team Venue Experience (-0.69). Partial Least Squares is also highly-ranked on RMSE for the Testing period and moderately well-ranked on R-squared for this period. Sparse Partial Least Squares ranks lower on RMSE but higher on R-squared in the Testing period. Neither performs particularly well on any performance measure in the Fitting period.

A hair's breadth behind this pair on MAPE for the Test period is the Generalized Additive Model, though it does relatively poorly on terms of the RMSE and R-squared metrics for the Test period, ranking only 19th and 21st respectively on these measures.

Thereafter follows a slew of algorithms with MAPEs for the Test period ranging from 29.33 to 9.35 points per game. Amongst the bunch, the Elastic Net algorithm is clearly superior, ranking in the top 5 on MAPE, RMSE and R-squared for the Test period, the only algorithm only than Projection Pursuit Regression to do this. Also in this group is the OLS algorithm, turning in a creditable performance in finishing top 10 on all three metrics for the Test period.

On the results for the remainder of the algorithms I offer the following observations: 

  • The Conditional Inference Tree forest, cforest, performs surprisingly poorly, especially considering its ranking of 7 on all three metrics for the Fitting period. Once again there's evidence of this algorithm's tendency to overfit for some problems and datasets despite the algorithm's in-built "protections" against this behaviour
  • cforest's close relative, the randomForest, shows even stronger evidence of overfitting, finishing top 2 on all three metrics in the Fitting period, but mid-pack on the same metrics for the Test period. Two other members of the family, quantile regression forest and parRF seem similarly afflicted, though ctree and ctree2 do not as their Fitting and Test period results are broadly similar, which is to say, poor.
  • One other algorithm that appears to have overfit the Test data is the Relevance Vector Machine with radial kernel, rvmRadial, which finished top 3 on all three metrics in the Fitting period, but 45th or worse on all three metrics in the Test period
  • Two of the fitted Support Vector Machine variants, svmLinear and svmPoly, both do exceptionally well on the R-squared metric for the Test period, finishing 1st and 2nd, but do poorly on the other two metrics. The other fitted SVM, svmRadial does poorly on all three metrics in the test period.
  • The last seven algorithms appear to have collectively missed the memo about the purpose of fitting a model, returning RMSEs on the Test data of 42 or more. 

I should point out that I made no attempts to manually tune any of these algorithms, preferring instead to use caret's default settings to do this for me. It's possible, therefore, that any of the poorly performing algorithms, especially those that are highly tunable, might have suffered from my lethargy in this regard.

The last thing to which I'll turn your attention is the row beneath the main body of the table, in which are recorded correlation coefficients. The first five of these are pairwise correlations across all predictors for the same metric in the Test and the Fitting period. So, for example, the correlation between predictors' RMSEs in the Test and Fitting period is just +0.27, suggesting that a predictor's performance on this metric in the Fitting period was a poor indicator of its performance in the Test period - a result that not surprising in light of the high levels of apparent overfitting that I highlighted earlier.

Given this result I wondered if the bootstapped versions of the RMSE and R-squared metrics, which caret provides by default on the basis of 25 resamples of the Test data, might be better indicators of an algorithm's ex-sample performance. This is quite clearly the case for the bootstrapped RMSE metric, which is correlated +0.95 with predictors' ex-sample RMSE performance, but equally clearly not the case for the bootstrapped R-squared measure.

Using the bootstrapped RMSE as the basis for selecting the preferred algorithm would, in hindsight, have been a moderately successful approach. Amongst the algorithms with the 10 best bootstrapped RMSEs - of which there are 12 algorithms, due to ties - none is outside the top 20 on the MAPE metric for the Test period, and only 1 is outside the top 20 on the RMSE and R-squared metrics for the Test period.

The conclusion is that we'll be using the Projection Pursuit Regression algorithm to predict the margins for the purposes of simulating the remainder of the season.