We Need to Talk About MoSHBODS ...

December 1, 2024 Tony Corke

Last year’s men’s seasons results for MoSHBODS and MoSSBODS - as forecasters and as opinion-sources for wagering - were at odds with what had gone before.

Other analyses have suggested that the MoS twins might have been a bit unlucky in the extent to which 2024 was different from bookmaker expectations, and I’ve never been one for knee-jerk reactions to single events, but the performance has nonetheless made me think more deeply about the algorithms underpinning the two Rating Systems, more details on which were provided in this blog from 2020, and from the blogs to which it links.

The natural enemy of the predictive modeller is overfitting, and both MoSSBODS and MoSHBODS current versions risk falling afoul of that folly because they:

Split the history of men’s footy into three eras, and allow separate model parameters for each
Split each season into sections for the purposes of applying different k’s (which are used for updating ratings based on the most recent results) to each section

That permissiveness lends itself to good fits to the underlying data, but risks overfitting in that it could easily result in models that well-describe history but do so at the expense of poorly describing the future.

Another feature of the extant MoS models is that they output ratings in terms of current-season points (MoSHBODS) or scoring shots (MoSSBODS), but average team scores have varied markedly across the history of men’s AFL, which means that a 10-point above average team in 1910 was, relatively speaking, a team more likely to defeat an average team of its day than a 10-point above average team in 2024. So, if we want to be able to even loosely compare teams across the entirety of the men’s competition, we need to find a way to incorporate these historical differences in scoring.

THE NEW MOSHBODS

To address the issues just raised, we’re going to rebuild MoSHBODS by:

Removing era-specific parameters - that is, we’ll be using the same model parameters for all seasons from 1897 onwards
Using a flexible functional form to determine what k to use for each home-and-away round. This will produce smooth changes in the k used from one round to the next, rather than the step-function changes used previously, which allocated one k to a block of rounds and then a different k to the next block
Using a fixed k for all Finals
Converting scores and scoring shots to standardised values - that is, (number - mean)/SD - where the mean is an average per team across some fixed time period. So, for example, if a team scored 80 points in an era where the average was 75 points and the standard deviation around that average 20 points, the standardised score would be (80-75)/20 = 0.25. Similarly, if a team registered 22 scoring shots in an era where the average was 25 scoring shots and the standard deviation 5 scoring shots, the standardised scoring shots would be (22-25)/5 = -0.6.

Everthing else about MoSHBODS will be the same, including that it will still provide an offensive and a defensive rating for each team, with the combined rating defined as the sum. It’s also still true that the average combined rating will always be zero.

Ihe approach used reduces the number of free parameters from 16 x 3 eras = 48 to just 16, and they are:

Alpha_P1, Alpha_P2, Alpha_P3, and Alpha_Finals - these are used to determine the k value for a round. For home and away rounds we use, k = Alpha_P2 + Alpha_P1 * exp(Alpha_P3*(Round Number/Max Round Number), and for Finals we use k = Alpha_Finals. (Note that Max Round Number includes Finals).
Adjustment_Factor - this determines how the scoring and scoring shot data in standard deviation form is converted into an adjusted score. The final adjusted score is: Adjustment_Factor * Actual Scoring Shots in deviations from the mean + (1 - Adjustment_Factor) * Actual Points Score in deviations from the mean. It’s best not to think too hard about what units this adjusted score should have. Let’s just say they’re in standard deviations.
Carryover - this determines what fraction of the end-of-season ratings, offensive and defensive, are carried forward to the next season
Games_Before_Average - this determines the number of games for which the average adjusted overperformance of a team at a given venue (after adjusting for relative abilities) is calculated and dragged towards zero before only the actual average adjusted overperfomance is used. The VPV used for a team playing in its home region is the Average Adjusted Excess Performance at the Venue times the Games Played at the Venue divided by Games_Before_Average (where the Games Played at the Venue is set equal to Games_Before_Average if the team has played more than Games_Before_Average at the venue). Note that all calculations are performed with the historical window defined by Days_To_Average_for_VPV. In essence we’re assuming that the Average Adjusted Excess Performance in all of the games after Games Played at the Venue to Games_Before_Average is zero. Where a team is playing outside its home region we add an amount equal to 1 - Games Played at the Venue/Games_Before_Average times Out_Of_Region_Default_VPV (again ensuring that the ratio Games Played at the Venue/Games_Before_Average never exceeds one. In this case we’re assuming that the Average Adjusted Excess Performance in all of the games after Games Played at the Venue to Games_Before_Average is equal to the Out_Of_Region_Default_VPV.
The Average Adjusted Excess Performance is the actual Average Excess Performance multiplied by Mean_Reg.
Days_To_Average_for_Score - this determines the number of days that should be included in the calculation of average team score and average team scoring shots
Mean_Reg - this determines what proportion of the actual average overperformance of a team at a given venue is used as the input to its Venue Performance Value
Days_To_Average_for_VPV - this determines how much history is included in the calculation of Venue Performance Values
Final_VPV_Fraction_Same_State, Final_VPV_Fraction_Diff_State, Grand_Final_VPV_Fraction_Same_State, Grand_Final_VPV_Fraction_Diff_State - these determine the multiplier that should be applied to Venue Performance Values for Finals and Grand Finals
Out_Of_Region_Default_VPV - this determines the base adjustment (in standard deviations) that should be made when a team is playing at a venue not in its region
Total_Multiplier_HA, Total_Multiplier_Final - these are used to adjust the raw forecasts of team scores in home and away and in Finals fixtures. In making the adjustments, the forecast margins are preserved and only the forecast team scores and total are altered.

For MoSHBODS, the objective function to be optimised is the all-time mean absolute error in the margin predictions, , where margin error is calculated in terms of actual margin and not standardised margin (ie after we converted our forecast margin in terms of standard deviations back into a margin in terms of common era points). This means that, unlike previous MoSHBODS optimisations, the margin error in a game from 1897 is weighted equally to one from 2024.

The estimated optimal values of Alpha_P1, Alpha_P2, and Alpha_P3 for MoSHBODS mean that the k’s for each home and away round are as shown in the chart below.

That makes the k’s relatively flat from about the time where 60% of the season is completed. The optimal k values for the Finals have been estimated as Alpha_Finals at 0.059, which is a couple of percent lower than the base of this chart.

Best for Adjustment_Factor is 0.5, which means we precisely average the standardised scores in Points and the standardised score in Scoring Shots to arrive at an adjusted score (ie Adjusted Score = 0.5 x Standardised Score in Points + 0.5 x Standardised Score in Scoring Shots)

For Carryover, we choose 0.65, which means that teams carry about two-thirds of their offensive and defensive ratings through from the end of the previous season to the start of the next.

The optimum for Games_Before_Average is a surprisingly high 65 games at a venue, which means that estimates are dragged towards zero (or the out of region VPV) for quite a long time, in most cases even for teams’ home grounds.

Days_To_Average_for_Score is optimal at 5.5 years.

Best for Mean_Reg is 0.605, meaning that about 60% of the actual average excess performance is included in the VPV calculation

Optimal for Days_To_Average_for_VPV is 8.5 years, meaning that any result at a given venue within this window is included in the Venue Performance Value calculation

Respectively, the optimal values of Final_VPV_Fraction_Same_State, Final_VPV_Fraction_Diff_State, Grand_Final_VPV_Fraction_Same_State, Grand_Final_VPV_Fraction_Diff_State are 0.75, 1.75, 3.5, and 0 meaning that, in FInals other than Grand Finals that involve teams from the same State, the VPV to use is 0.75 times the base VPV. Also, in FInals other than Grand Finals that involve teams from different States, the VPV to use is 1.75 times the base VPV. For Grand Finals, we use 3.5 times the VPV for same State teams, and no VPV at all for Grand Finals involving teams from different States.

The optimal value for Out_Of_Region_Default_VPV is -0.26 standard deviations, which is about 6.3 points when the standard deviation of team scores is at its current level of 24.4 points.

Total_Multiplier_HA, Total_Multiplier_Final are used to adjust the raw team scores implied by the team ratings and Venue Performance Values. The calculated optima are 0.775 and 0.865. So, for example, if the base forecasts came out with a 72 to 65 scoreline for a home and away game, that’s a total of 137. That needs to become 0.775*137, or 106.175, so the total adjustment is 137 - 106.175, which is 30.825. We apportion half of that adjustment to each team and therefore arrive at 72 - 15.4125 = 66.5875 and 65 - 15.4125 = 49.5875, thus preserving the original margin forecast and obtaining the desired total.

THE NEW MOSSBODS

The old MoSSBODS, you might recall, differed from the old MoSHBODS in that it looked only at Scoring Shots and not points. That is still the case with the new MoSSBODS, which uses exactly the same parameters as MoSHBODS, but with Adjustment_Factor set equal to 1 (thus ensuring it looks only at Scoring Shots).

Unlike MoSHBODS - and unlike previous years - the objective function that MoSSBODS looks to minimise now is the mean average error in Totals. We’ve therefore made it a Totals expert at the expense of minimising the MAE of its margin forecasts.

For it, the optimal values of some of the free parameters differ and are described below:

The optima for Alpha_P1, Alpha_P2, and Alpha_P3 mean that the k’s for each home and away round are as shown in the chart below.

It is even flatter than the chart for MoSHBODS and sees k’s ranging from about 0.655 to 0.06 across the home and away season. The optimal k values for the Finals have been estimated as Alpha_Finals at 0.016, which is a over four percentage points lower than the base of this chart for the home and away season.

For Carryover, we choose 0.925, which means that teams carry over 90% of their offensive and defensive ratings through from the end of the previous season.

The optimum for Games_Before_Average is a surprisingly high 65 games at a venue - the same as for MoSHBODS.

Days_To_Average_for_Score is optimal at 4.5 years, which is 1 year less than for MoSHBODS.

Best for Mean_Reg is 0.605 - also the same as for MoSHBODS.

Optimal for Days_To_Average_for_VPV is 8.5 years - also the same as for MoSHBODS.

Lastly, the optimal values of Final_VPV_Fraction_Same_State, Final_VPV_Fraction_Diff_State, Grand_Final_VPV_Fraction_Same_State, Grand_Final_VPV_Fraction_Diff_State are all the same as for MoSHBODS.

The optimal value for Out_Of_Region_Default_VPV is -0.26 - again the same as for MoSHBODS.

Best for Total_Multiplier_HA, Total_Multiplier_Final are 0.735 and 0.795, both of which are a little lower than their MoSHBODS equivalents.

PERFORMANCE

Given that the 2024 versions of MoSHBODS and MoSSBODS had three times the number of free parameters, it’s probably unrealistic to expect that either of these new algorithms will outperform their 2024 analogues on Margin MAE.

In the table at right we simply count the number of seasons for which a given algorithm - New or Old MoSHBODS, or New or Old MoSSBODS - was best on a given metric, be it Margin MAE, Total MAE, Margin RMSE, or Total RMSE.

The top half of the table records the all-time results and shows that:

ALL TIME

the old algorithms were overall superior on Margin MAEs (with the old MoSHBODS algorithm best of all)
the new algorithms are overall superior on Total MAEs (with the new MoSSBODS algorithm, which has been optimised for forecasting Totals, best of all)
the old MoSHBODS algorithm is best on Margin RMSEs, although the new MoSHBODS algorithm, with far fewer free parameters, is not far behind
the new algorithms are overall superior on Total RMSEs (with the new MoSSBODS algorithm, again, best of all)

FROM 1997 to 2024
(In the previous versions of the MoS models, the final era was defined as from 1997 to 2024, and more emphasis was placed on it in choosing optimal parameters)

the old algorithms remaim overall superior on Margin MAEs (with the old MoSHBODS algorithm best of all)
the new algorithms remain overall superior on Total MAEs (with the new MoSSBODS algorithm, as we’d hope, best of all)
the new MoSHBODS algorithm is best on Margin RMSEs, slightly ahead of the old MoSHBODS algorithm
the new algorithms remain overall superior on Total RMSEs (with the new MoSSBODS algorithm, narrowly, best of all)

Taking all of those results into consideration, the new MoS models seem to me to be fit for purpose and less likely to be overfit.

As well, they allow for better comparison of teams from different eras, albeit relative to the teams that were faced in that era and not relative to teams from other eras.

HIGHEST AND LOWEST RATED TEAMS OF ALL TIME

We’ll finish with two comparisons, firstly looking at the 30 highest and lowest rated teams of all-time, based on their end-of-season rating, according the old and the new MoSHBODS algorithms (noting that the new algorithm ratings are in standard deviations whereas the old ratings are in points).

First, consider the 30 highest rated teams of all time according to the old and new MoSHBODS Rating Systems.

Whilst there is a fair degree of agreement in the two lists with 19 teams appearing on both, the 11 different teams highlight how the new MoSHBODS algorithm provides a better chance for teams from earlier eras when scoring tended to be lower to make the cut. Conversely, the Hawthorn teams from the late 1980s and 1990 miss out under the new algorithm because they come from a high-scoring era when higher ratings - when measured in points - were easier to achieve.

Amongst the 11 new teams that make the list, nine are from before the 1980 season and seven pre-date World War II.

Lastly, let’s look at the 30 lowest rated teams under the two algorithms.

Again, there is a high degree of agreement, with 19 teams in common on both lists. The teams dropping out now are all from 1985 or later, and the teams taking their place under the new algorithm from 1897 to 1963.

FINAL 2024 RATINGS

Lastly, let’s look at how different the new and old algorithms are by seeing what they produce for the final 2024 Team Ratings.

First we’ll compare new and old MoSHBODS

They are broadly similar in terms of team rankings with only Collingwood and Sydney differing by more than two spots.

Next, compare new and old MoSSBODS where we might expect greater divergence because of the change in objective function from Margin MAE to Total MAE (and adopting a whole-of-history approach)

As expected there is larger divergence, with six teams moving ranks for more than two spots, but there is still a relatively high level of overall agreement.

We can also look at the correlations between the underlying Ratings.

For the defensive ratings, the highest correlations are between the Old MoSHBODS and Old MoSSBODS, and between the New MoSHBODS and New MoSSBODS. The lowest is between the Old and New MoSSBODS, but is still +0.965.

For the offensive ratings, the highest correlation is between the Old MoSHBODS and Old MoSSBODS and the lowest are between the Old MoSHBODS and the New MoSSBODS, and between the Old and New MoSSBODS, but these are both greater than +0.9.

The pattern for the combined ratings broadly mirrors that for the defensive ratings.

CONCLUSION

I like the new algorithms but recognise they’ve not been tested in a live situation. I look forward to testing them in 2025 and welcome any comments, thoughts or suggestions you might have.

Some of the things still to do are to map MoSHBODS (and maybe MoSSBODS) margin forecasts to probability estimates, and to create a new MoSHPlay, combining the new MoSHBODS with player data.