Building a Score-by-Score Men's AFL Simulator: Part I

This past week, in between some pieces of client work, I’ve been coming up with a workable methodology for creating a score-by-score men’s AFL simulator that will be as faithful as possible to the actual scoring behaviour we’ve observed in recent seasons.

Over the next few blogs I’ll be describing the process of building this simulator which, let me stress immediately, is not yet finished. From what I’ve been able to create so far, I think the general approach I’m following is viable, but we’ll only see just how viable when it’s done.

Let’s start.

THE DATA

The heart of the approach is score progression data sourced from the AFLtables site for all games from 2001 to 2019. For our initial model we’ll use only the data for the home and away games of the last three seasons, which includes 594 games and 27,714 scoring events.

THE METHODOLOGY

STEP 1: PREDICT SCORING EVENTS

The first step is to create a predictive model that will provide an estimate of the probability that some scoring event - a home goal, a home behind, an away goal, an away behind, or no score at all - occurs in the next portion of the game.

To facilitate that we need to take the raw scoring progression data and discretise it into fixed length periods. Now the shortest time between any two scoring events in the historical data is nine seconds between two behinds for the Brisbane Lions against Melbourne in 2010. If we assume that a game runs for roughly 7200 seconds (ie 30 minutes x 60 seconds x 4 quarters), then nine seconds represents about 0.125% of a game.

We want to ensure that, at most, one scoring event occurs within a single defined period, so we choose 0.1% of a game as our period length, which means that we divide each game up into 1,000 equally-long periods. Within each of those periods we record whether there was a home score, an away score, or no score at all .With 594 games we have 594,000 such data points that can serve as the targets for our predictive models.

But those models also need inputs, so what should we provide?

MODEL INPUTS

So far, I’ve opted for the following:

  • Bookmaker Expected Margin: the TAB bookmaker’s pre-game expected margin from the home team’s viewpoint

  • MoSH2020 Expected Home Score, Expected Away Score: the pre-game expected scores for the two teams according to the MoSH2020 System

  • MoSH2020 Expected Home Scoring Shots, Expected Away Scoring Shots: the pre-game expected scoring shot counts for the two teams according to the MoSH2020 System

  • Game Fraction: the proportion of the game that had been completed as at the end of the previous period

  • Previous Score Type: what happened in the immediately preceding period (a home goal, a home behind, an away goal, an away behind, or nothing). We include this in the hope that it will discourage the algorithms from providing high probability estimates for back-to-back scoring events, especially goals

  • Previous Lead: the home team’s lead at the end of the previous period

  • Previous Leader: a categorical recording whether the home team or the away team led at the end of the previous period, or whether the game was tied

  • Previous Scoring Rate, Previous Home Scoring Rate, Previous Away Scoring Rate: the average total points per period recorded up to the end of the previous period (and the decomposition into those for the home team, and for the away team)

  • Previous Scoring Shot Rate, Previous Home Scoring Shot Rate, Previous Away Scoring Shot Rate: defined equivalently to those above but using scoring shots rather than points

  • Previous Conversion Rate, Previous Home Conversion Rate, Previous Away Conversion Rate: the proportion of all scoring shots that have been registered as goals up to the end of the previous period (and the individual figures for the home and for the away team)

  • Previous Home Score Run, Previous Away Score Run: the number of consecutive points scored by the home/away team without any score being recorded by the away/home team, as at the end of the previous period. Note that the end of a quarter resets the run to zero.

  • Previous Home Scoring Shot Run, Previous Away Scoring Shot Run: defined equivalently to those above but using scoring shots rather than points

  • Previous Bookie Expected Progressive Margin: the expected home team lead as at the end of the previous period, assuming a linear relationship between time and lead (ie if 30% of the game is completed then the expected lead is 30% of the final lead)

  • Previous MoSH2020 Expected Home Score, Previous MoSH2020 Expected Away Score: the expected home team and away team scores as at the end of the previous period, assuming a linear relationship between time and scoring

  • Previous MoSH2020 Expected Home Scoring Shots, Previous MoSH2020 Expected Away Scoring Shots: defined equivalently to those above but using scoring shots rather than points

  • Previous Time Since Last Scoring Event: how long it had been, at the end of the previous period, since the last recorded scoring event by either team. This is also included partly as a means of suppressing scoring events too proximate in time

  • Previous Home Score Last 10PC, Previous Away Score Last 10PC, Previous Home Scoring Shots Last 10PC, Previous Away Scoring Shots Last 10PC: these record how many points or scoring shots each of the teams has registered in the most-recent 10% of the game, as at the end of the previous period. If less than 10% of the game has been played they are simply the scores or scoring shots so far.

  • Previous Home Score Last 25PC, Previous Away Score Last 25PC, Previous Home Scoring Shots Last 25PC, Previous Away Scoring Shots Last 25PC: these are defined as above but look at the most-recent 25% of the game, instead.

The key question is, to what extent do these variables provide algorithms with all of the information they need to come up with reasonable estimates of scoring event probabilities for the next period?

If you’ve suggestions for other inputs to try, please let me know.

MODEL SELECTION

We’ll be using the caret package in R, which provides us with a large number of candidate algorithms suitable for fitting when the target variable is a multi-class categorical like ours.

Chosen Algorithms

We’ll use the following algorithms: kknn, knn, C50, avNNet, xgbLinear, xgbTree, rpart, ctree.

(Ideally I’d have included cforest or rf, but these are very slow to fit with large sample sizes. I will probably investigate these and report back in a later blog).

Train vs Test Samples

The available data will be randomly split into a 5% training sample and a 95% testing sample. Having a relatively small training sample facilitates faster model fitting and means that, where necessary, evaluating the fitted model on the entire data set is not all that different to evaluating it on the test data alone.

That random split preserves the significant class imbalance in the data - the scoring outcome for more than 95% of the observations is “None”, and only just over 2% are “Home Score” and “Away Score” - but I’ve chosen not to take any steps to address this fact for now. In a future blog we might also come back to this and, for example, try SMOTE sampling.

Performance Metric for Parameter Tuning

The metric used for tuning will be accuracy, mostly for reasons of convenience. Later we might investigate using some probability score for this purpose.

Performance Metric for Model Selection

The metric used for model selection will be the multi-class Brier Score, measured using only the test data. With it, we narrowly choose the xgbTree model over ctree and avNNet.

There are some other interesting metrics that we can calculate to better understand the behaviour of our chosen algorithm. For example, if we sum the probabilities for home scoring events and for away scoring events across all 1000 periods in each game, we can derive an estimate for each game of the expected number of scoring shots for the home and for the away teams. Those estimates can then be compared with the actual number of scoring shots recorded.

When we do this we find that the xgbTree algorithm, on average, underestimates home team scoring shots by about 0.9 shots per game, and overestimates away team scoring shots by about 0.1 shots per game. On average, then, home teams tend to win by about 1 scoring shot more than the xgbTree algorithm would estimate.

The equivalent figures for ctree are home teams underestimated by 1.1 scoring shots, and away teams overestimated by 0.1 scoring shots, and for avNNet are home teams underestimated by 0.6 scoring shots, and away team overestimated by 0.5 scoring shots.

STEP 2: PREDICT CONVERSION

So far, with the xgbTree model, we have the ability to estimate the likelihood that the next period produces a scoring shot for the home team, for the away team, or for neither.

What we need next are models - one for the home team, and one for the away team because, historically, there is a meaningful difference in their base conversion rates - that will provide estimates about the probability of a scoring shot, conditional on one having been determined to occur, being a goal or a behind.

For this purpose we proceed as we did in Step 1, but include only those periods that include a home team scoring event (for the home team model) or an away team scoring event (for the away team model).

The target variable here takes on only two values: “Goal” or “Behind”.

We use exactly the same regressors as we did in Step 1.

MODEL SELECTION

With a binary target variable we have an even larger pool of available algorithms, but we’ll not avail ourselves from any other ones for now.

Chosen Algorithms

We’ll use the same algorithms: kknn, knn, C50, avNNet, xgbLinear, xgbTree, rpart, ctree.

Train vs Test Samples

We’ll also use the same random 5/95 split of the available data. Note that here there is no significant issue with class imbalances because goals and behinds occur with almost equal frequency.

Performance Metric for Parameter Tuning

Accuracy will again be the metric used for parameter tuning.

Performance Metric for Model Selection

The metric used for model selection will be the area under the curve (AUC), calculated using only the test data and with the algorithm provided in the ROCR package.

For the home team scoring events, none of the models produced an AUC above about 0.504, and for the away team scoring events, none produced an AUC over 0.516.

These are barely, and perhaps only coincidentally, above the 0.5 you would expect from random guessing.

If we include MoSH2020’s offensive and defensive team ratings as regressors, we increase some of the models’ AUCs, but by only a tiny amount.

The overwhelming conclusion is that the scoring progression data provides little to no help in predicting whether a given scoring shot is more or less likely to be a goal. Adding information about the teams’ offensive and defensive abilities provides little additional predictive ability. If there is any, the effect size is very small.

Given that, there’s no compelling reason to use other than a fixed probability for all home team and for all away team scoring shots. Across the three years in the sample, home teams converted 51.9% of their scoring shots, and away teams converted 53.1%. Those are the values we’ll use for the simulations.

THE SIMULATIONS

Simulation 1

Next, we’ll use our scoring shot (xgbTree) and conversion (fixed probabilities) models to simulate games between teams of varying ability.

Specifically, we’ll simulate 5,000 games where the expected scoring shot values for each replicate are chosen, at random, from the actual home and away team scoring shot data for games played across the period 2017 to 2019. In other words, each replicate will be the actual home and away team scoring shot results for one of the games in the sample.

This expected scoring shot data will be converted to expected scoring by assuming a conversion rate of 51.9% for the home teams, and 53.1% for the away teams. The bookmaker expected margin will then be the expected difference between the home and away team scores.

Each simulation will proceed as follows:

Game Fraction 1: use the xgbTree model to provide estimates of a home scoring shot, an away scoring shot, or no scoring shot at all in the first 1/1000th of the game. Use a random draw from a uniform distribution on the interval (0,1) and cutoffs based on the xgbTree outputs to determine if a home team or an away team scoring shot has occurred (or neither). If a scoring shot is deemed to have occurred determine whether it is a goal or a behind using a random draw from a uniform distribution on the interval (0,1) and cutoffs based on the historical conversion rates for the home and away teams provided earlier.

In other words, for example, if the xgbTree model provides a 2.1% estimate of a home team scoring shot, and a 1.8% estimate of an away team scoring shot, and the first draw from the uniform distribution produces a value less than or equal to 0.021, then a home team scoring event has occurred. Treat that as a goal if the second draw from the uniform distribution has value 0.519 or lower. Otherwise, treat the scoring event as a home team behind.

Next, update all of the regressor variables based on the event that just occurred.

Game Fraction 2: repeat the same steps as in Game Fraction 1.

Game Fraction 1000: repeat the same steps as in Game Fraction 1.

The chart below is a scatter plot based on the 5,000 simulation replicates following this protocol.

There is, as we would hope a positive relationship between the simulated margins and the pre-game expected margins, although the relationship is not one-to-one. Teams that were expected to win by X points do not, on average, win by X points, which would be the ideal situation.

In fact, across the 5,000 replicates, the home team margin is on average only about 73% of the pre-game expected margin. By way of reference, across the three-year period we’ve used for the modelling, home teams final margins were about 92% of the TAB bookmaker’s expectations. So, not 100% as the bookmaker would’ve hoped, but closer to 100% than xgbTree.

The average conversion rates across the 5,000 replicates were, however, at 51.9% for home teams, and 53.1% for away teams, very close to what we expected.

One other interesting difference between the simulated and actual results is the correlation between home team and away team scores. For the actual data the correlation is -0.32 while for the simulated data it is -0.56. This results in more home team blowouts in the simulated data than in the actual data, which you can detect in the chart above by looking at the relative number of points above compared to below the regression line for expected margins over 50.

I’m not yet sure what is causing this, but it might be that the xgbTree model is prone to overstating the effects of home-team momentum which, if true, will be a huge irony given my views on that topic. Anyway, something more for me to investigate.

Simulation 2

We’ll do one final simulation for today that will bring into sharper focus the specific issues with the xgbTree outputs.

For this run of 1,000 replicates we’ll fix the expected home scoring shots at 27 and the expected away scoring shots at 20.

The results are as follows:

  • Expected Margin: 24 points

  • Actual Average Margin: 15.4 points

  • Expected Conversion Rates: Home 51.8%/Away 52.9%

  • Actual Conversion Rates: Home 51.9%/Away 53.1%

  • Expected Scoring Shots: Home 27/Away 20

  • Actual Average Scoring Shots: Home 25.2/Away 20.6

So, the issue with lower-than-expected home team margins comes down to xgbTree’s generating too few scoring shots for home teams, and too many for away teams. This is the same pattern that we saw in the modelling phase, though more pronounced here.

EXAMPLE SIMULATION

To finish, just to give you a concrete idea about what we can do with the simulation model as it stands, here’s a simulation of a game where the home team was expected to register 30 scoring shots, and the away team 21.

Whilst it looks, for the most part, a fairly reasonable score progression, there are three goals in about 24 seconds in that late Quarter 3 flurry, which clearly is unrealistic.

CONCLUSION

There’s obviously some more work to be done on this, but I was keen to get something posted in the hope that it might spark some conversations and encourage some of you to offer your thoughts and suggestions.

In the meantime, I hope you and all your families are doing okay and keeping well in this weird period of time in our history.