The Predictability of Men’s AFL Crowds: Adding Weather, Temperature and Membership Numbers
/Sometimes, data comes at you fast ...
In the past 24 hours I’ve been directed towards some game-by-game weather and temperature information for 2010 onwards (see Tweet below), and had a suggestion to include and then sourced from the footyindustry website, each team’s membership counts.
This, much shorter, blog post will review what happens when we add this data that to our previous OLS model.
The new weather data provides, for each game, a description of the prevailing weather in one of 10 categories, three of which we combine as shown in the table that follows.
We see that, on average, attendance is highest at games where the weather is described as one of “Clear”, “Clear Night” or “Roof Closed”, and lowest for the very small number of games where the weather is described as “Thunderstorms”.
Though some of the results here are what you might expect - for example, rainy games tend to draw smaller crowds than mostly sunny games - there are some obvious interactions with venue that are influencing the numbers.
The MCG, for example, while hosting almost one-quarter of the games in the sample, includes less than 16% of those where the weather is described as “sunny”. In contrast, Carrara hosts less than 5% of the games, but almost 10% of those with sunny weather. Even more starkly, the Gabba, which hosts less than 6% of games, represents about 14% of games in “clear” weather.
The new prevailing temperature data is provided to the nearest whole degree in the raw data, but in the table at left I’ve grouped that data into 5 degree Celsius blocks.
(I know that some of you hardy Victorian people in particular will scoff at my labelling 11 to 15 degrees as “Cool”, but I’m sure I can find an online source to defend my choice should I need to.)
Here too the interaction effects with venue are significant. The MCG, for example, has 35% of the games with a “cool” temperature, while Docklands has 43% of those with a “cold” temperature (but only 23% of all games in the sample). Subiaco has about 9% of all games in the sample, but 27% of those with a “hot” temperature.
Of course, when we include these variables in the regression we will control for venue (and time of day) effects, and obtain a better idea of their independent effect on attendance.
The last of the new data, club membership figures, appears in the chart below. In preparing the data for modelling, we attach a team’s membership count to each of its games in a particular season.
THE MODEL
Newly included in this first block of coefficients are those related to the respective membership counts of the home and away teams.
We find that the home team membership figure is highly significant, whilst that for the away team is near zero and not significant. Roughly speaking, for every additional 1,000 members, we expect about a 200 increase in attendance.
As a result of including the membership counts, the other coefficients in the Home Team and Home Team and Favourites blocks are all quite different from what they were in the previous model. Also, fewer of them are statistically significant, and those that are tend to be less significant.
The coefficients in the Opponent block are far less affected, however, because of the relatively small estimated impact of Away team membership figures.
Adding in home and away team membership in the way we have here explains about an additional 0.4% of the overall variability in attendance figures.
The next block of coefficients, which covers venue, day of week, start time, favourite strength and period, includes no new regressors, but the period variable now groups seasons slightly differently.
Period 1 spans just the 2010 and 2011 seasons, Period 2 from 2012 to 2016, and Period 3 from 2017 to 2019, these splits designed to group seasons with broadly similar average attendance figures.
The coefficients for venues have changed, but the signs of all significant variables are the same, as is the general pattern of significance.
Those for the days of the week and time of day change more dramatically - the latter probably as result of including the weather and temperature data - but the overall pattern of (in)significance also remains.
The impact of favourite strength is assessed as being of a similar magnitude - that is, about 50 people for every additional point of favouritism.
Attendance levels for the two later periods are generally about 3,000 to 3,500 lower than for the reference 2010 to 2011 period.
The third block of coefficients, which covers the impact of “special” days, includes values very similar to those we had in the previous model and also reveals a very similar pattern of statistical significance.
At the MCG, ANZAC Day is still, for example, associated with larger crowds Mothers’ Day with smaller crowds, the Queen’s Birthday Monday with larger crowds, Good Friday with largely unchanged crowds, Easter Saturday and Easter Sunday with smaller crowds, and Easter Monday with larger crowds.
The fourth and final block of coefficients includes those for the teams’ State of origin and current ladder position, as well as the month of the year in which the game was played, and the new weather and temperature variables.
Those related to the teams’ States are broadly similar to what we had in the previous model, while those related to ladder position and month are quite different, presumably due to the inclusion of weather and temperature data, which will be correlated with month. That said, the estimated incremental effects of ladder position and month are not all that different in the new model.
We now have, for example, that:
A clash between two teams in the Top 8, played in August, is now expected to draw an incremental 2,000 rather than 3,500 fans
The same clash but where neither team is in the Top 8, is now expected to draw about 8,000 rather than 8,500 fewer fans (ie the coefficient for August)
The largest increases (still) come for games played in July or August and involving two teams in the Top 8
The largest decreases (still) come for games played in August or September and involving two teams that are outside the Top 8
Games where only the Away team is in the Top 8 are generally expected to draw crowds about 1,000 to 2,500 lower than games where only the Home team is in the Top 8. This figure leaps to nearer 4,000, however, if the game is in August. These numbers are also similar to those we calculated with the previous model.
The coefficients on the weather variables broadly make sense. Compared to when the weather is mostly clear, we expect:
About the same number of fans when the weather is sunny, overcast, or windy
About 1,300 more fans when the weather is clear
About 1,900 fewer fans when there is rain
About 2,300 fewer fans when there are thunderstorms
About 3,100 fewer fans when there is wind and rain
And, finally, the coefficient on the temperature variable tells us that expected attendance increases by about 120 fans for every 1 degree Celsius increase in temperature.
Combined, the weather and temperature variables add about 0.25% to the model’s explained variance.
We can summarise the effects of including our new variables as follows:
With neither membership nor weather and temperature data: 87.98% of variance explained
With membership but not weather or temperature data: 88.39% of variance explained
With weather and temperature but not membership data: 88.23% of variance explained
With both membership and weather and temperature data: 88.66% of variance explained
The combined increase is therefore about 0.7%, which is not as large as it might otherwise be because the base model already had proxies for membership (the home team variable) and for weather and temperature (the month and time of day variables).
The new model provides fitted attendance figures that are:
within 610 of the actual attendance 10% of the time
within 1,500 of the actual attendance 25% of the time
within 3,160 of the actual attendance 50% of the time
within 5,635 of the actual attendance 75% of the time
within 8,840 of the actual attendance 90% of the time
Lastly, if we chart the fitted and actual attendance levels we see that the new model is equally good in all three defined periods, and does a little better now with games played on other than “special” days that drew large crowds.
Having created a model that explains close to 90% of the variability in attendance, I suspect we’re rapidly approaching (if we haven’t yet arrived at) the point where most of what is left is noise. On that basis, that’ll do for now.