Matter of Stats

View Original

Scoring Shot Conversion in the VFL/AFL - Quantifying Team, Venue and Era Effects

In the previous blog we reviewed the Conversion Rate history of the VFL/AFL competition looking, in turn, at how it has varied across eras for different venues and for different teams. That blog provides some useful context for this one and you might find it helpful to review it before proceeding. 

As we noted in that blog, the venue-by-venue view for a particular era was likely to be influenced by the mix of teams that played on each ground during an era and, similarly, the team-by-team view for a particular era was likely to be influenced by the venues it mostly played at during that era. For today's blog I'm going to construct a linear statistical model in an attempt to separately quantify the effects of era, venue and team on the observed game-by-game Conversion Rates.

THE DATA AND THE MODEL

The model will use the individual team Conversion Rate, defined as Goals/(Goals + Behinds) for every team in every game ever played in the VFL/AFL competition. That gives us 29,572 observations to fit, 2 for each of the 14,786 games.

All the regressors will be categoricals and comprise for each observation:

  • The Team whose Conversion Rate we're fitting
  • The Venue on which the game was being played
  • The Era in which the game was being played, where 23 eras are defined as 1897-1905, 1906-1910, 1911-1915, 1916-1920, ... , 2001-2005, 2006-2010, 2011-2015.

Though a number of functional forms might be postulated for the model, the one we'll use is the following:

Conversion Rate = Constant + Venue + Era + Era x Team + Error

This form implies that:

  • Venues have a constant effect on Conversion Rates across time
  • Conversion Rates vary across Eras but not within them
  • Team Conversion Rates vary across Eras but not within them and not across Venues

All of those assumptions are debatable, but one major factor in their adoption was my desire to somewhat limit the number of coefficients being estimated in an effort to control the level of overfit. Even this form requires 333 parameters (1 Constant, 43 Venue effects, 22 Era effects, and 267 Team x Era effects, the latter figure reduced by the fact that not all teams played in every Era). 

The form also requires that we choose reference categories for each variable, for which purpose I've selected 1897-1905 as the reference Era, the MCG as the reference Venue, and Collingwood as the reference Team. This means that the coefficients on the Era variables are to be interpreted relative to 1897-1905, the coefficients on the Venue variables are to be interpreted relative to the MCG, and the coefficients on the Era x Team variables are to be interpreted relative to Collingwood during the given Era.

RESULTS

After pausing for a little while, R's lm function eventually provided a model that explained just over 13% of the game-to-game, team-to-team variability in Conversion Rates. That, I'll confess, is a much higher proportion than I expected. 

Let's look then at the model coefficients, firstly, and least interestingly, at the Constant term, which we can interpret as an estimate for Collingwood's Conversion Rate in a game played during the 1897-1905 period on the MCG.

In such a game, the model suggests, Collingwood might be expected to convert at about a 41% rate. Those were, indeed, very different times.

Next we consider the Era effects, the pattern of which broadly mirrors the trajectory of historical season-by-season Conversion Rates that we saw in the chart in the previous blog.

Recall that these estimates are all relative to the 1897-1905 Era, which means for example that the 2.1% figure in the first row should be interpreted as an estimate of the incremental Conversion Rate likely to be achieved in a game played during the 1906-1910 Era relative to a game involving the same Team at the same Venue played during the 1897-1905 Era. The absence of anything in the column headed Sig tells us that this coefficient is not statistically different from zero. 

All of the Eras from 1916-1920 onwards have statistically significant estimates, and are all positive and so are shaded green. The Era with the highest estimated Conversion Rate is the 1996-2000 Era where the estimated increment relative to 1897-1905 is over 12% points.

In the Conversion Rate history chart from the previous blog we saw a decline in Conversion Rates from about the early 1940s to the late 1950s, and that decline is encapsulated by the model in the relatively low coefficient estimates for the four Eras spanning 1941 to 1960.

So far, none of this has revealed anything much that we didn't already suspect. Next though we'll investigate the Venue effects, which now are estimated controlling for the mix of teams that have played at a particular Venue over the course of history.

The estimates here have been ordered highest-to-lowest, Venues with an estimated positive effect on Conversion Rate relative to the MCG appearing on the left, and those with an estimated negative effect on the right. Venues and coefficients shaded green are both positive and statistically significantly different from zero, while those shaded red are both negative and statistically significantly different from zero.

Just three Venues fall into the former category: Albury, which has a massive coefficient but which also has only ever seen a single game; Docklands, whose positive incremental effect on Conversion Rates suggested in the previous blog, is now confirmed; and Lake Oval, which has the smallest of the statistically significant and positive effects, but whose estimate is based on 704 games (which, of course, aids its ability to achieve statistical significance).

Docklands, in fact, has the highest positive coefficient estimate of any ground used on more than a handful of occasions. The Sydney Showground, which has been used 29 times, has the second-highest positive coefficient estimate amongst grounds used more than occasionally. After it, the remaining grounds from the positive column which can also claim to have been used on 25 occasions or more are only Lake Oval (704, as noted), Adelaide Oval (45), Manuka Oval (38), Subiaco (500), Kardinia Park (646), and Glenferrie Oval (443).

Six grounds have negative and statistically significant coefficients, the most negative estimate belonging to the WACA, on which 72 games have been played. The estimate for Cazaly's Stadium is more negative, but it is a ground on which only 5 games have been played. Other grounds with estimated negative effects but on which fewer than 25 games have been played are Bellerive Oval (9) and Olympic Park (3).

Amongst the grounds currently in use , Stadium Australia and the Gabba have the most negative coefficients. Both coefficients are around -1% point but neither is statistically significantly different from zero.

Reviewed as a whole, two features emerge:

  • Amongst the grounds used most regularly, more have negative coefficients than have positive coefficients. This means there are more grounds that have a negative effect on Conversion rates relative to the MCG than there are grounds that have a positive effect 
  • Venue effects for current and regularly used grounds are generally small, with Docklands the only venue showing a statistically significant positive effect, none showing a statistically significant negative effect, and only the Gabba and Stadium Australia showing a negative effect size of any practically meaningful size. This means that, for more-recent eras, the differences in Conversion Rates we were seeing for many grounds was predominantly due to the teams playing on them during the Era and not due to any specific characteristic of the Venue

So then, let's take a look at those Team x Era effects. These are, by far, the most complex to review and interpret. Adelaide's +0.1% figure for the 2011-2015 Era, for example, should be interpreted as Adelaide's expected Conversion Rate, relative to Collingwood's, for a game played during the 2011-2015 at any specified Venue. 

In this table too I've followed the convention of shading green those coefficients that are positive and statistically significant, and shading red those that are negative and statistically significant.

A team-by-team review will uncover other features that I'll not list here, but some that caught my eye were:

  • The slew of negative coefficients for most teams for the Eras spanning 1916 to 1940, reflecting the superiority of Collingwood's Conversion Rate performance across that period
  • The preponderance of negative coefficients (though generally not statistically significant) for Carlton from the start of the competition until about 1965, after which positive coefficients have been most common
  • The pattern for Essendon, which is very similar to Carlton's
  • Geelong's tendency to produce positive coefficients from about the end of WWII to the present day
  • Hawthorn's seven consecutive positive coefficients, four of them statistically significant, from 1981-1985 to 2011-2015
  • North Melbourne's / Kangaroos' record of 12 positive coefficients from the 15 most-recent Eras, including 7 positive coefficients from the 8 most-recent Eras
  • Melbourne's strong performance in the 1941 to 1970 period, missing a positive coefficient for only one Era in that time
  • Richmond's streaky performance profile across history, with 5 negative Eras spanning 1916 to 1940 being followed by 4 positive Eras, and with 4 other negative Eras spanning 1981 to 2000 being followed by 3 positive Eras from then until now.
  • St Kilda's poor Conversion Rate history from the start of the competition until the 1981 to 1985 Era, during which it returned a positive coefficient in only 3 Eras and including a string of negative coefficients spanning 1916 to 1965. In 5 of the 6 most-recent Eras, however, it has produced positive coefficients
  • South Melbourne's / Sydney's similarly poor record until 1990, after which it has produced a string of 5 unbroken positive coefficient Eras
  • Footscray's / Western Bulldog's poor run of 5 successive negative Eras from 1951 to 1975, followed by a string of mostly positive Eras, broken only by the 1986-1990 Era and ending with an unbroken run of 5 successive positive Eras.

One other, particularly important feature of this table is the average size of the coefficients it contains. These imply that, over the history of the sport, compared to Venue alone, the combination of Team and Era have had a much larger influence on the Conversion Rate observed in a game. We should, of course, note that coefficient magnitudes are heavily influenced by the choices made for reference categories, but the persistence and size of the effect seen here suggests that the broad inference I've just drawn is justified.   

In other words, what we're claiming is that knowing when a game was played and who was playing in it, likely tells us much more about the Conversion Rate seen in that game than does knowing only where that games was played. That's maybe not all that startling a claim when you realise that, in the first instance, we're altering two variables (Team and Era), whereas in the second instance we're altering only one (Venue).

(Note that a random forest model fitted to the same data suggests, based on the node impurity reduction variable importance measure, that, taken alone, Era is the most important variable, then Venue, then Team, so it seems that it's only by considering Team and Era varying together that we obtain such an elevation in predictive value. That said, it's known that random forest variable importance measures favour factors with more levels, which give Venue with its 44 values advantage over Team with its 20 values. Ultimately, it's the vibe of the thing ...

I'll finish with another version of that last table, this one ignoring statistical significance and colour-coding the coefficients based solely on their value. This view helps make runs of negative or positive coefficients for any single team more obvious, and presents the data in a format more similar to that used in the previous blog.