Finding Non-Linear Relationships Between AFL Variables : Alternative Measures to MIC
/The Maximal Information Coefficient (MIC) that we explored in the previous blog is not the only non-linear measure of the pairwise relationship between any two continuous variables.
Far from it, in fact.
As I was Googling for information about the MIC, I came across this review of it by Alexander W Blocker, in which he compares and contrasts the performance of the MIC measure to that of the Brownian Distance Covariance measure. There is, of course, an R package that provides the ability to calculate this distance measure and it's called energy.
One other, related notion that kept cropping up while I was investigating the MIC was that of Mutual Information. Again, there's a package for calculating that measure and it's called bioDist.
Note that, while the MIC and Brownian Distance measures range from 0 to 1, Mutual Information ranges from 0 to infinity. In all cases, larger values imply a stronger relationship or, if you prefer, a higher level of mutual information.
Each of these measures has tuning parameters:
- MIC has the exponent, exp, that determines how many cells are allowed in the grids considered by the algorithm, which are used to "bin" the underlying variables. The default value for this parameter is 0.6. For this blog I've investigated values in the range 0.4 to 0.8.
- Brownian Distance has an index variable that's used as the power on the distance measure and that can range from 0 (though not including it) to 2. For this blog I've investigated values in the range 0.25 to 2. When the index is equal to 2 the Brownian Distance measure becomes the usual Pearson correlation measure.
- Mutual Information has a variable nbinnumber, which is the number of bins into which the underlying variables are discretised. It plays a similar role to the exp parameter in the MIC measure. For this blog I've investigated the values 5, 10 and 20.
The data I've used for this analysis is a subset of the data I used in the previous blog. Here I've looked only at MARS-related and TAB Bookmaker-related variables for the period 2006 to 2011, and investigated their measured relationship with the Margin variable (ie Home Score less Away Score). Once again what I'm looking for the correlation measures to tell me is which variables have the greatest predictive value if I'm trying to predict game Margin.
Here are the results, both as raw correlations and as rankings.
On the left are the MIC values. Note how the measured correlation of every variable with the Margin variable increases as we increase the value of exp. In the supporting online material for the original Reshef paper it's mentioned that the choice of exp is important because setting it "too high can lead to non-zero scores even for random data because each data point gets its own cell, while setting ... [it] ... too low means we are searching only for simple patterns." When we set exp to 0.8 we see particularly high values of MIC for every variable, which seems to confirm the latter part of this comment, and when we set exp to 0.4 we obtain very different rankings across the variables, which might be explained by the initial part of this comment.
Underneath the raw correlations are the rankings of the 12 variables in terms of these correlations. We can see that the three highest-ranked variables are Home_MARS_on_Away_MARS, log_Home_MARS_less_Away_MARS, and sqrt_Home_MARS_on_Away_MARS for all values of exp except 0.40. At this value where we might only be searching for "simple patterns", three of the Bookmaker-related variables take the top rankings.
In the middle section of the table we have the correlations and related rankings using the Brownian Distance measure. Here we find, for smaller values of the exponent on the distance measure, the Bookmaker-related variables being assessed as most related to game Margin. As we increase the exponent, which penalises greater differences in the underlying distributions more heavily, some of the MARS-related variables rise in the rankings, though the Bookie_Prob_Diff measure remains top for all values of the exponent of 0.75 or higher.
Finally, the section on the right of the table relates to the Mutual Information measure. Whilst there are some differences as we move from using 5 to 20 bins in discretising the underlying variables, the general picture is very similar and shows the Bookie_Prob_Diff measure as being most related to game Margin regardless of the number of bins used, and a set of six MARS-related variables filling the next 6 ranks in varying orders, the order depending on whether 5, 10 or 20 bins are employed.
Looking across all the various measures of correlation/mutual information, I'd draw the following conclusions:
- Bookie_Prob_Diff appears to have the highest assessed level of information in common with game Margin
- Home_MARS_on_Away_MARS probably ranks next followed either by log_Home_MARS_less_Away_MARS or sqrt_Home_MARS_on_Away_MARS
- The Home_MARS_Rating and Away_MARS_Rating variables, taken alone, have the least to tell us about the game Margin.
In closing, I thought it would also be interesting to pairwise correlate (using the Pearson measure since you asked) the sets of correlation values provided by each of the metrics used in this blog.
From this table we can see, for example, that the correlation of the set of correlation measures for all 12 variables with the Margin variable when we use the MIC with an exp of 0.40 compared to when we use the MIC with an exp of 0.60, is +0.81. In other words, the measured correlations using either metric are very similar.
A few things to note here are:
- The relatively low correlation between the results for the MIC measure with low exponents and those with high exponents. It's as if we're measuring two quite different notions of mutual information
- The generally low correlation between the results using MIC with any exponent and any of those using the Brownian Distance or the Mutual Information measure. The only exception is where we select low exponents for the MIC measure (0.55 or under) and low exponents for the Brownian Distance measure (0.75 or under); in that case we see correlations of about +0.75 and higher.
- The generally high correlation between the results obtained using the Brownian Distance measure for any pair of exponents, especially for values of the exponent of 0.75 or above.
- Similarly, the generally high correlation between the results obtained using the Brownian Distance measure and those using the Mutual Information measure with any number of buckets, again especially for values of the exponent in the Brownian measure of 0.75 or above.