February 16, 2012

Specialist Margin Prediction: Epsilon Insensitive Loss Functions

February 16, 2012/ Tony Corke

In the last blog we looked at Margin Prediction using what I called "bathtub" loss functions. For the current blog I've extended the range of loss functions to include what are called epsilon-insensitive loss functions, which are similar to the "bathtub" loss functions except that they don't treat absolute errors of size greater than M points equally.

February 09, 2012

Specialist Margin Prediction: "Bathtub" Loss Functions

February 09, 2012/ Tony Corke

We know that we can build quite simple, non-linear models to predict the margin of AFL games that will, on average, be within about 30 points of the actual result. So, if you found a bet type for which general margin prediction accuracy was important - where every point of error contributed to your less - then this would be your model. This year we'll be moving into margin betting though, where the goal is to predict within X points of the actual result and being in error by X+1 points is no different from being wrong by X+100 points. In that environment, our all-purpose model might not be the right choice. In this blog I'll be describing a process for creating margin predicting models that specialise in predicting within X points of the final outcome.

September 15, 2011

Explaining More of the Variability in the Victory Margin of Finals

September 15, 2011/ Tony Corke

This morning while out walking I got to wondering about two of the results from the latest post on the Wagers & Tips blog. First that teams from higher on the ladder have won 20 of the 22 Semi Finals between 2000 and 2010, and second that the TAB bookmaker has installed the winning team as favourite in only 64% of these contests. Putting those two facts together it's apparent that, in Semi Finals at least, the bookmaker's often favoured the team that finished lower on the ladder, and these teams have rarely won.

August 20, 2011

Deconstructing The 2011 TAB Sportsbet Bookmaker

August 20, 2011/ Tony Corke

To what extent can the head-to-head prices set by the TAB Sportsbet Bookmaker in 2011 be modelled using only the competing teams' MAFL MARS Ratings, their respective Venue Experiences, and the Interstate Status of the fixture?

July 28, 2011

Projecting the Favourite's Final Margin

July 28, 2011/ Tony Corke

In a couple of earlier blogs I created binary logit models to predict the probability that the favourite would win given a specified lead at a quarter break and the bookmaker's assessed pre-game probability for the favourite. These models allow you to determine what a fair in-running price would be for the favourite. You might instead want to know what the favourite's projected victory margin is given the same input data, so in this blog I'll be providing some simple linear regressions that provide this information.

July 24, 2011

An Empirical Review of the Favourite In-Running Model

July 24, 2011/ Tony Corke

In the previous blog we reviewed a series of binary logits that modelled a favourite's probability of victory given its pre-game bookmaker-assessed head-to-head probability and its lead at the end of a particular quarter. There I provided just a single indication of the quality of those models: the accuracy with which they correctly predicted the final result of the game. That's a crude and very broad measure. In this blog we'll take a closer look at the empirical model fits to investigate their performance in games with different leads and probabilities.

July 19, 2011

Hanging Onto a Favourite: Assessing a Favourite's In-Running Chances of Victory

July 19, 2011/ Tony Corke

Over the weekend I was paying particular attention to the in-running odds being offered on various games and remain convinced that punters overestimate the probability of the favourite ultimately winning, especially when the favourite trails.

June 03, 2011

The Drivers of Overround

June 03, 2011/ Tony Corke

What features of a contest, I wondered this week, led to it having a larger or smaller overround than an average game? In which games might the bookie be able to grab another quarter or half a percent, and in which might he be forced to round down the overround?

March 05, 2011

Margin Prediction for 2011

March 05, 2011/ Tony Corke

We've fresh tipsters for 2011, fresh Funds for 2011, so now we need fresh margin predictors for 2011. This year, all of the margin predictors are based on models that produce probability forecasts, which includes the algorithms powering ProPred, WinPred and the Head-to-Head Fund and the "model" that is the TAB Sportsbet bookmaker. The process for creating the margin predictors was to let Eureqa loose on the historical data for seasons 2007 to 2010 to produce equations that fitted previous home team margins of victory as a function of these models' probabilities.

November 17, 2010

Can We Do Better Than The Binary Logit?

November 17, 2010/ Tony Corke

To say that there's a 'bit in this blog' is like declaring the 100 year war 'a bit of a skirmish'.

I'll start by broadly explaining what I've done. In a previous blog I constructed 12 models, each attempting to predict the winner of an AFL game. The 12 models varied in two ways, firstly in terms of how the winning team was described ...

October 29, 2010

Why It Matters Which Team Wins

October 29, 2010/ Tony Corke

In conversation - and in interrogation, come to think of it - the key to getting a good answer is often in the framing of the question.

So too in statistical modelling, where one common method for asking a slightly different question of the data is to take the variables you have and transform them.

Consider for example the following results for four binary logits, each built to provide an answer to the question 'Under what circumstances does the team with the higher MARS Rating tend to win?'.

September 27, 2010

Drawing On Hindsight

September 27, 2010/ Tony Corke

When sports journos wait until after a contest has been decided before declaring a group of winning punters to be "savvy", I find it hard not to be at least a little cynical about the aptness of the label.

So when, on Sunday, I read in the online version of the SMH that a posse of said savvy punters had foxed the bookies and cleaned up on the draw, collectively winning as I recall about $1m at prices ranging from $34 to $51, I did wonder how many column-inches would have been devoted to those same punters had the margin been anything different when the final siren sounded on Saturday. I'm fairly certain it would have been the number that has '1' as its next-door, up the road neighbour on Integer Street.

September 20, 2010

Adding Some Spline to Your Models

September 20, 2010/ Tony Corke

Creating the recent blog on predicting the Grand Final margin based on the difference in the teams' MARS Ratings set me off once again down the path of building simple models to predict game margin.

It usually doesn't take much.

Firstly, here's a simple linear model using MARS Ratings differences that repeats what I did for that recent blog post but uses every game since 1999, not just Grand Finals.

2010 - MARS Ratings vs Score Difference.png

It suggests that you can predict game margins - from the viewpoint of the home team - by completing the following steps:

subtract the away team's MARS Rating from the home team's MARS Rating
multiply this difference by 0.736
add 9.871 to the result you get in 2.

One interesting feature of this model is that it suggests that home ground advantage is worth about 10 points.

The R-squared number that appears on the chart tells you that this model explains 21.1% of the variability is game margins.

You might recall we've found previously that we can do better than this by using the home team's victory probability implied by its head-to-head price.

2010 - Bookie Probability vs Score Difference.png

This model says that you can predict the home team margin by multiplying its implicit probability by 105.4 and then subtracting 48.27. It explains 22.3% of the observed variability in game margins, or a little over 1% more than we can explain with the simple model based on MARS Ratings.

With this model we can obtain another estimate of the home team advantage by forecasting the margin with a home team probability of 50%. That gives an estimate of 4.4 points, which is much smaller than we obtained with the MARS-based model earlier.

(EDIT: On reflection, I should have been clearer about the relative interpretation of this estimate of home ground advantage in comparison to that from the MARS Rating based model above. They're not measuring the same thing.

The earlier estimate of about 10 points is a more natural estimate of home ground advantage. It's an estimate of how many more points a home team can be expected to score than an away team of equal quality based on MARS Rating, since the MARS Rating of a team for a particular game does not include any allowance for whether or not it's playing at home or away.

In comparison, this latest estimate of 4.4 points is a measure of the "unexpected" home ground advantage that has historically accrued to home teams, over-and-above the advantage that's already built into the bookie's probabilities. It's a measure of how many more points home teams have scored than away teams when the bookie has rated both teams as even money chances, taking into account the fact that one of the teams is (possibly) at home.

It's entirely possible that the true home ground advantage is about 10 points and that, historically, the bookie has priced only about 5 or 6 points into the head-to-head prices, leaving the excess of 4.4 that we're seeing. In fact, this is, if memory serves me, consistent with earlier analyses that suggested home teams have been receiving an unwarranted benefit of about 2 points per game on line betting.

Which, again, is why MAFL wagers on home teams.)

Perhaps we can transform the probability variable and explain even more of the variability in game margins.

In another earlier blog we found that the handicap a team received could be explained by using what's called the logit transformation of the bookie's probability, which is ln(Prob/(1-Prob)).

Let's try that.

2010 - Bookie Probability vs Score Difference - Logit Form.png

We do see some improvement in the fit, but it's only another 0.2% to 22.5%. Once again we can estimate home ground advantage by evaluating this model with a probability of 50%. That gives us 4.4 points, the same as we obtained with the previous bookie-probability based model.

A quick model-fitting analysis of the data in Eureqa gives us one more transformation to try: exp(Prob). Here's how that works out:

2010 - Bookie Probability vs Score Difference - Exp Form.png

We explain another 0.1% of the variability with this model as we inch our way to 22.6%. With this model the estimated home-ground advantage is 2.6 points, which is the lowest we've seen so far.

If you look closely at the first model we built using bookie probabilities you'll notice that there seems to be more points above the fitted line than below it for probabilities from somewhere around 60% onwards.

Statistically, there are various ways that we could deal with this, one of which is by using Multivariate Adaptive Regression Splines.

(The algorithm in R - the statistical package that I use for most of my analysis - with which I created my MARS models is called earth since, for legal reasons, it can't be called MARS. There is, however, another R package that also creates MARS models, albeit in a different format. The maintainer of the earth package couldn't resist the temptation not to call the function that converts from one model format to the other mars.to.earth. Nice.)

The benefit that MARS models bring us is the ability to incorporate 'kinks' in the model and to let the data determine how many such kinks to incorporate and where to place them.

Running earth on the bookie probability and margin data gives the following model:

Predicted Margin = 20.7799 + if(Prob > 0.6898155, 162.37738 x (Prob - 0.6898155),0) + if(Prob < 0.6898155, -91.86478 x (0.6898155 - Prob),0)

This is a model with one kink at a probability of around 69%, and it does a slightly better job at explaining the variability in game margins: it gives us an R-squared of 22.7%.

When you overlay it on the actual data, it looks like this.

2010 - Bookie Probability vs Score Difference - MARS.png

You can see the model's distinctive kink in the diagram, by virtue of which it seems to do a better job of dissecting the data for games with higher probabilities.

It's hard to keep all of these models based on bookie probability in our head, so let's bring them together by charting their predictions for a range of bookie probabilities.

2010 - Bookie Probability vs Score Difference - Predictions.png

For probabilities between about 30% and 70%, which approximately equates to prices in the $1.35 to $3.15 range, all four models give roughly the same margin prediction for a given bookie probability. They differ, however, outside that range of probabilities, by up to 10-15 points. Since only about 37% of games have bookie probabilities in this range, none of the models is penalised too heavily for producing errant margin forecasts for these probability values.

So far then, the best model we've produced has used only bookie probability and a MARS modelling approach.

Let's finish by adding the other MARS back into the equation - my MARS Ratings, which bear no resemblance to the MARS algorithm, and just happen to share a name. A bit like John Howard and John Howard.

This gives us the following model:

Predicted Margin = 14.487934 + if(Prob > 0.6898155, 78.090701 x (Prob - 0.6898155),0) + if(Prob < 0.6898155, -75.579198 x (0.6898155 - Prob),0) + if(MARS_Diff < -7.29, 0, 0.399591 x (MARS_Diff + 7.29)

The model described by this equation is kinked with respect to bookie probability in much the same way as the previous model. There's a single kink located at the same probability, though the slope to the left and right of the kink is smaller in this latest model.

There's also a kink for the MARS Rating variable (which I've called MARS_Diff here), but it's a kink of a different kind. For MARS Ratings differences below -7.29 Ratings points - that is, where the home team is rated 7.29 Ratings points or more below the away team - the contribution of the Ratings difference to the predicted margin is 0. Then, for every 1 Rating point increase in the difference above -7.29, the predicted margin goes up by about 0.4 points.

This final model, which I think can still legitimately be called a simple one, has an R-squared of 23.5%. That's a further increase of 0.8%, which can loosely be thought of as the contribution of MARS Ratings to the explanation of game margins over and above that which can be explained by the bookie's probability assessment of the home team's chances.

September 14, 2010

All You Ever Wanted to Know About Favourite-Longshot Bias ...

September 14, 2010/ Tony Corke

Previously, on at least a few occasions, I've looked at the topic of the Favourite-Longshot Bias and whether or not it exists in the TAB Sportsbet wagering markets for AFL.

A Favourite-Longshot Bias (FLB) is said to exist when favourites win at a rate in excess of their price-implied probability and longshots win at a rate less than their price-implied probability. So if, for example, teams priced at $10 - ignoring the vig for now - win at a rate of just 1 time in 15, this would be evidence for a bias against longshots. In addition, if teams priced at $1.10 won, say, 99% of the time, this would be evidence for a bias towards favourites.

When I've considered this topic in the past I've generally produced tables such as the following, which are highly suggestive of the existence of such an FLB.

Each row of this table, which is based on all games from 2006 to the present, corresponds to the results for teams with price-implied probabilities in a given range. The first row, for example, is for all those teams whose price-implied probability was less than 10%. This equates, roughly, to teams priced at $9.50 or more. The average implied probability for these teams has been 9%, yet they've won at a rate of only 4%, less than one-half of their 'expected' rate of victory.

As you move down the table you need to arrive at the second-last row before you come to one where the win rate exceed the expected rate (ie the average implied probability). That's fairly compelling evidence for an FLB.

This empirical analysis is interesting as far as it goes, but we need a more rigorous statistical approach if we're to take it much further. And heck, one of the things I do for a living is build statistical models, so you'd think that by now I might have thrown such a model at the topic ...

A bit of poking around on the net uncovered this paper which proposes an eminently suitable modelling approach, using what are called conditional logit models.

In this formulation we seek to explain a team's winning rate purely as a function of (the natural log of) its price-implied probability. There's only one parameter to fit in such a model and its value tells us whether or not there's evidence for an FLB: if it's greater than 1 then there is evidence for an FLB, and the larger it is the more pronounced is the bias.

When we fit this model to the data for the period 2006 to 2010 the fitted value of the parameter is 1.06, which provides evidence for a moderate level of FLB. The following table gives you some idea of the size and nature of the bias.

2010 - Favourite-Longshot Bias - Conditional Logit.png

The first row applies to those teams whose price-implied probability of victory is 10%. A fair-value price for such teams would be $10 but, with a 6% vig applied, these teams would carry a market price of around $9.40. The modelled win rate for these teams is just 9%, which is slightly less than their implied probability. So, even if you were able to bet on these teams at their fair-value price of $10, you'd lose money in the long run. Because, instead, you can only bet on them at $9.40 or thereabouts, in reality you lose even more - about 16c in the dollar, as the last column shows.

We need to move all the way down to the row for teams with 60% implied probabilities before we reach a row where the modelled win rate exceeds the implied probability. The excess is not, regrettably, enough to overcome the vig, which is why the rightmost entry for this row is also negative - as, indeed, it is for every other row underneath the 60% row.

Conclusion: there has been an FLB on the TAB Sportsbet market for AFL across the period 2006-2010, but it hasn't been generally exploitable (at least to level-stake wagering).

The modelling approach I've adopted also allows us to consider subsets of the data to see if there's any evidence for an FLB in those subsets.

I've looked firstly at the evidence for FLB considering just one season at a time, then considering only particular rounds across the five seasons.

2010 - Favourite-Longshot Bias - Year and Round.png

So, there is evidence for an FLB for every season except 2007. For that season there's evidence of a reverse FLB, which means that longshots won more often than they were expected to and favourites won less often. In fact, in that season, the modelled success rate of teams with implied probabilities of 20% or less was sufficiently high to overcome the vig and make wagering on them a profitable strategy.

That year aside, 2010 has been the year with the smallest FLB. One way to interpret this is as evidence for an increasing level of sophistication in the TAB Sportsbet wagering market, from punters or the bookie, or both. Let's hope not.

Turning next to a consideration of portions of the season, we can see that there's tended to be a very mild reverse FLB through rounds 1 to 6, a mild to strong FLB across rounds 7 to 16, a mild reverse FLB for the last 6 rounds of the season and a huge FLB in the finals. There's a reminder in that for all punters: longshots rarely win finals.

Lastly, I considered a few more subsets, and found:

No evidence of an FLB in games that are interstate clashes (fitted parameter = 0.994)
Mild evidence of an FLB in games that are not interstate clashes (fitted parameter = 1.03)
Mild to moderate evidence of an FLB in games where there is a home team (fitted parameter = 1.07)
Mild to moderate evidence of a reverse FLB in games where there is no home team (fitted parameter = 0.945)

FLB: done.

September 14, 2010

Divining the Bookie Mind: Singularly Difficult

September 14, 2010/ Tony Corke

It's fun this time of year to mine the posted TAB Sportsbet markets in an attempt to glean what their bookie is thinking about the relative chances of the teams in each of the four possible Grand Final pairings.

Three markets provide us with the relevant information: those for each of the two Preliminary Finals, and that for the Flag.

From these markets we can deduce the following about the TAB Sportsbet bookie's current beliefs (making my standard assumption that the overround on each competitor in a contest is the same, which should be fairly safe given the range of probabilities that we're facing with the possible exception of the Dogs in the Flag market):

The probability of Collingwood defeating Geelong this week is 52%
The probability of St Kilda defeating the Dogs this week is 75%
The probability of Collingwood winning the Flag is about 34%
The probability of Geelong winning the Flag is about 32%
The probability of St Kilda winning the Flag is about 27%
The probability of the Western Bulldogs winning the Flag is about 6%

(Strictly speaking, the last probability is redundant since it's implied by the three before it.)

What I'd like to know is what these explicit probabilities imply about the implicit probabilities that the TAB Sportsbet bookie holds for each of the four possible Grand Final matchups - that is for the probability that the Pies beat the Dogs if those two teams meet in the Grand Final; that the Pies beat the Saints if, instead, that pair meet; and so on for the two matchups involving the Cats and the Dogs, and the Cats and the Saints.

It turns out that the six probabilities listed above are insufficient to determine a unique solution for the four Grand Final probabilities I'm after - in mathematical terms, the relevant system that we need to solve is singular.

That system is (approximately) the following four equations, which we can construct on the basis of the six known probabilities and the mechanics of which team plays which other team this week and, depending on those results, in the Grand Final:

52% x Pr(Pies beat Dogs) + 48% x Pr(Cats beat Dogs) = 76%
52% x Pr(Pies beat Saints) + 48% x Pr(Cats beat Saints) = 63.5%
75% x Pr(Pies beat Saints) + 25% x Pr(Pies beat Dogs) = 66%
75% x Pr(Cats beat Saints) + 25% x Pr(Cats beat Dogs) = 67.5%

(If you've a mathematical bent you'll readily spot the reason for the singularity in this system of equations: the coefficients in every equation sum to 1, as they must since they're complementary probabilities.)

Whilst there's not a single solution to those four equations - actually there's an infinite number of them, so you'll be relieved to know that I won't be listing them all here - the fact that probabilities must lie between 0 and 1 puts constraints on the set of feasible solutions and allows us to bound the four probabilities we're after.

So, I can assert that, as far as the TAB Sportsbet bookie is concerned:

The probability that Collingwood would beat St Kilda if that were the Grand Final matchup - Pr(Pies beats Saints) in the above - is between about 55% and 70%
The probability that Collingwood would beat the Dogs if that were the Grand Final matchup is higher than 54% and, of course, less than or equal to 100%.
The probability that Geelong would beat St Kilda if that were the Grand Final matchup is between 57% and 73%
The probability that Geelong would beat the Dogs if that were the Grand Final matchup is higher than 50.5% and less than or equal to 100%.

One straightforward implication of these assertions is that the TAB Sportsbet bookie currently believes the winner of the Pies v Cats game on Friday night will start as favourite for the Grand Final. That's an interesting conclusion when you recall that the Saints beat the Cats in week 1 of the Finals.

We can be far more definitive about the four probabilities if we're willing to set the value of any one of them, as this then uniquely defines the other three.

So, let's assume that the bookie thinks that the probability of Collingwood defeating the Dogs if those two make the Grand Final is 80%. Given that, we can say that the bookie must also believe that:

The probability that Collingwood would beat St Kilda if that were the Grand Final matchup is about 61%.
The probability that Geelong would beat St Kilda if that were the Grand Final matchup, is about 66%.
The probability that Geelong would beat the Dogs if that were the Grand Final matchup is higher than 72%.

Together, that forms a plausible set of probabilities, I'd suggest, although the Geelong v St Kilda probability is higher than I'd have guessed. The only way to reduce that probability though is to also reduce the probability of the Pies beating the Dogs.

If you want to come up with your own rough numbers, choose your own probability for the Pies v Dogs matchup and then adjust the other three probabilities using the four equations above or using the following approximation:

For every 5% that you add to the Pies v Dogs probability:

subtract 1.5% from the Pies v Saints probability
add 2% to the Cats v Saints probability, and
subtract 5.5% from the Cats v Dogs probability

If you decide to reduce rather than increase the probability for the Pies v Dogs game then move the other three probabilities in the direction opposite to that prescribed in the above. Also, remember that you can't drop the Pies v Dogs probability below 55% nor raise it above 100% (no matter how much better than the Dogs you think the Pies are, the laws of probability must still be obeyed.)

Alternatively, you can just use the table below if you're happy to deal only in 5% increments of the Pies v Dogs probability. Each row corresponds to a set of the four probabilities that is consistent with the TAB Sportsbet markets as they currently stand.

I've highlighted the four rows in the table that I think are the ones most likely to match the actual beliefs of the TAB Sportsbet bookie. That narrows each of the four probabilities into a 5-15% range.

At the foot of the table I've then converted these probability ranges into equivalent fair-value price ranges. You should take about 5% off these prices if you want to obtain likely market prices.

July 30, 2010

Line Betting : A Codicil

July 30, 2010/ Tony Corke

While contemplating the result from an earlier blog, which was that home teams had higher handicap-adjusted margins and won at a rate significantly higher than 50% on line betting - virtually regardless of the start they were giving or receiving - I wondered if the source of this anomaly might be that the bookie gives home teams a slightly better deal in setting line margins.

July 17, 2010

The Importance of a Team's Recent Form: What Bookies (and MARS) Think

July 17, 2010/ Tony Corke

When the TAB Sportsbet bookie is framing a market for an upcoming game, clearly one set of data that he uses is the recent results of the participating teams.

July 17, 2010

Super Smart is Taking Heed of Bookies

July 17, 2010/ Tony Corke

Across a series of blogs now we've explored the Super Smart Model (SSM) and investigated its ability to predict victory margins. In this blog we'll look more closely at which variables most influence SSM's forecasts.

July 15, 2010

Trialling The Super Smart Model

July 15, 2010/ Tony Corke

The best way to trial a potential Fund algorithm, I'm beginning to appreciate, is to publish each week the forecasts that it makes. This forces me to work through the mechanics of how it would be used in practice and, importantly, to set down what restrictions should be applied to its wagering - for example should it, like most of the current Funds, only bet on Home Teams, and in which round of the season should it start wagering.

July 07, 2010

What Do Bookies Know That We Don't?

July 07, 2010/ Tony Corke

Bookies, I think MAFL has comprehensively shown, know a lot about football, but just how much more do they know than what you or I might glean from a careful review of each team's recent results and some other fairly basic knowledge about the venues at which games are played?

Statistical Analyses