Friday, December 16, 2011

Beating The Spread: Statistical Models of NFL Power Rankings and Point Spreads (Part 3)

by Tim Rubin

**Information on this site is collected from outside sources and/or is opinion and is offered "as is" without warranties of accuracy of any kind. Under no circumstances, and under no cause of action or legal theory, shall the owners, creators, associates or employees of this website be liable to you or any other person or entity for any direct, indirect, special, incidental, or consequential damages of any kind whatsoever. This information is not intended to be used for purposes of gambling, illegal or otherwise.**
_____________________________________________________________________

In Part 1 of this series I discussed the prevalence of statistical models in ratings systems for team strengths in things like Chess, Halo, etc.  In Part 2, I introduced a simple “Margin-of-Victory” style model for NFL game spreads, and I showed the power-ratings that this model learned, as well as it’s predictions for the Week 14 games.

In Parts 3 and 4 (which will be posted today and tomorrow), we will be looking at the Margin-Of-Victory model in a little more detail, so that we can understand how this model can be used to predict (in addition to the margin of victory), the probability of different outcomes (e.g., the probability that the home-team wins, both straight up and against the spread).  We will also be looking back to see how accurate the model’s predictions were for last week’s games.

After we have looked under the hood of this model, we will take a step back in order to look at some of the pros and cons of the model, as well as ways in which we can improve upon it.
_____________________________________________________________________
The simple Margin-of-Victory Model (Brief Review)
In part 2 of this series, I introduced the simple Margin-of-Victory model we’ve been using.  As a reminder, this is the basic idea of the model:

PointsHomeTeam – PointsAwayTeam = RatingHomeTeam –  RatingAwayTeam + HomefieldAdvantage + Error  
  
When applying this model to the NFL, this model consists of 33 parameters:  A “rating” for each of the 32 teams, and a “Homefield Advantage” which we assume to be equal for all teams and games.

After fitting the model last week, I showed what the model’s week 13 power-ratings were for each team, and the model’s prediction’s for week 14.1

Some Nice Features of this model

At a later time, we will do a more critical analysis of this model, looking at the pros and cons of it, etc.  But for now, I just want to point out a couple of nice properties of this model.  In particular, the “parameters values” for the model (i.e., each team’s Rating, and the “Homefield Advantage”), have an extremely intuitive, real-world interpretation. 

First, this model expresses team ratings on the same scale as the game score.  For example, on a neutral field (i.e., with no team having a home-field advantage) a team with a +10 rating would be expected to beat a team with a +5 rating by 5 points.  The value of the “Homefield-Advantage” in the model—let’s call it h—has similarly intuitive interpretation; it can be treated as adding h to the rating of the home-team.

A second nice feature is that we can arbitrarily “shift” all team ratings up or down as we please, since the only thing that matters is the difference between team ratings.  For example we could add +100 to each teams’ rating, and it wouldn’t change our predictions.  In fact, as I mentioned last week, the Sagarin “Pure Point” ratings are extremely similar to this model’s ratings, except that he “shifts” all of the team ratings so that the average rating is 20.  To illustrate this, I’ve created a plot here, in which I’ve aligned the Margin-of-Victory model’s ratings through week 13 with Sagarin’s week 13 ratings (by setting the average rating of our model to equal 20).  You can see in the plot that the ratings generated by the two models are generally quite similar.

For our Margin-of-Victory model, I chose to set the average rating to 0, since I think this leads to the most easily interpreted set of ratings (by far).  Specifically: any team with a positive rating is better than average, and any team with a negative rating is worse than average.  The more positive the rating, the better, and vice versa. 

Even better, our ratings then have a real-world interpretation of their meaning.  Specifically, each team’s rating corresponds to how we would expect them to perform vs. a league-average team, on a neutral field (i.e., with no team having a home-field advantage).  For example, using the power-ratings that were fit last week:  the Packer’s rating of +12.2 means that they would beat an average team by about 12 points on a neutral field (or by about 15 at home).  Which seems like a fairly reasonable statement.

Now that you hopefully have a pretty clear intuition for how to interpret this model, let’s take a closer look at some of the underlying details.  If this sounds intimidating, don’t worry.  I will literally make no assumptions about your background.  If anything is unclear, feel free to leave a comment and I’ll do my best to either give you an answer directly or clarify the issue in the post itself.
___________________________________________________________________________

The Margin-Of-Victory Model, In Pictures

Let’s take a step back, and consider one key thing: uncertainty.  In our model, we express uncertainty using the “error” term on the right side of the equation.  But where does this error come from, and what does it mean? 

Now, one way to think about the model, is that each team has a true rating corresponding to how good the team is and that each team’s performance always corresponds exactly to what their true rating is.  And that any deviation in a game’s outcome from what our model predicts (i.e., any deviation from our equation:

PointsHomeTeam – PointsAwayTeam = RatingHomeTeam –  RatingAwayTeam + HomefieldAdvantage

is due to randomness inherent to football games.  However, this is probably not the right way to think about things.  Although there certainly is a lot of randomness in football games, the notion that each team always plays at the same rating-level seems like a stretch.  One good example for why this idea is problematic relates to injuries: in the NFL, key players have to sit out for a series, a game, or multiple multiple due to injury, all the time.  And so each team’s starting roster is constantly in flux.  It seems highly unlikely that with all that personnel change going on, the true rating of a team stays constant.

A more realistic way to think of the model is as follows:  each team has a true rating corresponding to their team strength, but the level at which the team actually performs will fluctuate from game to game (i.e., some of the time they play worse than their rating indicates, some of the time they play better, but on average they play at the level corresponding to their true rating). 

Modeling Variations in a Team’s Performance

A good way to model the random variation in the performance of each team is using our old friend, the normal distribution.  In other words: we assume that each team has some underlying “rating”, but their performance across games varies according to a normal distribution centered about their rating.
.
The normal distribution can be specified using two parameters: a mean (indicated using the greek letter mu, μ) and a standard deviation, (indicated using the greek letter sigma, σ).  It is often more convenient to talk about a normal distribution in terms of its variance, which is simply the standard deviation squared, or σ2.
.
Here's what the two parameters of the normal distribution do:
.
- The Mean (μ): This parameter Controls the "Location" of the distribution.  The mean of a normal distribution is also it's median and mode (the peak of its curve).  In our model, each team's rating is equal to the mean of the team's distribution.
.
- The Standard Deviaton (σ) /  Variance (σ2):  This controls the "spread" the distribution.  As the standard deviation or variance of a distribution gets larger, it becomes much more likely to observe numbers that are further from the mean of the distribution.
.
The shorthand notation that is helpful in describing a normal distribution is: "normal(μ, σ2 )", which denotes a normal distribution a mean equal to μ  and a variance of  σ2.   
.
So for example, a normal (0, .1) is a normal distribution with a mean of 0 and variance of .1 (which is very small).  A normal(0,.1) distribution will mostly generate numbers very close to zero, and almost all of the numbers will fall between -1 and 1.  A normal with a mean of 0 and standard deviation of 100, on the other hand, will generate a huge range of numbers (very few of which will fall between -1 and 1, despite this range containing the "peak" of the distribution)
.
The image below gives a nice feel for how likely different values are, given the parameters of a normal distribution: About 70% of all numbers generated by a normal distribution will be within 1 standard deviation (1 sigma) of the mean.  About 95% of numbers will fall within 2 standard deviations, and nearly all numbers will fall within 3 standard deviations.  This general idea is known sometimes as the 3-sigma rule.


Probabilities of different values for a normal(μ,σ2). Courtesy of Wikipedia, 
.
In our model, we have 32 team “ratings”, each of these corresponds to a specific team’s average performance across cames.  Since we will model each team's performance across games using a normal distribution, and using each team's “rating” to describe the team's average (a.k.a. mean) performance, we will say that the “rating” for team i is μ.
.
To write this using the shorthand notation above: we say that for each game that the ith team, plays, they “sample” their game-performance from a normal distribution with parameters: (μi , σ2).

Don’t worry if this isn’t totally clear yet.  The pictures below will help a lot in terms of understanding this idea.

Bean Machine Prognostication
After the AFL/NFL merger, Sir Galton's infamous Bean Machine Prognosticator simply became too cumbersome for practical use.

To help visualize what it means for each team to “sample” their performance from a normal, it may be helpful to take a look at the Plinko-like machine (referenced to in the first part of this series) called the “Bean Machine”.  The Bean Machine was created by Sir Charles Galton as a way to demonstrate a mechanical approach to approximate samples from a normal distribution:
The Bean Machine
As balls (or “beans”) are dropped into this device, the frequency that they land in different bins approximates a normal distribution.  Note that the although normal distribution is continuous (it doesn’t just generate integers), this machine has discrete bins, so it is in fact just a rough approximation of a normal.  However, note that on the machine is an illustration of a smooth bell-curve like line, running through the set of bins; this line illustrates the probability distribution of the normal that the machine approximates.  Here’s a super high-speed video of one of these machines in action.
                                                                                                       
To go further with this visualization, we can think of our model in terms of “Bean Machines”, since each team is represented simply as a normal distribution.  Each team would have their own unique bean machine.  And the full model would be composed of 32 bean machines, each placed at different locations corresponding to each team’s rating.  To simulate games between two teams, you would drop a bean into the machine for both teams, and compute the difference in the outputs (you could even adjust the location of the machines to account for homefield advantage).
_____________________________________________________________________
And Now for Something Completely Different

If you are not used to thinking about things in terms of probability distributions, this stuff can be tough to wrap your mind around.  In the next post, I’ll be giving further details (and pictures). 

But for now, let’s move away from the math, and talk about evaluating model performance.
_____________________________________________________________________
Assessing Model Performance

In terms of assessing modeling performance, there are many things we can look at.  I am going to stick to the following two questions:

  • How accurately did we predict the margin of victory, compared to the Vegas line?
  • How did our model do, if we were to use it to pick against the spread?
Before we get into looking at results of our model's performance from last week, here is some general food-for-thought, in terms of using models to beat the spread.

1.)   We have no estimate of how reliable our model is, for the purpose of gambling:
Before using any model for real-world prediction, you need to get an estimate of how well the system can make predictions.  This is something that we can do, but it’s not as simple as it may at first seem: You can’t simply look at how our current model has performed on past games (because our model has already observed those outcomes, so it would be cheating).  You actually would need to do something a little more complicated, along the lines of “cross-validation”.

2 .) The model is not designed to beat the spread: the model we’ve been discussing is designed for two things: estimating team strength, and predicting game outcomes with respect to point-differentials.  So, really, our model is better thought of as a handicapping system (i.e., a system to set the spreads, not beat them). 

And sure, you could use our model’s predictions to choose which team to bet on.  And, in fact, you could do a lot worse (for example, by always picking “public teams*2).  And there are certainly dumber methods for picking teams (though I’m not sure I’d want challenge the octopus in a picks contest, it kinda feels like picking against Tebow right now). 

And there’s a more fundamental point here.  Let’s think about what our model’s input is, what it does with this input, and what it’s output is:

- MODEL INPUT:  For each game, the only thing game data that the model ever sees is the home and away team names, and their scoring differential.  Our model never sees what the Vegas line is. 
- MODEL FITTING:  The model takes this data, and then optimizes the parameters with respect to team-strengths and homefield advantage.  It does not optimize anything with respect to the spreads.  
- MODEL OUTPUT:  The model output is a rating for each team, and the value of homefield advantage.

At no point in this process is the model touching, looking at, or even recognizing the existence of there being such a thing as Vegas spreads.  So, if we wanted to use this model to pick against the spread, this model sure isn't going to do it on its own.  And furthermore, it is probably not going to be optimized with respect to picking against the spread (and in fact, as discussed in part 2, it is optimized with respect to minimizing the squared-error of its predictions of game outcomes).  So just because this is a good model, doesn't mean it's a good model for beating the spread.  

3.)  Ogres are like onions, and “models” are like Rainman.  Statistical Modeling is extremely powerful, and models can do amazing things.  Models can do things like accurately predict the outcome of elections, or predict the mass of elementary particles before they had even been observed (granted, this second example comes from Physics, and doesn’t really apply or even belong here. But it’s still pretty cool).

But even if a statistical model is accurate at predicting one thing, it does not necessarily mean that it is a good model to use for your intended purpose.  To make a model that is well-suited for your own purposes, you first need to think about what it is that you want to understand or predict.  You then need to formulate the model in such a way that it provides you with the information that you need.  And this is where Rainman comes in.

OK, so if you aren’t familiar with the movie Rainman…well, I really have nothing to say to you other than that you should go see it.  It’s fantastic. It’s so good that you forget all about the fact that Tom Cruise is completely insane (do I really need a link for this?  Probably not, but whatever).

Anyway, think of Tom Cruise’s character as “the gambler” (young guy, looking for any edge, even if it involves taking advantage of his autistic brother).  And think of Rainman (aforementioned autistic brother, who is an savant that can track every card in a 6-deck shoe of blackjack), as “the statistical model”. 

Now, if the Gambler were to give Rainman no instructions, and sit him in front of the blackjack table with $10,000, Rainman would have no idea what to do, and would probably lose all  of the Gambler's money.  That is because Rainman doesn’t know what he’s supposed to do, unless you explicitly tell him so.  So instead, the Gambler tells Rainman to simply keep track of the tens in the deck, and bet more money when there are lots of tens in the deck (this is how card counters can get an edge in blackjack).  And by doing this—namely, by explicitly informing Rainman what information he needs to provide—Rainman and the Gambler are able to win lots of money in a classic 1980s musical montage.

The point of this little metaphor was this: if you want to use statistical models to help you win money betting against the spread, you’d better be making sure that you are not just blindly using this model’s output to make bets.  For example, our Margin-of-Victory model gives us a totally reasonable estimate of the outcome in terms of score-differential.  Now let’s say it predicts a home-team victory by 5 points, and the Vegas spread is 2.5. The question is: should we bet on the home-team? 

The answer, for now, is that we simply don’t know.  The problem is that the model doesn’t give us an estimate of the probability that the home team will win by at least 3 points.  As of now, all we can say for sure is that model’s estimate suggests that this bet would win more often than not (that is, the home-team had greater than a 50% chance of winning by at least 3 points). 

But that is not enough information to make money on a bet.  To really beat the spread in Vegas (in the sense of turning a profit) you need your picks to win about 55% of the time (because when you win, the house takes 1$ out of every 10$ of your profit).

Now, it turns out that we can in fact use our model to estimate probabilities of specific outcomes (such as the probability that a team will win by at least x points, or the probability that a team will just flat-out win).  And in Part 4, we will be looking at just that.

_____________________________________________________________________
Stay Tuned for Part 4 of this series.  In part 4, we will look at (1) how our model performed last week, (2) some more pretty pictures, and (3) how to compute the probability of different outcomes using the Margin-of-Victory Model.
_____________________________________________________________________
Footnotes:
.
*1  I don't want to beat a dead horse hear, but remember: this model's predictions are predictions about the line.  They are not a system for beating the spread, nor do they even (on their own) tell you anything about which bets would give you a positive return on investment.  So for now, unless you are in a weekly competition to guess the lines with Bill Simmons on The B.S. Report, I would hold off on using this to model to make any bets until we've had a chance to look a little more deeply at the model.
.
*2  I got the term “public team” from Chad Millman.  I highly recommend checking out some of his articles, or tweets, about this idea (and more generally for his insight into Vegas handicapping, and related topics).  He’s also a fairly frequent guest on Bill Simmons' podcast.


No comments:

Post a Comment