**Information on this site is collected from outside sources and/or is opinion and is offered "as is" without warranties of accuracy of any kind. Under no circumstances, and under no cause of action or legal theory, shall the owners, creators, associates or employees of this website be liable to you or any other person or entity for any direct, indirect, special, incidental, or consequential damages of any kind whatsoever. This information is not intended to be used for purposes of gambling, illegal or otherwise.**

_______________________________________________________________________________________

In Part 1 of this series I discussed the prevalence of statistical models in ratings systems for team strengths in things like Chess, Halo, etc. In Part 2, I introduced a simple “Margin-of-Victory” style model for NFL game spreads, and I discussed how we “fit” the model, and then generated power-ratings with the model, using the results from this year up to week 13. I then used these to generate predictions for the Week 14 games. In Part 3, I discussed using normal distributions to model each team’s variation in performance.

In Part 4, we will look at how to model game outcomes and estimate win probabilities, both against the spread or straight up. I know that in Part 3 it may have been unclear why I spent so much time talking about the normal distributions and bean machines, and where all of that was going. This post should (I hope) clarify all that.

**NOTE:**I’m providing an Excel spreadsheet to accompany this post. You don’t need it to follow anything in this post. But I’ll be demonstrating some things in Excel, and it may be helpful—especially if you are a hands-on learner—to go ahead and download it here so you can follow along (and see exactly what I did) when I refer to it later.

**UPDATE:**Here is an updated version of the spreadsheet. It doesn't go along with the examples used in the post. However I've updated it to correspond to a real-world scenario (tonight's game); it incorporate the home-field advantage and team-ratings (for the Saints and Falcons) from Sagarin's "Pure-Point model, the Vegas spread on the game, and more realistic variance estimates.

_______________________________________________________________________________________

**_______________________________________________________________________________________**

The title of
this post (and the corresponding figures) gives a rough outline of the starting
and endpoints of what we’ll be covering in the post: We will be starting
with the simple Margin-of-Victory model that was introduced a couple posts
back, and showing how this model can be used to estimate win probabilities for
any matchup, either straight up, or against the spread.

**_______________________________________________________________________________________**

__The Story So Far__
Let’s review
everything we’ve covered so far as quickly as possible, just to refresh
everyone’s memory.

This is the
equation that I gave in Part 2, to describe our Margin-of-Victory model:

Points

_{HomeTeam}– Points_{AwayTeam }= Rating_{HomeTeam}– Rating_{AwayTeam}+ HomefieldAdvantage + Error
Essentially
everything you need to understand about the model can be captured by a couple
simple graphs, which are shown below.

Note that to keep things from getting too messy, I’m only showing 4 of the 32 teams here (furthermore, these aren’t the true estimates of these teams’ ratings; I chose these values for illustrative purposes).

Note that to keep things from getting too messy, I’m only showing 4 of the 32 teams here (furthermore, these aren’t the true estimates of these teams’ ratings; I chose these values for illustrative purposes).

**Team Power Ratings (top graph)**

Our model
assigns each team a

*rating*. This rating is a single real-numbered value. Teams with positive ratings (Rating > 0) are better than an average team; teams with a negative rating are worse than an average team.*Ratings*are on the same scale as

*points*. What this means—in terms of understanding a game outcome—is that a team with a rating of +6 is expected to beat a team with a rating of +1 by 5 points. All team ratings are constrained so that the average rating is 0. This gives the team ratings some inherent meaning outside of specific matchups; a team with a +6 rating would be expected to beat an average

*team by 6 points on a neutral field.*

So, in the image
above, the distance between any pair of teams would give an estimate of the
outcome of a game between the two teams (on a neutral field). And the
Packers and 49ers are above-average teams, while the Seahawks and Vikings are
below-average teams.

Our model also
learns a

*homefield advantage*, which is on the same scale as ratings and points, so can simply be added to expected game outcome, expressed in terms of: Points_{Home}– Points_{Away}. So if the value of homefield equals 3 points, a team rated +6 would be expected to beat a team with a +1 rating by 8 points at home, 5 points on a neutral field, and 2 points on the road.**Team Performance as normal distributions (bottom graph)**

To account for
the fact that each team’s performance will vary across games, we model each
team using a normal distribution. This is illustrated in the bottom plot of the
figure above.

The mean (or
center) of each team’s normal distribution is equivalent to its

*rating*. For each team, I’ve shown the team*rating*using a dashed vertical line, and the teams’ rating distribution (which capture the variability in the team’s performance across games) using a solid line; this solid line corresponds to a normal distribution.
This is how we
use normal distributions to model team ratings:

When describing a normal distribution, we only need two
parameters: a mean (μ) and variance (σ

^{2}). With these parameters we can then define any normal distribution using the notation:
Normal
(μ, σ

^{2})
- The μ parameter is the mean of the normal distribution; its
value is the location of the center or “peak” of the distribution. The
mean for any team each team (μ

_{TEAM}) is their team*rating*.
- The σ

^{2}parameter is the*variance*of the normal distribution; it captures the how “spread out” the distribution is, or the amount of uncertainty in the distribution. The higher the variance, the more likely it is that the distribution will generate values that are further away from the mean. In terms of our model, higher variances would indicate that teams’ performances will vary more from game to game.
If we increased the variance of the distributions in the
figure above, there would be more of an overlap between the team distributions,
and if we decreased the variance there would be more separation between the
distributions.

In fact: if we started with the plot on the bottom, and we
kept decreasing the variance of the normal distributions, the width of the
distributions would shrink until eventually the plot on the bottom would become
equivalent to the top plot. In fact, when the variance = 0, the only
value with any probability is the mean (μ) itself—and this would no longer be a
“normal distribution” but instead be a Delta
Function.

^{2}=4) throughout this post. But it’s important to note that this is a

*significantly*smaller variance than we actually get when fitting the model; the actual estimate for team variance is closer to 100. It is more convenient for illustrative purposes to use a variance of 4, but remember that there is

*much*more uncertainty in actual football than the figures in this post will indicate.

**_______________________________________________________________________________________**

*That is the heart of it. Now begin in the middle, and later learn the beginning; the end will take care of itself.*

- Harlan Ellison

I realize that
each part of this series has introduced more concepts, and if this sort of
thing is new to you it may feel a bit overwhelming. So let me make a couple points before moving
on:

I’ve been doing
my best to (1) assume no background knowledge on the part of the reader, and (2)
keep each post as self-contained as possible.
But that doesn’t mean everything is going to instantly make sense in
your head. In fact, I really wouldn’t
expect it to (it took me a

*long*time doing statistical modeling, before I really started to feel comfortable with this sort of thing). So I don’t expect everything to have totally clicked in your head yet.
That said, I
think that the best thing to do is to push forward. Even if you only have the
general gist of what we have covered so far, you can probably get a good feel
for what we’ll be covering in this post: that is, how we can model game
outcomes, and use this to understand win probabilities.

This is why I’ve
put the Harlan Ellison quote above.
We’re going to be covering stuff that draws on material from previous
posts. But, you shouldn’t feel like you
need a full handle on everything from the previous posts (or even that you need
to have read them) to follow this post; learning doesn’t always happen in a
totally linear fashion.

So let’s move forward,
getting into the middle of things now.
But don’t sweat the details if it isn’t all immediately clear; once you
have the broader picture of how this all works, I’ll bet that earlier pieces
that didn’t totally click for you will start to fall into place.

**_______________________________________________________________________________________**

__MODELING GAME OUTCOME PROBABILITIES__
Here, we are
going to make the key step in moving from modeling individual teams, to
modeling game outcomes.

To make the
ideas here more concrete, let’s use a single example of a game:

**Example: Modeling the outcome of a single game**

Suppose that we
are modeling the outcome of a game between the 49ers and the Seahawks. In
order to help define the game, we will say that the 49ers are the “home team”,
but to keep things simple we will assume that there is

*no homefield advantage*(i.e., we will set the value:*Homefield-Advantage*= 0 ).
What “modeling
the outcome” of a game means is that we want to consider the relative
probabilities of all of the different possible game outcomes, in terms of score
differential (more explicitly: we want to model game outcomes using a
probability distribution).

It takes two
steps to get our probability distribution for the game outcome: (1) define the matchup,
and the teams’ rating distributions, and (2) use these distributions to
estimate a distribution of game-outcomes.

__Step 1: Define our Game__
What we start
with is the normal distributions that describe each team. Normally, we would first “fit” the model
(i.e., learn all the parameter values from data) and then use these values to
define our game. I’ll be demonstrating
how to do fit the model in Excel in our next post; for now, I’m just going to assume
some parameter values that will make the example as straightforward as possible.

So, for this
example we will say that the 49ers have a rating of +4, and the Seahawks have a
rating of -2.

We will also
assume the following:

(1) All
teams’ variances are equal to 4

(2) That
homefield-advantage = 0.*

* Note that
ignoring homefield-advantage is

*completely*for the purpose of keeping the example as simple as possible; it is very easy (and important) to account for a homefield advantage, but here it would just be a nuisance. For example, If we*did*include homefield advantage, and set its value equal to +3, the equivalent game could be defined by simply lowering the 49ers rating by 3 points. *
So, we have two
teams. For each team we have a rating and a variance. These define
a normal distribution for each team, which captures how their performance
varies across games.

As a reminder, we can use the shorthand notation to describe
these two teams’ normal distributions.

Rating

_{49ers}~ Normal (+4 , 4)
Rating

_{Seahawks}~ Normal (-2 , 4)
[Read as,
e.g.: “The Seahawks rating is normally distributed, with a mean of -2 and
a variance of 4”]

Now that we have
defined our game we are ready to model the

*outcome*of the game. Since the 49ers are our “home team”, this means that we want to estimate the following:
Points

_{49ers}– Points_{Seahawks }=*Game Outcome*
We already know
how to get a single value (a

*point estimate*) of the game outcome. But now we want to think about this in terms of a*probability distribution*; we want to know the probabilities of all possible game outcomes.
The reason we
need the probabilities of different game outcomes, is that it allows us to
estimate the probability of us, e.g., winning a specific bet (e.g., the
probability that the 49ers beat the spread).

__Step 2: Model the game outcomes__
Let’s quickly
think about what it

*means*to model the outcome of the game. The most intuitive way of thinking about this may in terms of*simulating*game outcomes.**Step 2; Version 1: Simulating game-outcomes**

Let’s imagine
that we

*simulate*a bunch of games*,*and look at the distribution of outcomes. That is: we will “sample” a value from each of the teams’ normal distributions (think of drawing a sample from a “bean machine” if this helps visualize it), and use these to get a bunch of simulated game outcomes.
Each sample from
a team’s normal distribution gives us that team’s performance level (rating)
for the specific game we are simulating. And to determine how these
sampled ratings translate to point-differentials, we simply plug them into that
original equation:

Points

_{HomeTeam}– Points_{AwayTeam }= Rating_{HomeTeam}– Rating_{AwayTeam}
So suppose we
start randomly sampling from both teams’ rating distributions, and computing
the outcomes for each simulation. In the figure below, I’ve done just
that:

After repeating
this process for a while, we would have a bunch of

*samples*of outcomes that our model generates. We could then use all these samples to estimate different probabilities; for example, we can estimate the probability that the 49ers win by counting the proportion of our samples that have a positive value, or we could estimate that the 49ers beat the spread by counting the proportion of samples with values greater than the spread.
Luckily, there
is a much easier and more direct way to estimate the outcome probabilities: we
can simply define the game outcome using a probability distribution.

**Step 2; Version 2: Directly defining the probability distribution of the game outcome**

Here, what we
want to do is to define a probability distribution of game outcomes, just as we
defined the distribution of team Ratings. In other words, we want a
probability distribution that gives the relative probability of all possible
game outcomes.

This happens to
be particularly easy in our case—thanks to a convenient property of the normal
distribution:

*The difference between two samples generated a from normal distributions is normally distributed!*

^{2}
In other words,
since each individual team’s rating is sampled from a normal distribution, and
the game-outcome is simply the difference between the teams’ ratings, the game
outcome can be modeled using a normal distribution.

Furthermore, the
parameters of the normal distribution that defines the game outcome is
extremely easy to compute directly from the parameters of the team
distributions. These are the
rules for how to compute the mean and variance of the game outcome using the
team distributions:

- The mean
(m) of the game outcome = [ the mean of
the home team ] – [ mean of away team ]

- The
variance (s

^{2}) of the game outcome = [variance of home team] + [variance of away team]
Put simply: the
mean of the distribution of outcomes equals difference of the team means.
The variance of the game outcomes is the sum of the individual team variances
(and in our case, since all team variances are equal, this is simply equal to
two times the team variance).

Applying this
rule to our example game:

- m

_{OUTCOME}= m_{49ers - }m_{Seahawks}= 4 – (-2) = 6
- s

^{2}_{OUTCOME}= s^{2}_{49ers + }s^{2}_{Seahawks}= 4 + 4 = 8
That’s all there is to it. We can now define the normal distribution for game outcomes:

*Game Outcome*

_{ }~ Normal( 6 , 8)

Remembering that game outcome is expressed in terms of point differential:

*Game Outcome =*Points

_{49ers}– Points

_{Seahawks}

__Example Game: Team Distributions and Game Outcome Probabilities__
Now, let’s look at plots of the probability for the teams and the game outcomes, to see how this all comes together:

So, it should
hopefully be pretty easy to see how we get from the first picture to the
second. The mean of the game outcomes is equal to the 49er’s
average

*rating*minus the Seahawk’s average*rating*. The variance of the game outcome is larger than the variance of the team distributions (2x larger, to be precise).
Since the notion
of “variance” is less straightforward than the mean, here’s an example to help
you understand why the game-outcome is going to have more variance than the
individual teams.

Think of each
modeling each individual team’s performance as a single die. If this were
the case, each team could only generate the six values between 1 and 6.
Now imagine that we model the game’s outcome using the sum of the value of the
two dice (this also works using the difference, but let’s use the sum since
everyone is familiar with it). Obviously, there are more possible outcomes for
the sum of the dice than for each individual die (11 rather than 6, since 2 dice can sum to 2-12. This
same principle is at work when we are working with the difference (or sum) of
normal distributions.

In both
cases,

*most of the time, you aren’t going to get an extreme value*. The roll of two dice on average will sum to seven (which is simply the sum of the average value on each individual die), just as the difference between two Normals will on average be equal to the difference of the individual normal distributions’ values.
However, because
we now have two randomly varying values, in rare cases, you will get extreme
values from both distributions, which is what causes the increase in
variance. With the sum of two dice rolls, you get extreme values (e.g., 2
or 12) when both dice are either high or low.

Now, looking at
the pictures above, you can see the equivalent situation for the normal
distributions that would lead to extreme values. If the 49ers perform
above their average , and the Seahawks perform below their average, we would
get a game outcome off on the far right of the outcome distribution (say,
greater than 12). On the other hand, if the 49ers performed well below
average and the Seahawks performed well above average, the outcome would be on
the far left end of the distribution (say, less than 0).

__Game Outcome Distributions: Point Spreads and MoneyLines__
Once we have defined the
probability distribution that represents the game outcome in terms of point
differential, it is straightforward to compute the probability of

*any outcome you are interested in*.
For obvious reasons, the two
probabilities that people are most interested in are (1) the probability that
the outcome is greater or less than a specific value (i.e. the point-spread),
and (2) the probability that each team will win/lose the game (corresponding to
the “moneyline”).

To do this, let’s look again at
the distribution of outcomes for our example game:

**The Money-Line**

It’s fairly straightforward think about

*the win probabilities here, since values greater than zero correspond to an outcome in which the 49ers win, and values less than zero correspond to outcomes in which the Seahawks win. In other words, the probability of, e.g., the 49ers winning equals the probability that this distribution generates a positive value.***The Spread**

It’s not much harder to think about how we could use this model to estimate the probability of beating the spread. For example, suppose that the spread of this game had the 49ers favored by 4. It should be clear that in that case, we would want to pick the 49ers, since the mean of the distribution is actually 6. And the

*probability*of beating the spread would be equal to the probability that this distribution generates a value greater than or less than 4.
So, the question now is: how do we compute these probabilities from the game outcome distribution?

**The relationship between the spread and the moneyline**
We can use this same idea of the
distribution of game outcomes to understand the deep connection between the
“moneyline odds” and the probability of “beating the spread”. If we had
to estimate what the moneyline odds should be, we simply consider the
probability of the outcome being less than or greater than 0. The
probability of beating the spread is essentially the same thing, just instead
of comparing this distribution to the value 0, we compare the distribution to
the value of the spread.

To make this connection clear,
let’s think about things in terms of our odds of winning a bet, or our
“win-odds”. Imagine that all we care about is whether or not the game
outcome is greater than or equal to the dark vertical line (currently at
“0”). If we are making a money-line bet on the 49ers, this means
that we are betting the 49ers to win straight-up, so the “win-odds” of this bet
are equivalent to probability of the distribution generating a value greater
than 0.

Now, what if we are betting
against the spread? To think about this, just imagine that that we slide
the dark vertical line either to the left or to the right. If the 49ers
were favored by 6, we would slide the dark line over to the value +6. Our
“win-odds” on a bet for the 49ers against the spread would then be equal to the
odds of generating a value greater than 6 from the distribution of
game-outcomes. This is equal to 50%, since the mean of this distribution
is 6. And, since the house takes a cut of your winnings, this is
obviously a bad bet (unless you make the bet against a friend, in which case
it’s a fair bet, just not a

*good*bet per se).
However, what if the 49ers were
only favored by 4? To get a picture for how to think about this, imagine
that we slide that dark vertical line to +4, which I’ve done for this picture:

From the picture, it’s clear that
a bet on the 49ers against a spread of 4 would have a greater than 50% chance
of winning here. But, just how good

*are*our chances of winning that bet? In other words, how do we compute the probability of the outcome being greater than or less than a specific number?**Side-Note: What's a good bet?**

It’s worth pointing out here that
with the standard Vegas odds on a bet against the spread you need to win 11 out
of 21 bets you make to break even (or about 53% of the time). To see this, imagine we were to making a
series of bets against the spread, in increments of $110:

The standard odds (in sportsbook
terminology) of “-110” means that you win $100 on a bet of $110. So, if we lost 10 straight $110 bets, we’d be
down a total of $1,100. To make this
money back, we’d have to win 11 straight $110 bets, since we only get back $100
on each bet.

In terms of our model, this roughly
means that we need to have at least a .53 probability of the game outcome being
on the side of the spread we pick, in order to break even.

__Computing outcome probabilities__

First, let’s think about the
indirect method we discussed previously:

**. We could generate a bunch of values from the distribution of game outcomes, and use the proportion of values that were less than or greater than zero to estimate each team’s win probability (or greater than or less than the***simulation**spread*to estimate your odds of beating the spread).
Although this is an indirect
method, it is a totally valid estimation method. Furthermore, it is fairly easy since we can
generate tens of thousands random samples from normal distribution extremely
fast on a computer. In the Excel file that serves as a companion to this post,
I’ve implemented this exact simulation for our example game (Note that there
are three “sheets” in the excel file: the
sheet labeled GAME_OUTCOME_SIMULATION” is the one I’ll be discussing here).

This file
simulates the game we’ve been talking about 2500 times. For each game it
samples a random value for each teams’ rating from their
rating-distribution. It also computes the proportion of simulated games
which result in (1) the 49ers winning outright, and (2) the 49ers beating a
spread of 4.

You can adjust
various parameter settings on the left to see how it affects the game
simulations. The parameters that you can
change without things getting all screwy are the

*mean*and*variance*for either of the teams, and the*game spread*. Essentially, everything in the big table on the right deals with simulating the games (i.e., generating samples from the team distributions), and the parameters and outcome summaries are in the 4 tables on the left.
Since
simulation is not the method I’m recommending, I’m not going to discuss all the
details of that spreadsheet. However, I’ve done my best to label things
so that it is fairly clear what everything is doing, if you wish to look at
it.

The reason that I’m not
recommending this method is that there is a much easier and direct way to do
this:

*we can directly compute the probabilities from the normal distribution of the game outcomes.*
Look at the following images:

The top image corresponds to the
“moneyline” betting scenario discussed above, and the bottom image corresponds
to the betting “against the spread”. In each image, the area under
the curve for the region of interest (shown in red for the 49ers, and blue for
the Seahawks) corresponds to the win probabilities for those teams. So rather than simulating games, all we need
to do is compute the area of these regions.

Ok, so I know that “calculating
the area under the curve” tends to bring to mind bad memories in many people,
myself included. So let’s make something clear:

**(not directly anyway)***we will not be doing any calculus here***.**No integrals, no derivatives.

**Computing win probabilities directly from the normal distribution**

*NOTE:**The second sheet of the*

*Excel file*

*, labeled “DIRECT_COMPUTATION” goes along with this section*.

As I said, the way to directly
compute the probabilities for any outcome of interest here is by computing the
area under the curve for the region you are interested in.

^{}
So, for example, suppose that we
want to compute the probability of the Seahawks beating the spread of 4. What we want to do in this case is compute
the total probability of the game-outcome distribution generating a value
smaller than 4 (i.e. the probability that Points

_{49ers}-Points_{Seahawks }will be less than four 4). The way we calculate this is by computing the area under the normal distribution of the game outcome, for the region to the left of 4 (corresponding to the blue region in the 2^{nd}plot of the last figure).
The reason we don’t need calculus
to do this is that the normal distribution is

*extremely common*, so there are plenty of tools out there that will do this for you. And, in fact, if you’ve ever taken an intro to stats class you’ve probably looked up a “significance level” in a big table like this, which is just a giant table of values based on integrating a normal distribution. We will do something a little more sophisticated than looking at a table; we can compute the area under the curve of a region of a normal distribution in Excel, using the following command:
= NORMDIST(

*x*, m, s, TRUE)
In English, this command essentially
means: “compute the area under the curve of a normal distribution (with
parameters m and s) in the region to the left of

*x”.*If you have some familiarity—or faint memory—of calculus, this is equivalent to saying “integrate a normal (with parameters m and s) from: negative infinity to x”.
In the language of Excel, you can
think about the command I’ve given above as follows:

- The parameters m and s are
the parameters of the normal distribution.**

- The

*x*parameter is used to define the region of interest: if you set it to 0, it will give you the area in blue for the first plot above (the probability that the Seahawks win the game), and If you set it to 4, it will give you the blue region for the second plot (the probability of the Seahawks beating the spread).
- The “TRUE” command is used to
indicate that we wish to

*integrate*the normal distribution. The only other valid value to use here is “FALSE”, which will give you the*height*of the normal distribution at that location (which really isn’t useful here except for plotting the distribution).
** Note that Excel uses s and not s

^{2}to define the normal distribution. The parameter s is the “standard deviation” of the normal, which is just the square-root of the variance s^{2}. **
I’ve used this Excel command to
directly compute all of the win-probabilities we have discussed for our example
game on the second sheet of our Excel file. So in that file, you can see the exact
commands I’ve used to compute any of the regions from the plots above.

As a single example of how to use
this excel command, enter the following command in any cell of a spreadsheet,
to compute the probability that the Seahawks beat the spread for our game
example:

= NORMDIST(4, 6, 2.83, TRUE)

where x=4 because we are
interested in the region to the left of the spread of 4; 6 is the mean of the
game-outcome distribution, and 2.83 is the standard deviation of this distribution
(i.e. the square root of the variance, 8).

To compute the region of interest
for the 49ers, you just need to subtract the Seahawks win probability from 1. This is because the total area under the
normal distribution (or any probability distribution) is equal to 1. Alternatively, we know that one of the two
teams has to win, so that:

Win-Probability

_{49ers}+ Win-Probabiliity_{Seahawks}= 1
which can be re-written as:

Win-Probability

_{49ers}= 1 – Win-Probabiliity_{Seahawks}
In other words, subtracting the
area of the blue region from “1” is

*equivalent*to computing the area of the red region.
To see all of the outcome
probabilities (and the command used to compute each of them), just look at the
table labeled “outcome probabilities” in the spreadsheet.

**_______________________________________________________________________________________**

__A brief summary, and look at what's ahead__
So, to briefly sum things up:

- Starting with the rating
distributions for two teams, we looked at how one can compute a
probability distribution over possible game outcomes (both through simulation,
and by directly computing the normal distribution over outcomes).

- We then looked at how you can
use the probability distribution over game outcomes to compute each team’s
win-odds, both against the spread or straight up (i.e., on a moneyline wager).

__In the next post____:__I will give a tutorial on fitting this model using excel. And I will show how we can then apply the concepts from this post to get

*actual estimates*of win probabilities, for the following week's games.

**_______________________________________________________________________________________**

__A Final Word Of Caution About Variance__
For the sake of illustration,
I’ve assumed that all team variances were equal to 4 throughout this entire
post. But—and this is extremely important to understand—the actual variance of teams (or of game
outcomes) is

*much*larger: Based on the model I fit in Part 2 of this series, the estimated variance of the distribution for game-outcomes was about to s^{2}= 140 (corresponding to a variance of 70 for each team if we assume all teams have equal variance).
And, keeping in mind that
variance is a measure of uncertainty (or error), the upshot of all this is that
there is much more uncertainty about the outcomes of actual NFL games than in
the example I’ve used in this post.

To give a sense of how this
increase in variance changes our estimates of outcome probabilities: in our
example game the estimated 49ers win-probability drops from .983 to .694 on a
money-line bet, and from .760 to .567 against the spread. You can see
this for yourself by setting the values of the team variances to 70 on the
Excel spreadsheet.

Or, to put this into pictures,
here are more realistic versions
of some of the figures we’ve looked at in this post (i.e. when we set
team variances to 70).

**_______________________________________________________________________________________**

__Footnotes:__

**There is a simple way to estimate the team variance from our model. But this is irrelevant to the concepts I cover in this post. In the next post (where we will be fitting the model in Excel) we'll also look at how to estimate the team variance in this model.**

^{1}**This is a somewhat rough description of this property. First, I phrased it to be specific to the fact that we are thinking about the difference, rather than the sum of the scores (it applies to both). Secondly, for this property to hold the two distributions need to be**

^{2}*independent*. What this basically means is that the two distributions do not interact. For example, in our model this means that the 49ers rating-distribution is equivalent no matter what team they are playing. Since this is precisely one of the assumptions in the model, I didn’t want to go into this issue in depth. Whether this assumption is a reasonable one is a wholly different issue (and something we can examine later), but is not something we need to worry about for now. Note that, if you are unfamiliar with modeling, this assumption may feel

*very*wrong, and in fact that you probably have the right instinct. But, for now, just trust me when I say: although this assumption of independence is almost certainly incorrect, there are lots of good reasons to use it (one being that it is a way to deal with the fact that there is very little data for our model due to the NFL season being so short).

Loving these posts!

ReplyDeleteAny idea when Part 5 is coming?

Thanks :)

ReplyDeletePart 5 should be coming sometime next week (and definitely before the super bowl).

Definitely!?!

ReplyDelete;)

Er...There were some unfortunate delays in getting the next part posted before the super bowl (due to grad-school stuff getting in the way). But I will write a follow up.

ReplyDeleteAnd for what it's worth, the upside is that this model would have performed quite poorly in the playoffs/super-bowl :-/

Hey Tim, Just found your post and found it fascinating. I was wondering if you got around to posting part 5?

ReplyDeleteThanks

Part 5??

ReplyDelete