A Simple Improvement to FiveThirtyEight’s NBA Elo Model

By Erik Johnsson

A couple weeks ago, while I was watching James Harden lead the Houston Rockets to a stunning overtime victory over the Golden State Warriors, I was curious to see how the highly-popular ELO and CARMELO models at Nate Silver’s FiveThirtyEight ranked each of the NBA’s 30 teams. When I looked at their current 2018-2019 predictions, I noticed something I thought was a little strange: the CARMELO model (the more sophisticated of the two) ranked the Utah Jazz and the New Orleans Pelicans as the 9th and 10th strongest teams in the league. In fact, FiveThirtyEight’s simulations showed that the 20-21 Jazz, who then sat at 10th in the West, had a whopping 93% chance of making the playoffs. Furthermore, it ranked the 19-22 Pelicans (12th in the West) ahead of teams like the Pacers, the Lakers, and the Trail Blazers.

Of course, as one would expect from FiveThirtyEight, these models are very accurate. In the 2017-2018 season, the CARMELO system was the second-most accurate of all the major models tracked on the ABPRmetrics message board. After all, given these teams’ performances last year, there is good reason to believe that both of them will return to form in the second half of the season. Plus, the newest version of CARMELO was designed to favor historically good teams (like the Warriors) over teams that are “hot” (like the Clippers). Though, the results from the CARMELO model still got me wondering how games early in the season should be valued against games late in the season, particularly when it comes to Elo models. So, I decided to investigate.

Note: The 2018-2019 version of CARMELO no longer incorporates an Elo format, so this post will primarily discuss FiveThirtyEight’s pure Elo model. All information about these models can be found here.

Luckily, a post by Matteo Hoch on his blog ergosum.co conveniently details the specifics of FiveThirtyEight’s Elo model so that it can be easily replicated. As proof that his formulae are correct, below is a scatterplot of each team’s Elo rating at the end of the 2018 season from FiveThirtyEight’s model vs. the ratings from the replicated model. As you can see, these points almost perfectly form a straight line (the small deviations are the result of floating point arithmetic and rounding errors over thousands of games). Hoch_Comparison

In case you are unfamiliar with how Elo models work, I will briefly explain in context of the NBA: basically, each team is assigned a numeric rating that acts as an indicator of team strength. When a team wins a game, that team’s rating goes up; when they lose, it goes down. The amount that the winning team’s rating increases is the same amount by which the losing team’s rating decreases (zero-sum). At first, these ratings will be fairly inaccurate since they have not had sufficient time to learn. However, as teams play more games, they will more accurately reflect the true ability of each team. These ratings can then be used in a model to come up with the percentage chance that a team will win in a given matchup.

The component of FiveThirtyEight’s Elo model that I will be focusing on is the “K-factor”. Essentially, the K-factor in any Elo model is what decides how much a team’s rating should change after a single game. A K-factor that is too big will make the model overreact to teams that are just “hot”, or over-performing due to chance. A K-factor that is too small will fail to learn quickly enough to make accurate predictions. Some Elo models have a fixed K, so that every game is weighted equally. Others, like FiveThirtyEight’s, have a moving K. This particular model has a K-factor that is dependent on 1) the margin of victory and 2) the difference in Elo rating between the two teams.

What I propose is that adding a third dependency to this K-factor on the point of the season at which the game was played will improve the model’s prediction accuracy. In particular, I hypothesize that this Elo model does not do enough to account for off-season changes in team strength, and thus should make greater rating adjustments for games earlier in the season than it does for games later in the season. FiveThirtyEight’s attempt to account for this is simply to adjust each team’s rating at the beginning of every season by reverting each rating to a mean of 1505, using the following formula:

To do this, I incorporated an extra multiplier into FiveThirtyEight’s K-factor that decays at a certain rate every 100 games that are played throughout the league in a season. At the end of every season, the multiplier resets itself. In order to find the best possible decay-factor, I ran the model on all games in the 2004 – 2009 seasons as the model’s “burn period”, and then used Bayesian Optimization to validate and select the model’s new parameters on all games in the 2010 – 2018 seasons. In short, Bayesian Optimization is a learning algorithm that minimizes a given loss function by first guessing random values for each parameter, then building a probability model based on the results. As it learns, it makes guesses closer to the values that minimize the loss function. Here, our loss function (i.e. our accuracy metric) is log-loss, shown below, where y = 1 if the home team won (0 if they lost), and p is the predicted probability that the home team would win):

For each game, the model will output an estimated probability that the home team will win. In this case, log-loss is a better metric than pure prediction accuracy is because it differently rewards model outputs of .51 and .99. For example: consider two models that are trying to predict which team will win a game. One model says the home team has a 51% chance of winning, and the other says the home team has a 99% chance of winning. In the event that the home team wins, pure accuracy would suggest that both models are equally accurate since both predicted that the home team would win. On the other hand, log-loss would suggest that the second model was more accurate than the first because it was more confident that the home team would win. If we want to minimize total log-loss, then Bayesian Optimization finds that the following are the optimal values of the new parameters (additionally, using the same algorithm, a new Home-Field Advantage adjustment was fit. FiveThirtyEight’s model simply adds 100 to the home team’s rating to account for HFA, but since my model introduces a new K-decay factor, it is likely that a different adjustment is optimal).

best_params

These values indicate that it was indeed optimal for the model’s K-factor to decay over the course of a season. To see how the algorithm optimized each parameter, here are the plots of total log-loss against different values of each parameter. You can clearly see how some values grant lower losses than others do.

So now that we have an optimal model, how does it compare to FiveThirtyEight’s? Let’s look at the difference in each model’s total information loss over the last nine seasons (or 11,578 games):

Loss_Difference_Intervals

Over those nine seasons, though it performs slightly worse in three of them, the K-Decay Elo model generally outperforms FiveThirtyEight’s pure Elo Model. Of course, the apparent flaw here is that the 95% confidence interval bounds totally encapsulate y=0, which means that there in no single season in which the K-Decay model significantly outperforms FiveThirtyEight’s model.

However, we’re not just testing the two models for their performance in a single season – we’re testing their performance over many seasons. This is because it is quite possible for a worse model to out-predict a better one, just by chance, in a single season. However, over a longer stretch of time, that is decreasingly likely to happen. So, to better quantify the difference in predictive strength of these two models, we can calculate the Akaike Information Criterion (AIC) of each one over all games from 2010 to 2018. Here are the AICs of the two models (lower is better):

K-Decay Elo AIC: 13,895.90

FiveThirtyEight Elo AIC: 13,903.78

AIC is useful in that it measures the amount of information loss in a model, while also accounting for the amount of parameters (or predictors) in each model. Since statistical models almost always improve when you add more parameters, it is important to make sure that the resulting improvement is substantial enough to make the additional parameter worthwhile, without overfitting the model. Here, since we are adding a K-Decay rate to the traditional Elo model, we need to test if the new parameter lowers the AIC of the pure Elo model. Indeed, it does.

Better yet, these AICs allow us to run a hypothesis test for whether or not the K-Decay Elo model is significantly better than FiveThirtyEight’s Elo model. Under the null hypothesis that the two models have equal information loss and the alternative hypothesis that the K-Decay model has less information loss, we can calculate a p-value for whether or not to reject the null in favor of the alternative. We will reject the null if the p-value is below the standard .05 Type I error rate. Here is how we calculate the p-value:

p_val_elo

This p-value tells us that FiveThirtyEight’s model is only .019 times as probable to minimize total loss as the K-Decay model is, which means that we can be reasonably certain that, on the whole, the K-Decay model makes significantly better predictions than the FiveThirtyEight model does.

Of course, this enhanced version of FiveThirtyEight’s Elo model still leaves room for improvement. For example: the K-Decay model does not treat playoff games any differently than it treats regular season games (even though there are known predictive differences between the two), nor does it have a decay rate on Home-Court Advantage (which is known to have decreased over time). Regardless, perhaps this idea of differently valuing early-season and late-season games could even further boost the predictive capabilities of CARMELO and other advanced NBA projection systems.

Note: Many of the ideas presented in this post are linked to the Glicko rating system, which basically adds a “ratings deviation” parameter to the Elo rating system. Look out for a later post that dives into the Glicko model, and how it can be used to further improve NBA predictions.

If you have any questions for Erik, please feel free to reach out to him by email at ejohnsson@college.harvard.edu.

Just for fun, here is how the new K-Decay model currently ranks each of the NBA’s 30 teams, compared to the pure FiveThirtyEight Elo model that it is based on (as of January 16, 2019):

all_new_rankings

A Simple Improvement to FiveThirtyEight’s NBA Elo Model

Like this:

Related

About the author

harvardsports

Leave a Reply Cancel reply

Follow Us on Twitter

Like us on Facebook

Follow us via Email

Harvard Sports Analysis

Share this: