By: Jake Levene and Zach Diamandis
Around this time every year, a familiar but always exciting debate reignites: who should be the NBA’s MVP? There are familiar faces, such as Lebron James, who has been within the top 10 of MVP voting since his second season in 2004-05, and exciting newcomers, like former Sixth Man of the Year winner James Harden. Since we are at the end of the season, it is fairly easy to determine a shortlist of candidates; however, we wanted to investigate, via regression analysis, how well you can model MVP voting. The hypothesis that motivated this question was the following: while quantitative performance statistics have some influence on the selection of MVP, the most important factor in determining the winner of the NBA’s MVP in any given year is “narrative.” Narrative is a naturally hard to define concept, but by that we mean the kinds of the qualitative, hard to model factors that often influence voters. A player making a big leap or having a lot of media attention can certainly sway the public opinion on the vote. So, the worse our regression is, the stronger this hypothesis appears because we are only using quantitative measures.
In order to test this hypothesis, we ran a series of regressions, and judged them based on the number of MVP winners correctly predicted in the last 20 seasons. We used the top 5 MVP vote getters from each of the seasons based on their MVP score – the weighted average of a player’s first, second, third, fourth and fifth place votes (a first place vote counts as 10 points, a second place vote as 7, a third place vote as 5, a fourth place vote as 3, and a fifth place vote as 1). Our best model correctly predicted the MVP in 16 of the last 20 seasons.
Model Building Process:
To arrive at this final model, we tried a number of different approaches. Initially, we ran a rather large regression with 17 different predictors, which included several counting, shooting, and team level statistics. We also included advanced individual performance metrics such as win shares, value over replacement player (VORP), and box plus minus (BPM). There were originally problems of collinearity, so we performed a backwards-stepwise regression, which resulted in the regression below. Another strategy we tried was to run several regressions using R’s “allpossregs” command and select a model based on its Mallow’s CP. Our best model using this strategy predicts 14 of 20 winners correctly. By that measure (predictive ability), adjusted R squared, and residual standard error, this model is quite similar but slightly inferior to the model we ultimately used.
Discussion of Final Model and Predictions:
There are a number of interesting conclusions that can be drawn from our final model that merit comment. One salient point is the importance of team wins as a predictor of MVP score. Team wins proved to be by far the most important predictor, and that is undoubtedly part of the reason that our model did not correctly predict Russell Westbrook’s 2017 MVP win. The Oklahoma City Thunder won 47 games that year, an unusually low value for an MVP’s team. Also, Westbrook’s performance last season was regarded by many experts to be quantitatively inferior to James Harden from a strictly basketball perspective. This suggests that narrative exerts influence, as Westbrook’s triple double average won the day despite Harden’s more robust statistical performance. On the other hand, the model’s largest predicted MVP score belonged to Stephen Curry’s historic 2016 MVP campaign, in which he became the first unanimous MVP ever, putting together one of the greatest individual seasons in history with shooting splits of 50-45-90 (only reached by two other players ever), and breaking the records for most 3 pointers made in a season and team wins.
Another interesting topic that merits some attention is a brief discussion of the four seasons for which the analysis predicted the wrong MVP winner. These were Russell Westbrook’s aforementioned 2017 campaign, Dirk Nowitzki’s 2007 victory, Steve Nash’s second of his back to back wins in 2006, and Allen Iverson’s 2001 victory. A plot of the cook’s distances of all points is below, and these four are, of course, the four most obviously unusual points:
We initially hypothesized that Westbrook would be the biggest outlier because of the controversy surrounding his victory, which stemmed largely from his team’s unusually low win total, which was unprecedented for an MVP winner. Although the model did predict James Harden as 2017’s winner over Westbrook, by all accounts Westbrook was not an outlier. Both Westbrook’s 2017 season and Nowitzki’s 2007 campaign were not seen to be overly influential; however, our model flagged Nash’s 2006 season as a potentially overly influential point. This is likely because he averaged only 19 points per game, a rather low total for an MVP winner, and was fairly unremarkable in all other categories besides assists. Nash’s defense is notoriously underrated and difficult to capture in simple statistics, and that season is one in which “narrative” clearly played a role. Nash was the leader of the so called “7 seconds or less” Suns teams that revolutionized today’s NBA, but was not known as a defensive stalwart. Shaquille O’Neal, who won only one MVP, has repeatedly claimed he deserved to win both of Nash’s MVPs. Our model does not O’Neal finished inside the top 5 in MVP voting in 2006, and he has a substantially lower predicted MVP score than Nash in Nash’s 2005 winning season, lending no credence to that claim. As a final note, the predicted winners for these four seasons would have been James Harden, Steve Nash, Chauncey Billups, and Shaquille O’Neal, respectively.
Sources of Error:
The first and most obvious source of error is incomplete data. While the Basketball-Reference gives full summary statistics, the NBA has stated multiple times that there are no publicly available defensive statistics that fully and satisfactorily capture players’ defensive skill. Thus, our model likely under accounts for defense when assessing MVP. Further, we did not add all possible available statistics into the model as repressors, so it is possible that a different selection of statistics would lead to different model results. Finally, as mentioned earlier, we posit narrative is highly influential, which is supported by a relatively low R squared value. To improve this model, we could account for narrative by using a variable like positive news coverage or Google searches with the players name and “MVP”.
A second source of error is that the process for MVP score is not consistent each year. Some years have a different number of voters than other years, so the MVP score for each pairwise set of years is not exactly comparable. There are a couple of lockout years in the data, but we addressed this issue by scaling these years in proportion to the number of games missed. Although some of our significant predictors probably scaled fairly accurately, as most individual statistics tend not to change all that much after 66 and 50 games (the totals for the two lockout years), it is possible that games played and team wins might have been different because of injuries or other sudden changes.
Prediction for This Year:
As a final note, one additional test of how good our model is will be how well it predicts this year’s MVP voting, so we decided to use the relevant statistics to make MVP score predictions for some of the players who we expect to be among the highest vote getters this year. The model predicts that James Harden will have the highest MVP score and win the award, with a score of 1198, and, in descending order, the players that followed him were Lebron James, Kevin Durant, Russell Westbrook, Demar Derozan, Anthony Davis, and Giannis Antetokounmpo. Many NBA media outlets and personalities are predicting a win for Harden, so this bodes well for our model and will almost certainly result in a new model success rate of 17/21 or 81%. A future area of interest could be applying our model midseason to see how it fares with incomplete data and how robust it is to swings in team and player performance.
Editors Note: This post was initially conducted as a final project for the class Statistics 109, “Intro To Statistical Modelling,” and has been adapted for this blog. If you have any questions about this article, please feel free to reach out to Jake and Zach at jlevene@college.harvard.edu and z_diamandis@college.harvard.edu